# Confusion Matrix Generation for Zero Shot Classification
## Written by Leah Ryu for CS72 final, 22S

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

We need to test the effectiveness of the zero-shot classification we used to sort review sentences into topics. To do this, we'll generate a confusion matrix; this will require us to compare the true labels for each sentence to the predicted labels. We'll parse the predicted labels out of the confusion files, then prompt a user to input a true label for each sentence. Then we can print the confusion matrix.

In [10]:
from sklearn.metrics import confusion_matrix

In [11]:
# Libraries needed to import files from drive
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [12]:
# Open all the files we need: pos and neg data for the four companies
f1 = open("/content/drive/MyDrive/compling_final/riotPosConfusion.txt", 'r')
riotPos = f1.readlines()

f2 = open("/content/drive/MyDrive/compling_final/riotNegConfusion.txt", 'r')
riotNeg = f2.readlines()

f3 = open("/content/drive/MyDrive/compling_final/sonyPosConfusion.txt", 'r')
sonyPos = f3.readlines()

f4 = open("/content/drive/MyDrive/compling_final/sonyNegConfusion.txt", 'r')
sonyNeg = f4.readlines()

f5 = open("/content/drive/MyDrive/compling_final/ubisoftPosConfusion.txt", 'r')
ubisoftPos = f5.readlines()

f6 = open("/content/drive/MyDrive/compling_final/ubisoftNegConfusion.txt", 'r')
ubisoftNeg = f6.readlines()

f7 = open("/content/drive/MyDrive/compling_final/activisionPosConfusion.txt", 'r')
activisionPos = f7.readlines()

f8 = open("/content/drive/MyDrive/compling_final/activisionNegConfusion.txt", 'r')
activisionNeg = f8.readlines()

Now that we have the files open, we'll parse the predicted values out of the zero shot classifiers.

In [41]:
# Store the predicted values in a 1D array--order is important
predictedLabels = []
# Also store the sentences which were classified so that the user can later input
# the true label.
sentences = []

# Parse out the top predicted label.
def parseConfusionFiles(theFile):
  for line in theFile:
    value = line.strip("{}")
    values = value.split(", '")

    # Get the review sentence.
    sentence = values[0].split("'")[-2]
    sentences.append(sentence)

    for item in values:
      if item[:6] == "labels":
        # Now the item we have looks something like 
        # labels': ['diversity and inclusion'
        # Parse out the label
        label = item.split("'")[-2]
        predictedLabels.append(label)
    
parseConfusionFiles(riotPos)
parseConfusionFiles(riotNeg)
parseConfusionFiles(sonyPos)
parseConfusionFiles(sonyNeg)
parseConfusionFiles(ubisoftPos)
parseConfusionFiles(ubisoftNeg)
parseConfusionFiles(activisionPos)
parseConfusionFiles(activisionNeg)
print(len(predictedLabels))
print(len(sentences))


200
200


Now we need to manually input values for each sentence as gold labels. We'll take in a number corresponding to the category, then store that category.

'diversity and inclusion' = 1 

'culture and values' = 2 

'work life balance' = 3 

'senior management' = 4

'career opportunities' = 5 

'compensation and benefits' = 6

In [52]:
trueLabels = []

# The user must be VERY careful to input a number between 1 and 6, or else they
# will have to restart
def getTrueLabels():
  for i in range(len(predictedLabels)):
    print("The sentence is \"" + sentences[i] + "\"")
    val = input("The true label is? ")
    val = int(val)
    
    if val == 1:
      trueLabels.append('diversity and inclusion')
    elif val == 2:
      trueLabels.append('culture and values')
    elif val == 3:
      trueLabels.append('work life balance')
    elif val == 4:
      trueLabels.append('senior management')
    elif val == 5:
      trueLabels.append('career opportunities')
    elif val == 6:
      trueLabels.append('compensation and benefits')
    else: 
      print("ERROR: did not input [1, 6]. Please start over.")


getTrueLabels()

The sentence is "Company which respect it employee and is pleasure to work at "


KeyboardInterrupt: ignored

In [46]:
# DAMAGE CONTROL: if you accidentally input an invalid character, you can uncomment this block
# to get a smaller confusion matrix.
# trueLabelsLength = len(trueLabels)
# print(trueLabelsLength)
# print("Predicted vs true")
# confusion_matrix(trueLabels, predictedLabels[:trueLabelsLength], labels = ['diversity and inclusion', 'culture and values', 'work life balance', 'senior management', 'career opportunities', 'compensation and benefits'])

185
Predicted vs true


array([[ 0,  1,  0,  0,  0,  0],
       [ 2, 26, 16,  1, 13,  3],
       [ 0,  1, 21,  0,  4,  1],
       [ 0,  6,  4,  4,  4,  0],
       [ 1,  6,  5,  0, 18,  2],
       [ 0,  6, 15,  1,  1, 23]])

Now we'll print the confusion matrix. We can interpret it like this:

Example matrix: 

       [[0, 0, 0, 0, 0, 0],
       [1, 0, 5, 0, 4, 3],
       [0, 0, 3, 0, 2, 1],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 4]]

Let's look at line two. Remember our categories are `['diversity and inclusion', 'culture and values', 'work life balance', 'senior management', 'career opportunities', 'compensation and benefits']`. So line two

`[1, 0, 5, 0, 4, 3]`

can be interpreted as, "of all the things that were actually related to culture and values, 1 was predicted by the model to be related to diversity, 0 were actually predicted to be related to culture, 5 were predicted to be related to work life balance, etc."



In [45]:
print("Predicted vs true")
confusion_matrix(trueLabels, predictedLabels, labels = ['diversity and inclusion', 'culture and values', 'work life balance', 'senior management', 'career opportunities', 'compensation and benefits'])

Predicted vs true


array([[ 0,  1,  0,  0,  0,  0],
       [ 2, 26, 16,  1, 13,  3],
       [ 0,  1, 21,  0,  4,  1],
       [ 0,  6,  4,  4,  4,  0],
       [ 1,  6,  5,  0, 18,  2],
       [ 0,  6, 15,  1,  1, 23]])

We will also use `sklearn`'s precision recall fscore support function to print these values per label. The output is four arrays of length `# of labels`: array 1 is precision, array 2 is recall, array 3 is fscore, and array 4 is support (in this case, the number of sentences that were gold labeled as belonging to each category).

Precision = # of sentences that were actually `x` AND were labeled as `x` / # sentences labeled as `x`

Recall = # of sentences that were labeled as `x` / # sentences that were actually `x` 

In [47]:
from sklearn.metrics import precision_recall_fscore_support

In [51]:
precision_recall_fscore_support(trueLabels, predictedLabels[:185], average=None, labels=['diversity and inclusion', 'culture and values', 'work life balance', 'senior management', 'career opportunities', 'compensation and benefits'])

(array([0.        , 0.56521739, 0.3442623 , 0.66666667, 0.45      ,
        0.79310345]),
 array([0.        , 0.42622951, 0.77777778, 0.22222222, 0.5625    ,
        0.5       ]),
 array([0.        , 0.48598131, 0.47727273, 0.33333333, 0.5       ,
        0.61333333]),
 array([ 1, 61, 27, 18, 32, 46]))