# Task 3

This assignment deals with evaluating Hearful predictive model performance against hand validated results.

Hearful relies upon a human validator to evaluate the performance of its models. The human validator is presented with the same set of 300 reviews that the model is run against to produce a similarly structured file containing predictions for the presence of specific themes as well as the sentiment associated with those themes.

For this task, you will be analyzing a human validated output for the same set of 300 reviews, contained in the file APPAREL_ODOM_1_2019.csv. Your task is to write a Python script to evaluate the performance of the model on the following metrics: precision, recall, accuracy, and f-measure.

The metrics should be computed via standard formulas, using a 2x2 matrix, with true positive, true negative, false positive, and false negative values computed for each theme and each theme’s sentiment independently. The results should be presented similarly to the below table.


Please remember, that the human validator result is taken as the golden standard. Therefore, if a certain result is marked as present by the validator but is not present in the model prediction, that is counted as a false negative. Conversely, if a model predicted an outcome not marked by the validator, that is considered a false positive.

Please justify any assumptions you make in your computation and submit all code and output files to the GitHub repository.



### Define Positive Classes in Binary Classification Scheme

Positive Class For Themes Exists columns: 1
    
Positive Class For Theme Sentiment Columns: 'pos'

## Examine Validation and Prediction Sets

In [66]:
import pandas as pd
import numpy as np

#Display full columns in pandas dataframes
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", -1)

In [67]:
#Human Validator Set
human_df = pd.read_csv('APPAREL_ODOM_1_2019.csv')

In [68]:
human_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 298 entries, 0 to 297
Data columns (total 16 columns):
_id                        298 non-null object
domain_global_string       298 non-null object
review_rating              298 non-null int64
notes                      97 non-null object
review_text                298 non-null object
review_title               278 non-null object
use_sentiment_label        110 non-null object
use_theme_exists           298 non-null int64
fit_sentiment_label        209 non-null object
fit_theme_exists           298 non-null int64
value_sentiment_label      101 non-null object
value_theme_exists         298 non-null int64
style_sentiment_label      120 non-null object
style_theme_exists         298 non-null int64
quality_sentiment_label    189 non-null object
quality_theme_exists       297 non-null float64
dtypes: float64(1), int64(5), object(10)
memory usage: 37.3+ KB


In [69]:
#Prediction Set
model_df = pd.read_csv('APPAREL_ids_1_2019.csv')

In [70]:
model_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 15 columns):
_id                        300 non-null object
domain_global_string       300 non-null object
review_rating              300 non-null int64
review_text                300 non-null object
review_title               278 non-null object
use_sentiment_label        146 non-null object
use_theme_exists           156 non-null float64
fit_sentiment_label        220 non-null object
fit_theme_exists           226 non-null float64
value_sentiment_label      93 non-null object
value_theme_exists         99 non-null float64
style_sentiment_label      180 non-null object
style_theme_exists         183 non-null float64
quality_sentiment_label    201 non-null object
quality_theme_exists       215 non-null float64
dtypes: float64(5), int64(1), object(9)
memory usage: 35.2+ KB


There are two rows of data in the model prediction not in the human validator.  We will remove those two rows because we will not be able to calulate whether the model predicted corrected whith out knowing the true labels of the
validator.

In [71]:
#Find rows present in the prediction model not in the human model
ids = []
for id1 in list(model_df._id):
    if id1 not in list(human_df._id):
        ids.append(id1)

In [72]:
ids

['walmart79904828', 'zappos5307009']

In [73]:
#Find rows present human model not present in the prediction model
ids2 = []
for id1 in list(human_df._id):
    if id1 not in list(model_df._id):
        ids2.append(id1)

In [74]:
ids2

[]

In [75]:
#Remove Rows from model_df not in human_df
model_df = model_df[model_df._id.isin(list(human_df._id))]

In [76]:
model_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 298 entries, 0 to 299
Data columns (total 15 columns):
_id                        298 non-null object
domain_global_string       298 non-null object
review_rating              298 non-null int64
review_text                298 non-null object
review_title               278 non-null object
use_sentiment_label        146 non-null object
use_theme_exists           155 non-null float64
fit_sentiment_label        218 non-null object
fit_theme_exists           224 non-null float64
value_sentiment_label      92 non-null object
value_theme_exists         98 non-null float64
style_sentiment_label      179 non-null object
style_theme_exists         182 non-null float64
quality_sentiment_label    200 non-null object
quality_theme_exists       213 non-null float64
dtypes: float64(5), int64(1), object(9)
memory usage: 37.2+ KB


In [77]:
#Reset index of model
model_df.reset_index(inplace=True)

In [78]:
#Check to see if ids match and align in both datasets
#Will throw an error if this is not so
model_df._id == human_df._id

0      True
1      True
2      True
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
13     True
14     True
15     True
16     True
17     True
18     True
19     True
20     True
21     True
22     True
23     True
24     True
25     True
26     True
27     True
28     True
29     True
       ... 
268    True
269    True
270    True
271    True
272    True
273    True
274    True
275    True
276    True
277    True
278    True
279    True
280    True
281    True
282    True
283    True
284    True
285    True
286    True
287    True
288    True
289    True
290    True
291    True
292    True
293    True
294    True
295    True
296    True
297    True
Name: _id, Length: 298, dtype: bool

In [79]:
#Join the dataframes together to better compare the data
df_joined = model_df.merge(human_df, how='inner', on='_id',suffixes=('_pred', '_val'))

In [80]:
#columns to drop so dataframe is easier to read
cols = ['index', 'domain_global_string_pred', 'review_rating_pred', 'review_text_pred',
        'review_title_pred', 'notes', 'review_text_val', 'domain_global_string_val',
        'review_rating_val', 'review_title_val']

In [81]:
df_joined.drop(cols, axis=1, inplace=True)

In [82]:
df_joined.head()

Unnamed: 0,_id,use_sentiment_label_pred,use_theme_exists_pred,fit_sentiment_label_pred,fit_theme_exists_pred,value_sentiment_label_pred,value_theme_exists_pred,style_sentiment_label_pred,style_theme_exists_pred,quality_sentiment_label_pred,quality_theme_exists_pred,use_sentiment_label_val,use_theme_exists_val,fit_sentiment_label_val,fit_theme_exists_val,value_sentiment_label_val,value_theme_exists_val,style_sentiment_label_val,style_theme_exists_val,quality_sentiment_label_val,quality_theme_exists_val
0,academy58403947,,,,,pos,1.0,pos,1.0,,,,0,neg,1,,0,pos,1,,0.0
1,adidas100338674,pos,1.0,,,,,,,pos,1.0,,0,,0,,0,pos,1,neg,1.0
2,adidas102938471,pos,1.0,pos,1.0,pos,1.0,pos,1.0,,,pos,1,pos,1,pos,1,pos,1,,0.0
3,amazonR17DE72WNC7FQM,,,,,,,,,pos,1.0,,0,,0,,0,,0,neg,1.0
4,amazonR19HL1JO6GEKJK,,,neg,1.0,pos,1.0,,,,,,0,neg,1,pos,1,,0,pos,1.0


In [83]:
#Comapre prediction and validation for use_sentiment_label
df_use_sent = df_joined[['use_sentiment_label_pred', 'use_sentiment_label_val']]

In [85]:
df_use_sent.iloc[0][1]

nan

In [102]:
pd.isnull(df_use_sent.iloc[1][1])

True

## Computing Metrics

When looking at sentiment labels.  In this case we are determining if a label is positive or negative.  One thing to notice is that sometimes the validation set has a value while the prediction set does not, and sometimes the prediction set has a value and the validation set does not.  In those two cases, we can not assign a value (True Positive, True Negative, False Positive, or False Negative) to the prediction.

True Positive: sent_label_pred = pos & sent_label_val = pos

False Positve:sent_label_pred = pos & sent_label_val = neg

True Negative: sent_label_pred = neg & sent_label_val = neg

False Negative: sent_label_pred = neg & sent_label_val = pos

In [103]:
df_use_sent.head()

Unnamed: 0,use_sentiment_label_pred,use_sentiment_label_val
0,,
1,pos,
2,pos,pos
3,,
4,,


In [154]:
def make_confuse_mat(tp, tn, fp, fn):
    #Returns a 2x2 matrix
    #[[true positve, false positive], [false negative, true negative]]
    return [[tp, fp],
            [fn, tn]]

In [176]:
def confuse_mat_label(df_pred, df_val, label):
    """
    Creates a confusion 2x2 matrix
    [[true positve, false positive], [false negative, true negative]]
    """
    tp = 0
    tn = 0
    fp = 0
    fn = 0
    
    for pred, val in list(zip(df_pred[label], df_val[label])):
        if pred == 'pos' and val == 'pos':
            tp += 1
        elif pred == 'neg' and val == 'neg':
            tn += 1
        elif pred == 'pos' and val == 'neg':
            fp += 1
        elif pred == 'neg' and val =='pos':
            fn += 1
    return make_confuse_mat(tp, tn, fp, fn)

In [126]:
mat = confuse_mat_label(model_df, human_df, 'use_sentiment_label')

In [177]:
confuse_mat_label(model_df, human_df, 'use_sentiment_label')

[[88, 3], [3, 2]]

When Looking at the theme exists columns we have to take into a account for things a little differently to identify false negatives and true negatives.

True Positive: theme_exists_pred = 1 & theme_exists_val = 1

False Positve: theme_exists_pred = 1 & theme_exists_val = 0

True Negative: theme_exists_pred = null & theme_exists_val = 0

False Negative: theme_exists_pred = null & theme_exists_val = 1

In [174]:
def confuse_mat_exists(df_pred, df_val, label):
    tp = 0
    tn = 0
    fp = 0
    fn = 0
    
    for pred, val in list(zip(df_pred[label], df_val[label])):
        if pred == 1 and val == 1:
            tp += 1
        elif pd.isnull(pred) == True and val == 0:
            tn += 1
        elif pred == 1 and val == 0:
            fp += 1
        elif pd.isnull(pred) == True and val == 1:
            fn += 1
    return make_confuse_mat(tp, tn, fp, fn)

In [152]:
mat2 = confuse_mat_exists(model_df, human_df, 'use_theme_exists')

In [175]:
confuse_mat_exists(model_df, human_df, 'use_theme_exists')

[[100, 55], [10, 133]]

In [158]:
def accuracy_score(mat):
    #takes in argument as a 2x2 confonsion matrix per Task3 request
    #[[true positve, false positive], [false negative, true negative]]
    #accurancy = (TP + TN) / (TP + TN + FP + FN)
    
    return (mat[0][0] + mat[1][1]) / (mat[0][0] + mat[1][1] + mat[0][1] + mat[1][0])

In [161]:
def precision_score(mat):
    #takes in argument as a 2x2 confonsion matrix per Task3 request
    #[[true positve, false positive], [false negative, true negative]]
    #precision = (TP) / (TP + FP)
    #Answers what proportion of positive identifications were actually correct
    
    return (mat[0][0]) / (mat[0][0] + mat[0][1])

In [163]:
def recall_score(mat):
    #takes in argument as a 2x2 confonsion matrix per Task3 request
    #[[true positve, false positive], [false negative, true negative]]
    #recall = (TP) / (TP + FN)
    #Answers what proportion of actual positives was identified correctly
    
    return (mat[0][0]) / (mat[0][0] + mat[1][0])

In [165]:
def f1_score(precision, recall):
    #takes in arguments precision and recall scores
    #recall = 2 x ((precision x recall)/(precision + recall))
    #harmonic mean of precision and recall
    
    return 2*((precision*recall)/(precision+recall))

In [167]:
acc = accuracy_score(mat2)
prec = precision_score(mat2)
recall = recall_score(mat2)
f1 = f1_score(prec, recall)

In [172]:
print(f'accuracy: {acc:.6}')
print(f'precision: {prec:.6}')
print(f'recall: {recall:.6}')
print(f'f1-score: {f1:.6}')

accuracy: 0.781879
precision: 0.645161
recall: 0.909091
f1-score: 0.754717


In [178]:
human_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 298 entries, 0 to 297
Data columns (total 16 columns):
_id                        298 non-null object
domain_global_string       298 non-null object
review_rating              298 non-null int64
notes                      97 non-null object
review_text                298 non-null object
review_title               278 non-null object
use_sentiment_label        110 non-null object
use_theme_exists           298 non-null int64
fit_sentiment_label        209 non-null object
fit_theme_exists           298 non-null int64
value_sentiment_label      101 non-null object
value_theme_exists         298 non-null int64
style_sentiment_label      120 non-null object
style_theme_exists         298 non-null int64
quality_sentiment_label    189 non-null object
quality_theme_exists       297 non-null float64
dtypes: float64(1), int64(5), object(10)
memory usage: 37.3+ KB


## Metrics for the All Themes

In [263]:
def analyze_themes(df_pred, df_val):
    """
    Calulates the precision, recall, accuracy and f1 scores for
    the predictions of a model against its validator.
    
    Prints out results to console.  Does not return any values.
    
    df_pred: Dataframe of the predication results
    df_cal: Dataframe of the validation results
    """
    
    #Get confusion matrices for theme_exists
    fit_exists_mat = confuse_mat_exists(df_pred, df_val, 'fit_theme_exists')
    quality_exists_mat = confuse_mat_exists(df_pred, df_val, 'quality_theme_exists')
    style_exists_mat = confuse_mat_exists(df_pred, df_val, 'style_theme_exists')
    use_exists_mat = confuse_mat_exists(df_pred, df_val, 'use_theme_exists')
    value_exists_mat = confuse_mat_exists(df_pred, df_val, 'value_theme_exists')
    
    #Store matrices in a list
    exists_mats = [fit_exists_mat, quality_exists_mat, style_exists_mat, use_exists_mat, value_exists_mat]
    
    #Get confusion matrices for theme_sentiment_label
    fit_sentiment_mat = confuse_mat_label(df_pred, df_val, 'fit_sentiment_label')
    quality_sentiment_mat = confuse_mat_label(df_pred, df_val, 'quality_sentiment_label')
    style_sentiment_mat = confuse_mat_label(df_pred, df_val, 'style_sentiment_label')
    use_sentiment_mat = confuse_mat_label(df_pred, df_val, 'use_sentiment_label')
    value_sentiment_mat = confuse_mat_label(df_pred, df_val, 'value_sentiment_label')
    
    #Store matrices in a list
    sent_mats = [fit_sentiment_mat, quality_sentiment_mat, style_sentiment_mat, use_sentiment_mat, value_sentiment_mat]
    
    #list of theme labels
    #padding soem whitespace to format print out better
    theme_labels = ['fit    ', 'quality', 'style  ', 'use    ', 'value  ']
    
    #Print heading with padding
    print("                  Theme_Exists Theme_Sentiment")
    
    #Loop over matrices and theme labels to calculate and print accuracy scores
    for exists_mat, sent_mat, lab in list(zip(exists_mats, sent_mats, theme_labels)):
        exists_acc = accuracy_score(exists_mat)
        sent_acc = accuracy_score(sent_mat)
        print(f'Accuracy {lab} : {exists_acc:.6f}    {sent_acc:.6f}')
    
    #Loop over matrices and theme labels to calculate and print f1 scores
    for exists_mat, sent_mat, lab in list(zip(exists_mats, sent_mats, theme_labels)):
        exists_recall = recall_score(exists_mat)
        sent_recall = recall_score(sent_mat)
        exists_prec = precision_score(exists_mat)
        sent_prec = precision_score(sent_mat)
        exists_f1 = f1_score(exists_prec, exists_recall)
        sent_f1 = f1_score(sent_prec, sent_recall)
        print(f'F-Measure {lab}: {exists_f1:.6f}    {sent_f1:.6f}')
    
    #Loop over matrices and theme labels to calculate and print precision scores
    for exists_mat, sent_mat, lab in list(zip(exists_mats, sent_mats, theme_labels)):
        exists_prec = precision_score(exists_mat)
        sent_prec = precision_score(sent_mat)
        print(f'Precision {lab}: {exists_prec:.6f}    {sent_prec:.6f}')
    
    #Loop over matrices and theme labels to calculate and print recall scores
    for exists_mat, sent_mat, lab in list(zip(exists_mats, sent_mats, theme_labels)):
        exists_recall = recall_score(exists_mat)
        sent_recall = recall_score(sent_mat)
        print(f'Recall {lab}   : {exists_recall:.6f}    {sent_recall:.6f}')

In [262]:
analyze_themes(model_df, human_df)

                  Theme_Exists Theme_Sentiment
Accuracy fit     : 0.859060    0.905263
Accuracy quality : 0.801347    0.803797
Accuracy style   : 0.657718    0.938776
Accuracy use     : 0.781879    0.937500
Accuracy value   : 0.862416    0.930556
F-Measure fit    : 0.901408    0.948571
F-Measure quality: 0.851385    0.874494
F-Measure style  : 0.662252    0.968085
F-Measure use    : 0.754717    0.967033
F-Measure value  : 0.787565    0.961240
Precision fit    : 0.857143    0.917127
Precision quality: 0.797170    0.805970
Precision style  : 0.549451    0.989130
Precision use    : 0.645161    0.967033
Precision value  : 0.775510    0.939394
Recall fit       : 0.950495    0.982249
Recall quality   : 0.913514    0.955752
Recall style     : 0.833333    0.947917
Recall use       : 0.909091    0.967033
Recall value     : 0.800000    0.984127


There also exists a Python script called metrics.py that be run in the command prompt by typing:

python metrics.py APPAREL_ids_1_2019.csv APPAREL_ODOM_1_2019.csv