#**Significance Test**#

* Paired t-test to determine if our improved model is significantly better than the benchmark.

* We have seen that the F1-score of the model which has been trained on auxiliary data, is higher than the F1-score of the benchmark model that has been trained only on the original training set, without enrichment.

* We calculate the p_value to see if that improvement is significant.  

In [None]:
USE_GOOGLE_DRIVE_FOR_FILES    = False
DATA_FOLDER_PATH              = "./Data/"

import pandas as pd
import numpy as np
from scipy import stats
from sklearn.metrics import confusion_matrix

if USE_GOOGLE_DRIVE_FOR_FILES:
  from google.colab import drive
  drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
def paired_t_test(vector1, vector2):
    """
    Perform a paired t-test between two vectors
    :param vector1: numpy array, first vector
    :param vector2: numpy array, second vector
    :return: t-value, p-value
    """
    t, p = stats.ttest_rel(vector1, vector2)
    return t, p


def get_confusion_matrix(y_valid, y_pred, labels):
  cm = pd.DataFrame(confusion_matrix(y_valid, y_pred, labels=labels))
  cm.columns.name='predicted'
  cm.index.name='actual'

  return cm

Load the results of the **benchmark** model

In [None]:
benchmark = pd.read_csv(f'{DATA_FOLDER_PATH}csv_files/test_predicted_proba_0.csv')
benchmark['y_pred_binary'] = (benchmark['y_pred'] > 0.5).astype('int')

In [None]:
get_confusion_matrix(benchmark.y_true.values, benchmark.y_pred_binary.values, [0,1])

predicted,0,1
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,946,265
1,140,505


Load the results of the **enriched model** after 3 iterations (run 31)

In [None]:
enriched = pd.read_csv(f'{DATA_FOLDER_PATH}csv_files/test_predicted_proba_3.csv')
enriched['y_pred_binary'] = (enriched['y_pred'] > 0.5).astype('int')

In [None]:
get_confusion_matrix(enriched.y_true.values, enriched.y_pred_binary.values, [0,1])

predicted,0,1
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,975,236
1,141,504


Run the test

In [None]:
t, p = paired_t_test(np.array(benchmark.y_pred_binary.values), np.array(enriched.y_pred_binary.values))
p

0.04691654808219231

**With p_value <0.05 we can reject the null hypothesis that the two methods are identical. The improvement is significant.**

Examples: what did the winner justifiably classify as Not-Hate, while the benchmark wrongly classified as Hate?

In [None]:
merged_df = benchmark.merge(enriched, on=['Unnamed: 0', 'text', 'y_true'], how='inner', suffixes=['_benchmark', '_winner'])
diff = merged_df[ (merged_df.y_true==0) & (merged_df.y_pred_binary_benchmark==1) & (merged_df.y_pred_binary_winner==0) ]
diff.sample(5)['text'].values

array(["Punk and coward Tom Arnold Back-Up Twitter Account Suspended After TGP Report on Violent Threat Against 'Narcs'",
       'We used to be "The Grand Old Party" GOP, "We are now, "The Grand New Party" and we need to form our own branch, and leave these spineless old Republicans behind...They act more like Dems, then Rep. anyway...all they think about is position, power and money...and we gave them all 3...Not. anymore...and if we want Trump or someone like Trump to lead us from now on, that\'s our decision.. No More Politicians...We need powerful Business minded representation to lead our country. NO more of these leaders who just want to fill their pockets and live in big mansions, but do nothing for the people. ALL PEOPLE, black, white, brown, yellow, gay, straight, male, female...we all deserve to live in nice neighborhoods, have good jobs, and our children deserve higher education...We deserve to own businesses without fear of home grown terrorists burning them down, or lootin