
# Conducting T tests + Pearson's R



Computational Literature Review

Creator: Nancy Xu

Date created: September 26, 2022

Date last modified: November 26, 2022

This notebook:

- conducts T tests to see if the difference between mean of negative vs. positive data is significant
- conducts Pearson's Correlation test to see the correlation between each metric and the actual labels

Files used:
- `pos_cult_full_3_metrics_df.csv` and other csv's (containing normalized scores for cwmd, wmd, cos) generated by `embeddings/word_movers_distance/Calculate_WMD_cos_scores_full.ipynb`

In [1]:
import pandas as pd
import gensim
import pickle
import os
import re
import ast
from gensim.models.phrases import Phrases
from gensim.models.phrases import Phrases, Phraser
import multiprocessing
from sklearn import utils
cores = multiprocessing.cpu_count()
import gensim
from gensim.test.utils import get_tmpfile
from gensim.models import Word2Vec
from tqdm import tqdm
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.test.utils import common_texts
from numpy import dot, absolute
from numpy.linalg import norm

In [6]:
pos_cult_full=pd.read_csv('pos_cult_full_3_metrics_df.csv')
neg_cult_full=pd.read_csv('neg_cult_full_3_metrics_df.csv')
pos_demog_full = pd.read_csv('pos_demog_full_3_metrics_df.csv')
neg_demog_full = pd.read_csv('neg_demog_full_3_metrics_df.csv')
pos_rela_full = pd.read_csv('pos_rela_full_3_metrics_df.csv')
neg_rela_full = pd.read_csv('neg_rela_full_3_metrics_df.csv')

## T tests

H0 => µ1 = µ2 (population mean of positive data is equal to negative data)


HA => µ1 ≠µ2 (population mean of positive data is different from negative data)


Assumptions:
    
- Normal distribution (large sample from non-normal distribution is also ok)
- Variances approximately equal


### cwmd

In [25]:
import scipy.stats as stats
 
# Creating data groups
data_group1 = pos_cult_full['normalized_cult_cwmd']
 
data_group2 = neg_cult_full['normalized_cult_cwmd']
 
# Perform the two sample t-test with equal variances
print(np.var(data_group1), np.var(data_group2))

0.08342599952133542 0.08042570994563532


In [26]:
stats.ttest_ind(a=data_group1, b=data_group2, equal_var=True)

Ttest_indResult(statistic=0.4839403283389464, pvalue=0.6285780792604434)

Here, since the p-value (0.6285) is greater than alpha = 0.05 so we cannot reject the null hypothesis of the test. We do not have sufficient evidence to say that the mean cwmd between positive and negative cultural data are different.

In [27]:
import scipy.stats as stats
 
# Creating data groups
data_group1 = pos_rela_full['normalized_rela_cwmd']
 
data_group2 = neg_rela_full['normalized_rela_cwmd']
 
# Perform the two sample t-test with equal variances
print(np.var(data_group1), np.var(data_group2))

0.10780223433176751 0.09075853360361387


In [28]:
stats.ttest_ind(a=data_group1, b=data_group2, equal_var=True)

Ttest_indResult(statistic=-0.061426352018829476, pvalue=0.9510369773347835)

In [29]:
import scipy.stats as stats
 
# Creating data groups
data_group1 = pos_demog_full['normalized_dem_cwmd']
 
data_group2 = neg_demog_full['normalized_dem_cwmd']
 
# Perform the two sample t-test with equal variances
print(np.var(data_group1), np.var(data_group2))

0.08236026875863917 0.05546518320582084


In [30]:
stats.ttest_ind(a=data_group1, b=data_group2, equal_var=False)

Ttest_indResult(statistic=0.45190839832007235, pvalue=0.6515569495236879)

### wmd

In [31]:
import scipy.stats as stats
 
# Creating data groups
data_group1 = pos_cult_full['normalized_cult_wmd']
 
data_group2 = neg_cult_full['normalized_cult_wmd']
 
# Perform the two sample t-test with equal variances
print(np.var(data_group1), np.var(data_group2))

5.016458728379483e-05 3.6215723341453105e-05


In [32]:
stats.ttest_ind(a=data_group1, b=data_group2, equal_var=True)

Ttest_indResult(statistic=14.825503097051833, pvalue=1.692591380579473e-43)

In [33]:
import scipy.stats as stats
 
# Creating data groups
data_group1 = pos_rela_full['normalized_rela_wmd']
 
data_group2 = neg_rela_full['normalized_rela_wmd']
 
# Perform the two sample t-test with equal variances
print(np.var(data_group1), np.var(data_group2))

3.208812808041857e-05 5.248135120590465e-05


In [34]:
stats.ttest_ind(a=data_group1, b=data_group2, equal_var=True)

Ttest_indResult(statistic=-8.0682926184409, pvalue=3.0459430156285046e-15)

In [35]:
import scipy.stats as stats
 
# Creating data groups
data_group1 = pos_demog_full['normalized_dem_wmd']
 
data_group2 = neg_demog_full['normalized_dem_wmd']
 
# Perform the two sample t-test with equal variances
print(np.var(data_group1), np.var(data_group2))

7.87143449593322e-05 1.9751803775596565e-05


In [36]:
stats.ttest_ind(a=data_group1, b=data_group2, equal_var=True)

Ttest_indResult(statistic=12.461853496369585, pvalue=1.8481072850877612e-32)

### cos

In [37]:
import scipy.stats as stats
 
# Creating data groups
data_group1 = pos_cult_full['normalized_cult_cos']
 
data_group2 = neg_cult_full['normalized_cult_cos']
 
# Perform the two sample t-test with equal variances
print(np.var(data_group1), np.var(data_group2))

0.0009693092183261205 0.0013247059457304079


In [38]:
stats.ttest_ind(a=data_group1, b=data_group2, equal_var=True)

Ttest_indResult(statistic=13.181788527222324, pvalue=1.2295872848401228e-35)

In [39]:
import scipy.stats as stats
 
# Creating data groups
data_group1 = pos_rela_full['normalized_rela_cos']
 
data_group2 = neg_rela_full['normalized_rela_cos']
 
# Perform the two sample t-test with equal variances
print(np.var(data_group1), np.var(data_group2))

0.0005860074198526151 0.001031138345817007


In [40]:
stats.ttest_ind(a=data_group1, b=data_group2, equal_var=True)

Ttest_indResult(statistic=-8.949502023993045, pvalue=3.090385982859387e-18)

In [41]:
import scipy.stats as stats
 
# Creating data groups
data_group1 = pos_demog_full['normalized_dem_cos']
 
data_group2 = neg_demog_full['normalized_dem_cos']
 
# Perform the two sample t-test with equal variances
print(np.var(data_group1), np.var(data_group2))

0.0007820500248570914 0.00040178817427078165


In [42]:
stats.ttest_ind(a=data_group1, b=data_group2, equal_var=True)

Ttest_indResult(statistic=13.36056564204887, pvalue=1.3791371350619354e-36)

#### All metrics except for CWMD have different means for positive and negative labeled data.


## Pearson's correlation

### cwmd

In [47]:
pos_cult_full['cult_label']=1
neg_cult_full['cult_label']=0
cult_df = pd.concat([pos_cult_full,neg_cult_full])

In [48]:
stats.pearsonr(cult_df['normalized_cult_cwmd'],cult_df['cult_label'])

(0.01819743548727936, 0.6285780792601551)

In [49]:
pos_demog_full['dem_label']=1
neg_demog_full['dem_label']=0
demog_df = pd.concat([pos_demog_full,neg_demog_full])

In [57]:
stats.pearsonr(demog_df['normalized_dem_cwmd'],demog_df['dem_label'])

(0.017731949317939123, 0.6317304349264263)

In [52]:
pos_rela_full['rela_label']=1
neg_rela_full['rela_label']=0
rela_df = pd.concat([pos_rela_full,neg_rela_full])

In [55]:
stats.pearsonr(rela_df['normalized_rela_cwmd'],rela_df['rela_label'])

(-0.002306908934528322, 0.9510369773353674)

### wmd

In [63]:
stats.pearsonr(cult_df['normalized_cult_wmd'],cult_df['cult_label'])

(0.48698711642007264, 1.6925913805865138e-43)

In [64]:
stats.pearsonr(demog_df['normalized_dem_wmd'],demog_df['dem_label'])

(0.4185938154119084, 1.8481072850932498e-32)

In [65]:
stats.pearsonr(rela_df['normalized_rela_wmd'],rela_df['rela_label'])

(-0.2899905539098764, 3.0459430156275067e-15)

### cos

In [66]:
stats.pearsonr(cult_df['normalized_cult_cos'],cult_df['cult_label'])

(0.4441663567132837, 1.2295872848420857e-35)

In [67]:
stats.pearsonr(demog_df['normalized_dem_cos'],demog_df['dem_label'])

(0.4430190157260756, 1.3791371350651215e-36)

In [68]:
stats.pearsonr(rela_df['normalized_rela_cos'],rela_df['rela_label'])

(-0.31859181480658466, 3.0903859828597096e-18)

#### For cultural, WMD is the best. For demographic, cos is the best.  For relational, none of them are good (ngram core ratio has R = 0.168408)
