This test is invalid when the observed or expected frequencies in each category are too small.
A typical rule is that all of the observed and expected frequencies should be at least 5.
According to [3], the total number of samples is recommended to be greater than 13, otherwise exact tests (such as Barnard’s Exact test) should be used because they do not overreject.

Also, the sum of the observed and expected frequencies must be the same for the test to be valid; chisquare raises an error if the sums do not agree within a relative tolerance of 1e-8.

The default degrees of freedom, k-1, are for the case when no parameters of the distribution are estimated. If p parameters are estimated by efficient maximum likelihood then the correct degrees of freedom are k-1-p.
If the parameters are estimated in a different way, then the dof can be between k-1-p and k-1.
However, it is also possible that the asymptotic distribution is not chi-square, in which case this test is not appropriate.

In [96]:
import bundle.baselines.st3 as st3
import bundle.baselines.st2 as st2
from models.helpers import *
import pandas as pd
from sklearn.metrics import classification_report as report


In [97]:
languages_with_train = ["en", "fr", "ge", "it", "po", "ru"]
languages_without_train = ["es", "gr", "ka"]

paths_st3 = get_paths('en')
paths_st2 = get_paths('en', subtask=2)

st3_train = st3.make_dataframe(paths_st3["train_folder"], paths_st3["train_labels"])

st3_test = st3.make_dataframe(paths_st3["dev_folder"], paths_st3["dev_labels"])

st2_train = st2.make_dataframe(paths_st2["train_folder"], paths_st2["train_labels"])
    
st2_test = st2.make_dataframe(paths_st2["dev_folder"], paths_st2["dev_labels"])

446it [00:00, 5370.60it/s]
90it [00:00, 6837.42it/s]
433it [00:00, 8612.94it/s]
83it [00:00, 8105.03it/s]


In [98]:
st3_train.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 3760 entries, (111111111, 3) to (999001970, 13)
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    3760 non-null   object
 1   labels  3760 non-null   object
dtypes: object(2)
memory usage: 223.6+ KB


In [99]:
st2_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 433 entries, 833042063 to 832917778
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    433 non-null    object
 1   frames  433 non-null    object
dtypes: object(2)
memory usage: 26.3+ KB


In [100]:
# Reset index in both dataframes
st3_train = st3_train.reset_index()
st2_train = st2_train.reset_index()

# Convert 'id' column in both dataframes to int
st3_train['id'] = st3_train['id'].astype(int)
st2_train['id'] = st2_train['id'].astype(int)

# Now you can merge
train = st3_train.merge(st2_train, how='inner', on='id')

In [101]:
train = train.drop(columns=['id', "line", 'text_x', 'text_y'])

In [102]:
train

Unnamed: 0,labels,frames
0,Doubt,"Health_and_safety,Quality_of_life"
1,Appeal_to_Authority,"Health_and_safety,Quality_of_life"
2,Repetition,"Health_and_safety,Quality_of_life"
3,Appeal_to_Fear-Prejudice,"Health_and_safety,Quality_of_life"
4,Appeal_to_Fear-Prejudice,"Health_and_safety,Quality_of_life"
...,...,...
3597,"Exaggeration-Minimisation,Slogans","Morality,Fairness_and_equality,Legality_Consti..."
3598,Exaggeration-Minimisation,"Morality,Fairness_and_equality,Legality_Consti..."
3599,Name_Calling-Labeling,"Morality,Fairness_and_equality,Legality_Consti..."
3600,"Exaggeration-Minimisation,Name_Calling-Labeling","Morality,Fairness_and_equality,Legality_Consti..."


In [103]:
# let's figure out how many rows we have lost in the merge

train_outer = st3_train.merge(st2_train, how='outer', on='id')

print(f"we have lost {train_outer.shape[0] - train.shape[0]} rows")
# there were some rows that were not present in both dataframes, so this is why we have a loss of data
print(f"we have {train.shape[0]} rows in the merged dataframe")
# considering we have 3602 rows in the merged dataframe, the loss in not significant and we have enough data to perform the chi squared test

we have lost 179 rows
we have 3602 rows in the merged dataframe


## Chi squared test

In [104]:
import scipy.stats as stats

# let's perform the chi squared test

chi2, p, dof, ex = stats.chi2_contingency(pd.crosstab(train['labels'], train['frames']))

In [105]:
H0 = "There is no significant relationship between the labels and the frames"
H1 = "There is a significant relationship between the labels and the frames"

if p < 0.05:
    print(H1)
else:
    print(H0)

print(f"The values of the chi squared test are: chi2 = {chi2}, p = {p}, dof = {dof}")

There is a significant relationship between the labels and the frames
The values of the chi squared test are: chi2 = 90450.69645384239, p = 4.743877828064794e-132, dof = 80264


Above, we considered different combinations of labels as a unique label.
Let's see what happens when we explode all the comma seperated labels into seperate rows.

Example:

Exaggeration-Minimisation,Slogans | Morality,Fairness_and_equality

turns into:

Exaggeration-Minimisation | Morality  
Exaggeration-Minimisation | Fairness_and_equality  
Slogans                   | Morality  
Slogans                   | Fairness_and_equality  

In [106]:
train_exploded = train.copy()

train_exploded['labels'] = train_exploded['labels'].str.split(',')
train_exploded["frames"] = train_exploded["frames"].str.split(',')

train_exploded = train_exploded.explode('labels')
train_exploded = train_exploded.explode('frames')

train_exploded

Unnamed: 0,labels,frames
0,Doubt,Health_and_safety
0,Doubt,Quality_of_life
1,Appeal_to_Authority,Health_and_safety
1,Appeal_to_Authority,Quality_of_life
2,Repetition,Health_and_safety
...,...,...
3600,Name_Calling-Labeling,Fairness_and_equality
3600,Name_Calling-Labeling,Legality_Constitutionality_and_jurisprudence
3601,Exaggeration-Minimisation,Morality
3601,Exaggeration-Minimisation,Fairness_and_equality


Let's perform the chi squared test on the exploded dataset

In [107]:
chi2, p, dof, ex = stats.chi2_contingency(pd.crosstab(train_exploded['labels'], train_exploded['frames']))

H0 = "There is no significant relationship between the labels and the frames"
H1 = "There is a significant relationship between the labels and the frames"

if p < 0.05:
    print(H1)
else:
    print(H0)

print(f"The values of the chi squared test are: chi2 = {chi2}, p = {p}, dof = {dof}")

There is a significant relationship between the labels and the frames
The values of the chi squared test are: chi2 = 660.5448352645097, p = 2.5699618191412425e-42, dof = 234


In [109]:
import scipy
print(scipy.__version__)

1.13.0


By doing the chi squared test we have proven there is a significat relationship between the 2 sets of labels, thus making them a good candidate for trasnfer learning!