# Ancient Texts: Hypothesis Testing
### Purpose
To discover whether modern NLP tools and predictive algorithms can provide insights into ancient text corpora

### This notebook contains Hypothesis Testing (two-sample t-tests)
Several two-sample hypothesis tests (t-tests) are conducted to ascertain if there is a statistically significant difference or a random sampling occurrence in *means* by target variable `is_hymn` and the features: `word_count`, `whole_polarity`, and `whole_subjectivity`

### Data
Data worked with is `ancient_for_ML.csv` which was finalized in the notebook ancient_NLP_balancing

#### `is_hymn` is divided between

**'is_hymn=1'**
> 'Royal praise poetry and hymns to deities on behalf of rulers'
'Praise poetry and hymns for unknown rulers'
'Hymns addressed to deities'
'Hymns addressed to or concerning temples'
'Other letters and letter-prayers'

**'is_hymn=0'**

> 'Narratives featuring deities', 'Narratives featuring heroes', 'King lists and other compositions', 'City laments', 'Royal correspondence', 'School stories', 'Debate poems', 'Dialogues and diatribes', 'Personal laments', 'Lu-digira compositions', 'Types of song', 'Didactic compositions', 'Short tales', 'Animal fables', 'Offering compositions', 'Proverb collections', 'Other proverbs, 'Reflective compositions', 'Other', 'Lexical compositions'


In [1]:
# Import standard operational packages
import pandas as pd
import numpy as np

# Import additional statistical package
from scipy import stats

# Set Jupyter to display all of the columns (no redaction)
pd.set_option('display.max_columns', None)

In [2]:
# Import data; create df
df0 = pd.read_csv('ancient_for_ML.csv')
df0.head(3)

Unnamed: 0,b_category,number,whole_polarity,whole_subjectivity,word_count,tfidf_top_word,sentence_count,is_hymn,summary,lemmatized_summaries
0,Narratives featuring deities,1.1.1,0.330364,0.559201,1440,oil like,167,0,pure is dilmun land. pure is dilmun land. pure...,pure dilmun land pure dilmun land pure dilmun ...
1,Narratives featuring deities,1.1.2,0.1106,0.449011,825,could not,81,0,"in those days, in the days when heaven and ear...",day day heaven earth created night night heave...
2,Narratives featuring deities,1.1.3,0.382723,0.596019,2247,placed charge,236,0,"grandiloquent lord of heaven and earth, self-r...",grandiloquent lord heaven earth self-reliant f...


In [3]:
df0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 364 entries, 0 to 363
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   b_category            364 non-null    object 
 1   number                364 non-null    object 
 2   whole_polarity        364 non-null    float64
 3   whole_subjectivity    364 non-null    float64
 4   word_count            364 non-null    int64  
 5   tfidf_top_word        364 non-null    object 
 6   sentence_count        364 non-null    int64  
 7   is_hymn               364 non-null    int64  
 8   summary               364 non-null    object 
 9   lemmatized_summaries  364 non-null    object 
dtypes: float64(2), int64(3), object(5)
memory usage: 28.6+ KB


In [4]:
df0.describe().round(2)

Unnamed: 0,whole_polarity,whole_subjectivity,word_count,sentence_count,is_hymn
count,364.0,364.0,364.0,364.0,364.0
mean,0.28,0.58,439.93,55.2,0.58
std,0.2,0.13,611.72,67.88,0.49
min,-0.37,0.0,31.0,1.0,0.0
25%,0.14,0.52,124.75,17.0,0.0
50%,0.27,0.58,217.5,29.0,1.0
75%,0.41,0.65,438.5,60.25,1.0
max,0.84,0.91,5305.0,452.0,1.0


In [5]:
# Calculate count and % of `is_hymn` in full dataset
print(df0.shape)
print(df0['is_hymn'].value_counts()) # value_counts counts the # of times appears
print(df0['is_hymn'].value_counts(normalize = True).mul(100).round(1).astype(str) + '%')  # normalize = True displays in percentages

(364, 10)
is_hymn
1    210
0    154
Name: count, dtype: int64
is_hymn
1    57.7%
0    42.3%
Name: proportion, dtype: object


## Perform Hypothesis testing: Welch's t-test
Welch's t-test assumes unequal variances in population (no reason to assume same variance here)\
(Variance: the average of the squared difference of each data point from the mean)

#### Steps
1. state the NULL hypothesis ($H_0$) and the alternative hypothesis ($H_a$)
    * $H_0$: there is no statistical difference – not reject, any difference is CHANCE
    * $H_a$: there is a statistical difference - REJECT, not due to chance; there is a relationship.
2. choose a significance level: 5%
3. find the p-value; stats.ttest_ind() function to perform the test
4. reject or fail to reject the NULL hypothesis

In [6]:
# Set significance level
significance_level = 0.05
significance_level

0.05

### Two-sample t-test for statistical significance between `is_hymn` statuses and median and mean `word_count` 
* $H_0$: there is no statistical difference between the *mean* number of word counts between 'is_hymn=0' and 'is_hymn=1' – any difference is CHANCE
* $H_a$: there is a statistical difference; REJECT $H_0$, because difference is not due to chance; there is a relationship.

In [8]:
# Calculate median `word_count` for each group in `is_hymn`
df0.groupby('is_hymn')['word_count'].median()

is_hymn
0    309.5
1    185.0
Name: word_count, dtype: float64

In [10]:
# Calculate mean `word_count` for each group in `is_hymn`
df0.groupby('is_hymn')['word_count'].mean()

is_hymn
0    642.772727
1    291.180952
Name: word_count, dtype: float64

In [12]:
# Conduct a two-sample t-test to compare means
# Isolate 'is_hymn=0' and 'is_hymn=1'
not_hymn = df0[df0['is_hymn'] == 0]['word_count']
hymn = df0[df0['is_hymn'] == 1]['word_count']

# Perform t-test
stats.ttest_ind(a=not_hymn, b=hymn, equal_var=False) # equal_var=False to not assume population variances are =

TtestResult(statistic=5.07111588277485, pvalue=9.192666985612655e-07, df=194.37129015419953)

#### Results
There is a large difference between the medians and averages between 'is_hymn=0' and 'is_hymn=1'   

And, indeed the p-value is phenomenally small (9.192666985612655e-07 < 0.05) so the NULL hypothesis can be rejected \
There is a statistical difference between the means of `word_count` by 'is_hymn=0' and 'is_hymn=1'

### Two-sample t-test for statistical significance between `is_hymn` statuses and median and mean `whole_polarity` 
* $H_0$: there is no statistical difference between the *mean* number of `whole_polarity` numbers between 'is_hymn=0' and 'is_hymn=1' – any difference is CHANCE
* $H_a$: there is a statistical difference; REJECT $H_0$, because difference is not due to chance; there is a relationship.

In [16]:
# Calculate median `whole_polarity` for each group in `is_hymn`
df0.groupby('is_hymn')['whole_polarity'].median()

is_hymn
0    0.155301
1    0.358964
Name: whole_polarity, dtype: float64

In [17]:
# Calculate mean `whole_polarity` for each group in `is_hymn`
df0.groupby('is_hymn')['whole_polarity'].mean()

is_hymn
0    0.181630
1    0.343882
Name: whole_polarity, dtype: float64

In [18]:
# Conduct a two-sample t-test to compare means
# Isolate 'is_hymn=0' and 'is_hymn=1'
not_hymn = df0[df0['is_hymn'] == 0]['whole_polarity']
hymn = df0[df0['is_hymn'] == 1]['whole_polarity']

# Perform t-test
stats.ttest_ind(a=not_hymn, b=hymn, equal_var=False)

TtestResult(statistic=-8.451312209963868, pvalue=8.579570660847614e-16, df=339.69806614238684)

#### Results
There is a large difference between the medians and averages between 'is_hymn=0' and 'is_hymn=1'    

And, indeed the p-value is phenomenally small (pvalue=8.579570660847614e-16 < 0.05) so the NULL hypothesis can be rejected \
There is a statistical difference between the means of `whole_polarity` by 'is_hymn=0' and 'is_hymn=1'

### Two-sample t-test for statistical significance between `is_hymn` statuses and median and mean `whole_subjectivity` 
* $H_0$: there is no statistical difference between the *mean* number of `whole_subjectivity` numbers between 'is_hymn=0' and 'is_hymn=1' – any difference is CHANCE
* $H_a$: there is a statistical difference; REJECT $H_0$, because difference is not due to chance; there is a relationship.

In [19]:
# Calculate median `whole_polarity` for each group in `is_hymn`
df0.groupby('is_hymn')['whole_subjectivity'].median()

is_hymn
0    0.539778
1    0.619656
Name: whole_subjectivity, dtype: float64

In [20]:
# Calculate mean `whole_polarity` for each group in `is_hymn`
df0.groupby('is_hymn')['whole_subjectivity'].mean()

is_hymn
0    0.537204
1    0.605122
Name: whole_subjectivity, dtype: float64

In [21]:
# Conduct a two-sample t-test to compare means
# Isolate 'is_hymn=0' and 'is_hymn=1'
not_hymn = df0[df0['is_hymn'] == 0]['whole_subjectivity']
hymn = df0[df0['is_hymn'] == 1]['whole_subjectivity']

# Perform t-test
stats.ttest_ind(a=not_hymn, b=hymn, equal_var=False)

TtestResult(statistic=-5.129226919810546, pvalue=4.985454716634919e-07, df=328.22082466301015)

#### Results
There is a large difference between the medians and averages between 'is_hymn=0' and 'is_hymn=1'    

And, indeed the p-value is phenomenally small (pvalue=4.985454716634919e-07 < 0.05) so the NULL hypothesis can be rejected \
There is a statistical difference between the means of `whole_subjectivity` by 'is_hymn=0' and 'is_hymn=1'

### Results
There is a statistical difference between the *means* in all cases:
* There is a low p-value for the factor `word_count` by 'is_hymn=1' versus 'is_hymn=0'
* There is a low p-value for the factor `whole_polarity` by 'is_hymn=1' versus 'is_hymn=0'
* There is a low p-value for the factor `whole_subjectivity` by 'is_hymn=1' versus 'is_hymn=0'

A low p-value (here below 0.05) indicates that there is a statistically significant effect or relationship and that it is likely a meaningful predictor.

Thus, `word_count`, `whole_polarity`, and `whole_subjectivity` likely have strong relationships with the outcome variable ('is_hymn').

Keeping these features in the models 
* Will likely improve their performances
* Will help in interpreting the model's decisions - insights into patterns