<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - ANOVA Preparation - INRS

## What is `ANOVA`?

ANOVA, which stands for Analysis of Variance, is a statistical method used to compare the means of three or more groups to determine if there are any statistically significant differences between them. It helps in understanding whether the variation in data is due to the differences between the groups or just random chance.

Please refer to:
- [Analysis of Variance](https://en.wikipedia.org/wiki/Analysis_of_variance)

## Required Python packages

- pandas
- yake

## Importing the required libraries

In [1]:
import pandas as pd
import re
import shutil
import yake

## Extracting key words from non-lemmatised text

### Importing the raw data into a DataFrame

#### `Republican + Democratic + Independent` data set

In [2]:
df_debates_turns = pd.read_json('cl_st1_inrs_tc/debates_turns.jsonl', lines=True)

In [3]:
df_debates_turns.dtypes

Title           object
Debate          object
Date             int64
Participants    object
Moderators      object
Speaker         object
Text            object
dtype: object

When a DataFrame with a `datetime64[ns]` column is exported to `JSONL`, the dates are converted to UNIX timestamps (milliseconds since the epoch). When you import the JSONL file back into a DataFrame, these timestamps are read as integers. To convert these integers back to `datetime64[ns]` format, you can use the `pd.to_datetime()` function with the `unit` parameter set to 'ms' (milliseconds).

In [4]:
df_debates_turns['Date'] = pd.to_datetime(df_debates_turns['Date'], unit='ms')

In [5]:
df_debates_turns.dtypes

Title                   object
Debate                  object
Date            datetime64[ns]
Participants            object
Moderators              object
Speaker                 object
Text                    object
dtype: object

In [6]:
df_debates_turns.head(5)

Unnamed: 0,Title,Debate,Date,Participants,Moderators,Speaker,Text
0,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,"Thank you very much, Chris. I will tell you ve..."
1,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,"Well, first of all, thank you for doing this a..."
2,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,The American people have a right to have a say...
3,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,There aren’t a hundred million people with pre...
4,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,"During that period of time, during that period..."


##### Checking the number of texts

In [7]:
df_debates_turns.shape

(3478, 7)

##### Inspecting a few texts

In [8]:
inspected_row = 0
print('Speaker:' + df_debates_turns.loc[inspected_row, 'Speaker'])
print('Text:' + df_debates_turns.loc[inspected_row, 'Text'])

Speaker:TRUMP
Text:Thank you very much, Chris. I will tell you very simply. We won the election. Elections have consequences. We have the Senate, we have the White House, and we have a phenomenal nominee respected by all. Top, top academic, good in every way. Good in every way. In fact, some of her biggest endorsers are very liberal people from Notre Dame and other places. So I think she’s going to be fantastic. We have plenty of time. Even if we did it after the election itself. I have a lot of time after the election, as you know. So I think that she will be outstanding. She’s going to be as good as anybody that has served on that court. We really feel that. We have a professor at Notre Dame, highly respected by all, said she’s the single greatest student he’s ever had. He’s been a professor for a long time at a great school. And we won the election and therefore we have the right to choose her, and very few people knowingly would say otherwise. And by the way, the Democrats, they wo

### Creating a test data set

In [9]:
df_debates_turns_test = df_debates_turns.head(10)

In [10]:
df_debates_turns_test

Unnamed: 0,Title,Debate,Date,Participants,Moderators,Speaker,Text
0,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,"Thank you very much, Chris. I will tell you ve..."
1,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,"Well, first of all, thank you for doing this a..."
2,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,The American people have a right to have a say...
3,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,There aren’t a hundred million people with pre...
4,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,"During that period of time, during that period..."
5,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,"Well, you’re certainly going to socialist. You..."
6,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,"Number one, he knows what I proposed. What I p..."
7,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,That’s not what you’ve said and it’s not what ...
8,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,Your party doesn’t say it. Your party wants to...
9,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,"The party is me. Right now, I am the Democrati..."


### Combining all texts into a single text

In [11]:
corpus_test = ' '.join(df_debates_turns_test['Text'].tolist())

In [12]:
type(corpus_test)

str

In [13]:
corpus_test

'Thank you very much, Chris. I will tell you very simply. We won the election. Elections have consequences. We have the Senate, we have the White House, and we have a phenomenal nominee respected by all. Top, top academic, good in every way. Good in every way. In fact, some of her biggest endorsers are very liberal people from Notre Dame and other places. So I think she’s going to be fantastic. We have plenty of time. Even if we did it after the election itself. I have a lot of time after the election, as you know. So I think that she will be outstanding. She’s going to be as good as anybody that has served on that court. We really feel that. We have a professor at Notre Dame, highly respected by all, said she’s the single greatest student he’s ever had. He’s been a professor for a long time at a great school. And we won the election and therefore we have the right to choose her, and very few people knowingly would say otherwise. And by the way, the Democrats, they wouldn’t even think 

### Initialising the `YAKE!` extractor

In [14]:
language = 'en'
max_ngram_size = 1
deduplication_threshold = 0.9
deduplication_algo = 'seqm' # for 'Sequence Matcher', the default
#deduplication_algo = 'levs' # for Levenshtein algorith
#deduplication_algo = 'jaccard' # for Jaccard algorith
#windowSize = 2 # the number of words to the left and right of a given word that are considered when calculating the word’s context - the default is 2
windowSize = 1
numOfKeywords = 20

In [15]:
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, dedupFunc=deduplication_algo, windowsSize=windowSize, top=numOfKeywords, features=None)

### Extracting key words

In [16]:
key_words = custom_kw_extractor.extract_keywords(corpus_test)

In [17]:
type(key_words)

list

### Displaying the key words

In [18]:
df_key_words = pd.DataFrame(key_words)
df_key_words.columns = ['Key Word', 'Score']

In [19]:
df_key_words

Unnamed: 0,Key Word,Score
0,election,0.032157
1,President,0.03258
2,people,0.033653
3,’re,0.057993
4,Care,0.064828
5,time,0.077241
6,elected,0.080974
7,Affordable,0.081247
8,Senate,0.085748
9,Act,0.08584


### Excluding words that are of no consequence to the analysis

In [20]:
df_key_words = df_key_words.drop(index=3)
df_key_words = df_key_words.drop(index=14)
df_key_words = df_key_words.reset_index(drop=True)

In [21]:
df_key_words

Unnamed: 0,Key Word,Score
0,election,0.032157
1,President,0.03258
2,people,0.033653
3,Care,0.064828
4,time,0.077241
5,elected,0.080974
6,Affordable,0.081247
7,Senate,0.085748
8,Act,0.08584
9,court,0.086206


In [22]:
df_key_words.index = df_key_words.index + 1
df_key_words['Variable ID'] = 'v' + df_key_words.index.astype(str).str.zfill(6)

In [23]:
df_key_words

Unnamed: 0,Key Word,Score,Variable ID
1,election,0.032157,v000001
2,President,0.03258,v000002
3,people,0.033653,v000003
4,Care,0.064828,v000004
5,time,0.077241,v000005
6,elected,0.080974,v000006
7,Affordable,0.081247,v000007
8,Senate,0.085748,v000008
9,Act,0.08584,v000009
10,court,0.086206,v000010


### Exporting to a file

In [24]:
df_key_words[['Variable ID', 'Key Word']].to_csv('selectedwords_nolem', sep=' ', index=False, header=False, encoding='utf-8', lineterminator='\n')

In [25]:
shutil.copy('selectedwords_nolem', 'var_index_nolem.txt')

'var_index_nolem.txt'

## Extracting key words from lemmatised text

### Importing the raw data into a DataFrame

#### `Republican + Democratic + Independent` data set

In [26]:
df_debates_turns_lem = pd.read_csv('tweets/tokens.txt', sep='|', header=None)
df_debates_turns_lem.columns = ['Text ID', 'Text']

In [27]:
df_debates_turns_lem

Unnamed: 0,Text ID,Text
0,t000000,c:thank tell win election election consequence...
1,t000001,c:first thank do look president
2,t000002,c:american people right say nominee say occur ...
3,t000003,c:aren people pre exist condition say concern ...
4,t000004,c:period time period time opening elect year e...
...,...,...
3472,t003473,c:say issue stay campaign issue long persist t...
3473,t003474,c:go resolution commit president states suppor...
3474,t003475,c:testimony fifty eight
3475,t003476,c:say ve serve country fourteen year serve war...


In [28]:
# Remove 'c:' prefix
def remove_c_prefix(input_line):
    processed_line = re.sub(r'^c:', '', input_line)
    return processed_line

df_debates_turns_lem['Text'] = df_debates_turns_lem['Text'].apply(remove_c_prefix)

In [29]:
df_debates_turns_lem.dtypes

Text ID    object
Text       object
dtype: object

In [30]:
df_debates_turns_lem.head(5)

Unnamed: 0,Text ID,Text
0,t000000,thank tell win election election consequence p...
1,t000001,first thank do look president
2,t000002,american people right say nominee say occur vo...
3,t000003,aren people pre exist condition say concern pe...
4,t000004,period time period time opening elect year ele...


##### Checking the number of texts

In [31]:
df_debates_turns_lem.shape

(3477, 2)

##### Inspecting a few texts

In [32]:
inspected_row = 0
print('Text:' + df_debates_turns_lem.loc[inspected_row, 'Text'])

Text:thank tell win election election consequence phenomenal nominee respect top top academic good way good way fact big endorser liberal people other place think go fantastic plenty time do election lot time election know think outstanding she go good anybody serve court feel professor respect say single great student professor long time great school win election right choose few people say way democrats wouldn think do only difference try do way give problem didn election stop happen reverse happen reverse win election right do 


### Creating a test data set

In [33]:
df_debates_turns_lem_test = df_debates_turns_lem.head(10)

In [34]:
df_debates_turns_lem_test

Unnamed: 0,Text ID,Text
0,t000000,thank tell win election election consequence p...
1,t000001,first thank do look president
2,t000002,american people right say nominee say occur vo...
3,t000003,aren people pre exist condition say concern pe...
4,t000004,period time period time opening elect year ele...
5,t000005,re go socialist re go socialist medicine
6,t000006,number know propose propose expand increase do...
7,t000007,ve say party say
8,t000008,party doesn say party want go socialist medici...
9,t000009,party democratic


### Combining all texts into a single text

In [35]:
corpus_lem_test = ' '.join(df_debates_turns_lem_test['Text'].tolist())

In [36]:
type(corpus_lem_test)

str

In [37]:
corpus_lem_test

'thank tell win election election consequence phenomenal nominee respect top top academic good way good way fact big endorser liberal people other place think go fantastic plenty time do election lot time election know think outstanding she go good anybody serve court feel professor respect say single great student professor long time great school win election right choose few people say way democrats wouldn think do only difference try do way give problem didn election stop happen reverse happen reverse win election right do  first thank do look president  american people right say nominee say occur vote states vote states re go get chance re middle election election start ten thousand people vote thing happen wait wait see outcome election only way american people get express view elect elect stake make clear want get rid run run govern try get rid strip people health insurance go court justice oppose justice seem fine person write go bench right think constitutional other thing cour

### Initialising the `YAKE!` extractor

In [38]:
language = 'en'
max_ngram_size = 1
deduplication_threshold = 0.9
deduplication_algo = 'seqm' # for 'Sequence Matcher', the default
#deduplication_algo = 'levs' # for Levenshtein algorith
#deduplication_algo = 'jaccard' # for Jaccard algorith
#windowSize = 2 # the number of words to the left and right of a given word that are considered when calculating the word’s context - the default is 2
windowSize = 1
numOfKeywords = 20

In [39]:
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, dedupFunc=deduplication_algo, windowsSize=windowSize, top=numOfKeywords, features=None)

### Extracting key words

In [40]:
key_words_lem = custom_kw_extractor.extract_keywords(corpus_lem_test)

In [41]:
type(key_words_lem)

list

### Displaying the key words

In [42]:
df_key_words_lem = pd.DataFrame(key_words_lem)
df_key_words_lem.columns = ['Key Word', 'Score']

In [43]:
df_key_words_lem

Unnamed: 0,Key Word,Score
0,election,0.009477
1,people,0.01126
2,elect,0.012225
3,year,0.013663
4,exist,0.014405
5,win,0.016393
6,time,0.01782
7,pre,0.018489
8,condition,0.020675
9,socialist,0.020675


In [44]:
df_key_words_lem.index = df_key_words_lem.index + 1
df_key_words_lem['Variable ID'] = 'v' + df_key_words_lem.index.astype(str).str.zfill(6)

### Exporting to a file

In [45]:
df_key_words_lem[['Variable ID', 'Key Word']].to_csv('selectedwords_lem', sep=' ', index=False, header=False, encoding='utf-8', lineterminator='\n')

In [46]:
shutil.copy('selectedwords_lem', 'var_index_lem.txt')

'var_index_lem.txt'