<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - ANOVA Preparation - INRS

## What is `ANOVA`?

ANOVA, which stands for Analysis of Variance, is a statistical method used to compare the means of three or more groups to determine if there are any statistically significant differences between them. It helps in understanding whether the variation in data is due to the differences between the groups or just random chance.

Please refer to:
- [Analysis of Variance](https://en.wikipedia.org/wiki/Analysis_of_variance)

## Required Python packages

- pandas

## Importing the required libraries

In [1]:
import pandas as pd

## Defining input variables

In [2]:
input_directory = 'cl_st1_inrs_tc'
output_directory = 'cl_st1_ph4_inrs'

## Preparing data for ANOVA

### Importing the raw data into a DataFrame

#### `Republican + Democratic + Independent` data set

In [3]:
df_debates_turns = pd.read_json(f"{input_directory}/debates_turns.jsonl", lines=True)

In [4]:
df_debates_turns['Date'] = pd.to_datetime(df_debates_turns['Date'], unit='ms')

In [5]:
df_debates_turns.dtypes

Title                   object
Debate                  object
Date            datetime64[ns]
Participants            object
Moderators              object
Speaker                 object
Text                    object
dtype: object

In [6]:
df_debates_turns.head(5)

Unnamed: 0,Title,Debate,Date,Participants,Moderators,Speaker,Text
0,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,"Thank you very much, Chris. I will tell you ve..."
1,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,"Well, first of all, thank you for doing this a..."
2,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,The American people have a right to have a say...
3,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,There aren’t a hundred million people with pre...
4,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,"During that period of time, during that period..."


##### Checking the number of texts

In [7]:
df_debates_turns.shape

(3478, 7)

## Adding the column `Decade` from the column `Date`

The formula extracts the year from each date, then divides the year by 10, discards the remainder (using integer division), and finally multiplies the result back by 10 to get the start year of the decade.

In [8]:
df_debates_turns['Decade'] = (df_debates_turns['Date'].dt.year // 10) * 10

In [9]:
df_debates_turns

Unnamed: 0,Title,Debate,Date,Participants,Moderators,Speaker,Text,Decade
0,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,"Thank you very much, Chris. I will tell you ve...",2020
1,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,"Well, first of all, thank you for doing this a...",2020
2,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,The American people have a right to have a say...,2020
3,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,There aren’t a hundred million people with pre...,2020
4,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,"During that period of time, during that period...",2020
...,...,...,...,...,...,...,...,...
3473,"October 21, 1960 Debate Transcript",The Fourth Kennedy-Nixon Presidential Debate,1960-10-21,Kennedy-Nixon,QUINCY HOWE,MR. NIXON,I would say that the issue will stay with us a...,1960
3474,"October 21, 1960 Debate Transcript",The Fourth Kennedy-Nixon Presidential Debate,1960-10-21,Kennedy-Nixon,QUINCY HOWE,MR. KENNEDY,"Well, Mr. Nixon, to go back to 1955. The resol...",1960
3475,"October 21, 1960 Debate Transcript",The Fourth Kennedy-Nixon Presidential Debate,1960-10-21,Kennedy-Nixon,QUINCY HOWE,MR. KENNEDY,And that’s the testimony of uh – General Twini...,1960
3476,"October 21, 1960 Debate Transcript",The Fourth Kennedy-Nixon Presidential Debate,1960-10-21,Kennedy-Nixon,QUINCY HOWE,MR. KENNEDY,I uh – said that I’ve served this country for ...,1960


## Retrieving the dictionary of speakers' respective political party

In [10]:
candidates_parties = {
    'ADMIRAL STOCKDALE': 'independent', 
    'ANDERSON': 'independent', 
    'BENTSEN': 'Democratic', 
    'BIDEN': 'Democratic', 
    'BUSH': 'Republican', 
    'CHENEY': 'Republican', 
    'CLINTON': 'Democratic', 
    'DOLE': 'Republican', 
    'DUKAKIS': 'Democratic', 
    'EDWARDS': 'Democratic', 
    'FERRARO': 'Democratic', 
    'GORE': 'Democratic', 
    'GOV. RONALD REAGAN': 'Republican', 
    'GOVERNOR CLINTON': 'Democratic', 
    'HARRIS': 'Democratic', 
    'KAINE': 'Democratic', 
    'KEMP': 'Republican', 
    'KERRY': 'Democratic', 
    'LIEBERMAN': 'Democratic', 
    'MCCAIN': 'Republican', 
    'MR. CARTER': 'Democratic', 
    'MR. FORD': 'Republican', 
    'Mr. KENNEDY': 'Democratic', 
    'MR. KENNEDY': 'Democratic', 
    'MR. MONDALE': 'Democratic', 
    'Mr. NIXON': 'Republican', 
    'MR. NIXON': 'Republican', 
    'MR. REAGAN': 'Republican', 
    'MR.FORD': 'Republican', 
    'OBAMA': 'Democratic', 
    'PALIN': 'Republican', 
    'PENCE': 'Republican', 
    'PEROT': 'independent', 
    'PRESIDENT BUSH': 'Republican', 
    'PRESIDENT GEORGE BUSH': '', 
    'QUAYLE': 'Republican', 
    'REAGAN': 'Republican', 
    'REP. JOHN B. ANDERSON': 'independent', 
    'RYAN': 'Republican', 
    'ROMNEHY': 'Republican', 
    'ROMNEY': 'Republican', 
    'ROSS PEROT': 'independent', 
    'SENATOR GORE': 'Democratic', 
    'SENATOR KENNEDY': 'Democratic', 
    'STOCKDALE': 'independent', 
    'THE PRESIDENT': 'Republican', 
    'The President': 'Republican', 
    'TRUMP': 'Republican', 
    'VICE PRESIDENT QUAYLE': 'Republican'
}

## Creating the column `Party` based on the dictionary `candidates_parties`

In [11]:
df_debates_turns['Party'] = df_debates_turns['Speaker'].map(candidates_parties)

In [12]:
df_debates_turns

Unnamed: 0,Title,Debate,Date,Participants,Moderators,Speaker,Text,Decade,Party
0,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,"Thank you very much, Chris. I will tell you ve...",2020,Republican
1,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,"Well, first of all, thank you for doing this a...",2020,Democratic
2,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,The American people have a right to have a say...,2020,Democratic
3,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,There aren’t a hundred million people with pre...,2020,Republican
4,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,"During that period of time, during that period...",2020,Republican
...,...,...,...,...,...,...,...,...,...
3473,"October 21, 1960 Debate Transcript",The Fourth Kennedy-Nixon Presidential Debate,1960-10-21,Kennedy-Nixon,QUINCY HOWE,MR. NIXON,I would say that the issue will stay with us a...,1960,Republican
3474,"October 21, 1960 Debate Transcript",The Fourth Kennedy-Nixon Presidential Debate,1960-10-21,Kennedy-Nixon,QUINCY HOWE,MR. KENNEDY,"Well, Mr. Nixon, to go back to 1955. The resol...",1960,Democratic
3475,"October 21, 1960 Debate Transcript",The Fourth Kennedy-Nixon Presidential Debate,1960-10-21,Kennedy-Nixon,QUINCY HOWE,MR. KENNEDY,And that’s the testimony of uh – General Twini...,1960,Democratic
3476,"October 21, 1960 Debate Transcript",The Fourth Kennedy-Nixon Presidential Debate,1960-10-21,Kennedy-Nixon,QUINCY HOWE,MR. KENNEDY,I uh – said that I’ve served this country for ...,1960,Democratic


In [13]:
df_debates_turns.dtypes

Title                   object
Debate                  object
Date            datetime64[ns]
Participants            object
Moderators              object
Speaker                 object
Text                    object
Decade                   int32
Party                   object
dtype: object

### Checking if there are any missing values in the column `Party`

In [14]:
print(df_debates_turns['Party'].isnull().sum())

0


## Exporting to a file

In [None]:
df_debates_turns[['Title', 'Debate', 'Date', 'Decade', 'Participants', 'Moderators', 'Speaker', 'Party', 'Text']].to_json(f'{output_directory}/debates_turns_republican.jsonl', orient='records', lines=True)