<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Phase 3 - INRS

## Prerequisites

Make sure the prerequisites in [CL_LMDA_prerequisites](https://github.com/laelgelc/laelgelc/blob/main/CL_LMDA_prerequisites.ipynb) are satisfied.

## Dataset

Please download the following dataset (Right-click on the link and choose `Save link as` to download the corresponding file):
- [CL_St1_Ph2_INRS.tar.gz](https://github.com/laelgelc/cl_st1_inrs/blob/main/CL_St1_Ph2_INRS.tar.gz)

## Importing the required libraries

In [1]:
import pandas as pd
import demoji
import re
import os
from collections import Counter

## Data wrangling

### Importing the tweet raw data into a dataframe

#### Alternative 1 - Importing data as `JSONL`

In [2]:
df_tweets_raw_data = pd.read_json('cl_st1_inrs_tc/debates_turns.jsonl', lines=True)

In [3]:
df_tweets_raw_data.dtypes

Title           object
Debate          object
Date             int64
Participants    object
Moderators      object
Speaker         object
Text            object
dtype: object

When a DataFrame with a `datetime64[ns]` column is exported to `JSONL`, the dates are converted to UNIX timestamps (milliseconds since the epoch). When you import the JSONL file back into a DataFrame, these timestamps are read as integers. To convert these integers back to `datetime64[ns]` format, you can use the `pd.to_datetime()` function with the `unit` parameter set to 'ms' (milliseconds).

In [4]:
df_tweets_raw_data['Date'] = pd.to_datetime(df_tweets_raw_data['Date'], unit='ms')

In [5]:
df_tweets_raw_data.dtypes

Title                   object
Debate                  object
Date            datetime64[ns]
Participants            object
Moderators              object
Speaker                 object
Text                    object
dtype: object

In [6]:
df_tweets_raw_data.head(5)

Unnamed: 0,Title,Debate,Date,Participants,Moderators,Speaker,Text
0,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,"Thank you very much, Chris. I will tell you ve..."
1,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,"Well, first of all, thank you for doing this a..."
2,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,The American people have a right to have a say...
3,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,There aren’t a hundred million people with pre...
4,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,"During that period of time, during that period..."


#### Alternative 2 - Importing data as `TSV`

```
df_tweets_raw_data = pd.read_csv('cl_st1_inrs_tc/debates_turns.tsv', sep='\t')
```

```
df_tweets_raw_data.dtypes
```

The 'Date' column is imported as 'object' ('string') data type - it should be converted to `datetime64[ns]` format.

```
df_tweets_raw_data['Date'] = pd.to_datetime(df_tweets_raw_data['Date'])
#df_tweets_raw_data['Date'] = df_tweets_raw_data['Date'].astype('datetime64[ns]') # Alternative command
```

```
df_tweets_raw_data.dtypes
```

```
df_tweets_raw_data.head(5)
```

In [7]:
df_tweets_raw_data.shape

(3478, 7)

#### Inspecting a few tweets

In [8]:
inspected_row = 0
print('Speaker:' + df_tweets_raw_data.loc[inspected_row, 'Speaker'])
print('Text:' + df_tweets_raw_data.loc[inspected_row, 'Text'])

Speaker:TRUMP
Text:Thank you very much, Chris. I will tell you very simply. We won the election. Elections have consequences. We have the Senate, we have the White House, and we have a phenomenal nominee respected by all. Top, top academic, good in every way. Good in every way. In fact, some of her biggest endorsers are very liberal people from Notre Dame and other places. So I think she’s going to be fantastic. We have plenty of time. Even if we did it after the election itself. I have a lot of time after the election, as you know. So I think that she will be outstanding. She’s going to be as good as anybody that has served on that court. We really feel that. We have a professor at Notre Dame, highly respected by all, said she’s the single greatest student he’s ever had. He’s been a professor for a long time at a great school. And we won the election and therefore we have the right to choose her, and very few people knowingly would say otherwise. And by the way, the Democrats, they wo

## Sampling the raw data according to filtering expressions

In [9]:
# Bypassing the screening by filtering expressions because it is not relevant in this case
df_tweets_filtered = df_tweets_raw_data

In [10]:
df_tweets_filtered

Unnamed: 0,Title,Debate,Date,Participants,Moderators,Speaker,Text
0,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,"Thank you very much, Chris. I will tell you ve..."
1,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,"Well, first of all, thank you for doing this a..."
2,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,The American people have a right to have a say...
3,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,There aren’t a hundred million people with pre...
4,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,"During that period of time, during that period..."
...,...,...,...,...,...,...,...
3473,"October 21, 1960 Debate Transcript",The Fourth Kennedy-Nixon Presidential Debate,1960-10-21,Kennedy-Nixon,QUINCY HOWE,MR. NIXON,I would say that the issue will stay with us a...
3474,"October 21, 1960 Debate Transcript",The Fourth Kennedy-Nixon Presidential Debate,1960-10-21,Kennedy-Nixon,QUINCY HOWE,MR. KENNEDY,"Well, Mr. Nixon, to go back to 1955. The resol..."
3475,"October 21, 1960 Debate Transcript",The Fourth Kennedy-Nixon Presidential Debate,1960-10-21,Kennedy-Nixon,QUINCY HOWE,MR. KENNEDY,And that’s the testimony of uh – General Twini...
3476,"October 21, 1960 Debate Transcript",The Fourth Kennedy-Nixon Presidential Debate,1960-10-21,Kennedy-Nixon,QUINCY HOWE,MR. KENNEDY,I uh – said that I’ve served this country for ...


### Exporting the filtered data into a file for inspection

In [11]:
df_tweets_filtered.to_csv('tweets_emojified.tsv', sep='\t', index=False)

In [12]:
df_tweets_filtered.to_json('tweets_emojified.jsonl', orient='records', lines=True)

## Replacing emojis

### Demojifying the column `text`

In [13]:
# Defining a function to demojify a string
def demojify_line(input_line):
    demojified_line = demoji.replace_with_desc(input_line, sep='<em>')
    return demojified_line

df_tweets_filtered['Text'] = df_tweets_filtered['Text'].apply(demojify_line)

#### Exporting the filtered data into a file for inspection

In [14]:
df_tweets_filtered.to_csv('tweets_demojified1.tsv', sep='\t', index=False)

### Separating the demojified strings with spaces

In [15]:
# Defining a function to separate the demojified strings with spaces
def preprocess_line(input_line):
    # Add a space before the first delimiter '<em>', if it is not already preceded by one
    preprocessed_line = re.sub(r'(?<! )<em>', ' <em>', input_line)
    # Add a space after the first delimiter '<em>', if it is not already followed by one
    preprocessed_line = re.sub(r'<em>(?! )', '<em> ', preprocessed_line)
    return preprocessed_line

# Separating the demojified strings with spaces
df_tweets_filtered['Text'] = df_tweets_filtered['Text'].apply(preprocess_line)

#### Exporting the filtered data into a file for inspection

In [16]:
df_tweets_filtered.to_csv('tweets_demojified2.tsv', sep='\t', index=False)

### Formatting the demojified strings

In [17]:
# Defining a function to format the demojified string
def format_demojified_string(input_line):
    # Defining a function to format the demojified string using RegEx
    def process_demojified_string(s):
            # Lowercase the string
            s = s.lower()
            # Replace spaces and colons followed by a space with underscores
            s = re.sub(r'(: )| ', '_', s)
            # Add the appropriate prefixes and suffixes
            s = f'EMOJI{s}e'
            return s

    # Use RegEx to find and process each demojified string
    processed_line = re.sub(r'<em>(.*?)<em>', lambda match: process_demojified_string(match.group(1)), input_line)
    return processed_line

# Formatting the demojified strings
df_tweets_filtered['Text'] = df_tweets_filtered['Text'].apply(format_demojified_string)

### Replacing the `pipe` character by the `-` character in the `text` column

Further on, a few columns of the dataframe are going to be exported into the file `tweets.txt` whose columns need to be delimited by the `pipe` character. Therefore, it is recommended that any occurrences of the `pipe` character in the `text` column are replaced by another character.

In [18]:
# Defining a function to replace the 'pipe' character by the '-' character
def replace_pipe_with_hyphen(input_string):
    modified_string = re.sub(r'\|', '-', input_string)
    return modified_string

# Replacing the 'pipe' character by the '-' character
df_tweets_filtered['Text'] = df_tweets_filtered['Text'].apply(replace_pipe_with_hyphen)

#### Exporting the filtered data into a file for inspection

In [19]:
df_tweets_filtered.to_csv('tweets_demojified3.tsv', sep='\t', index=False)

## Tokenising

Please refer to [What is tokenization in NLP?](https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/).

In [20]:
# Defining a function to tokenise a string
def tokenise_string(input_line):
    # Replace URLs with placeholders
    url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+\b'
    placeholder = '<URL>'  # Choose a unique placeholder
    urls = re.findall(url_pattern, input_line)
    tokenised_line = re.sub(url_pattern, placeholder, input_line)  # Replace URLs with placeholders
    
    # Replace curly quotes with straight ones
    tokenised_line = tokenised_line.replace('“', '"').replace('”', '"').replace("‘", "'").replace("’", "'")
    # Separate common punctuation marks with spaces
    tokenised_line = re.sub(r'([.\!?,"\'/()])', r' \1 ', tokenised_line)
    # Add a space before '#'
    tokenised_line = re.sub(r'(?<!\s)#', r' #', tokenised_line)  # Add a space before '#' if it is not already preceded by one
    # Reduce extra spaces by a single space
    tokenised_line = re.sub(r'\s+', ' ', tokenised_line)
    
    # Replace the placeholders with the respective URLs
    for url in urls:
        tokenised_line = tokenised_line.replace(placeholder, url, 1)
    
    return tokenised_line

# Tokenising the strings
df_tweets_filtered['Text'] = df_tweets_filtered['Text'].apply(tokenise_string)

## Creating the files `file_index.txt` and `tweets.txt`

### Creating column `text_id`

In [21]:
df_tweets_filtered['text_id'] = 't' + df_tweets_filtered.index.astype(str).str.zfill(6)

### Creating column `conversation`

The column `conversation` maps on the column `Title`.

In [22]:
df_tweets_filtered['conversation'] = 'v:' + df_tweets_filtered['Title']

#### Replacing space by the `_` character

**Important**: Since the strings in the original columns contain spaces, Pandas creates `file_index.txt` with the columns enclosed with `"` - this caracter causes issues in `examples.sh` when it is executed. Therefore, spaces should be replaced by another character such as underscore.

In [23]:
# Defining a function to replace space by the '_' character
def replace_space_with_underscore(input_string):
    modified_string = re.sub(r' ', '_', input_string)
    return modified_string

In [24]:
# Replacing space by the '_' character
df_tweets_filtered['conversation'] = df_tweets_filtered['conversation'].apply(replace_space_with_underscore)

### Creating column `date`

The column `date` maps on the column `Date`.

In [25]:
# Extract the date part (without time) into a new column 'date'
df_tweets_filtered['date'] = df_tweets_filtered['Date'].dt.date

# Add the prefix 'd:' to the 'date' values
df_tweets_filtered['date'] = 'd:' + df_tweets_filtered['date'].astype(str)

### Creating column `text_url`

The column `text_url` maps on the column `Debate`.

In [26]:
df_tweets_filtered['text_url'] = 'url:' + df_tweets_filtered['Debate']

#### Replacing space by the `_` character

In [27]:
# Replacing space by the '_' character
df_tweets_filtered['text_url'] = df_tweets_filtered['text_url'].apply(replace_space_with_underscore)

### Creating column `user`

The column `user` maps on the column `Speaker`.

In [28]:
df_tweets_filtered['user'] = 'u:' + df_tweets_filtered['Speaker']

#### Replacing space by the `_` character

In [29]:
# Replacing space by the '_' character
df_tweets_filtered['user'] = df_tweets_filtered['user'].apply(replace_space_with_underscore)

### Creating column `content`

The column `content` maps on the column `Text`.

In [30]:
df_tweets_filtered['content'] = 'c:' + df_tweets_filtered['Text']

### Reordering the created columns

Please refer to:
- [Python - List Comprehension 1](https://www.w3schools.com/python/python_lists_comprehension.asp)
- [Python - List Comprehension 2](https://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/)

In [31]:
# Reorder the columns (we use list comprehension to create a list of all columns except 'text_id', 'variable', 'date' and 'text_url')
df_tweets_filtered = df_tweets_filtered[['text_id', 'conversation', 'date', 'text_url', 'user', 'content'] + [col for col in df_tweets_filtered.columns if col not in ['text_id', 'conversation', 'date', 'text_url', 'user', 'content']]]

In [32]:
df_tweets_filtered

Unnamed: 0,text_id,conversation,date,text_url,user,content,Title,Debate,Date,Participants,Moderators,Speaker,Text
0,t000000,"v:September_29,_2020_Debate_Transcript",d:2020-09-29,url:Presidential_Debate_at_Case_Western_Reserv...,u:TRUMP,"c:Thank you very much , Chris . I will tell yo...","September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,"Thank you very much , Chris . I will tell you ..."
1,t000001,"v:September_29,_2020_Debate_Transcript",d:2020-09-29,url:Presidential_Debate_at_Case_Western_Reserv...,u:BIDEN,"c:Well , first of all , thank you for doing th...","September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,"Well , first of all , thank you for doing this..."
2,t000002,"v:September_29,_2020_Debate_Transcript",d:2020-09-29,url:Presidential_Debate_at_Case_Western_Reserv...,u:BIDEN,c:The American people have a right to have a s...,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,The American people have a right to have a say...
3,t000003,"v:September_29,_2020_Debate_Transcript",d:2020-09-29,url:Presidential_Debate_at_Case_Western_Reserv...,u:TRUMP,c:There aren ' t a hundred million people with...,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,There aren ' t a hundred million people with p...
4,t000004,"v:September_29,_2020_Debate_Transcript",d:2020-09-29,url:Presidential_Debate_at_Case_Western_Reserv...,u:TRUMP,"c:During that period of time , during that per...","September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,"During that period of time , during that perio..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3473,t003473,"v:October_21,_1960_Debate_Transcript",d:1960-10-21,url:The_Fourth_Kennedy-Nixon_Presidential_Debate,u:MR._NIXON,c:I would say that the issue will stay with us...,"October 21, 1960 Debate Transcript",The Fourth Kennedy-Nixon Presidential Debate,1960-10-21,Kennedy-Nixon,QUINCY HOWE,MR. NIXON,I would say that the issue will stay with us a...
3474,t003474,"v:October_21,_1960_Debate_Transcript",d:1960-10-21,url:The_Fourth_Kennedy-Nixon_Presidential_Debate,u:MR._KENNEDY,"c:Well , Mr . Nixon , to go back to 1955 . The...","October 21, 1960 Debate Transcript",The Fourth Kennedy-Nixon Presidential Debate,1960-10-21,Kennedy-Nixon,QUINCY HOWE,MR. KENNEDY,"Well , Mr . Nixon , to go back to 1955 . The r..."
3475,t003475,"v:October_21,_1960_Debate_Transcript",d:1960-10-21,url:The_Fourth_Kennedy-Nixon_Presidential_Debate,u:MR._KENNEDY,c:And that ' s the testimony of uh – General T...,"October 21, 1960 Debate Transcript",The Fourth Kennedy-Nixon Presidential Debate,1960-10-21,Kennedy-Nixon,QUINCY HOWE,MR. KENNEDY,And that ' s the testimony of uh – General Twi...
3476,t003476,"v:October_21,_1960_Debate_Transcript",d:1960-10-21,url:The_Fourth_Kennedy-Nixon_Presidential_Debate,u:MR._KENNEDY,c:I uh – said that I ' ve served this country ...,"October 21, 1960 Debate Transcript",The Fourth Kennedy-Nixon Presidential Debate,1960-10-21,Kennedy-Nixon,QUINCY HOWE,MR. KENNEDY,I uh – said that I ' ve served this country fo...


### Creating the file `file_index.txt`

In [33]:
df_tweets_filtered[['text_id', 'conversation', 'date', 'text_url']].to_csv('file_index.txt', sep=' ', index=False, header=False, encoding='utf-8', lineterminator='\n')

### Creating the file `tweets.txt`

In [34]:
folder = 'tweets'
try:
    os.mkdir(folder)
    print(f'Folder {folder} created!')
except FileExistsError:
    print(f'Folder {folder} already exists')

Folder tweets created!


Note: The parameters `doublequote=False` and `escapechar=' '` are required to avoid that the column content is doublequoted with '"' in sentences that use characters that need to be escaped such as double quote '"' itself - this causes a malformed response from TreeTagger.

In [35]:
df_tweets_filtered[['text_id', 'conversation', 'date', 'user', 'content']].to_csv(f'{folder}/tweets.txt', sep='|', index=False, header=False, encoding='utf-8', lineterminator='\n', doublequote=False, escapechar=' ')

## Tagging with TreeTagger

- On Visual Studio Code (VS Code), open the folder where your project is located with `Open Folder...`
- Open a WSL Ubuntu Terminal on VS Code
- **Important**: Activate the `my_env` Python environment by executing `source "$HOME"/my_env/bin/activate`
- Proceed as indicated

Note: You have to download and open this Jupyter Notebook on JupyterLab (provided as part of Anaconda Distribution) to visualise the procedure

Purpose: Annotate the texts in `tweets/tweets.txt` with part-of-speech and lemma information.
- Input
    - `file_index.txt`
    - `tweets/tweets.txt`
- Output
    - `tweets/tagged.txt`

## Processing `tokenstypes`

Purpose: Capture the content tokens (specific occurrences of words) and the content types (general concept of words) from `tweets/tagged.txt`.
- Input
    - `file_index.txt`
    - `tweets/tagged.txt`
- Output
    - `tweets/tokens.txt`
    - `tweets/types.txt`

## Processing `toplemmas`

Purpose: Determine the 1.000 top lemmas. **Important**: This process requires manual inspection. Non-meaningful lemmas should be excluded by updating `stoplist.sed` and reiterating the processing.
- Input
    - `tweets/types.txt`
    - `stoplist.sed`: List of rules that allows the exclusion of a certain lemmas
- Output
    - `selectedwords` = `var_index.txt`

## Processing `sas`

Purpose: Prepare input data for processing in SAS.
- Input
    - `tweets/types.txt`
    - `selectedwords`
    - `file_index.txt`
- Output
    - `columns`
    - `sas/data.txt`
    - `sas/dates.txt`
    - `sas/wcount.txt`

## Processing `datamatrix`

Purpose: Prepares input data for calculating the correlation matrix.
- Input
    - `file_index.txt`
    - `columns`
    - `selectedwords`
- Output
    - `file_ids.txt`
    - `data.csv`

## Processing `correlationmatrix`

Purpose: Calculates the correlation matrix.
- Input
    - `data.csv`
- Output
    - `correlation`

## Processing `formats`

Purpose: Prepare input data for processing in SAS.
- Input
    - `data.csv`
    - `selectedwords`
- Output
    - `sas/corr.txt`
    - `sas/word_labels_format.sas`

## Processing the statistical procedures on SAS

- Log in to your [SAS OnDemand for Academics](https://welcome.oda.sas.com/) account
- Proceed as indicated in this [video tutorial](https://youtu.be/I3u9zD3jyOA?si=68uIKVc2iusGG2KY)

## Processing `examples`

Purpose: Extract examples for analysis.
- Input
    - `sas/output_"$project"/loadtable.html`
    - `sas/output_"$project"/"$project"_scores.tsv`
    - `sas/output_"$project"/"$project"_scores_only.tsv`
- Output
    - `examples/factors`
    - `example files`

## Results

Right-click on the link and choose `Save link as` to download the corresponding file.

- [CL_St1_Ph3_INRS_Results.tar.gz](https://laelgelcinrs.s3.sa-east-1.amazonaws.com/CL_St1_Ph3_INRS_Results.tar.gz)