<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - INRS

## Importing the required libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

## Data wrangling

### Importing the texts of `September 29, 2020 Debate Transcript` into a dataframe

In [2]:
url = 'https://www.debates.org/voter-education/debate-transcripts/september-29-2020-debate-transcript/'
response = requests.get(url)

if response.status_code == 200:
    page_content = response.content
    # Step 2: Parse HTML content
    soup = BeautifulSoup(page_content, 'lxml')

    # Extract relevant tags (e.g., <h1>, <p>, etc.) and store them in a list
    rows = []
    for tag in soup.find_all(['h1', 'p']):
        rows.append(tag.text.strip())  # Remove leading/trailing spaces
    
    # Create a DataFrame from the list of rows
    df = pd.DataFrame(rows, columns=['Text'])

else:
    print(f'Failed to retrieve the page. Status code: {response.status_code}')

In [3]:
df

Unnamed: 0,Text
0,"September 29, 2020 Debate Transcript"
1,Presidential Debate at Case Western Reserve Un...
2,"September 29, 2020"
3,PARTICIPANTS:\nFormer Vice President Joe Biden...
4,MODERATOR:\nChris Wallace (Fox News)
...,...
874,TRUMP: I want to see an honest ballot count.
875,WALLACE: We’re going to leave it there. . .
876,TRUMP: And I think he does too. . .
877,WALLACE: … to be continued in more debates as ...


### Creating the column `Title`

In [4]:
title = df.at[0, 'Text']
df['Title'] = title

### Creating the column `Debate`

In [5]:
debate = df.at[1, 'Text']
df['Debate'] = debate

### Creating the column `Date`

In [6]:
date = df.at[2, 'Text']
df['Date'] = date

### Creating the column `Participants`

In [7]:
participants = df.at[3, 'Text']
participants = re.sub(r'^\w+:\n', '', participants)
participants = re.sub(r'\n', ' ', participants)
df['Participants'] = participants

### Creating the column `Moderators`

In [8]:
moderators = df.at[4, 'Text']
moderators = re.sub(r'^\w+:\n', '', moderators)
df['Moderators'] = moderators

### Dropping the `COPYRIGHT` row

In [9]:
df.loc[878, 'Text']

'© COPYRIGHT 2020 THE COMMISSION ON PRESIDENTIAL DEBATES. ALL RIGHTS RESERVED.'

In [10]:
df = df[~df['Text'].str.contains(df.loc[878, 'Text'])]

In [11]:
df

Unnamed: 0,Text,Title,Debate,Date,Participants,Moderators
0,"September 29, 2020 Debate Transcript","September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News)
1,Presidential Debate at Case Western Reserve Un...,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News)
2,"September 29, 2020","September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News)
3,PARTICIPANTS:\nFormer Vice President Joe Biden...,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News)
4,MODERATOR:\nChris Wallace (Fox News),"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News)
...,...,...,...,...,...,...
873,"WALLACE: Gentlemen, just say that’s the end of...","September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News)
874,TRUMP: I want to see an honest ballot count.,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News)
875,WALLACE: We’re going to leave it there. . .,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News)
876,TRUMP: And I think he does too. . .,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News)


### Dropping the rows that were used to create the columns

In [12]:
df = df.loc[5:]
df = df.reset_index(drop=True)

In [13]:
df

Unnamed: 0,Text,Title,Debate,Date,Participants,Moderators
0,WALLACE: Good evening from the Health Educatio...,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News)
1,This debate is being conducted under health an...,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News)
2,"BIDEN: How you doing, man?","September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News)
3,TRUMP: How are you doing?,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News)
4,BIDEN: I’m well.,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News)
...,...,...,...,...,...,...
868,"WALLACE: Gentlemen, just say that’s the end of...","September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News)
869,TRUMP: I want to see an honest ballot count.,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News)
870,WALLACE: We’re going to leave it there. . .,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News)
871,TRUMP: And I think he does too. . .,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News)


### Creating the column `Speaker`

In [14]:
df.loc[0, 'Text']

'WALLACE:\xa0Good evening from the Health Education Campus of Case Western Reserve University and the Cleveland Clinic. I’m Chris Wallace of Fox News and I welcome you to the first of the 2020 presidential debates between President Donald J. Trump and former Vice President Joe Biden. This debate is sponsored by the Commission on Presidential Debates. The Commission has designed the format, six roughly 15-minute segments with two-minute answers from each candidate to the first question, then open discussion for the rest of each segment. Both campaigns have agreed to these rules. For the record, I decided the topics and the questions in each topic. I can assure you none of the questions has been shared with the Commission or the two candidates.'

In [15]:
df.at[0, 'Speaker'] = 1 # Adding column 'Speaker' by initialising it with a numeric value in order to avoid DtypeWarning
df = df.astype('object') # Converting the column to the desired data type

for index, row in df.iterrows():
    match = re.match(r'^(\w+):\xa0', row['Text'])
    if match:
        speaker = match.group(1)
        df.at[index, 'Speaker'] = speaker
        text_without_speaker = row['Text'][len(match.group(0)):]
        df.at[index, 'Text'] = text_without_speaker
    else:
        # Handle the case when no match is found (optional)
        #df.at[index, 'Speaker'] = ''  # Set a default value
        pass

In [16]:
df

Unnamed: 0,Text,Title,Debate,Date,Participants,Moderators,Speaker
0,Good evening from the Health Education Campus ...,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),WALLACE
1,This debate is being conducted under health an...,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),
2,"How you doing, man?","September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN
3,How are you doing?,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP
4,I’m well.,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN
...,...,...,...,...,...,...,...
868,"Gentlemen, just say that’s the end of it [cros...","September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),WALLACE
869,I want to see an honest ballot count.,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP
870,We’re going to leave it there. . .,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),WALLACE
871,And I think he does too. . .,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP


### Checking how many rows in column `Speaker` are empty (they are paragraphs of the speaker that is speaking - the latest previous speaker)

In [17]:
print(df['Speaker'].isnull().sum())

18


### Replacing missing (NaN or empty) values with the last non-empty value in the column (assigning them to the speaker that is speaking)

In [18]:
df['Speaker'].ffill(inplace=True)

### Double checking

In [19]:
print(df['Speaker'].isnull().sum())

0


In [20]:
df

Unnamed: 0,Text,Title,Debate,Date,Participants,Moderators,Speaker
0,Good evening from the Health Education Campus ...,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),WALLACE
1,This debate is being conducted under health an...,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),WALLACE
2,"How you doing, man?","September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN
3,How are you doing?,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP
4,I’m well.,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN
...,...,...,...,...,...,...,...
868,"Gentlemen, just say that’s the end of it [cros...","September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),WALLACE
869,I want to see an honest ballot count.,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP
870,We’re going to leave it there. . .,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),WALLACE
871,And I think he does too. . .,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,"September 29, 2020",Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP


### Exporting to a file

#### JSONL format

In [21]:
df[['Title', 'Debate', 'Date', 'Participants', 'Moderators', 'Speaker', 'Text']].to_json('September 29, 2020 Debate Transcript.jsonl', orient='records', lines=True)

#### TSV format

In [22]:
df[['Title', 'Debate', 'Date', 'Participants', 'Moderators', 'Speaker', 'Text']].to_csv('September 29, 2020 Debate Transcript.tsv', sep='\t', index=False, encoding='utf-8', lineterminator='\n')