# Corpus Collection and processing 

This Jupyter notebook contains the cleaning and processing a text corpus.  
The corpus comprises the full speeches delivered by Donald Trump at 35 rallies until 2021.

## 1. Cleaning

### 1.1. Import packages

In [155]:
# Import spacy
import spacy

# Import os to upload documents and metadata
import os

# Load spaCy visualizer
from spacy import displacy

# Import pandas DataFrame packages
import pandas as pd
pd.options.mode.chained_assignment = None

### 1.2. Read files and extract information from filenames

In [200]:
# Create empty lists for file names and contents
texts = []
file_names = []
date = []
location = []

# Iterate through each file in the folder
for _file_name in os.listdir('archive'):
    # Look for only text files
    if _file_name.endswith('.txt'):
        # Append contents of each text file to text list
        texts.append(open('archive' + '/' + _file_name, 'r', encoding='utf-8').read())
        # Append name of each file to file name list
        file_names.append(_file_name)
        # Extract and append date and location of each file
        date.append(_file_name[-14:-4])
        location.append(_file_name[0:-14])

### 1.3. Create a dataframe linking each file name to its original text,  the corresponding date and location of the speech

In [201]:
d = {'Filename':file_names,'Date':date, 'Location':location,'Document':texts}

In [202]:
speech_df = pd.DataFrame(d)

In [203]:
speech_df.head()

Unnamed: 0,Filename,Date,Location,Document
0,CharlotteMar02_2020.txt,Mar02_2020,Charlotte,"I want to thank you very much. North Carolina,..."
1,NewHampshireAug15_2019.txt,Aug15_2019,NewHampshire,Thank you very much everybody. Thank you. Wow...
2,ToledoJan09_2020.txt,Jan09_2020,Toledo,"Well, thank you very much. Vice President Mike..."
3,LatrobeSep03_2020.txt,Sep03_2020,Latrobe,"So thank you Pennsylvania, very much. I'm thri..."
4,HendersonSep13_2020.txt,Sep13_2020,Henderson,"Thank you, thank you. Wow. Wow, and I'm thrill..."


### 1.4.  Clean the text

In [204]:
nlp = spacy.load('en_core_web_sm') 

In [205]:
# Thanks for this advice in the feedback!
cleaned_texts = []
for text in texts:
    # Remove the heading space of the text
    text = text.lstrip()
    # Remove Punctuations
    doc = nlp(text)
    text = ' '.join(token.text for token in doc if not token.is_punct)
    # Change text to lower case
    cleaned_text = text.lower()
    cleaned_texts.append(cleaned_text)
    
#test
cleaned_texts[1]

"thank you very much everybody thank you wow i will never ever let you down that i can tell you amazing and i want to thank manchester and new hampshire they 're very special you remember those primaries that primary came around and remember what happened during the primary trump should come in third or fourth and we came in easily number one and that was the beginning that was an easy one i want to thank you all this is incredible and it 's great to be back in a state that i love with thousands of hardworking patriots who are the heart and soul of america and that 's what you are we 're actually here today to officially launch our campaign to win the great state of new hampshire in 2020 and i saw some fake polls put out by the fake news media and it said that i 'm tied with three of the other candidates the democrats i 'm tied in new hampshire i do n't think so one of them has a rally he 's got 100 people so this holds 12,000 we 're full and we could fill it up four times at least we 

In [206]:
speech_df['Text'] = cleaned_texts

In [207]:
speech_df.head()

Unnamed: 0,Filename,Date,Location,Document,Text
0,CharlotteMar02_2020.txt,Mar02_2020,Charlotte,"I want to thank you very much. North Carolina,...",i want to thank you very much north carolina t...
1,NewHampshireAug15_2019.txt,Aug15_2019,NewHampshire,Thank you very much everybody. Thank you. Wow...,thank you very much everybody thank you wow i ...
2,ToledoJan09_2020.txt,Jan09_2020,Toledo,"Well, thank you very much. Vice President Mike...",well thank you very much vice president mike p...
3,LatrobeSep03_2020.txt,Sep03_2020,Latrobe,"So thank you Pennsylvania, very much. I'm thri...",so thank you pennsylvania very much i 'm thril...
4,HendersonSep13_2020.txt,Sep13_2020,Henderson,"Thank you, thank you. Wow. Wow, and I'm thrill...",thank you thank you wow wow and i 'm thrilled ...


### 1.5. Apply methods from NLP to generate Doc

In [208]:
def process_text(text):
    return nlp(text)

In [209]:
speech_df['Doc'] = speech_df['Text'].apply(process_text) 

In [210]:
speech_df.head()

Unnamed: 0,Filename,Date,Location,Document,Text,Doc
0,CharlotteMar02_2020.txt,Mar02_2020,Charlotte,"I want to thank you very much. North Carolina,...",i want to thank you very much north carolina t...,"(i, want, to, thank, you, very, much, north, c..."
1,NewHampshireAug15_2019.txt,Aug15_2019,NewHampshire,Thank you very much everybody. Thank you. Wow...,thank you very much everybody thank you wow i ...,"(thank, you, very, much, everybody, thank, you..."
2,ToledoJan09_2020.txt,Jan09_2020,Toledo,"Well, thank you very much. Vice President Mike...",well thank you very much vice president mike p...,"(well, thank, you, very, much, vice, president..."
3,LatrobeSep03_2020.txt,Sep03_2020,Latrobe,"So thank you Pennsylvania, very much. I'm thri...",so thank you pennsylvania very much i 'm thril...,"(so, thank, you, pennsylvania, very, much, i, ..."
4,HendersonSep13_2020.txt,Sep13_2020,Henderson,"Thank you, thank you. Wow. Wow, and I'm thrill...",thank you thank you wow wow and i 'm thrilled ...,"(thank, you, thank, you, wow, wow, and, i, ', ..."


### 1.6. Generate Tokens

In [211]:
def get_token(doc):
    return [(token.text) for token in doc]

In [212]:
speech_df['Tokens'] = speech_df['Doc'].apply(get_token)

In [213]:
speech_df.head()

Unnamed: 0,Filename,Date,Location,Document,Text,Doc,Tokens
0,CharlotteMar02_2020.txt,Mar02_2020,Charlotte,"I want to thank you very much. North Carolina,...",i want to thank you very much north carolina t...,"(i, want, to, thank, you, very, much, north, c...","[i, want, to, thank, you, very, much, north, c..."
1,NewHampshireAug15_2019.txt,Aug15_2019,NewHampshire,Thank you very much everybody. Thank you. Wow...,thank you very much everybody thank you wow i ...,"(thank, you, very, much, everybody, thank, you...","[thank, you, very, much, everybody, thank, you..."
2,ToledoJan09_2020.txt,Jan09_2020,Toledo,"Well, thank you very much. Vice President Mike...",well thank you very much vice president mike p...,"(well, thank, you, very, much, vice, president...","[well, thank, you, very, much, vice, president..."
3,LatrobeSep03_2020.txt,Sep03_2020,Latrobe,"So thank you Pennsylvania, very much. I'm thri...",so thank you pennsylvania very much i 'm thril...,"(so, thank, you, pennsylvania, very, much, i, ...","[so, thank, you, pennsylvania, very, much, i, ..."
4,HendersonSep13_2020.txt,Sep13_2020,Henderson,"Thank you, thank you. Wow. Wow, and I'm thrill...",thank you thank you wow wow and i 'm thrilled ...,"(thank, you, thank, you, wow, wow, and, i, ', ...","[thank, you, thank, you, wow, wow, and, i, ', ..."


## 2. Processing

### 2.1.  Add annotations : Lemmas

In [214]:
def get_lemma(doc):
    return [(token.lemma_) for token in doc]

In [215]:
speech_df['Lemmas'] = speech_df['Doc'].apply(get_lemma)

In [216]:
speech_df.head()

Unnamed: 0,Filename,Date,Location,Document,Text,Doc,Tokens,Lemmas
0,CharlotteMar02_2020.txt,Mar02_2020,Charlotte,"I want to thank you very much. North Carolina,...",i want to thank you very much north carolina t...,"(i, want, to, thank, you, very, much, north, c...","[i, want, to, thank, you, very, much, north, c...","[I, want, to, thank, you, very, much, north, c..."
1,NewHampshireAug15_2019.txt,Aug15_2019,NewHampshire,Thank you very much everybody. Thank you. Wow...,thank you very much everybody thank you wow i ...,"(thank, you, very, much, everybody, thank, you...","[thank, you, very, much, everybody, thank, you...","[thank, you, very, much, everybody, thank, you..."
2,ToledoJan09_2020.txt,Jan09_2020,Toledo,"Well, thank you very much. Vice President Mike...",well thank you very much vice president mike p...,"(well, thank, you, very, much, vice, president...","[well, thank, you, very, much, vice, president...","[well, thank, you, very, much, vice, president..."
3,LatrobeSep03_2020.txt,Sep03_2020,Latrobe,"So thank you Pennsylvania, very much. I'm thri...",so thank you pennsylvania very much i 'm thril...,"(so, thank, you, pennsylvania, very, much, i, ...","[so, thank, you, pennsylvania, very, much, i, ...","[so, thank, you, pennsylvania, very, much, I, ..."
4,HendersonSep13_2020.txt,Sep13_2020,Henderson,"Thank you, thank you. Wow. Wow, and I'm thrill...",thank you thank you wow wow and i 'm thrilled ...,"(thank, you, thank, you, wow, wow, and, i, ', ...","[thank, you, thank, you, wow, wow, and, i, ', ...","[thank, you, thank, you, wow, wow, and, I, ', ..."


### 2.2. Add anotations : Part-of-speech (POS) tags

In [217]:
def get_pos(doc):
    return [(token.pos_, token.tag_) for token in doc]

In [218]:
speech_df['POS'] = speech_df['Doc'].apply(get_pos)

In [219]:
speech_df.head()

Unnamed: 0,Filename,Date,Location,Document,Text,Doc,Tokens,Lemmas,POS
0,CharlotteMar02_2020.txt,Mar02_2020,Charlotte,"I want to thank you very much. North Carolina,...",i want to thank you very much north carolina t...,"(i, want, to, thank, you, very, much, north, c...","[i, want, to, thank, you, very, much, north, c...","[I, want, to, thank, you, very, much, north, c...","[(PRON, PRP), (VERB, VBP), (PART, TO), (VERB, ..."
1,NewHampshireAug15_2019.txt,Aug15_2019,NewHampshire,Thank you very much everybody. Thank you. Wow...,thank you very much everybody thank you wow i ...,"(thank, you, very, much, everybody, thank, you...","[thank, you, very, much, everybody, thank, you...","[thank, you, very, much, everybody, thank, you...","[(VERB, VBP), (PRON, PRP), (ADV, RB), (ADV, RB..."
2,ToledoJan09_2020.txt,Jan09_2020,Toledo,"Well, thank you very much. Vice President Mike...",well thank you very much vice president mike p...,"(well, thank, you, very, much, vice, president...","[well, thank, you, very, much, vice, president...","[well, thank, you, very, much, vice, president...","[(INTJ, UH), (VERB, VBP), (PRON, PRP), (ADV, R..."
3,LatrobeSep03_2020.txt,Sep03_2020,Latrobe,"So thank you Pennsylvania, very much. I'm thri...",so thank you pennsylvania very much i 'm thril...,"(so, thank, you, pennsylvania, very, much, i, ...","[so, thank, you, pennsylvania, very, much, i, ...","[so, thank, you, pennsylvania, very, much, I, ...","[(ADV, RB), (VERB, VBP), (PRON, PRP), (NOUN, N..."
4,HendersonSep13_2020.txt,Sep13_2020,Henderson,"Thank you, thank you. Wow. Wow, and I'm thrill...",thank you thank you wow wow and i 'm thrilled ...,"(thank, you, thank, you, wow, wow, and, i, ', ...","[thank, you, thank, you, wow, wow, and, i, ', ...","[thank, you, thank, you, wow, wow, and, I, ', ...","[(VERB, VBP), (PRON, PRP), (VERB, VBP), (PRON,..."


### 2.3. Add anotations : Proper Nouns

In [220]:
def extract_proper_nouns(doc):
    return [token.text for token in doc if token.pos_ == 'PROPN']

In [221]:
speech_df['Proper_Nouns'] = speech_df['Doc'].apply(extract_proper_nouns)

In [222]:
list(speech_df.loc[[3, 34], 'Proper_Nouns'])

[['arnold',
  'pennsylvania',
  'white',
  'house',
  'joe',
  'biden',
  'china',
  'china',
  'virus',
  'america',
  'america',
  'united',
  'states',
  'military',
  'november',
  'november',
  'china',
  'mr',
  '.',
  'congressman',
  'china',
  'china',
  'china',
  'china',
  'democrat',
  'run',
  'america',
  'america',
  'national',
  'guard',
  'biden',
  'congress',
  'washington',
  'march',
  'portland',
  'mike',
  'schmidt',
  'philadelphia',
  'bernie',
  'sanders',
  'pennsylvania',
  'bernie',
  'bernie',
  'crazy',
  'bernie',
  'i.',
  'biden',
  'florida',
  'ohio',
  'texas',
  'north',
  'carolina',
  'south',
  'carolina',
  'rick',
  'perry',
  'rick',
  'rick',
  'rick',
  'pittsburgh',
  'pennsylvania',
  'ohio',
  'west',
  'virginia',
  'john',
  'hughes',
  'john',
  'john',
  'john',
  'john',
  'john',
  'john',
  'john',
  'john',
  'hey',
  'joe',
  'biden',
  'china',
  'pennsylvania',
  'oil',
  'pennsylvania',
  'pennsylvania',
  'joe',
  'hiden'

### 2.4. Add anotations : Named Entity Labels

In [223]:
def extract_named_entities(doc):
    return [ent.label_ for ent in doc.ents]

In [224]:
speech_df['Named_Entities'] = speech_df['Doc'].apply(extract_named_entities)

In [225]:
speech_df['Named_Entities']

0     [GPE, GPE, CARDINAL, NORP, GPE, DATE, GPE, PER...
1     [GPE, ORDINAL, ORDINAL, CARDINAL, CARDINAL, GP...
2     [PERSON, GPE, GPE, DATE, DATE, ORDINAL, DATE, ...
3     [GPE, DATE, GPE, DATE, ORG, PERSON, GPE, GPE, ...
4     [GPE, CARDINAL, NORP, DATE, GPE, DATE, ORG, DA...
5     [GPE, GPE, CARDINAL, NORP, DATE, GPE, CARDINAL...
6     [GPE, CARDINAL, DATE, NORP, GPE, TIME, EVENT, ...
7     [DATE, GPE, CARDINAL, GPE, DATE, DATE, ORDINAL...
8     [GPE, GPE, GPE, TIME, TIME, DATE, GPE, GPE, GP...
9     [GPE, GPE, CARDINAL, GPE, GPE, NORP, DATE, CAR...
10    [GPE, GPE, GPE, DATE, LOC, CARDINAL, NORP, DAT...
11    [GPE, TIME, GPE, NORP, FAC, NORP, CARDINAL, GP...
12    [CARDINAL, GPE, MONEY, GPE, GPE, NORP, CARDINA...
13    [DATE, CARDINAL, CARDINAL, CARDINAL, TIME, PER...
14    [PERSON, QUANTITY, PERSON, GPE, DATE, QUANTITY...
15    [GPE, GPE, GPE, CARDINAL, NORP, DATE, DATE, DA...
16    [GPE, PERSON, DATE, DATE, GPE, CARDINAL, GPE, ...
17    [GPE, CARDINAL, NORP, GPE, DATE, GPE, DATE

### 2.5. Add anotations : Named Entity Objects

In [226]:
def extract_named_entities(doc):
    return [ent for ent in doc.ents]

In [227]:
speech_df['NE_Words'] = speech_df['Doc'].apply(extract_named_entities)

In [228]:
speech_df['NE_Words']

0     [(north, carolina), (charlotte), (thousands), ...
1     [(new, hampshire), (third), (fourth), (number,...
2     [(mike, pence), (toledo), (toledo), (a, year),...
3     [(pennsylvania), (61, days), (pennsylvania), (...
4     [(henderson), (thousands), (american), (51, da...
5     [(toledo), (ohio), (thousands), (american), (4...
6     [(arizona), (thousands), (november), (democrat...
7     [(christmas), (michigan), (thousands), (americ...
8     [(california), (pennsylvania), (pennsylvania),...
9     [(houston), (texas), (one), (america), (india)...
10    [(colorado), (colorado), (colorado), (the, yea...
11    [(dallas), (tonight), (texas), (american), (al...
12    [(two), (japan), (three, 40, billion), (japan)...
13    [(the, day), (20,000), (25,000), (close, to, 1...
14    [(joe, biden), (122, degrees), (joe, biden), (...
15    [(north, carolina), (north, carolina), (winsto...
16    [(oklahoma), (mike), (weeks), (today), (oklaho...
17    [(mexico), (thousands), (american), (ameri

### 2.6. Visualize named entity tagging in a single speech

In [229]:
doc = speech_df['Doc'][1]
displacy.render(doc, style='ent', jupyter=True)

### 2.6. Drop the 'Doc' column

Once the file is saved as a CSV or text, all those annotations are gone, and all we have is text. So we don’t need that column at all.  
Thanks for the advice in the feedback!

In [230]:
speech_df = speech_df.drop('Doc',axis = 1)

## 3. Saving annotated corpus as a CSV file

In [231]:
speech_df.to_csv('Trump_speeches_with_spaCy_tags.csv')

 **<font color = blue>Thank you for your time!<font>**