# Introduction

I this project I would like to analyse US 2020 presidential debate by focusing on speech types of the two politicians, Donald Trump and Joe Biden. It is interesting to figure out and understand the character and personality of each politician by analysing their speech types.

Specifically, I'll be walking through:

1. Handling the missing values
2. Making the time consecutive
3. Cleaning the data
4. Document Term Matrix

The output of this notebook will be clean, organized data in two standard text formats:

1. **Corpus** - a collection of text
2. **Document-Term Matrix** - word counts in matrix format

## Context

The US 2020 election sees the incumbent Republican president Donald Trump facing off against his Democrat challenger Joe Biden. Both candidates agreed to debate their political stances in the lead-up to the vote on November 3rd.

The 1st presidential debate took place on September 29th at Case Western Reserve University in Cleveland, Ohio. It was moderated by Fox News anchor Chris Wallace.

The  2nd presidential debate between Biden and Trump took place on October 22nd at Belmont University, Nashville, Tennessee and was moderated by NBC News' Kristen Welker. 

## Dataset

This dataset was downloaded from Kaggle and can be found <a href="https://www.kaggle.com/headsortails/us-election-2020-presidential-debates">here</a>.

# Importing Libraries

In [None]:
import pandas as pd
import re
import string
import unicodedata
import nltk
import spacy
nltk.download('stopwords')
from nltk.tokenize.toktok import ToktokTokenizer
from sklearn.feature_extraction.text import CountVectorizer
import pickle
import datetime

nlp = spacy.load('en', parse=True, tag=True, entity=True)
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')
pd.set_option('max_colwidth',100)

import warnings
warnings.filterwarnings('ignore')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Importing Dataset

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
first_debate = pd.read_csv("/content/drive/MyDrive/Data Science/us election presidential debates/datasets/us_election_2020_1st_presidential_debate.csv")
second_debate = pd.read_csv("/content/drive/MyDrive/Data Science/us election presidential debates/datasets/us_election_2020_2nd_presidential_debate.csv")

In [None]:
first_debate.head()

Unnamed: 0,speaker,minute,text
0,Chris Wallace,01:20,Good evening from the Health Education Campus of Case Western Reserve University and the Clevela...
1,Chris Wallace,02:10,This debate is being conducted under health and safety protocols designed by the Cleveland Clini...
2,Vice President Joe Biden,02:49,"How you doing, man?"
3,President Donald J. Trump,02:51,How are you doing?
4,Vice President Joe Biden,02:51,I’m well.


In [None]:
second_debate.head()

Unnamed: 0,speaker,minute,text
0,Kristen Welker,00:18,"Good evening, everyone. Good evening. Thank you so much for being here. It is such an honor for ..."
1,Donald Trump,07:37,How are you doing? How are you?
2,Kristen Welker,07:58,And I do want to say a very good evening to both of you. This debate will cover six major topics...
3,Kristen Welker,08:27,The goal is for you to hear each other and for the American people to hear every word of what yo...
4,Kristen Welker,09:03,… during this next stage of the coronavirus crisis. Two minutes uninterrupted.


# Handling Missing Values

We can see in the outputs below, there is one missing value in first debate for minute value. After searching the first debate, I noticed that missing value refers new starting point of the minute value. Therefore we can assign '00.00' starting point for this value. 

In [None]:
missing_df = pd.DataFrame(pd.concat([first_debate.isnull().sum(), second_debate.isnull().sum()], axis = 1))
missing_df.columns = ['first debate', 'second debate']
missing_df

Unnamed: 0,first debate,second debate
speaker,0,0
minute,1,0
text,0,0


In [None]:
first_debate.iloc[178:181]

Unnamed: 0,speaker,minute,text
178,President Donald J. Trump,24:25,"You don’t trust Johnson & Johnson, Pfizer?"
179,Chris Wallace:,,"Okay, gentlemen, gentlemen. Let me move on to questions about the future because you both have t..."
180,President Donald J. Trump,00:15,"Well, I’ve spoken to the companies and we can have it a lot sooner. It’s a very political thing ..."


In [None]:
first_debate.loc[first_debate.minute.isnull(), 'minute'] = '00:00'

In [None]:
first_debate.iloc[178:181]

Unnamed: 0,speaker,minute,text
178,President Donald J. Trump,24:25,"You don’t trust Johnson & Johnson, Pfizer?"
179,Chris Wallace:,00:00,"Okay, gentlemen, gentlemen. Let me move on to questions about the future because you both have t..."
180,President Donald J. Trump,00:15,"Well, I’ve spoken to the companies and we can have it a lot sooner. It’s a very political thing ..."


In [None]:
second_debate.iloc[88:91]

Unnamed: 0,speaker,minute,text
88,Donald Trump,32:22,"Now, about your thing last night. I knew all about that. And through John — who is John Ratcliff..."
89,Donald Trump,00:00,They both want you to lose because there has been nobody tougher to Russia between the sanctions...
90,Donald Trump,00:33,"And I’ll tell you, they were so bad. They took over the submarine port, you remember that very w..."


In [None]:
second_debate.iloc[336:339]

Unnamed: 0,speaker,minute,text
336,Joe Biden,39:14,I do. I do. My daughter is a social worker and she’s written a lot about this. She has her gradu...
337,Joe Biden,00:00,"Making sure that you, in fact, if you get pulled over just, yes, sir, no, sir. Hands on top of t..."
338,Kristen Welker,01:06,"President Trump, same question to you, and let me remind you of the question. I would like you t..."


In [None]:
# Let's check unique speakers in first debate
print('Speakers in the first debate:', (first_debate["speaker"].unique()))

Speakers in the first debate: ['Chris Wallace' 'Vice President Joe Biden' 'President Donald J. Trump'
 'Chris Wallace:']


As we see in our unique speakers, we have 2 moderators in our dataframe. They are same person but written uncorrect therefore I will rename moderator name by correcting mistake.

In [None]:
# Let's correct the typo in the name
first_debate["speaker"] = first_debate["speaker"].replace({"Chris Wallace:": "Chris Wallace"})

In [None]:
# Let's change their names for more simplicity and coherence in two datasets
first_debate["speaker"] = first_debate["speaker"].replace({"Chris Wallace": "Moderator"})
first_debate["speaker"] = first_debate["speaker"].replace({"Vice President Joe Biden": "Joe Biden"})
first_debate["speaker"] = first_debate["speaker"].replace({"President Donald J. Trump": "Donald Trump"})

In [None]:
# Let's create a corpus by combining text in all rows and making one text row for each speaker.
dict_dt={'transcript':', '.join(first_debate[first_debate["speaker"]=="Donald Trump"]["text"])}
dt_df = pd.DataFrame(data=dict_dt, index=["Donald Trump"])

moderator={'transcript':', '.join(first_debate[first_debate["speaker"]=="Moderator"]["text"])}
moderator_df = pd.DataFrame(data=moderator, index=["Moderator"])

dict_jb={'transcript':', '.join(first_debate[first_debate["speaker"]=="Joe Biden"]["text"])}
jb_df = pd.DataFrame(data=dict_jb, index=["Joe Biden"])

first_data = pd.concat([dt_df, moderator_df,jb_df])
first_data.head()

Unnamed: 0,transcript
Donald Trump,"How are you doing?, Thank you very much, Chris. I will tell you very simply. We won the election..."
Moderator,Good evening from the Health Education Campus of Case Western Reserve University and the Clevela...
Joe Biden,"How you doing, man?, I’m well., Well, first of all, thank you for doing this and looking forward..."


In [None]:
#Let's check unique speakers in second debate
print('Speakers in the second debate:', (second_debate["speaker"].unique()))

Speakers in the second debate: ['Kristen Welker' 'Donald Trump' 'Joe Biden']


In [None]:
# Let's change their names for more simplicity and coherence in two datasets
second_debate["speaker"] = second_debate["speaker"].replace({"Kristen Welker": "Moderator"})

In [None]:
# Let's create a corpus by combining text in all rows and making one text row for each speaker.
dt_dict={'transcript':', '.join(second_debate[second_debate["speaker"]=="Donald Trump"]["text"])}
df_dt = pd.DataFrame(data=dt_dict, index=["Donald Trump"])

moderator_2={'transcript':', '.join(second_debate[second_debate["speaker"]=="Moderator"]["text"])}
moderator_2_df = pd.DataFrame(data=moderator_2, index=["Moderator"])

jb_dict={'transcript':', '.join(second_debate[second_debate["speaker"]=="Joe Biden"]["text"])}
df_jb = pd.DataFrame(data=jb_dict, index=["Joe Biden"])

second_data = pd.concat([df_dt, moderator_2_df,df_jb])
second_data.head()

Unnamed: 0,transcript
Donald Trump,"How are you doing? How are you?, So as you know, 2.2 million people modeled out, were expected t..."
Moderator,"Good evening, everyone. Good evening. Thank you so much for being here. It is such an honor for ..."
Joe Biden,"220,000 Americans dead. You hear nothing else I say tonight, hear this. Anyone who is responsibl..."


# Making The Time Consecutive

In this part we're trying to have a consistent timeframe instead of having two parts we'll have one that covers all their speaches. So what we do is that we parse the minute column into hour, minute and second and then give a specific format to all.

In [None]:
# Let's define a function that calculates seconds of a given minute value.
def seconds(column):
  secs = [x for x in column.split(':')]
  return int(secs[-3])*3600+int(secs[-2])*60 + int(secs[-1]) if len(secs)==3 else int(secs[-2])*60 + int(secs[-1])

In [None]:
# First Debate
first_debate["seconds"]=first_debate['minute'].apply(seconds)

In [None]:
first_debate.iloc[178:181]

Unnamed: 0,speaker,minute,text,seconds
178,Donald Trump,24:25,"You don’t trust Johnson & Johnson, Pfizer?",1465
179,Moderator,00:00,"Okay, gentlemen, gentlemen. Let me move on to questions about the future because you both have t...",0
180,Donald Trump,00:15,"Well, I’ve spoken to the companies and we can have it a lot sooner. It’s a very political thing ...",15


We see that the speech of moderator starts from zero that is a mistake in our minute column however there were not different sessions for debate. I can add the seconds value of the previous speaker from this value till to end. By the way we can have a long consecutive talk time.

In [None]:
first_debate.iloc[179:]["seconds"] = first_debate.iloc[179:]["seconds"]+1465

In [None]:
first_debate['minutes'] = first_debate["seconds"].apply(lambda x:x//60)

# We use this format of %h:%m:%s by using the following command
first_debate['hour'] = first_debate["seconds"].apply(lambda x:str(datetime.timedelta(seconds=x)))

In [None]:
first_debate.tail()

Unnamed: 0,speaker,minute,text,seconds,minutes,hour
784,Moderator,01:10:43,"Gentlemen, just say that’s the end of it [crosstalk 01:10:45]. This is the end of this debate-",5708,95,1:35:08
785,Donald Trump,01:10:47,I want to see an honest ballot count.,5712,95,1:35:12
786,Moderator,01:10:48,We’re going to leave it there-,5713,95,1:35:13
787,Donald Trump,01:10:49,And I think he does too-,5714,95,1:35:14
788,Moderator,01:10:50,"… to be continued in more debates as we go on. President Trump, Vice President Biden, it’s been ...",5715,95,1:35:15


Organised talk time of first debate is 1:35:15.

In [None]:
# Second Debate
second_debate["seconds"]=second_debate['minute'].apply(seconds)

In [None]:
second_debate.iloc[88:91]

Unnamed: 0,speaker,minute,text,seconds
88,Donald Trump,32:22,"Now, about your thing last night. I knew all about that. And through John — who is John Ratcliff...",1942
89,Donald Trump,00:00,They both want you to lose because there has been nobody tougher to Russia between the sanctions...,0
90,Donald Trump,00:33,"And I’ll tell you, they were so bad. They took over the submarine port, you remember that very w...",33


In [None]:
second_debate.iloc[336:339]

Unnamed: 0,speaker,minute,text,seconds
336,Joe Biden,39:14,I do. I do. My daughter is a social worker and she’s written a lot about this. She has her gradu...,2354
337,Joe Biden,00:00,"Making sure that you, in fact, if you get pulled over just, yes, sir, no, sir. Hands on top of t...",0
338,Moderator,01:06,"President Trump, same question to you, and let me remind you of the question. I would like you t...",66


I will add  the seconds value of the previous speaker from index **89** till the other starting minute index **337**. I will add the seconds sum of previous talk time from index **337** till to the end.

In [None]:
second_debate.iloc[89:337]["seconds"] = second_debate.iloc[89:337]["seconds"] + 1942

In [None]:
# (1942+2354) = 4296
second_debate.iloc[337:]["seconds"] = second_debate.iloc[337:]["seconds"] +4296

In [None]:
second_debate['minutes'] = second_debate["seconds"].apply(lambda x:x//60)

# We use this format of %h:%m:%s by using the following command
second_debate['hour'] = second_debate["seconds"].apply(lambda x:str(datetime.timedelta(seconds=x)))

In [None]:
second_debate.iloc[88:91]

Unnamed: 0,speaker,minute,text,seconds,minutes,hour
88,Donald Trump,32:22,"Now, about your thing last night. I knew all about that. And through John — who is John Ratcliff...",1942,32,0:32:22
89,Donald Trump,00:00,They both want you to lose because there has been nobody tougher to Russia between the sanctions...,1942,32,0:32:22
90,Donald Trump,00:33,"And I’ll tell you, they were so bad. They took over the submarine port, you remember that very w...",1975,32,0:32:55


In [None]:
second_debate.iloc[336:339]

Unnamed: 0,speaker,minute,text,seconds,minutes,hour
336,Joe Biden,39:14,I do. I do. My daughter is a social worker and she’s written a lot about this. She has her gradu...,4296,71,1:11:36
337,Joe Biden,00:00,"Making sure that you, in fact, if you get pulled over just, yes, sir, no, sir. Hands on top of t...",4296,71,1:11:36
338,Moderator,01:06,"President Trump, same question to you, and let me remind you of the question. I would like you t...",4362,72,1:12:42


In [None]:
second_debate.tail()

Unnamed: 0,speaker,minute,text,seconds,minutes,hour
507,Moderator,25:49,"All right. Vice President Biden, same question to you: what will you say during your inaugural a...",5845,97,1:37:25
508,Joe Biden,25:57,"I will say, I’m an American President. I represent all of you, whether you voted for me or again...",5853,97,1:37:33
509,Joe Biden,26:19,"We can grow this economy, we can deal with the systemic racism. At the same time, we can make su...",5875,97,1:37:55
510,Moderator,26:53,"All right, I want to thank you both for a very robust hour and a half, a fantastic debate. Reall...",5909,98,1:38:29
511,Joe Biden,27:16,Thank you.,5932,98,1:38:52


Organised talk time of second debate is 1:38:52.

### Speaking duration of each speaker

I would like to find out how many minutes each speaker talked during the debates by finding the duration of each speechs. I will subtract seconds from the previous one and calculate the total speaking time. I will use **diff()** function to make this calculation.

In [None]:
first_debate["duration"] = first_debate["seconds"].diff()

In [None]:
first_debate.iloc[0]["duration"]

nan

In [None]:
first_debate["duration"].isnull().sum()

1

In [None]:
first_debate.loc[first_debate.duration.isnull(), 'duration'] = 80

In [None]:
first_debate["duration"].isnull().sum()

0

In [None]:
second_debate["duration"] = second_debate["seconds"].diff()

In [None]:
second_debate["duration"].isnull().sum()

1

In [None]:
second_debate.loc[second_debate.duration.isnull(), 'duration'] = 18

In [None]:
first_debate.groupby("speaker").sum()['duration']

speaker
Donald Trump    2172.0
Joe Biden       1689.0
Moderator       1854.0
Name: duration, dtype: float64

In [None]:
first_data["speech_time"]=[(lambda x:x//60)(first_debate[first_debate["speaker"]=="Donald Trump"]["duration"].sum()),
                           (lambda x:x//60)(first_debate[first_debate["speaker"]=="Moderator"]["duration"].sum()),
                           (lambda x:x//60)(first_debate[first_debate["speaker"]=="Joe Biden"]["duration"].sum())]

In [None]:
first_data

Unnamed: 0,transcript,speech_time
Donald Trump,"How are you doing?, Thank you very much, Chris. I will tell you very simply. We won the election...",36.0
Moderator,Good evening from the Health Education Campus of Case Western Reserve University and the Clevela...,30.0
Joe Biden,"How you doing, man?, I’m well., Well, first of all, thank you for doing this and looking forward...",28.0


**- Donald Trump**: 36 mins             
**- Joe Biden**: 28 mins               
**- Moderator**: 30 mins           



In [None]:
second_debate.groupby("speaker").sum()['duration']

speaker
Donald Trump    1903.0
Joe Biden       1269.0
Moderator       2760.0
Name: duration, dtype: float64

In [None]:
second_data["speech_time"]=[(lambda x:x//60)(second_debate[second_debate["speaker"]=="Donald Trump"]["duration"].sum()),
                            (lambda x:x//60)(second_debate[second_debate["speaker"]=="Moderator"]["duration"].sum()),
                            (lambda x:x//60)(second_debate[second_debate["speaker"]=="Joe Biden"]["duration"].sum())]

In [None]:
second_data

Unnamed: 0,transcript,speech_time
Donald Trump,"How are you doing? How are you?, So as you know, 2.2 million people modeled out, were expected t...",31.0
Moderator,"Good evening, everyone. Good evening. Thank you so much for being here. It is such an honor for ...",46.0
Joe Biden,"220,000 Americans dead. You hear nothing else I say tonight, hear this. Anyone who is responsibl...",21.0


**- Donald Trump**: 31 mins             
**- Joe Biden**: 21 mins               
**- Moderator**: 46 mins  

In [None]:
first_debate = first_debate[["speaker","seconds","minutes","hour","duration","text"]]
second_debate = second_debate[["speaker","seconds","minutes","hour","duration","text"]]

# Cleaning the Data

When there are some common data cleaning techniques, which are also known as text pre-processing techniques.

**Common data cleaning steps on all text:**
* Make text all lower case
* Remove punctuation
* Remove numerical values
* Expanding contractions
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words
* Stemming / lemmatization

## Defining text pre processing functions

### 1- Expanding Contractions
- shortened version of words,written or spoken forms in the English language.

In [None]:
def expand_contractions(text):
    
    '''Make text lowercase.'''
    text = text.lower()
    
    ''' Expanding contractions'''
    text = re.sub("that’s","that is",text)
    text = re.sub("there’s","there is",text)
    text = re.sub("here’s","here is",text)
    text = re.sub("what’s","what is",text)
    text = re.sub("where’s","where is",text)
    text = re.sub("who’s","who is",text)
    text = re.sub("i’m","i am",text)
    text = re.sub("it’s","it is",text)
    text = re.sub("she’s","she is",text)
    text = re.sub("he’s","he is",text)
    text = re.sub("they’re","they are",text)
    text = re.sub("we’re","we are",text)
    text = re.sub("you’re","you are",text)
    text = re.sub("who’re","who are",text)
    text = re.sub("i’ll","i will",text)
    text = re.sub("you’ll","you will",text)
    text = re.sub("we’ll","we will",text)
    text = re.sub("didn’t","did not",text)
    text = re.sub("doesn’t","does not",text)
    text = re.sub("aren’t","are not",text)
    text = re.sub("don’t","do not",text)
    text = re.sub("i’ve","i have",text)
    text = re.sub("you’ve","you have",text)
    text = re.sub("we’ve","we have",text)
    text = re.sub("they’ve","they have",text)
    text = re.sub("ain’t","am not",text)
    text = re.sub("wouldn’t","would not",text)
    text = re.sub("shouldn’t","should not",text)
    text = re.sub("can’t","can not",text)
    text = re.sub("couldn’t","could not",text)
    text = re.sub("won’t","will not",text)
    
    return text

### 2- Removing Special Characters
 - non-alphanumeric characters or noise in unstructured text and  to make sure that texts are standerdized into ASCII characters.

In [None]:
def text_cleaner(text):
    
    '''Make text lowercase.'''
    text = text.lower()
    
    '''Removing text in square brackets, remove punctuation and remove words containing numbers.'''
    text = re.sub('\[.*?\]', '', text) 
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text) 
    text = re.sub('\w*\d\w*', '', text) 
    text = re.sub(r'\[[0-9]*\]',' ',text)
    
    '''Removing extra spaces'''
    text = re.sub(r'\s+',' ',text)
    text = re.sub(r'\s+[a-z]\s+',' ',text)
    text = re.sub(r'^[a-z]\s+',' ',text)

    '''Get rid of some additional punctuation and non-sensical text.'''
    text = re.sub('[‘’“”…]', '', text)
        
    ''' Get rid of accented characters'''
    text = unicodedata.normalize('NFKD', text).encode('ascii','ignore').decode('utf-8','ignore')
    
    return text

### 3- Lemmatization

The goal of lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. 

In [None]:
def text_lemmatizer(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

### 4- Removing Stopwords
- Words which have little or no significance. These can be articles, conjunctions, prepositions and so on.

In [None]:
def stopwords_remover(text):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)
    
    return filtered_text

## Building a Text Normalizer Function

In [None]:
def normalize_corpus(corpus, contraction_expansion=True,
                     text_lemmatization=True, text_cleaning=True, 
                     stopword_removal=True):
    
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:
         
         # expand contractions    
        if contraction_expansion:
            doc = expand_contractions(doc)
        
         # text cleaning
        if text_cleaning:
            doc = text_cleaner(doc)   
       
        # lemmatize text
        if text_lemmatization:
            doc = text_lemmatizer(doc)
     
        # remove stopwords
        if stopword_removal:
            doc = stopwords_remover(doc)
            
        normalized_corpus.append(doc)
        
    return normalized_corpus

### Building The Corpus

In [None]:
first_debate['clean_text']=normalize_corpus(first_debate['text'], contraction_expansion=True,
                     text_lemmatization=True, text_cleaning=True, 
                     stopword_removal=True)

first_debate.head()

Unnamed: 0,speaker,seconds,minutes,hour,duration,text,clean_text
0,Moderator,80,1,0:01:20,80.0,Good evening from the Health Education Campus of Case Western Reserve University and the Clevela...,good evening health education campus case western reserve university cleveland clinic chris wall...
1,Moderator,130,2,0:02:10,50.0,This debate is being conducted under health and safety protocols designed by the Cleveland Clini...,debate conduct health safety protocol design cleveland clinic serve health security advisor comm...
2,Joe Biden,169,2,0:02:49,39.0,"How you doing, man?",man
3,Donald Trump,171,2,0:02:51,2.0,How are you doing?,
4,Joe Biden,171,2,0:02:51,0.0,I’m well.,well


In [None]:
first_data['clean_text']=normalize_corpus(first_data['transcript'], contraction_expansion=True,
                     text_lemmatization=True, text_cleaning=True, 
                     stopword_removal=True)

first_data.head()


Unnamed: 0,transcript,speech_time,clean_text
Donald Trump,"How are you doing?, Thank you very much, Chris. I will tell you very simply. We won the election...",36.0,thank much chris tell simply win election election consequence senate white house phenomenal nom...
Moderator,Good evening from the Health Education Campus of Case Western Reserve University and the Clevela...,30.0,good evening health education campus case western reserve university cleveland clinic chris wall...
Joe Biden,"How you doing, man?, I’m well., Well, first of all, thank you for doing this and looking forward...",28.0,man well well first thank look forward mr president american people right say supreme court nomi...


In [None]:
second_debate['clean_text']=normalize_corpus(second_debate['text'], contraction_expansion=True,
                     text_lemmatization=True, text_cleaning=True, 
                     stopword_removal=True)

second_debate.head()

Unnamed: 0,speaker,seconds,minutes,hour,duration,text,clean_text
0,Moderator,18,0,0:00:18,18.0,"Good evening, everyone. Good evening. Thank you so much for being here. It is such an honor for ...",good evening everyone good evening thank much honor moderate debate tonight final debate want we...
1,Donald Trump,457,7,0:07:37,439.0,How are you doing? How are you?,
2,Moderator,478,7,0:07:58,21.0,And I do want to say a very good evening to both of you. This debate will cover six major topics...,want say good evening debate cover six major topic beginning section candidate two minute uninte...
3,Moderator,507,8,0:08:27,29.0,The goal is for you to hear each other and for the American people to hear every word of what yo...,goal hear american people hear every word say ready let start begin fight coronavirus president ...
4,Moderator,543,9,0:09:03,36.0,… during this next stage of the coronavirus crisis. Two minutes uninterrupted.,next stage coronavirus crisis two minute uninterrupted


In [None]:
second_data['clean_text']=normalize_corpus(second_data['transcript'], contraction_expansion=True,
                     text_lemmatization=True, text_cleaning=True, 
                     stopword_removal=True)

second_data.head()

Unnamed: 0,transcript,speech_time,clean_text
Donald Trump,"How are you doing? How are you?, So as you know, 2.2 million people modeled out, were expected t...",31.0,know million people model expect die close great economy world order fight horrible disease come...
Moderator,"Good evening, everyone. Good evening. Thank you so much for being here. It is such an honor for ...",46.0,good evening everyone good evening thank much honor moderate debate tonight final debate want we...
Joe Biden,"220,000 Americans dead. You hear nothing else I say tonight, hear this. Anyone who is responsibl...",21.0,americans dead hear nothing else say tonight hear anyone responsible not take control fact not s...


In [None]:
# Let's pickle them for later use
first_data.to_pickle("/content/drive/MyDrive/Data Science/us election presidential debates/pickles/first_whole_corpus.pkl")
second_data.to_pickle("/content/drive/MyDrive/Data Science/us election presidential debates/pickles/second_whole_corpus.pkl")

first_debate.to_pickle("/content/drive/MyDrive/Data Science/us election presidential debates/pickles/first_debate_corpus.pkl")
second_debate.to_pickle("/content/drive/MyDrive/Data Science/us election presidential debates/pickles/second_debate_corpus.pkl")


# Document Term Matrix

For many of the techniques that we'll in future notebooks, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.


In [None]:
# first_data 
# We are going to create a document-term matrix using CountVectorizer
cv = CountVectorizer()

first_data_cv = cv.fit_transform(first_data['clean_text'])
first_data_dtm = pd.DataFrame(first_data_cv.toarray(), columns=cv.get_feature_names())
first_data_dtm.index = first_data.index
first_data_dtm

Unnamed: 0,ability,able,abolish,abraham,absolutely,absorb,abuse,academic,accept,accompany,accomplish,accord,accountable,acknowledge,acre,across,act,actually,add,addition,additional,address,administration,admission,admit,advantage,advisor,affect,affidavit,afford,affordable,afraid,african,africanamerican,africanamericans,agency,ago,agree,ahead,air,...,whatsoever,wherewithal,whether,whichever,whistle,white,whole,wide,wife,willing,win,wing,winner,wipe,wishful,without,woman,womens,wonder,word,work,worker,workforce,world,worried,worth,would,wrap,write,wrong,wuhan,xenophobic,xi,yapping,yeah,year,yes,york,young,zero
Donald Trump,0,1,0,0,3,0,0,1,1,0,0,3,0,0,1,0,1,1,0,1,0,0,4,0,0,0,0,0,0,1,0,1,0,2,0,0,3,7,5,4,...,1,0,0,0,0,1,5,1,1,1,9,4,0,0,0,0,0,0,1,7,2,0,1,1,0,0,48,1,0,10,0,2,0,0,2,26,5,4,3,0
Moderator,0,1,1,1,0,0,1,0,0,0,0,1,1,0,1,0,0,1,3,0,0,1,2,1,0,0,1,2,0,0,0,0,0,0,0,1,1,10,16,0,...,0,0,2,1,0,3,0,0,0,2,0,1,2,0,0,0,0,0,0,2,0,1,0,0,2,2,20,0,0,2,0,0,0,0,1,20,3,0,0,1
Joe Biden,2,17,0,0,3,1,0,0,5,1,1,3,3,2,0,1,6,1,0,2,1,0,6,0,1,2,0,0,1,0,5,1,2,1,2,0,0,0,0,0,...,0,1,1,0,2,4,4,0,0,0,4,0,1,3,1,1,4,1,0,2,8,0,0,5,2,1,18,0,2,6,1,0,2,1,5,9,5,0,1,1


In [None]:
# Let's pickle it for later use
first_data_dtm.to_pickle("/content/drive/MyDrive/Data Science/us election presidential debates/pickles/first_data_dtm.pkl")

In [None]:
# second_data 
# We are going to create a document-term matrix using CountVectorizer
cv = CountVectorizer()

second_data_cv = cv.fit_transform(second_data['clean_text'])
second_data_dtm = pd.DataFrame(second_data_cv.toarray(), columns=cv.get_feature_names())
second_data_dtm.index = second_data.index
second_data_dtm

Unnamed: 0,abide,ability,able,abraham,abroad,absolutely,abuse,access,accord,account,accountant,accumulate,accurate,accuse,across,act,action,activity,actually,actuary,addition,address,administration,advance,adversary,advisor,advocate,affect,affordable,afghanistan,africanamerican,agent,ago,agree,ahead,air,alabama,alcohol,allow,ally,...,wilmington,win,wind,windmill,window,windshield,winter,wiper,witch,withhold,within,without,woman,wonder,wonderful,word,work,worker,world,worldwide,worried,worry,worth,would,wrap,write,wrong,wuhan,xenophobia,xenophobic,xi,yeah,year,yes,yesterday,yet,york,young,zero,zone
Donald Trump,0,0,4,6,0,1,2,0,1,4,1,0,1,0,0,0,3,0,1,0,0,0,2,1,0,0,0,0,0,0,0,1,14,0,1,4,1,1,2,0,...,0,3,2,2,4,0,1,0,1,0,2,3,1,0,1,2,9,0,10,3,0,2,0,34,1,2,2,0,0,2,0,1,35,2,0,2,7,3,0,4
Moderator,0,0,2,0,1,0,1,0,0,1,2,0,0,1,0,1,0,0,0,0,0,3,7,0,1,1,1,0,1,0,0,0,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0,4,1,1,0,0,1,3,0,0,0,0,2,0,21,0,0,0,0,0,0,0,0,1,1,1,1,0,2,1,0
Joe Biden,1,2,16,2,0,1,0,3,2,1,0,1,0,2,5,5,0,1,2,1,1,0,2,0,0,2,0,1,4,1,1,0,0,2,0,2,0,1,4,1,...,1,1,3,1,1,1,2,1,0,1,3,0,1,1,0,2,4,1,6,0,1,8,1,21,0,1,3,1,1,1,1,0,16,2,0,1,2,2,3,2


In [None]:
# Let's pickle it for later use
second_data_dtm.to_pickle("/content/drive/MyDrive/Data Science/us election presidential debates/pickles/second_data_dtm.pkl")

In [None]:
pickle.dump(cv, open("/content/drive/MyDrive/Data Science/us election presidential debates/pickles/cv.pkl", "wb"))