In [2]:
import pandas as pd
import re
debates = pd.read_csv('./debates.csv')

In [13]:
# Looking at summary stats about the size of unknown speaker dialog
speaker_or_moderator = re.compile(r'(Speaker|Moderator)')
unknown = debates['speaker'].str.contains(speaker_or_moderator)
unkowns_only = debates[unknown]
words_in_unknown = unkowns_only['dialog'].str.split().str.len()
words_in_unknown.describe()

Rows before removing unknown speakers: 6286


count    210.000000
mean      15.519048
std       25.732375
min        0.000000
25%        3.000000
50%        5.000000
75%       15.750000
max      174.000000
Name: dialog, dtype: float64

In [12]:
# comparing to size of known speaker dialog
debate_without_unknowns = debates[~unknown]
words_in_known = debate_without_unknowns['dialog'].str.split().str.len()
words_in_known.describe()

count    6076.000000
mean       47.514319
std        50.714123
min         0.000000
25%         5.000000
50%        24.000000
75%        81.000000
max       304.000000
Name: dialog, dtype: float64

# Unknown Cleanup
Taking a look at the data immediatly after scraping, there was one problem that immediatly stood out: not all of the speakers could be identified. Almost all of them were, but some were marked with `"Speaker <number>"` or `"Moderator <number>"`. While this could have been fixed with a careful examination of the original videos alongside the transcrips, that would have been incredibly time-consuming for 210 rows.

Looking at summary statistics for the number of words in dialog by unknown vs. known speakers, we can see that the mean dialog (pun intended) spoken by an unknown speaker is about one quarter the length of that of a known speaker. It is also much more closely clustered around that smaller size (with a SD of ~15.52 versus the SD of ~47.51 for known speakers). 

Because of the relatively small number of unknown lines (~3.34% of the total), and the small size of those lines, we are simply removing the rows with unknown speakers. After doing so, our next step will be to clean up the names of the remaining speakers.

In [3]:

print(debates['speaker'].unique())

['Lester Holt' 'Savannah G.' 'Senator Warren' 'Amy Klobachar'
 'Beto O’Rourke' 'Cory Booker' 'Julian Castro' 'Tulsi Gabbard' 'Jose D.B.'
 'Mayor de Blasio' 'John Delaney' 'Jay Inslee' 'Tim Ryan' 'Jose'
 'Bill De Blasio' 'Savannah' 'Amy Klobuchar' 'Steve Kornacki'
 'Rachel Maddow' 'Chuck Todd' 'Elizabeth W.' 'Jose D. B.' 'Savanagh G.'
 'Bernie Sanders' 'Bennett' 'Joe Biden' 'Kamala Harris' 'John H.'
 'Kirsten G.' 'Pete Buttigieg' 'Eric Stalwell' 'Andrew Yang'
 'Eric Swalwell' 'M. Williamson' 'Marianne W.' 'Michael Bennet'
 'Jake Tapper' 'Diana' 'Steve Bullock' 'Dana Bash' 'Ms. Williamson'
 'Don Lemon' 'John H' 'Elizabeth W' 'Female' 'Male' 'Mayor Buttigieg'
 'Elizabeth Warre' 'John Hickenloop' 'Marianne Willia' 'J. Hickenlooper'
 'E. Warren' 'Anderson Cooper' 'John King' 'N. Henderson' 'Bill de Blasio'
 'Crowd' 'Yang' 'Kristen Gillibr' 'Gillibrand' 'Senator Bennet'
 'George S.' 'Voiceover' 'Jorge Ramos' 'David Muir' 'George S'
 'Sen Klobuchar' 'Sec. Castro' 'David ' 'Lindsey' 'Erin Burn