In [1]:
import pandas as pd
import re
debates = pd.read_csv('./../data/raw_debates.csv')

In [2]:
# Looking at summary stats about the size of unknown speaker dialog
speaker_or_moderator = re.compile(r'(Speaker|Moderator|Audience)')
unknown = debates['speaker'].str.contains(speaker_or_moderator)
unkowns_only = debates[unknown]
words_in_unknown = unkowns_only['dialog'].str.split().str.len()
words_in_unknown.describe()

count    206.000000
mean      15.834951
std       25.881756
min        1.000000
25%        3.000000
50%        5.000000
75%       16.000000
max      174.000000
Name: dialog, dtype: float64

In [3]:
# comparing to size of known speaker dialog
debate_without_unknowns = debates[~unknown]
words_in_known = debate_without_unknowns['dialog'].str.split().str.len()
words_in_known.describe()

count    6045.000000
mean       47.757486
std        50.729882
min         1.000000
25%         6.000000
50%        24.000000
75%        81.000000
max       304.000000
Name: dialog, dtype: float64

# Unknown Cleanup
Taking a look at the data immediatly after scraping, there was one problem that immediatly stood out: not all of the speakers could be identified. Almost all of them were, but some were marked with `"Speaker <number>"` or `"Moderator <number>"`. While this could have been fixed with a careful examination of the original videos alongside the transcrips, that would have been incredibly time-consuming for over 200 rows.

Looking at summary statistics for the number of words in dialog by unknown vs. known speakers, we can see that the mean dialog (pun intended) spoken by an unknown speaker is less than one quarter the length of that of a known speaker. It is also much more closely clustered around that smaller size (with a SD of ~15.83 versus the SD of ~47.76 for known speakers). 

Because of the relatively small number of unknown lines (~3% of the total), and the small size of those lines, we are simply removing the rows with unknown speakers. After doing so, our next step will be to clean up the names of the remaining speakers.

In [4]:

# remove unknown speakers
debates = debate_without_unknowns
# see all names
unique_names = sorted(debates['speaker'].unique())
print(unique_names)

text_file = open('./../data/names.csv', 'w')
n = text_file.write(',\n'.join(unique_names))
text_file.close()



['A. Cooper', 'Abby P', 'Abby Phillip', 'Abby Phillips', 'Adam Sexton', 'Amna', 'Amna Nawaz', 'Amy Klobachar', 'Amy Klobuchar', 'Amy Langenfeld', 'Amy Walter', 'Anderson Cooper', 'Andrea Mitchell', 'Andrew Yang', 'Announcer', 'Ashley Parker', 'B. Pfannenstiel', 'Bennett', 'Bernie Sanders', 'Beto O’Rourke', 'Bill De Blasio', 'Bill Whitaker', 'Bill de Blasio', 'Brianne P', 'Brianne P.', 'Chuck Todd', 'Cory Booker', 'Crowd', 'Dana Bash', 'David', 'David Muir', 'Devin Dwyer', 'Diana', 'Don Lemon', 'Dr. Sanjay Gupta', 'E. Warren', 'Elizabeth W', 'Elizabeth W.', 'Elizabeth Warre', 'Elizabeth Warren', 'Eric Stalwell', 'Eric Swalwell', 'Erin Burnett', 'Female', 'Gayle King', 'George S', 'George S.', 'Gillibrand', 'Hallie Jackson', 'Helen', 'Ilia Calderón', 'J. Hickenlooper', 'Jake Tapper', 'Jay Inslee', 'Joe Biden', 'John Delaney', 'John H', 'John H.', 'John Hickenloop', 'John King', 'Jon Ralston', 'Jorge Ramos', 'Jose', 'Jose D. B.', 'Jose D.B.', 'Judy', 'Judy Woodruff', 'Julian Castro', 'Kam

# Normalizing Names
My next problem was the large variation in names. The same candidate could be referred to by a lot of different names. Some of these were as simple as extra spacing on the end (which I fixed in the data scraper), while others were more complicated.

For example: Elizabeth Warren was in the transcripts under 6 different variations of her name, e.g. "Elizabeth Warren,", "E. Warren", "Senator Warren"

It would have been cool to do this programatically. One solution I saw to a similar problem was to use k-mean clustering by Levenshtein Distance, but there were two problems with this approach. One, I would need to know how many clusters (i.e. people) I needed to find, and two, I'd need to take the time to implement that rather heavy way of doing things, in which time I could have done the one-time task several times over. With that in mind, I went through the unique list of names by hand and created a table that would tell me what actual names correspond to names in the transcripts. 

I also used that opportunity to denote which people were candidates, and which invalid entries remained after the validation in the scraping step, e.g. "Crowd", "Male", etc.

You can see the results of this process in `names_conversion.csv`.



In [5]:
# create dicts for name conversion
name_conv = pd.read_csv('../data/names_conversion.csv')
conv_dict = name_conv.set_index('present_name').T.to_dict()

name_dict = {}
candidate_dict = {}
invalid_entries = []
# split csv into dicts and lists to clean different values
for k in conv_dict:
    v = conv_dict[k]
    # invalid values are nan (float), but valid ones are strings 
    if not isinstance(v['to_name'], str):
        invalid_entries.append(k)
        continue

    # turn non-standard names into standard names
    name_dict[k] = v['to_name']

    # determine if a speaker is a moderator
    if v['is_candidate'] == 'y':
        candidate_dict[v['to_name']] = True
    else:
        candidate_dict[v['to_name']] = False

# remove all invalid entries
invalid_entries = debates['speaker'].isin(invalid_entries)
debates = debates[~invalid_entries]

# standardize names
debates['speaker'] = debates['speaker'].map(name_dict)
# add candidate column
debates['is_candidate'] = debates['speaker'].map(candidate_dict)

print(debates.head(40))


speaker                                             dialog  \
0        Lester Holt  comment on every topic but over the course of ...   
1   Savannah Guthrie  all right so with that business out of the way...   
2   Elizabeth Warren                      thank you its good to be here   
3   Savannah Guthrie  you have many plans free college free childcar...   
4   Elizabeth Warren  i think of it this way who is this economy rea...   
5   Elizabeth Warren  its doing great for giant oil companies that w...   
6   Savannah Guthrie  senator klobuchar you called programs like fre...   
7      Amy Klobuchar  well first the economy we know that not everyo...   
8      Amy Klobuchar  secondly id use pell grants id double them fro...   
9   Savannah Guthrie  thats time thank you congressman orourke what ...   
10     Beto O’Rourke  this economy has got to work for everyone and ...   
11  Savannah Guthrie                                congressman orourke   
12     Beto O’Rourke  thats how we eac

# Conclusion

In this notebook we've taken the following steps to ensure that the data is as clean as possible:

- Removed initial invalid entries (i.e. entries with unknown speakers)
- Normalized names so that one speaker is always labelled with the same name (e.g. Michael Bloomberg is always "Michael Bloomberg" and never "Mike Bloomberg")
- Removed more unknown speakers who weren't uncovered in the initial steps
- Added information on who is a candidate and who isn't so that we can filter on that information later