# What's in this notebook?
Here I dive into attempting to label the type of speaker, such as speeches given by Barack Obama labeled as 'politician', so that statistical analysis can be done on if there is any statistically significant differences between speeches given by politicians versus celebrities versus college faculty, etc.

In [1]:
# load data
import pickle

with open('processed_transcripts.pickle', 'rb') as f:
    # The protocol version used is detected automatically, so we do not
    # have to specify it.
    data = pickle.load(f)

In [10]:
import pandas as pd
import numpy as np

In [5]:
data.head()

Unnamed: 0,source,transcript,processed
0,https://www.youtube.com/watch?v=bPv21OyQLkM,I'm pleased to welcome to the platform miss Ca...,"[pleased, welcome, platform, miss, carlton, fi..."
1,https://www.youtube.com/watch?v=ngzIkKtjT6o,author Tom Wolfe addressed the graduating clas...,"[author, tom, wolfe, address, graduate, class,..."
2,https://www.youtube.com/watch?v=y5YvCbOmNxQ,ladies and gentlemen dr. Fred Rogers Wow it's ...,"[lady, gentlemen, dr, fred, rogers, wow, beaut..."
3,https://www.youtube.com/watch?v=Q34H3As2QJA,I'd like to tell you too true stories evening ...,"[like, tell, true, story, evening, together, m..."
4,https://www.youtube.com/watch?v=abo-YcLrnao,thank you and I'm a doctor so pay attention fo...,"[thank, doctor, pay, attention, cry, loud, tha..."


In [8]:
data.apply(lambda x: x.source[0:30], axis=1).value_counts()

https://www.youtube.com/watch?    786
https://www.presidency.ucsb.ed    152
https://www.graduationwisdom.c    141
http://www.humanity.org/voices     18
https://www.americanrhetoric.c     14
https://www.wellesley.edu/even      8
https://www.berklee.edu/commen      6
http://graduationwisdom.com/sp      2
https://news.syr.edu/blog/2012      1
http://gos.sbc.edu/k/khouri.ht      1
dtype: int64

In [56]:
data.index = range(len(data))

## The easy ones
There are about 150 speeches delivered by presidents/vice presidents that I scraped from the American Presidency Project. Because I retained all the sources, I can easily tag all of these as 'politician'.

In [57]:
data['given_by'] = data.apply(lambda x: 'politician' if 'https://www.presidency.ucsb.ed' in x.source else np.nan, axis=1)

## Hand labeling 
Some of the sources are just so infrequent (ie, americanrhetoric.com or berklee.edu) that it's less time consuming just to glance at the speeches or sources and tag them myself, based on their wikipage. I also did this to learn how wikipedia works, so that I could best figure out how to label _not_ by hand when I looked into youtube descriptions (see YouTube section).

In [58]:
for source in data[data.source.apply(lambda x: x[0:30] in ['http://www.humanity.org/voices',
                           'https://www.americanrhetoric.c',
                           'https://www.wellesley.edu/even',
                           'https://www.berklee.edu/commen',
                           'http://graduationwisdom.com/sp',
                           'https://news.syr.edu/blog/2012',
                           'http://gos.sbc.edu/k/khouri.ht'])].source:
    print(source)

http://graduationwisdom.com/speeches/0103-schmidt-inspirational-speech.htm
http://graduationwisdom.com/speeches/0100-sandberg_commencement.htm
https://www.americanrhetoric.com/speeches/ahmedzewailcaltechcommencement.htm
https://www.americanrhetoric.com/speeches/alexandersolzhenitsynharvard.htm
https://www.americanrhetoric.com/speeches/angelamerkelharvardcommencementenglish.htm
https://www.americanrhetoric.com/speeches/barbarabushwellesleycommencement.htm
https://www.americanrhetoric.com/speeches/carlschrammcommencementuil.htm
https://www.americanrhetoric.com/speeches/charleswcolsongeneva.htm
https://www.americanrhetoric.com/speeches/davidmcculloughwellesleyhighschoolcommencement.htm
https://www.americanrhetoric.com/speeches/decaroldavisuscgcommencement.htm
https://www.americanrhetoric.com/speeches/johnrobertscardigancommencement.htm
https://www.americanrhetoric.com/speeches/jonstewartcommencementwilliam&mary.htm
https://www.americanrhetoric.com/speeches/michelleobamatuskegeecommencemen

In [59]:
handlabeled = ['faculty', 'business', 'academic', 'author', 'politician', 'politician', 
    'academic', 'politician', 'faculty', 'student', 'politician', 'celebrity', 
     'politician', 'politician', 'journalist', 'author', 'celebrity', 'celebrity', 
     'politician', 'author', 'politician', 'journalist', 'faculty', 'author', 
     'politician', 'journalist', 'academic', 'author', 'author', 'politician', 
     'journalist', 'journalist', 'politician', 'academic','author', 'author', 
     'journalist', 'author', 'author', 'celebrity', 'author', 
     'author', 'politician', 'journalist', 'celebrity', 'celebrity', 'celebrity', 
     'celebrity', 'celebrity', 'celebrity']

In [61]:
i = 0
for index in data[data.source.apply(lambda x: x[0:30] in ['http://www.humanity.org/voices',
                           'https://www.americanrhetoric.c',
                           'https://www.wellesley.edu/even',
                           'https://www.berklee.edu/commen',
                           'http://graduationwisdom.com/sp',
                           'https://news.syr.edu/blog/2012',
                           'http://gos.sbc.edu/k/khouri.ht'])].index:
    data.iloc[index].given_by = handlabeled[i]
    i+=1

## Graduationwisdom.com labeling
This site lists the title of the speakers under their name on the webpage with the transcript. I'll scrape those pages to get the title, and then categorize accordingly. 

In [219]:
import requests
from bs4 import BeautifulSoup 

given_by_info = [] # name, title
for source in data[data.source.apply(lambda x: x[0:30] == 'https://www.graduationwisdom.c')].source:
    r = requests.get(source)
    soup = BeautifulSoup(r.text, 'html.parser')
    info = []
    for x in soup.find_all('div', {'id':'main-text'})[0]:
        try:
            info.append(x.text)
        except:
            pass
    try:
        given_by_info.append((info[1], info[2]))
    except:
        print(source)
    time.sleep(1)

In [470]:
role = {'statesman': 'politician',
       'congressman': 'politician',
        'congresswoman': 'politician',
        'speechwriter':'journalist',
        'politician':'politician',
        'senator': 'politician',
        'student': 'student',
       'entrepreneur': 'business',
        'businessperson':'business',
       'owner': 'business',
       'executive':'business',
       'billionaire': 'business',
       'ceo': 'business',
       'cfo':'business',
       'coo':'business',
       'cio':'business',
        'Chairman': 'business',
        'founder': 'business',
        'businessman':'business',
        'businesswoman': 'business',
       'musician': 'celebrity',
       'comedian': 'celebrity',
       'actress': 'celebrity',
        'filmmaker': 'celebrity',
        'rapper': 'celebrity',
        'celebrity':'celebrity',
       'actor':'celebrity',
       'singer-songwriter':'celebrity',
       'singer':'celebrity',
       'screenwriter':'celebrity',
       'director':'celebrity',
       'producer':'celebrity',
       'professor': 'faculty',
        'teacher': 'faculty',
        'host': 'celebrity',
        'faculty':'faculty',
       'scientist': 'academic',
       'biologist':'academic',
       'historian':'academic',
        'philosopher':'academic',
        'novelist': 'author',
        'writer':'author',
        'poet': 'author',
        'author':'author',
        'essayist':'author',
        'cartoonist': 'author',
        'fiction':'author',
        'newspaper editor': 'journalist',
        'columnist': 'journalist',
        'journalist': 'journalist',
        'editorial': 'journalist',
        'commentator':'journalist', 
        'soccer': 'athlete', 
        'football': 'athlete',
        'swimming': 'athlete',
        'runner': 'athlete',
        'media': 'business', 
        'creator': 'celebrity',
        'co-founder': 'business',
        'investor': 'business', 
        'mayor': 'politician',
        'governor':'politician',
        'representative': 'politician', 
        'correspondent': 'journalist',
        'dean': 'faculty',
        'broadcaster': 'journalist',
        'activist': 'activist',
        'co-founded':'business',
        'moderator':'journalist',
        'lawyer':'lawyer',
        'justice':'lawyer',
        'astrophysicist': 'academic',
        'astrophysics':'academic',
        'anchor': 'journalist',
        'news':'journalist',
        'editor':'journalist',
        'fashion':'business',
        'attorney':'lawyer',
        'nfl':'athlete',
        'coach':'athlete',
        'sports':'athlete',
        'advocate':'activist',
        'quarterback':'athlete',
        'councilman':'politician',
        'councilwoman':'politician',
        'sportscaster':'journalist',
        'epidemiologist':'academic',
        'reporter':'journalist',
        'rev':'religious',
        'television':'celebrity',
        'tennis':'athlete',
        'book':'author'
       }

In [471]:
given_by_grad_wisdom = []
for speaker in given_by_info:
    assigned=False
    temp_speaker = speaker[1]
    for punc in ['"', "\n", ",", "'", ':', ';', '/']:
        temp_speaker = temp_speaker.replace(punc, ' ').lower()
    for word in temp_speaker.split():
        if word in role.keys():
            given_by_grad_wisdom.append(role[word])
            assigned=True
            break
    if not assigned:
        given_by_grad_wisdom.append(np.nan)

In [472]:
i = 0
for index in data[data.source.apply(lambda x: x[0:30] == 'https://www.graduationwisdom.c')].index:
    data.iloc[index].given_by = given_by_grad_wisdom[i]
    i+=1

## Youtube
This is gonna be a little harder, but not the worst thing in the world -- I'll try and get the titles and descriptions of all the youtube videos (all rows with sources starting with "https://www.youtube.com/watch?"), and use some simple rules to determine the easily identifiable roles (such as "obama" --> politician or "moderator" --> journalist).

In [297]:
from tqdm import tqdm_notebook

In [300]:
yt_titles = []
yt_descriptions = []

for source in tqdm_notebook(data[data.source.apply(lambda x: x[0:30] == 'https://www.youtube.com/watch?')].source):
    r = requests.get(source)
    soup = BeautifulSoup(r.text, 'html.parser')
    try:
        yt_titles.append(soup.find('h1', {'class': 'watch-title-container'}).find('span')['title'])
        
    except:
        print('title issue', source)
    try:
        yt_descriptions.append(soup.find('meta', {'itemprop':'description'})['content'])
        
    except:
        print('description issue', source)
    time.sleep(.5)

HBox(children=(IntProgress(value=0, max=786), HTML(value='')))

title issue https://www.youtube.com/watch?v=SAJH03U7aHM
description issue https://www.youtube.com/watch?v=SAJH03U7aHM
title issue https://www.youtube.com/watch?v=w72F6qzkBNI
description issue https://www.youtube.com/watch?v=w72F6qzkBNI
title issue https://www.youtube.com/watch?v=SJxv3KYnuaI
description issue https://www.youtube.com/watch?v=SJxv3KYnuaI
title issue https://www.youtube.com/watch?v=4JCR4NW1Px4
description issue https://www.youtube.com/watch?v=4JCR4NW1Px4



In [386]:
# I forgot to insert a placeholder for the pages that had issues! Doing that here.
# get correct index
data[(data.source == 'https://www.youtube.com/watch?v=SAJH03U7aHM') | 
    (data.source == 'https://www.youtube.com/watch?v=w72F6qzkBNI')|
    (data.source == 'https://www.youtube.com/watch?v=SJxv3KYnuaI')|
    (data.source == 'https://www.youtube.com/watch?v=4JCR4NW1Px4')].index

yt_titles.insert(23,'Toni Morrison: College Commencement Address (2004 Speech to Students)')
yt_descriptions.insert(23, 'Toni Morrison (born Chloe Ardelia Wofford; February 18, 1931) is an American novelist, editor, and professor. Her novels are known for their epic themes, vivid dialogue, and richly detailed characters.')
yt_titles.insert(231, '2009 Webster University Commencement Address')
yt_descriptions.insert(231,'Tom Willis delivers commencement address at Webster University graduation 2009.')
yt_titles.insert(320,'John Ortberg | 2010 Undergraduate Commencement Address')
yt_descriptions.insert(320,'May 9, 2010')
yt_titles.insert(461, 'Dr. Mehmet Oz, Keynote Speaker | Wharton MBA Graduation Ceremony 2012')
yt_descriptions.insert(461, 'Wharton address the MBA Class of 2012')


In [439]:
yt_role = []

i = 0
for description in yt_descriptions:
    for punc in [',', '.', '?', '!', ':', '-', ';', '_', '/']:
        description = description.replace(punc, ' ')
    description = description.lower()
    roles = []
    for word in description.split():
        if word in role.keys():
            roles.append(role[word])
    yt_role.append(roles)

In [446]:
update_df = []
for role in yt_role:
    if len(role) > 0:
        update_df.append(role[0])
    else: 
        update_df.append(np.nan)

In [447]:
for i in data[data.source.apply(lambda x: x[0:30] == 'https://www.youtube.com/watch?')].index:
    data.given_by.iloc[i] = update_df[i]

In [454]:
for i in range(len(update_df)):
    if type(update_df[i]) == type(np.nan):
        print(yt_descriptions[i])
        print()
        print(yt_titles[i])
        print('---------------------------------------------------------------')

http://www.stvincent.edu | Fred Rogers addresses the graduates of Saint Vincent College at its 2000 Commencement, emphasizing the importance of honoring the ...

Fred Rogers' Commencement Address - 2000
---------------------------------------------------------------
Josh Rubman Oration Dr. Marc Lewis University of Texas Commencement Address 2000

Josh Rubman "University of Texas Commencement Address 2000"
---------------------------------------------------------------
Paul Reiser '77, best known for this starring role on the NBC sitcom Mad About You, was invited to the Harpur College Commencement Ceremonies in 2000 where h...

Paul Reiser '77 Commencement Address to the Class of 2000
---------------------------------------------------------------
Walter Issaacson delivers the Tulane University Class of 2000 Commencement Address. May 20, 2000

Tulane 2000 Commencement Address- Walter Issaacson
---------------------------------------------------------------
President Russell K. Osgood's 

In [505]:
roles2 = {'staring':'celebrity',
         'bush':'politician',
         'clinton':'politician',
         'obama':'politician',
         'bono':'celebrity',
         'military':'military',
         'comedy':'celebrity',
         'mcculough':'academic',
         'science':'academic',
         'television':'celebrity',
         'role':'celebrity',
         'gates':'business',
         'oprah':'business', 
         'bloomberg':'politician',
         'playwright':'author',
         'commentators':'journalist',
         'dr':'academic',
         'chancellor':'politician',
         'pastor':'religious',
         'naval':'military',
         'investors':'business',
         'representative':'politician',
         'rep': 'politician',
         'reverend':'religious',
         'admiral':'military',
         'booker': 'politician',
         'entertainer':'celebrity',
         'archbishop':'religious',
         'justice': 'activist'}

In [509]:
import string
for i in range(len(update_df)):
    # first check title
    if type(update_df[i]) == type(np.nan):
        desc = yt_titles[i][:350].lower()
        for punc in string.punctuation:
            desc.replace(punc, ' ')
        for word in desc.split():
            if word in roles2:
                update_df[i] = roles2[word]
                break
            if word in roles:
                update_df[i] = roles[word]
                break
    # then check description
    if type(update_df[i]) == type(np.nan):
        desc = yt_descriptions[i][:350].lower()
        for punc in string.punctuation:
            desc.replace(punc, ' ')
        for word in desc.split():
            if word in roles2:
                update_df[i] = roles2[word]
                break
            if word in roles:
                update_df[i] = roles[word]
                break
    #then check the beginning of the transcript (for intros)
    if type(update_df[i]) == type(np.nan):
        desc = data.transcript[i][:350].lower()
        for punc in string.punctuation:
            desc.replace(punc, ' ')
        for word in desc.split():
            if word in roles2:
                update_df[i] = roles2[word]
                break
            if word in roles:
                update_df[i] = roles[word]
                break

In [510]:
for i in data[data.source.apply(lambda x: x[0:30] == 'https://www.youtube.com/watch?')].index:
    data.given_by.iloc[i] = update_df[i]

In [529]:
data.given_by.value_counts()

politician    333
celebrity     247
business      158
author        100
academic       76
journalist     64
faculty        45
military       23
athlete        20
religious      19
student        14
lawyer         12
activist        9
artist          5
hournalist      1
facilty         1
other           1
coach           1
Name: given_by, dtype: int64

In [526]:
# hand label the remaining
for i in data[data.given_by.isnull()].index:
    print(i)
    print(data.source.iloc[i])
    print(data.transcript.iloc[i][:350])
    r = requests.get(data.source.iloc[i])
    soup = BeautifulSoup(r.text, 'html.parser')
    try:
        print(soup.find('h1', {'class': 'watch-title-container'}).find('span')['title'])
        
    except:
        print('title issue', source)
    try:
        print(soup.find('meta', {'itemprop':'description'})['content'])
        
    except:
        print('description issue', source)
    data.given_by.iloc[i] = input()
    print()

921
https://www.graduationwisdom.com/speeches/0014-jobs.htm
 "Remembering that you are going to die is the best way I know to avoid the trap of thinking you have something to lose. You are  already naked. There is no reason not to follow your heart."I am honored to be with you today at your commencement from one of the finest universities in the world. I never graduated from college. Truth be told, this is
title issue https://www.youtube.com/watch?v=dqbUuYKip-k
description issue https://www.youtube.com/watch?v=dqbUuYKip-k
business

922
https://www.graduationwisdom.com/speeches/0013-lewis.htm
   "The way to be happy is to like yourself and the way to like yourself is to do only things that make you proud." TRANSCRIPT   I want to tell you three true stories this evening. Together they make a point that I consider one of the great secrets of life and I hope you’ll remember these stories, because I promise you that you’ll need them at some
title issue https://www.youtube.com/watch?v=dqbUuY

In [536]:
# fix some typos in the labeling, and combine some of the smaller categories into one bigger one
fixes = {'coach': 'athlete',
        'facilty': 'faculty',
        'hournalist': 'journalist',
        'artist':'academic',
        'activist':'academic',
        'lawyer':'academic',
        }
data['given_by'] = data['given_by'].apply(lambda x: fixes[x] if x in fixes.keys() else x)

In [537]:
# save our work
with open('labeled_data.pickle', 'wb') as f:
    pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)