# Problem Statement

The goal is to look at transcripts of various comedian and note their similarity and differences, specifically for Russell Peters, a comedian that got me interested in stand up comedy.

# Data Preparation

## Scraping the Data
We will use stand up comedy transcripts from [Scraps From The Loft](http://scrapsfromtheloft.com/).

To decide which comedians to look into, we will check on IMDB, and look for the comedy show that were released in the past 6 years with rating of 7.5/10 and more than 1000 votes. I will pick several comedians with the most highly rated show. I will also pick Russel Peters Outsourced (2016) Transcript, one of his stand up comedy that I really like, so I can compare his style and try to check other comedians with the same similarity. (This IMDB search part is done manually via Google Search)

After curating the url list, we will scrape each transcript and put it in a text file.

In [1]:
import requests
from bs4 import BeautifulSoup
import pickle
import time

In [2]:
# Scrapes the transcript data from scrapsfromtheloft.com
def transcript_url(url):
    '''Returns transcript data from scrapsfromtheloft.com'''
    page = requests.get(url).text
    soup = BeautifulSoup(page,'lxml')
    text = [p.text for p in soup.find(class_='post-content').find_all('p')]
    print(f"Scraped {url}")
    time.sleep(0.2)
    return text

In [3]:
# URLs
urls = ['http://scrapsfromtheloft.com/2017/05/06/louis-ck-oh-my-god-full-transcript/',
        'http://scrapsfromtheloft.com/2017/04/11/dave-chappelle-age-spin-2017-full-transcript/',
        'http://scrapsfromtheloft.com/2018/03/15/ricky-gervais-humanity-transcript/',
        'http://scrapsfromtheloft.com/2017/08/07/bo-burnham-2013-full-transcript/',
        'http://scrapsfromtheloft.com/2017/05/24/bill-burr-im-sorry-feel-way-2014-full-transcript/',
        'http://scrapsfromtheloft.com/2017/04/21/jim-jefferies-bare-2014-full-transcript/',
        'http://scrapsfromtheloft.com/2017/08/02/john-mulaney-comeback-kid-2015-full-transcript/',
        'http://scrapsfromtheloft.com/2017/10/21/hasan-minhaj-homecoming-king-2017-full-transcript/',
        'http://scrapsfromtheloft.com/2017/09/19/ali-wong-baby-cobra-2016-full-transcript/',
        'http://scrapsfromtheloft.com/2017/08/03/anthony-jeselnik-thoughts-prayers-2015-full-transcript/',
        'http://scrapsfromtheloft.com/2018/03/03/mike-birbiglia-my-girlfriends-boyfriend-2013-full-transcript/',
        'http://scrapsfromtheloft.com/2017/08/19/joe-rogan-triggered-2016-full-transcript/',
        'http://scrapsfromtheloft.com/2018/07/02/russell-peters-outsourced-transcript/']

In [4]:
# Comedian Names
comedians = ['louis_ck', 'dave_chapelle','ricky_gervais','bo_burnham', 'bill_burr', 'jim_jefferies', 'john_mulaney',
            'hasan_minhaj', 'ali_wong', 'anthony_jeselnik','mike_birbiglia','joe_rogan','russell_peters']

In [5]:
# Request Transcripts
# transcripts = [transcript_url(url) for url in urls]

In [6]:
# test the transcripts
# len(transcripts)

In [7]:
# Pickle the files

# Make a new directory to hold pickled files
# !mkdir transcripts

# for i, c in enumerate(comedians):
#     with open('transcripts/' + c + '.txt', 'wb') as file:
#         pickle.dump(transcripts[i], file)

In [8]:
# Load pickled files
data = {}
for i,c in enumerate(comedians):
    with open('transcripts/' + c + '.txt', 'rb') as file:
        data[c] = pickle.load(file)

In [9]:
# Checking the data
data.keys()

dict_keys(['louis_ck', 'dave_chapelle', 'ricky_gervais', 'bo_burnham', 'bill_burr', 'jim_jefferies', 'john_mulaney', 'hasan_minhaj', 'ali_wong', 'anthony_jeselnik', 'mike_birbiglia', 'joe_rogan', 'russell_peters'])

In [10]:
type(data['dave_chapelle'])

list

## Cleaning the Data

We will execute some common cleaning steps:

__Common data cleaning steps on all text:__
    * Lowercase the text
    * Remove punctuation
    * Remove numerical value
    * Remove common non-sensical text (ex: /n)
    * Tokenize text
    * Remove stop words

__More data cleaning steps after tokenization:__
    * Stemming / lemmatization
    * Parts of speech tagging
    * Create bi-grams or tri-grams
    * Deal with typos

In [11]:
# Currently the value in data is multiple lists, we will combine all the lists in a value as a string
def combine_text(list_text):
    text = ' '.join(list_text)
    return text

In [12]:
data_combined = {key: [combine_text(value)] for key, value in data.items()}

### Put the data into dataframe

In [13]:
import pandas as pd

pd.set_option('max_colwidth',150)
df = pd.DataFrame.from_dict(data_combined).transpose()

In [14]:
df.columns = ['transcript']
df = df.sort_index()
df

Unnamed: 0,transcript
ali_wong,"Ladies and gentlemen, please welcome to the stage: Ali Wong! Hi. Hello! Welcome! Thank you! Thank you for coming. Hello! Hello. We are gonna have ..."
anthony_jeselnik,"Thank you. Thank you. Thank you, San Francisco. Thank you so much. So good to be here. People were surprised when I told ’em I was gonna tape my s..."
bill_burr,"[cheers and applause] All right, thank you! Thank you very much! Thank you. Thank you. Thank you. How are you? What’s going on? Thank you. It’s a ..."
bo_burnham,Bo What? Old MacDonald had a farm E I E I O And on that farm he had a pig E I E I O Here a snort There a Old MacDonald had a farm E I E I O [Appla...
dave_chapelle,"This is Dave. He tells dirty jokes for a living. That stare is where most of his hard work happens. It signifies a profound train of thought, the ..."
hasan_minhaj,"[theme music: orchestral hip-hop] [crowd roars] What’s up? Davis, what’s up? I’m home. I had to bring it back here. Netflix said, “Where do you wa..."
jim_jefferies,"[Car horn honks] [Audience cheering] [Announcer] Ladies and gentlemen, please welcome to the stage Mr. Jim Jefferies! [Upbeat music playing] Hello..."
joe_rogan,"[rock music playing] [audience cheering] [announcer] Ladies and gentlemen, welcome Joe Rogan. [audience cheering and applauding] What the fuck is ..."
john_mulaney,"All right, Petunia. Wish me luck out there. You will die on August 7th, 2037. That’s pretty good. All right. Hello. Hello, Chicago. Nice to see yo..."
louis_ck,Intro\nFade the music out. Let’s roll. Hold there. Lights. Do the lights. Thank you. Thank you very much. I appreciate that. I don’t necessarily a...


In [15]:
df.transcript['russell_peters']

'Ladies and gentlemen, ladies and gentlemen, please, if you say that, gentlemen. My man, [MIXED] Russell Peters! Yeah, brothers know his name. Here he is, guys! Russell Peters! [HOUSE MUSIC CONTINUES] All right. How you doing? All right. All right, look at you, you filthy downloaders. Look at this audience, man. Everybody. This is cool, man. Everybody. We got– clearly we got some Asians in the house. That’s uh… I saw all the Honda Civics in the parking lot. I knew you were here. I thought they were shooting Fast and The Furious Part 3 or something. Oh, man, and then the brown bastards. Look at you, huh? All right. There’s a lot of closed motels in town right now, I tell you that. There’s uh… White people, how you doing? White folks, good to see you. All right, a white guy with a brown girl. Good job, buddy, huh? Her parents must be so happy. Ha ha. There’s a brown man with a white woman. Nice, see? Balance. That’s what I’m talking about. He’s living the American dream. Or at least the 

### Apply first round of text cleaning techniques

In [16]:
import re
import string

In [17]:
def clean_text_round1(text):
    '''Make text lowercase, remove text in square bracket, remove punctuation, remove words containing numbers'''
    text = text.lower()
    text = re.sub('\[.*?\]','', text)
    text = re.sub('[%s]' % re.escape(string.punctuation),'', text)
    text = re.sub('\w*\d\w*','',text)
    
    return text

In [18]:
round1 = lambda x: clean_text_round1(x)

In [19]:
df_clean = pd.DataFrame(df.transcript.apply(round1))
df_clean

Unnamed: 0,transcript
ali_wong,ladies and gentlemen please welcome to the stage ali wong hi hello welcome thank you thank you for coming hello hello we are gonna have to get thi...
anthony_jeselnik,thank you thank you thank you san francisco thank you so much so good to be here people were surprised when i told ’em i was gonna tape my special...
bill_burr,all right thank you thank you very much thank you thank you thank you how are you what’s going on thank you it’s a pleasure to be here in the gre...
bo_burnham,bo what old macdonald had a farm e i e i o and on that farm he had a pig e i e i o here a snort there a old macdonald had a farm e i e i o this i...
dave_chapelle,this is dave he tells dirty jokes for a living that stare is where most of his hard work happens it signifies a profound train of thought the alch...
hasan_minhaj,what’s up davis what’s up i’m home i had to bring it back here netflix said “where do you want to do the special la chicago new york” i was like...
jim_jefferies,ladies and gentlemen please welcome to the stage mr jim jefferies hello sit down sit down sit down sit down sit down thank you boston i appre...
joe_rogan,ladies and gentlemen welcome joe rogan what the fuck is going on san francisco thanks for coming i appreciate it god damn put your phone down ...
john_mulaney,all right petunia wish me luck out there you will die on august that’s pretty good all right hello hello chicago nice to see you again thank you...
louis_ck,intro\nfade the music out let’s roll hold there lights do the lights thank you thank you very much i appreciate that i don’t necessarily agree wit...


### Apply second round of text cleaning

In [20]:
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non sensical text'''
    text = re.sub(r'[‘’“”…♪]','',text)
    text = re.sub(r'\n','',text)
    text = text.strip()
    return text

In [21]:
df_clean = pd.DataFrame(df_clean.transcript.apply(lambda x: clean_text_round2(x)))

In [22]:
# Checking
df_clean.transcript['dave_chapelle']

'this is dave he tells dirty jokes for a living that stare is where most of his hard work happens it signifies a profound train of thought the alchemists fire that transforms fear and tragedy into levity and livelihood dave calls that look the trance  play me   buy me   workinonit   tune up   tune   oh   fade me   ahah ahah ahah   in every ghetto   ahah ahah ahah   in every ghetto   ahah ahah ahah   in every ghetto   ahah ahah ahah   in every ghetto   ahah ahah ahah   in every ghetto   ahah ahah ahah   in every ghetto   ahah ahah ahah  thank you thank you very much thank you all oh wow that was exciting wasnt it thank you guys have a seat feel comfortable relax i want to thank everyone in la for a wonderful week its been great here you know what its been ten years since the last time i played los angeles if you can imagine i know i know ive been gone for a very long time and unbeknownst to you it was a difficult ten years im not gonna take you through all the agony ive been through but

__Note:__ We are going to stop doing the data cleaning for now, later on we can improve the data cleaning again when the result of the analysis technique is not satisfying enough or could be improved.

## Organizing the Data

We will organize the data in two standard text formats:

1. __Corpus__ - a collection of text

2. __Document-Term Matrix__ - word counts in matrix format

### Corpus

We actually have created the corpus, we will just add a full name column to the corpus, so we can use that for some visualization later on

In [25]:
df

Unnamed: 0,transcript
ali_wong,"Ladies and gentlemen, please welcome to the stage: Ali Wong! Hi. Hello! Welcome! Thank you! Thank you for coming. Hello! Hello. We are gonna have ..."
anthony_jeselnik,"Thank you. Thank you. Thank you, San Francisco. Thank you so much. So good to be here. People were surprised when I told ’em I was gonna tape my s..."
bill_burr,"[cheers and applause] All right, thank you! Thank you very much! Thank you. Thank you. Thank you. How are you? What’s going on? Thank you. It’s a ..."
bo_burnham,Bo What? Old MacDonald had a farm E I E I O And on that farm he had a pig E I E I O Here a snort There a Old MacDonald had a farm E I E I O [Appla...
dave_chapelle,"This is Dave. He tells dirty jokes for a living. That stare is where most of his hard work happens. It signifies a profound train of thought, the ..."
hasan_minhaj,"[theme music: orchestral hip-hop] [crowd roars] What’s up? Davis, what’s up? I’m home. I had to bring it back here. Netflix said, “Where do you wa..."
jim_jefferies,"[Car horn honks] [Audience cheering] [Announcer] Ladies and gentlemen, please welcome to the stage Mr. Jim Jefferies! [Upbeat music playing] Hello..."
joe_rogan,"[rock music playing] [audience cheering] [announcer] Ladies and gentlemen, welcome Joe Rogan. [audience cheering and applauding] What the fuck is ..."
john_mulaney,"All right, Petunia. Wish me luck out there. You will die on August 7th, 2037. That’s pretty good. All right. Hello. Hello, Chicago. Nice to see yo..."
louis_ck,Intro\nFade the music out. Let’s roll. Hold there. Lights. Do the lights. Thank you. Thank you very much. I appreciate that. I don’t necessarily a...


In [24]:
full_names = ['Ali Wong', 'Anthony Jeselnik', 'Bill Burr', 'Bo Burnham', 'Dave Chappelle', 'Hasan Minhaj',
              'Jim Jefferies', 'Joe Rogan', 'John Mulaney', 'Louis C.K.', 'Mike Birbiglia', 'Ricky Gervais', 'Russell Peters']

In [26]:
df['full_name'] = full_names

df

Unnamed: 0,transcript,full_name
ali_wong,"Ladies and gentlemen, please welcome to the stage: Ali Wong! Hi. Hello! Welcome! Thank you! Thank you for coming. Hello! Hello. We are gonna have ...",Ali Wong
anthony_jeselnik,"Thank you. Thank you. Thank you, San Francisco. Thank you so much. So good to be here. People were surprised when I told ’em I was gonna tape my s...",Anthony Jeselnik
bill_burr,"[cheers and applause] All right, thank you! Thank you very much! Thank you. Thank you. Thank you. How are you? What’s going on? Thank you. It’s a ...",Bill Burr
bo_burnham,Bo What? Old MacDonald had a farm E I E I O And on that farm he had a pig E I E I O Here a snort There a Old MacDonald had a farm E I E I O [Appla...,Bo Burnham
dave_chapelle,"This is Dave. He tells dirty jokes for a living. That stare is where most of his hard work happens. It signifies a profound train of thought, the ...",Dave Chappelle
hasan_minhaj,"[theme music: orchestral hip-hop] [crowd roars] What’s up? Davis, what’s up? I’m home. I had to bring it back here. Netflix said, “Where do you wa...",Hasan Minhaj
jim_jefferies,"[Car horn honks] [Audience cheering] [Announcer] Ladies and gentlemen, please welcome to the stage Mr. Jim Jefferies! [Upbeat music playing] Hello...",Jim Jefferies
joe_rogan,"[rock music playing] [audience cheering] [announcer] Ladies and gentlemen, welcome Joe Rogan. [audience cheering and applauding] What the fuck is ...",Joe Rogan
john_mulaney,"All right, Petunia. Wish me luck out there. You will die on August 7th, 2037. That’s pretty good. All right. Hello. Hello, Chicago. Nice to see yo...",John Mulaney
louis_ck,Intro\nFade the music out. Let’s roll. Hold there. Lights. Do the lights. Thank you. Thank you very much. I appreciate that. I don’t necessarily a...,Louis C.K.


In [27]:
# Let's pickle the corpus
df.to_pickle('corpus.pkl')

### Document-Term Matrix

We will do tokenization technique to break down text into words.
We are going to create the document-term matrix using CountVectorizer, and exclude common English stop words

In [28]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
df_cv = cv.fit_transform(df_clean.transcript)

In [30]:
df_cv.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [31]:
df_dtm = pd.DataFrame(df_cv.toarray(), columns=cv.get_feature_names())
df_dtm

Unnamed: 0,aaaaah,aaaaahhhhhhh,aaaaauuugghhhhhh,aaaahhhhh,aaah,aah,aahh,abc,abcs,ability,...,zee,zen,zeppelin,zero,zillion,zombie,zombies,zoning,zoo,éclair
0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,1,0,...,0,0,0,1,1,1,1,1,0,0
3,0,1,1,1,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,2,1,0,1,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9,0,0,0,0,0,3,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0


In [32]:
df_dtm.index = df_clean.index
df_dtm

Unnamed: 0,aaaaah,aaaaahhhhhhh,aaaaauuugghhhhhh,aaaahhhhh,aaah,aah,aahh,abc,abcs,ability,...,zee,zen,zeppelin,zero,zillion,zombie,zombies,zoning,zoo,éclair
ali_wong,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
anthony_jeselnik,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
bill_burr,1,0,0,0,0,0,0,0,1,0,...,0,0,0,1,1,1,1,1,0,0
bo_burnham,0,1,1,1,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,0
dave_chapelle,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
hasan_minhaj,0,0,0,0,0,0,0,0,0,0,...,2,1,0,1,0,0,0,0,0,0
jim_jefferies,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
joe_rogan,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
john_mulaney,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
louis_ck,0,0,0,0,0,3,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0


In [33]:
# pickle the document-term matrix, the cleaned dataframe, and the countvectorizer
df_dtm.to_pickle('dtm.pkl')
df_clean.to_pickle('df_clean.pkl')
pickle.dump(cv, open('cv.pkl','wb'))

In [34]:
# Check
cv

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)