# Exploratory Data Analysis

So until now, we have not idea whether all our efforts are gonna be invain or not; But by doing the analysis here, we will know with high confidence whether the data is useful or not.

We do this by analysing the followings:
1. **Most common words** - find these and create word clouds
2. **Size of vocabulary** - look number of unique words and also how quickly someone speaks
3. **Amount of profanity** - most common terms

## Load the cleaned data

In [14]:
import pandas as pd

data = pd.read_csv('saves/cleaned_transcripts_df.csv', index_col = 0)
data

Unnamed: 0,Transcript
Lousic C.K.,intro fade the music out let 's roll hold ther...
Dave Chappelle,this be dave he tell dirty joke for a living t...
Ricky Gervais,hello hello how you do great thank you wow cal...
Bo Burham,bo what old macdonald have a farm e I e I o an...
Bill Burr,all right thank you thank you very much thank ...
Jim Jefferies,lady and gentleman please welcome to the stage...
John Mulaney,armed with boyish charm and a sharp wit the fo...
Hasan Minhaj,what be up davis what be up I be home I have t...
Ali Wong,lady and gentleman please welcome to the stage...
Anthony Jeselnik,thank you thank you thank you san francisco th...


## Vectorize the data

In [40]:
# Imports
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Vectorize
vectorizer = CountVectorizer(stop_words = 'english')
vectorized_data = vectorizer.fit_transform(data['Transcript'])

# Convert to DataFrame
vectorized_df = pd.DataFrame(vectorized_data.toarray(), columns = vectorizer.get_feature_names())
vectorized_df.index = data.index
vectorized_df

Unnamed: 0,aaaaah,aaaaahhhhhhh,aaaaauuugghhhhhh,aaaahhhhh,aaah,aah,abc,ability,abject,able,...,zealand,zee,zen,zeppelin,zero,zillion,zombie,zone,zoo,éclair
Lousic C.K.,0,0,0,0,0,3,0,0,0,1,...,0,0,0,0,2,0,0,0,0,0
Dave Chappelle,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Ricky Gervais,0,0,0,0,0,0,0,1,1,2,...,0,0,0,0,0,0,0,0,1,0
Bo Burham,0,1,1,1,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
Bill Burr,1,0,0,0,0,0,1,0,0,1,...,0,0,0,0,1,1,2,1,0,0
Jim Jefferies,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
John Mulaney,0,0,0,0,0,0,0,0,0,3,...,0,0,0,0,0,0,0,0,0,1
Hasan Minhaj,0,0,0,0,0,0,0,0,0,1,...,0,2,1,0,1,0,0,0,0,0
Ali Wong,0,0,0,0,0,0,1,0,0,2,...,0,0,0,0,0,0,1,0,0,0
Anthony Jeselnik,0,0,0,0,0,0,0,0,0,0,...,10,0,0,0,0,0,0,0,0,0


## Save vectorized data

In [42]:
vectorized_df.to_csv('saves/vectorized_transcripts_df.csv')

## Analysis

### Most Common Words

In [44]:
vectorized_df = vectorized_df.transpose()

In [108]:
top_words = {}

for column in vectorized_df.columns:
    tokens = vectorized_df[column]
    tokens = tokens.sort_values(ascending = False).head(30)    
    
    top_words[column] = tokens
    print(column + ':\n -', ', '.join(list(tokens.index[:15])))

Lousic C.K.:
 - like, just, know, life, people, thing, say, tit, na, gon, think, good, kid, cause, shit
Dave Chappelle:
 - like, know, say, just, shit, fuck, people, man, time, ahah, black, come, guy, look, fucking
Ricky Gervais:
 - say, right, like, just, know, ve, yeah, thing, fucking, joke, year, think, people, little, ll
Bo Burham:
 - know, like, think, love, bo, just, say, stuff, repeat, want, yeah, right, eye, slut, ve
Bill Burr:
 - like, just, right, know, na, gon, fucking, yeah, shit, come, think, guy, want, make, say
Jim Jefferies:
 - like, right, fucking, know, say, just, come, think, fuck, ve, thing, gun, people, day, oh
John Mulaney:
 - like, know, say, just, walk, clinton, right, time, think, kid, little, hey, look, mom, day
Hasan Minhaj:
 - like, know, dad, say, want, just, look, love, ve, time, hasan, right, come, life, walk
Ali Wong:
 - like, know, just, shit, gon, na, woman, ok, lot, come, oh, day, time, ta, husband
Anthony Jeselnik:
 - say, joke, like, know, thing, gu

### Remove the most commen words

Common words seem meaningless in our analysis since everyone uses them. So they won't provide much information.

In [109]:
top_all_words = []

for column in vectorized_df.columns:
    words = list(top_words[column].index)
    
    for word in words: top_all_words.append(word)

In [112]:
# Imports
from collections import Counter

# Find words that are common between atleast 6 of the comedians
add_stop_words = [word for word, count in Counter(top_all_words).most_common() if count > 7]
add_stop_words

['like',
 'just',
 'know',
 'people',
 'say',
 'right',
 'think',
 'time',
 'come',
 'look',
 'fuck',
 'want',
 'thing',
 'na',
 'gon',
 'guy',
 'make']

### Load the cleaned data

In [118]:
clean_df = pd.read_csv('saves/cleaned_transcripts_df.csv', index_col = 0)
clean_df

Unnamed: 0,Transcript
Lousic C.K.,intro fade the music out let 's roll hold ther...
Dave Chappelle,this be dave he tell dirty joke for a living t...
Ricky Gervais,hello hello how you do great thank you wow cal...
Bo Burham,bo what old macdonald have a farm e I e I o an...
Bill Burr,all right thank you thank you very much thank ...
Jim Jefferies,lady and gentleman please welcome to the stage...
John Mulaney,armed with boyish charm and a sharp wit the fo...
Hasan Minhaj,what be up davis what be up I be home I have t...
Ali Wong,lady and gentleman please welcome to the stage...
Anthony Jeselnik,thank you thank you thank you san francisco th...


### Re-vectorize the clean data

In [120]:
# Imports
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

# Update the stop words list
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)


vectorizer = CountVectorizer(stop_words = stop_words)
data_cv = vectorizer.fit_transform(clean_df['Transcript'])
new_vectorized_df = pd.DataFrame(data_cv.toarray(), columns = vectorizer.get_feature_names())
new_vectorized_df.index = clean_df.index

### Save the new vectorized data

In [123]:
new_vectorized_df.to_csv('saves/stopwords_vectorized_df.csv')