# Exploratory Data Analysis

## Introduction

After the data cleaning step where we put our data into a few standard formats, the next step is to take a look at the data and see if what we're looking at makes sense. Before applying any fancy algorithms, it's always important to explore the data first.

When working with numerical data, some of the exploratory data analysis (EDA) techniques we can use include finding the average of the data set, the distribution of the data, the most common values, etc. The idea is the same when working with text data. We are going to find some more obvious patterns with EDA before identifying the hidden patterns with machines learning (ML) techniques. We are going to look at the following for each comedian:

1. **Most common words** - find these and create word clouds
2. **Size of vocabulary** - look number of unique words and also how quickly someone speaks
3. **Amount of profanity** - most common terms

## Most Common Words

### Analysis

In [6]:
# Read in the document-term matrix
import pandas as pd

#data = pd.read_pickle('dtm_sample.pkl')

data = pd.read_pickle('dtm.pkl')
#data = data.transpose()
data.head(30)


Unnamed: 0,aaaaah,aaaaahhhhhhh,aaaaauuugghhhhhh,aaaahhhhh,aaah,aah,abc,abcs,ability,abject,...,zee,zen,zeppelin,zero,zillion,zombie,zombies,zoning,zoo,éclair
ali,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
anthony,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
bill,1,0,0,0,0,0,0,1,0,0,...,0,0,0,1,1,1,1,1,0,0
bo,0,1,1,1,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
dave,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
hasan,0,0,0,0,0,0,0,0,0,0,...,2,1,0,1,0,0,0,0,0,0
jim,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
joe,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
john,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
louis,0,0,0,0,0,3,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0


In [3]:
# Find the top 30 words said by each comedian
top_dict = {}
for c in data.columns:
    top = data[c].sort_values(ascending=False).head(30)
    top_dict[c]= list(zip(top.index, top.values))

top_dict

{'aaaaah': [('bill', 1),
  ('ali', 0),
  ('anthony', 0),
  ('bo', 0),
  ('dave', 0),
  ('hasan', 0),
  ('jim', 0),
  ('joe', 0),
  ('john', 0),
  ('louis', 0),
  ('mike', 0),
  ('ricky', 0)],
 'aaaaahhhhhhh': [('bo', 1),
  ('ali', 0),
  ('anthony', 0),
  ('bill', 0),
  ('dave', 0),
  ('hasan', 0),
  ('jim', 0),
  ('joe', 0),
  ('john', 0),
  ('louis', 0),
  ('mike', 0),
  ('ricky', 0)],
 'aaaaauuugghhhhhh': [('bo', 1),
  ('ali', 0),
  ('anthony', 0),
  ('bill', 0),
  ('dave', 0),
  ('hasan', 0),
  ('jim', 0),
  ('joe', 0),
  ('john', 0),
  ('louis', 0),
  ('mike', 0),
  ('ricky', 0)],
 'aaaahhhhh': [('bo', 1),
  ('ali', 0),
  ('anthony', 0),
  ('bill', 0),
  ('dave', 0),
  ('hasan', 0),
  ('jim', 0),
  ('joe', 0),
  ('john', 0),
  ('louis', 0),
  ('mike', 0),
  ('ricky', 0)],
 'aaah': [('dave', 1),
  ('ali', 0),
  ('anthony', 0),
  ('bill', 0),
  ('bo', 0),
  ('hasan', 0),
  ('jim', 0),
  ('joe', 0),
  ('john', 0),
  ('louis', 0),
  ('mike', 0),
  ('ricky', 0)],
 'aah': [('louis', 3),


In [4]:
# Print the top 15 words said by each comedian
for comedian, top_words in top_dict.items():
    print(comedian)
    print(', '.join([word for word, count in top_words[0:30]]))
    print('---')

aaaaah
bill, ali, anthony, bo, dave, hasan, jim, joe, john, louis, mike, ricky
---
aaaaahhhhhhh
bo, ali, anthony, bill, dave, hasan, jim, joe, john, louis, mike, ricky
---
aaaaauuugghhhhhh
bo, ali, anthony, bill, dave, hasan, jim, joe, john, louis, mike, ricky
---
aaaahhhhh
bo, ali, anthony, bill, dave, hasan, jim, joe, john, louis, mike, ricky
---
aaah
dave, ali, anthony, bill, bo, hasan, jim, joe, john, louis, mike, ricky
---
aah
louis, ali, anthony, bill, bo, dave, hasan, jim, joe, john, mike, ricky
---
abc
ali, anthony, bill, bo, dave, hasan, jim, joe, john, louis, mike, ricky
---
abcs
bill, ali, anthony, bo, dave, hasan, jim, joe, john, louis, mike, ricky
---
ability
bo, ricky, ali, anthony, bill, dave, hasan, jim, joe, john, louis, mike
---
abject
ricky, ali, anthony, bill, bo, dave, hasan, jim, joe, john, louis, mike
---
able
john, ali, joe, ricky, bill, hasan, jim, louis, anthony, bo, dave, mike
---
ablebodied
jim, ali, anthony, bill, bo, dave, hasan, joe, john, louis, mike, ri

**NOTE:** At this point, we could go on and create word clouds. However, by looking at these top words, you can see that some of them have very little meaning and could be added to a stop words list, so let's do just that.



In [7]:
# Look at the most common top words --> add them to the stop word list
from collections import Counter

# Let's first pull out the top 30 words for each comedian
words = []
for comedian in data.columns:
    top = [word for (word, count) in top_dict[comedian]]
    for t in top:
        words.append(t)
        
words

['bill',
 'ali',
 'anthony',
 'bo',
 'dave',
 'hasan',
 'jim',
 'joe',
 'john',
 'louis',
 'mike',
 'ricky',
 'bo',
 'ali',
 'anthony',
 'bill',
 'dave',
 'hasan',
 'jim',
 'joe',
 'john',
 'louis',
 'mike',
 'ricky',
 'bo',
 'ali',
 'anthony',
 'bill',
 'dave',
 'hasan',
 'jim',
 'joe',
 'john',
 'louis',
 'mike',
 'ricky',
 'bo',
 'ali',
 'anthony',
 'bill',
 'dave',
 'hasan',
 'jim',
 'joe',
 'john',
 'louis',
 'mike',
 'ricky',
 'dave',
 'ali',
 'anthony',
 'bill',
 'bo',
 'hasan',
 'jim',
 'joe',
 'john',
 'louis',
 'mike',
 'ricky',
 'louis',
 'ali',
 'anthony',
 'bill',
 'bo',
 'dave',
 'hasan',
 'jim',
 'joe',
 'john',
 'mike',
 'ricky',
 'ali',
 'anthony',
 'bill',
 'bo',
 'dave',
 'hasan',
 'jim',
 'joe',
 'john',
 'louis',
 'mike',
 'ricky',
 'bill',
 'ali',
 'anthony',
 'bo',
 'dave',
 'hasan',
 'jim',
 'joe',
 'john',
 'louis',
 'mike',
 'ricky',
 'bo',
 'ricky',
 'ali',
 'anthony',
 'bill',
 'dave',
 'hasan',
 'jim',
 'joe',
 'john',
 'louis',
 'mike',
 'ricky',
 'ali',
 

In [8]:
# Let's aggregate this list and identify the most common words along with how many routines they occur in
Counter(words).most_common()

[('bill', 7484),
 ('ali', 7484),
 ('anthony', 7484),
 ('bo', 7484),
 ('dave', 7484),
 ('hasan', 7484),
 ('jim', 7484),
 ('joe', 7484),
 ('john', 7484),
 ('louis', 7484),
 ('mike', 7484),
 ('ricky', 7484)]

In [11]:
# If more than half of the comedians have it as a top word, exclude it from the list
add_stop_words = [word for word, count in Counter(words).most_common() if count > 6]
add_stop_words

['bill',
 'ali',
 'anthony',
 'bo',
 'dave',
 'hasan',
 'jim',
 'joe',
 'john',
 'louis',
 'mike',
 'ricky']

In [14]:
# Let's update our document-term matrix with the new list of stop words
from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer

# Read in cleaned data
data_clean = pd.read_pickle('data_clean.pkl')
data_clean

Unnamed: 0,transcript
ali,ladies and gentlemen please welcome to the sta...
anthony,thank you thank you thank you san francisco th...
bill,all right thank you thank you very much thank...
bo,bo what old macdonald had a farm e i e i o and...
dave,this is dave he tells dirty jokes for a living...
hasan,whats up davis whats up im home i had to bri...
jim,ladies and gentlemen please welcome to the ...
joe,ladies and gentlemen welcome joe rogan wha...
john,all right petunia wish me luck out there you w...
louis,introfade the music out lets roll hold there l...


In [15]:
# If you have a frozenset of stop words
stop_words_set = frozenset({'besides', 'call', 'how', 'seem', 'just', 'at', 'who', 'below', 'others', 'rather', 'you', 'across', 'along', 'thereupon', 'both', 'most', 'become', 'some', 'though', 'everyone', 'part', 'four', 'never', 'she', 'few', 'been', 'their', 'each', 'yeah', 'was', 'beyond', 'but', 'afterwards', 'meanwhile', 'back', 'can', 'were', 'themselves', 'one', 'moreover', 'otherwise', 'nobody', 'him', 'wherever', 'are', 'per', 'sincere', 'again', 'after', 'still', 'right', 'cant', 'whereas', 'about', 'name', 'amongst', 'before', 'and', 'here', 'whoever', 'empty', 'amoungst', 'be', 'several', 'all', 'this', 'seemed', 'same', 'further', 'neither', 'mill', 'together', 'nor', 'perhaps', 'very', 'elsewhere', 'our', 'not', 'toward', 'am', 'indeed', 'un', 'fifty', 'detail', 'under', 'in', 'eight', 'among', 'seems', 'put', 'interest', 'whose', 'front', 'forty', 'five', 'do', 'had', 'i', 'during', 'whether', 'said', 'beside', 'now', 'we', 'time', 'however', 'any', 'against', 'only', 'such', 'might', 'mine', 'done', 'with', 'will', 'from', 'anywhere', 'on', 'me', 'fill', 'once', 'full', 'noone', 'serious', 'thus', 'what', 'there', 'the', 'or', 'whereafter', 'if', 'them', 'nothing', 'people', 'least', 'sometimes', 'third', 'yet', 'latterly', 'should', 'he', 'mostly', 'im', 'whatever', 'alone', 'last', 'take', 'first', 'thru', 'dont', 'bottom', 'throughout', 'yourselves', 'between', 'con', 'over', 'already', 'yours', 'off', 'somewhere', 'within', 'enough', 'none', 'ten', 'youre', 'another', 'so', 'us', 'without', 'behind', 'find', 'until', 'know', 'it', 'nevertheless', 'yourself', 'system', 'something', 'amount', 'whole', 'because', 'bill', 'then', 'latter', 'therefore', 'whenever', 'would', 'while', 'became', 'everything', 'own', 'his', 'hundred', 'anyway', 'more', 'seeming', 'that', 'her', 'describe', 'whereby', 'for', 'cannot', 'up', 'itself', 'hence', 'wherein', 'see', 'ourselves', 'whither', 'hereupon', 'where', 'six', 'by', 'when', 'beforehand', 'de', 'almost', 'down', 'much', 'always', 'hers', 'even', 'no', 'thereafter', 'nine', 'which', 'found', 'oh', 'former', 'of', 'nowhere', 'your', 'into', 'than', 'fifteen', 'a', 'ever', 'through', 'becomes', 'becoming', 'two', 'herein', 'towards', 'hereafter', 'must', 'whence', 'ours', 'sometime', 'an', 'keep', 'is', 'cry', 'someone', 'although', 'twenty', 'ie', 'due', 'got', 'upon', 'they', 'co', 'myself', 'gonna', 'think', 'out', 'thereby', 'thick', 'well', 'often', 'thence', 'next', 'its', 'anything', 'therein', 'thin', 'why', 'three', 'have', 'many', 'thats', 'except', 'other', 'anyone', 'those', 'couldnt', 'since', 're', 'could', 'every', 'has', 'move', 'may', 'etc', 'these', 'either', 'also', 'too', 'himself', 'namely', 'whereupon', 'above', 'eleven', 'give', 'ltd', 'top', 'go', 'formerly', 'hasnt', 'twelve', 'eg', 'onto', 'less', 'like', 'whom', 'somehow', 'hereby', 'everywhere', 'made', 'please', 'anyhow', 'get', 'herself', 'sixty', 'else', 'fire', 'show', 'my', 'to', 'via', 'being', 'around', 'side', 'inc', 'as'})

# Convert frozenset to list
stop_words = list(stop_words_set)

In [18]:
type(stop_words)

list

In [20]:
# Use the list in CountVectorizer
cv = CountVectorizer(stop_words=stop_words)
cv

In [21]:
# Add new stop words
#stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recreate document-term matrix
#cv = CountVectorizer(stop_words=stop_words)
data_cv = cv.fit_transform(data_clean.transcript)
data_cv

<12x7468 sparse matrix of type '<class 'numpy.int64'>'
	with 16367 stored elements in Compressed Sparse Row format>

In [22]:
data_stop = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data_stop.index = data_clean.index

In [24]:
data_stop.index

Index(['ali', 'anthony', 'bill', 'bo', 'dave', 'hasan', 'jim', 'joe', 'john',
       'louis', 'mike', 'ricky'],
      dtype='object')

In [25]:
# Pickle it for later use
import pickle
pickle.dump(cv, open("cv_stop.pkl", "wb"))
data_stop.to_pickle("dtm_stop.pkl")

In [26]:
# Let's make some word clouds!
# Terminal / Anaconda Prompt: conda install -c conda-forge wordcloud
from wordcloud import WordCloud

wc = WordCloud(stopwords=stop_words, background_color="white", colormap="Dark2",
               max_font_size=150, random_state=42)

In [27]:
wc

<wordcloud.wordcloud.WordCloud at 0x215f11bed60>

### Findings

* Ali Wong says the s-word a lot and talks about her husband. I guess that's funny to me.
* A lot of people use the F-word. Let's dig into that later.

## Number of Words

### Analysis

In [None]:
# Find the number of unique words that each comedian uses

# Identify the non-zero items in the document-term matrix, meaning that the word occurs at least once
unique_list = []
for comedian in data.columns:
    uniques = data[comedian].to_numpy().nonzero()[0].size
    unique_list.append(uniques)

# Create a new dataframe that contains this unique word count
data_words = pd.DataFrame(list(zip(full_names, unique_list)), columns=['comedian', 'unique_words'])
data_unique_sort = data_words.sort_values(by='unique_words')
data_unique_sort

In [None]:
# Calculate the words per minute of each comedian

# Find the total number of words that a comedian uses
total_list = []
for comedian in data.columns:
    totals = sum(data[comedian])
    total_list.append(totals)
    
# Comedy special run times from IMDB, in minutes
run_times = [60, 59, 80, 60, 67, 73, 77, 63, 62, 58, 76, 79]

# Let's add some columns to our dataframe
data_words['total_words'] = total_list
data_words['run_times'] = run_times
data_words['words_per_minute'] = data_words['total_words'] / data_words['run_times']

# Sort the dataframe by words per minute to see who talks the slowest and fastest
data_wpm_sort = data_words.sort_values(by='words_per_minute')
data_wpm_sort

In [None]:
# Let's plot our findings
import numpy as np

y_pos = np.arange(len(data_words))

plt.subplot(1, 2, 1)
plt.barh(y_pos, data_unique_sort.unique_words, align='center')
plt.yticks(y_pos, data_unique_sort.comedian)
plt.title('Number of Unique Words', fontsize=20)

plt.subplot(1, 2, 2)
plt.barh(y_pos, data_wpm_sort.words_per_minute, align='center')
plt.yticks(y_pos, data_wpm_sort.comedian)
plt.title('Number of Words Per Minute', fontsize=20)

plt.tight_layout()
plt.show()

### Findings

* **Vocabulary**
   * Ricky Gervais (British comedy) and Bill Burr (podcast host) use a lot of words in their comedy
   * Louis C.K. (self-depricating comedy) and Anthony Jeselnik (dark humor) have a smaller vocabulary


* **Talking Speed**
   * Joe Rogan (blue comedy) and Bill Burr (podcast host) talk fast
   * Bo Burnham (musical comedy) and Anthony Jeselnik (dark humor) talk slow
   
Ali Wong is somewhere in the middle in both cases. Nothing too interesting here.

## Side Note

What was our goal for the EDA portion of our journey? **To be able to take an initial look at our data and see if the results of some basic analysis made sense.**

My conclusion - yes, it does, for a first pass. There are definitely some things that could be better cleaned up, such as adding more stop words or including bi-grams. But we can save that for another day. The results, especially the profanity findings, are interesting and make general sense, so we're going to move on.

As a reminder, the data science process is an interative one. It's better to see some non-perfect but acceptable results to help you quickly decide whether your project is a dud or not, instead of having analysis paralysis and never delivering anything.

**Alice's data science (and life) motto: Let go of perfectionism!**

## Additional Exercises

1. What other word counts do you think would be interesting to compare instead of the f-word and s-word? Create a scatter plot comparing them.