# Exploratory Data Analysis

## Goal of this Notebook:
After the data cleaning step where we put our data into a few standard formats, the next step is to take a look at the data and see if what we're looking at makes sense. Before applying any fancy algorithms, it's always important to explore the data first.

When working with numerical data, some of the exploratory data analysis (EDA) techniques we can use include finding the average of the data set, the distribution of the data, the most common values, etc. The idea is the same when working with text data. We are going to find some more obvious patterns with EDA before identifying the hidden patterns with machines learning (ML) techniques. 
We are going to look at the following for each channel of communications:

Most common words - find these and create word clouds

In [74]:
import json
import pickle
import pandas as pd
from collections import Counter
from sklearn.feature_extraction import text # Contains the stop word list
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt

# Terminal / Anaconda Prompt: conda install -c conda-forge wordcloud
from wordcloud import WordCloud

In [77]:
data_clean = pd.read_pickle('term_matrix.pickle')
#Transpose becuse it's harder to operate across rows. Easier across columns.
#We want to aggregate for each comedian. So comedians should be on the columns.
data = data_clean.transpose() 
data.head()

Unnamed: 0,CHATS,EMAILS,SMS
able,1,3,0
abn,0,2,0
accelerated,0,1,0
accelerating,0,1,0
accenture,0,5,0


In [54]:
# Find the top 30 words said by each comedian
top_dict = {}
for c in data.columns:
    top = data[c].sort_values(ascending=False).head(30)
    top_dict[c]= list(zip(top.index, top.values))

top_dict

{'CHATS': [('hi', 15),
  ('im', 13),
  ('long', 11),
  ('going', 10),
  ('john', 10),
  ('short', 9),
  ('time', 9),
  ('just', 8),
  ('good', 7),
  ('trading', 7),
  ('like', 6),
  ('risk', 6),
  ('chart', 6),
  ('trade', 6),
  ('yeah', 6),
  ('guys', 5),
  ('dont', 5),
  ('phil', 5),
  ('hello', 4),
  ('today', 4),
  ('eurgbp', 4),
  ('look', 4),
  ('moment', 4),
  ('looking', 4),
  ('right', 4),
  ('later', 4),
  ('steve', 3),
  ('downtrend', 3),
  ('hows', 3),
  ('yes', 3)],
 'EMAILS': [('email', 92),
  ('buy', 80),
  ('information', 54),
  ('phillip', 50),
  ('message', 44),
  ('click', 32),
  ('downgraded', 32),
  ('use', 31),
  ('account', 31),
  ('subject', 28),
  ('thanks', 28),
  ('need', 28),
  ('know', 27),
  ('original', 27),
  ('new', 27),
  ('request', 27),
  ('enron', 27),
  ('password', 26),
  ('strong', 25),
  ('time', 25),
  ('review', 25),
  ('recipient', 24),
  ('like', 23),
  ('sent', 23),
  ('change', 22),
  ('let', 22),
  ('receive', 22),
  ('price', 21),
  ('th

In [59]:
# Print the top 15 words said by each channel
for channel, top_words in top_dict.items():
    print(channel)
    print(', '.join([word for word, count in top_words[0:14]]))
    print('---')

CHATS
hi, im, long, going, john, short, time, just, good, trading, like, risk, chart, trade
---
EMAILS
email, buy, information, phillip, message, click, downgraded, use, account, subject, thanks, need, know, original
---
SMS
este, 鞋子全部打湿完了, getting, exemplo, hola, della, um, critical, polskiego, maltija, say, olá, ένα, αυτό
---


### By looking at these top words, you can see that some of them have very little meaning and could be added to a stop words list, so let's do just that.

In [60]:
# Look at the most common top words --> add them to the stop word list

# Let's first create a list that just has each channel top 30 words (even if repeated)
words = []
for chanel in data.columns:
    top = [word for (word, count) in top_dict[chanel]]
    for t in top:
        words.append(t)
        
words

['hi',
 'im',
 'long',
 'going',
 'john',
 'short',
 'time',
 'just',
 'good',
 'trading',
 'like',
 'risk',
 'chart',
 'trade',
 'yeah',
 'guys',
 'dont',
 'phil',
 'hello',
 'today',
 'eurgbp',
 'look',
 'moment',
 'looking',
 'right',
 'later',
 'steve',
 'downtrend',
 'hows',
 'yes',
 'email',
 'buy',
 'information',
 'phillip',
 'message',
 'click',
 'downgraded',
 'use',
 'account',
 'subject',
 'thanks',
 'need',
 'know',
 'original',
 'new',
 'request',
 'enron',
 'password',
 'strong',
 'time',
 'review',
 'recipient',
 'like',
 'sent',
 'change',
 'let',
 'receive',
 'price',
 'thank',
 'stock',
 'este',
 '鞋子全部打湿完了',
 'getting',
 'exemplo',
 'hola',
 'della',
 'um',
 'critical',
 'polskiego',
 'maltija',
 'say',
 'olá',
 'ένα',
 'αυτό',
 'γεια',
 'γλώσσας',
 'είναι',
 'ελληνικής',
 'open',
 'języka',
 'σας',
 'questo',
 'thx',
 'italiana',
 'ejemplo',
 'iteresting',
 'portuguesa',
 'língua',
 'saluti',
 'hi']

In [61]:
# Aggregate this list and identify the most common words along with how many comedian's routines they occur in
Counter(words).most_common()

[('hi', 2),
 ('time', 2),
 ('like', 2),
 ('im', 1),
 ('long', 1),
 ('going', 1),
 ('john', 1),
 ('short', 1),
 ('just', 1),
 ('good', 1),
 ('trading', 1),
 ('risk', 1),
 ('chart', 1),
 ('trade', 1),
 ('yeah', 1),
 ('guys', 1),
 ('dont', 1),
 ('phil', 1),
 ('hello', 1),
 ('today', 1),
 ('eurgbp', 1),
 ('look', 1),
 ('moment', 1),
 ('looking', 1),
 ('right', 1),
 ('later', 1),
 ('steve', 1),
 ('downtrend', 1),
 ('hows', 1),
 ('yes', 1),
 ('email', 1),
 ('buy', 1),
 ('information', 1),
 ('phillip', 1),
 ('message', 1),
 ('click', 1),
 ('downgraded', 1),
 ('use', 1),
 ('account', 1),
 ('subject', 1),
 ('thanks', 1),
 ('need', 1),
 ('know', 1),
 ('original', 1),
 ('new', 1),
 ('request', 1),
 ('enron', 1),
 ('password', 1),
 ('strong', 1),
 ('review', 1),
 ('recipient', 1),
 ('sent', 1),
 ('change', 1),
 ('let', 1),
 ('receive', 1),
 ('price', 1),
 ('thank', 1),
 ('stock', 1),
 ('este', 1),
 ('鞋子全部打湿完了', 1),
 ('getting', 1),
 ('exemplo', 1),
 ('hola', 1),
 ('della', 1),
 ('um', 1),
 ('criti

In [68]:
# If more than half of the channel have it as a top word, exclude it from the list
add_stop_words = [word for word, count in Counter(words).most_common() if count > 5]
add_stop_words

[]

In [72]:
stop_words = text.ENGLISH_STOP_WORDS.union()

In [73]:
# Let's make some word clouds
wc = WordCloud(stopwords=stop_words, background_color="white", colormap="Dark2",
               max_font_size=150, random_state=42)

In [80]:
# Reset the output dimensions

plt.rcParams['figure.figsize'] = [16, 6]

channels = ['SMS', 'EMAILS', 'CHATS']

# Create subplots for each channel
for index, chan in enumerate(data.columns):
    wc.generate(data[chan])
    
    plt.subplot(3, 4, index+1)
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.title(full_names[index])
    
plt.show()

TypeError: expected string or bytes-like object