# More Data Cleaning 

We realised that the document-term matrices we created in 2-Data-Cleaning.ipynb using Count Vectorizer and TF-IDF Vectorizer has a lot of meaningless filler words and common words such as `'like'`, `'just'`, `'people'`, `'youre'` and etc. 

Therefore, we wish to inspect the matrices further and create a new stop words list in this notebook. 

In [1]:
import pandas as pd
from collections import Counter
import pickle
import re
import string 
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer 
from nltk.stem import PorterStemmer
from nltk.corpus import wordnet
from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# read in the document-term matrix formed by Count Vectorizer 
df_cv = pd.read_pickle('/Users/lihuicham/Desktop/Y2S2/BT4222/project/standup-comedy-analysis/main/pickle/cv.pkl')
# transpose to term-document matrix 
df_cv = df_cv.transpose()   

In [3]:
# Find the top 50 words in each transcript 
top_dict = {}
for c in df_cv.columns:
    top = df_cv[c].sort_values(ascending=False).head(50)
    top_dict[c]= list(zip(top.index, top.values))

In [4]:
# Print the top 50 words in each transcript 
# for transcript, top_words in top_dict.items():
#     print(transcript)
#     print(', '.join([word for word, count in top_words[0:49]]))
#     print('---')

In [5]:
# we add the most common top words to a stop word list

# Let's first pull out the top 50 words for each comedian
words = []
for transcript in df_cv.columns:
    top = [word for (word, count) in top_dict[transcript]]
    for t in top:
        words.append(t)

In [7]:
# Aggregate this list and identify the most common words along with how many transcripts they occur in
most_common_words = Counter(words).most_common()

[('get', 404),
 ('go', 404),
 ('know', 401),
 ('dont', 400),
 ('im', 398),
 ('like', 396),
 ('say', 391),
 ('thats', 389),
 ('one', 383),
 ('come', 363),
 ('right', 357),
 ('think', 352),
 ('youre', 346),
 ('people', 338),
 ('see', 332),
 ('look', 327),
 ('want', 321),
 ('time', 315),
 ('make', 307),
 ('na', 295),
 ('gon', 272),
 ('thing', 270),
 ('oh', 263),
 ('take', 251),
 ('good', 249),
 ('guy', 249),
 ('fuck', 243),
 ('would', 227),
 ('yeah', 227),
 ('tell', 227),
 ('well', 225),
 ('he', 197),
 ('shit', 196),
 ('cause', 195),
 ('back', 194),
 ('theyre', 191),
 ('man', 188),
 ('really', 173),
 ('cant', 170),
 ('little', 167),
 ('let', 150),
 ('love', 145),
 ('okay', 136),
 ('give', 133),
 ('never', 130),
 ('day', 129),
 ('even', 127),
 ('didnt', 125),
 ('kid', 120),
 ('mean', 120),
 ('woman', 117),
 ('year', 114),
 ('show', 110),
 ('way', 105),
 ('ive', 105),
 ('♪', 102),
 ('put', 100),
 ('talk', 99),
 ('call', 88),
 ('shes', 84),
 ('ill', 83),
 ('hey', 83),
 ('–', 80),
 ('try', 79

In [9]:
# create our own stop word list based on top words 
# we consider the word as a stop word if >= 150 transcript have it as top word

add_stop_words = [word for word, count in most_common_words if count >= 150]

In [11]:
# pickle
with open('pickle/' + 'mostcommonwords-st.pkl', 'wb') as f:
    pickle.dump(most_common_words, f)

In [15]:
# after a few iterations of checking the top words with Count Vectorizer
# we created a list of stop words that needs to be removed too

own_stop_words = ['just', 'okay', 'ive', '♪', '–', 'ta', 'uh', 'wan', 'g', 'e', 'ah', 'r', 'mi', 'le']
complete_stop_words = [*add_stop_words, *own_stop_words]

['get',
 'go',
 'know',
 'dont',
 'im',
 'like',
 'say',
 'thats',
 'one',
 'come',
 'right',
 'think',
 'youre',
 'people',
 'see',
 'look',
 'want',
 'time',
 'make',
 'na',
 'gon',
 'thing',
 'oh',
 'take',
 'good',
 'guy',
 'fuck',
 'would',
 'yeah',
 'tell',
 'well',
 'he',
 'shit',
 'cause',
 'back',
 'theyre',
 'man',
 'really',
 'cant',
 'little',
 'let',
 'just',
 'okay',
 'ive',
 '♪',
 '–',
 'ta',
 'uh',
 'wan',
 'g',
 'e',
 'ah',
 'r',
 'mi',
 'le']

## Helper Functions 
From 2-Data-Cleaning.ipynb file. 

In [6]:
# same function as 2-Data-Cleaning 
def get_wordnet_pos(treebank_tag) : 
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        # As default pos in lemmatization is Noun
        return wordnet.NOUN

In [7]:
# same function as 2-Data-Cleaning 

lemmatizer = WordNetLemmatizer()

def pos_then_lemmatize(pos_tagged_words) :
    res = []
    for pos in pos_tagged_words : 
        word = pos[0]
        pos_tag = pos[1]

        lem = lemmatizer.lemmatize(word, get_wordnet_pos(pos_tag))
        res.append(lem)
    return res

In [8]:
# new function 
def custom_tokenizer_stop(doc) : 
    words = word_tokenize(doc.lower())
    
    # add our own stop word list to the existing English stop words 
    new_stop_words = text.ENGLISH_STOP_WORDS.union(complete_stop_words)
    
    filtered_words = [w for w in words if not w in new_stop_words] 
    pos_tagged_words = nltk.pos_tag(filtered_words)
    pos_lemmatized_words = pos_then_lemmatize(pos_tagged_words)
    filtered_words_2 = [w for w in pos_lemmatized_words if not w in new_stop_words] 
    
    return filtered_words_2

## An updated Document-Term Matrix 

### Count Vectorizer

In [9]:
# read in the clean data 
df_clean = pd.read_pickle('/Users/lihuicham/Desktop/Y2S2/BT4222/project/standup-comedy-analysis/main/pickle/corpus.pkl')
df_clean.head()

Unnamed: 0,Comedian,Date,Title,Subtitle,Transcript
0,Chris Rock,"March 8, 2023",Selective Outrage (2023) | Transcript,,lets go she said ill do anything you w...
1,Marc Maron,"March 3, 2023",Thinky Pain (2013) | Transcript,Marc Maron returns to his old stomping grounds...,i dont know what you were thinking like im no...
2,Chelsea Handler,"March 3, 2023",Evolution (2020) | Transcript,Chelsea Handler is back and better than ever -...,join me in welcoming the author of six number ...
3,Tom Papa,"March 3, 2023",What A Day! (2022) | Transcript,"Follows Papa as he shares about parenting, his...",premiered on december ladies and gentlemen g...
4,Jim Jefferies,"February 22, 2023",High n’ Dry (2023) | Transcript,Jim Jefferies is back and no topic is off limi...,please welcome to the stage jim jefferies hell...


In [16]:
from sklearn.feature_extraction.text import CountVectorizer

# (1, 2) : include bigram 
# max_features = 300 : choose features/words that occur most frequently to be its vocabulary 
cv = CountVectorizer(ngram_range = (1, 1),
                    tokenizer = custom_tokenizer_stop)
cv_vectors = cv.fit_transform(df_clean['Transcript'])
cv_feature_names = cv.get_feature_names_out()
cv_matrix_stop = pd.DataFrame(cv_vectors.toarray(), columns=cv_feature_names)

### Double Checking

In the below code chunk, we double check whether our `completed_stop_words` list is working.  

In `top_dict_check`, we can clearly see that now the words are starting to makes sense and are indeed meaningful in each transcript. The common top words that are meaningless and filler words are removed successfully. 

In [19]:
# we double check on the top words in each transcript now. 
cv_matrix_check = cv_matrix_stop.transpose()

top_dict_check_cv = {}
for c in cv_matrix_check.columns:
    top = cv_matrix_check[c].sort_values(ascending=False).head(30)
    top_dict_check_cv[c]= list(zip(top.index, top.values))

In [52]:
# pickle 
with open('pickle/' + 'common_words_cv.pkl', 'wb') as f:
    pickle.dump(top_dict_check_cv, f)

In [27]:
# we check with the first transcript 
first_transcript_value_cv = list(top_dict_check_cv.values())[0]
first_transcript_value_cv 

[('kid', 33),
 ('black', 33),
 ('woman', 31),
 ('try', 29),
 ('everybody', 26),
 ('school', 26),
 ('white', 25),
 ('love', 25),
 ('motherfucker', 23),
 ('ngga', 23),
 ('need', 22),
 ('talk', 21),
 ('lola', 21),
 ('year', 20),
 ('pussy', 19),
 ('day', 18),
 ('work', 18),
 ('shoe', 18),
 ('aint', 17),
 ('child', 16),
 ('girl', 16),
 ('lawyer', 16),
 ('didnt', 16),
 ('men', 15),
 ('mother', 15),
 ('baby', 15),
 ('accept', 14),
 ('attention', 14),
 ('sell', 12),
 ('bitch', 11)]

In [31]:
# pickle the updated document-term matrix from Count Vectorizer
with open('pickle/' + 'cv_stop.pkl', 'wb') as f:
    pickle.dump(cv_matrix_stop, f)

### TF-IDF

We do the same for TF-IDF too.  
Output : An updated TF-IDF matrix

In [23]:
# TF-IDF Vectorizer

tf = TfidfVectorizer(ngram_range = (1, 1),
                    tokenizer = custom_tokenizer_stop)
tf_vectors = tf.fit_transform(df_clean['Transcript'])
tf_feature_names = tf.get_feature_names_out()
tfidf_matrix_stop = pd.DataFrame(tf_vectors.toarray(), columns=tf_feature_names)

In [28]:
# we double check on the top words in each transcript now. 
tf_matrix_check = tfidf_matrix_stop.transpose()

top_dict_check_tf = {}
for c in tf_matrix_check.columns:
    top = tf_matrix_check[c].sort_values(ascending=False).head(30)
    top_dict_check_tf[c]= list(zip(top.index, top.values))

In [53]:
# pickle 
with open('pickle/' + 'common_words_tfidf.pkl', 'wb') as f:
    pickle.dump(top_dict_check_tf, f)

In [29]:
first_transcript_value_tf = list(top_dict_check_tf.values())[0]
first_transcript_value_tf 

[('lola', 0.36753783017238584),
 ('ngga', 0.3034142012409467),
 ('lawyer', 0.14154347913241483),
 ('black', 0.1389059795692906),
 ('motherfucker', 0.13089440529085258),
 ('kid', 0.11902042541909003),
 ('nggas', 0.11533644274132217),
 ('oj', 0.11446768159232128),
 ('pussy', 0.11238470741225165),
 ('woman', 0.10993135770144177),
 ('school', 0.10916810655918228),
 ('accept', 0.10699290692932405),
 ('prochoice', 0.10684845293716282),
 ('touché', 0.10684845293716282),
 ('everybody', 0.10622262315393302),
 ('shoe', 0.10273941410638181),
 ('white', 0.10138166293883931),
 ('try', 0.09847239157383815),
 ('kardashian', 0.09768550841638349),
 ('yoga', 0.0970973211812972),
 ('attention', 0.09409720481079092),
 ('aint', 0.09266061803079417),
 ('draymond', 0.09111095879801154),
 ('inlaws', 0.08973820811704201),
 ('abortion', 0.08892078503780722),
 ('love', 0.08591815048902417),
 ('victim', 0.08531639251172383),
 ('trimester', 0.08207771997589927),
 ('spoil', 0.08056287095631871),
 ('elon', 0.0799222

In [30]:
with open('pickle/' + 'tfidf_stop.pkl', 'wb') as f:
    pickle.dump(tfidf_matrix_stop, f)

## Decision Making 

Now, we need to decide which document-term matrix to use for the project.  
1. Count Vectorizer 
2. TF-IDF Vectorizer 

From the top words shown, **TF-IDF** might be a better matrix.  

Reasons : 
* More meaningful words that are useful for topic modelling and EDA. For example, important nouns such as `'kardashian'`, `'trimester'` and `'victim'` are valued in TF-IDF matrix compared to Count Vectorizer matrix. These words are important for topic modelling. 
