# Data Cleaning

## Text Pre-processing 

We will perform some common data cleaning steps on all text. Then, we will perform more cleaning after the text has been tokenized.   

Data cleaning process can go forever. However, we will start simple and iterate. We can execute the common cleaning steps and inspect our results. If needed, more cleaning can be done at a later point to improve our results. 

**Common data cleaning steps on all text :**
* make text lowercase
* remove punctuation
* remove numerical values 
* remove common non-sensical text (\n) 

**Word Tokenization :**  
Split a sentence into list of words. 

**More data cleaning steps after tokenization :**
* remove stop words
* lemmatization for meaning root word
* parts of speech tagging 
* create bi-grams or tri-grams 
* deal with typos 

In [15]:
import pandas as pd 
import pickle
import re
import string 
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer 
from nltk.stem import PorterStemmer
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [16]:
df = pd.read_csv('/Users/lihuicham/Desktop/Y2S2/BT4222/project/standup-comedy-analysis/main/transcripts.csv')
df = df[df.columns[1:]]
df.head()

Unnamed: 0,Comedian,Date,Title,Subtitle,Transcript
0,Chris Rock,"March 8, 2023",Selective Outrage (2023) | Transcript,,[slow instrumental music playing] [funk drums ...
1,Marc Maron,"March 3, 2023",Thinky Pain (2013) | Transcript,Marc Maron returns to his old stomping grounds...,[siren wailing] I don’t know what you were thi...
2,Chelsea Handler,"March 3, 2023",Evolution (2020) | Transcript,Chelsea Handler is back and better than ever -...,Join me in welcoming the author of six number ...
3,Tom Papa,"March 3, 2023",What A Day! (2022) | Transcript,"Follows Papa as he shares about parenting, his...","Premiered on December 13, 2022 Ladies and gent..."
4,Jim Jefferies,"February 22, 2023",High n’ Dry (2023) | Transcript,Jim Jefferies is back and no topic is off limi...,"Please welcome to the stage, Jim Jefferies! He..."


In [17]:
def clean_text(text) :
    '''Make text lowercase, remove text in square brackets, remove punctuations, 
    remove quotation marks, remove words containing numbers, remove \n'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)   
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text) 
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    
    return text

cleaning = lambda x : clean_text(x)

In [18]:
# apply data cleaning to Transcript column 
df_clean = df.copy()
df_clean['Transcript'] = df_clean['Transcript'].apply(cleaning)

In [19]:
df_clean.head()

Unnamed: 0,Comedian,Date,Title,Subtitle,Transcript
0,Chris Rock,"March 8, 2023",Selective Outrage (2023) | Transcript,,lets go she said ill do anything you w...
1,Marc Maron,"March 3, 2023",Thinky Pain (2013) | Transcript,Marc Maron returns to his old stomping grounds...,i dont know what you were thinking like im no...
2,Chelsea Handler,"March 3, 2023",Evolution (2020) | Transcript,Chelsea Handler is back and better than ever -...,join me in welcoming the author of six number ...
3,Tom Papa,"March 3, 2023",What A Day! (2022) | Transcript,"Follows Papa as he shares about parenting, his...",premiered on december ladies and gentlemen g...
4,Jim Jefferies,"February 22, 2023",High n’ Dry (2023) | Transcript,Jim Jefferies is back and no topic is off limi...,please welcome to the stage jim jefferies hell...


In [20]:
df_clean.shape

# there are 415 transcripts in total 

(415, 5)

## Data Organization 

We will organised data in two standard text formats : 
1. **Corpus :** a collection of text where its order is preserved. 
2. **Document-Term matrix:** implementation of Bag of Words - a collection of words to represent a sentence with word count disregarding the order, in a matrix format. 
3. **TF-IDF :** reflect how important a word is to a document in a collection or corpus.

## Corpus

In [21]:
# pickle corpus
with open('pickle/' + 'corpus.pkl', 'wb') as f:
    pickle.dump(df_clean, f)

<br>
<br>

## Helper Functions 
Create own tokenizer for CountVectorizer and TF-IDF Vectorizer 

In [22]:
def get_wordnet_pos(treebank_tag) : 
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        # As default pos in lemmatization is Noun
        return wordnet.NOUN

In [23]:
lemmatizer = WordNetLemmatizer()

def pos_then_lemmatize(pos_tagged_words) :
    res = []
    for pos in pos_tagged_words : 
        word = pos[0]
        pos_tag = pos[1]

        lem = lemmatizer.lemmatize(word, get_wordnet_pos(pos_tag))
        res.append(lem)
    return res

In [24]:
def custom_tokenizer(text) : 
    words = word_tokenize(text.lower())
    
    stop_words = set(stopwords.words('english')) 
    filtered_words = [w for w in words if not w in stop_words] 
    pos_tagged_words = nltk.pos_tag(filtered_words)
    tokens = pos_then_lemmatize(pos_tagged_words)
    
    return tokens

## Document-Term Matrix

In [26]:
# Count Vectorizer - Document-Term Matrix 
from sklearn.feature_extraction.text import CountVectorizer

# (1, 2) : include bigram 
# max_features = 300 : choose features/words that occur most frequently to be its vocabulary 
cv = CountVectorizer(ngram_range = (1, 1),
                    tokenizer = custom_tokenizer)
cv_vectors = cv.fit_transform(df_clean['Transcript'])
cv_feature_names = cv.get_feature_names_out()
cv_matrix = pd.DataFrame(cv_vectors.toarray(), columns=cv_feature_names)
cv_matrix


Unnamed: 0,aa,aaa,aaaa,aaaaaa,aaaaaaaaaaall,aaaaaaaaah,aaaaaaaahhhhhhh,aaaaaaah,aaaaaaarhhh,aaaaaabout,...,♪with,♪you,♪youse,♪♪,♪♪♪,♫,♫if,♫third,♬,ﬂoor
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
410,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
411,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
412,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
413,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
# pickle document-term matrix
with open('pickle/' + 'dtm.pkl', 'wb') as f:
    pickle.dump(cv_matrix, f)

## TF-IDF

In [28]:
# TF-IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(ngram_range = (1, 1),
                    tokenizer = custom_tokenizer)
tf_vectors = tf.fit_transform(df_clean['Transcript'])
tf_feature_names = tf.get_feature_names_out()
tfidf_matrix = pd.DataFrame(tf_vectors.toarray(), columns=tf_feature_names)
tfidf_matrix

Unnamed: 0,aa,aaa,aaaa,aaaaaa,aaaaaaaaaaall,aaaaaaaaah,aaaaaaaahhhhhhh,aaaaaaah,aaaaaaarhhh,aaaaaabout,...,♪with,♪you,♪youse,♪♪,♪♪♪,♫,♫if,♫third,♬,ﬂoor
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
410,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
411,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
412,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
413,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
# pickle tfidf_matrix
with open('pickle/' + 'tfidf.pkl', 'wb') as f:
    pickle.dump(tfidf_matrix, f)

<br>
<br>
<br>

## Testing 

Use one simple sentece to find out how to process words. 

In [2]:
sent = 'Follows Papa as he shares about parenting his reliance on modern technology rescuing his pet pug and how his marriage has evolved over time'

In [3]:
# tokenization 

words = word_tokenize(sent.lower())
print(words)

['follows', 'papa', 'as', 'he', 'shares', 'about', 'parenting', 'his', 'reliance', 'on', 'modern', 'technology', 'rescuing', 'his', 'pet', 'pug', 'and', 'how', 'his', 'marriage', 'has', 'evolved', 'over', 'time']


In [4]:
# remove stopwords 

stop_words = set(stopwords.words('english')) 
filtered_words = [w for w in words if not w in stop_words] 
print(filtered_words)

['follows', 'papa', 'shares', 'parenting', 'reliance', 'modern', 'technology', 'rescuing', 'pet', 'pug', 'marriage', 'evolved', 'time']


In [5]:
# lemmitization 

lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
print(lemmatized_words)

# lemmitization by default uses Noun as pos, we need to be more specific

['follows', 'papa', 'share', 'parenting', 'reliance', 'modern', 'technology', 'rescuing', 'pet', 'pug', 'marriage', 'evolved', 'time']


In [6]:
# stemming

stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]
print(stemmed_words)

# stemming overtruncated the words. use lemmitization instead 

['follow', 'papa', 'share', 'parent', 'relianc', 'modern', 'technolog', 'rescu', 'pet', 'pug', 'marriag', 'evolv', 'time']


In [9]:
# parts of speech tagging 
pos_tagged_words = nltk.pos_tag(filtered_words)
print(pos_tagged_words)

[('follows', 'VBZ'), ('papa', 'JJ'), ('shares', 'NNS'), ('parenting', 'VBG'), ('reliance', 'NN'), ('modern', 'JJ'), ('technology', 'NN'), ('rescuing', 'VBG'), ('pet', 'JJ'), ('pug', 'JJ'), ('marriage', 'NN'), ('evolved', 'VBD'), ('time', 'NN')]


In [10]:
# define a helper function to map the pos tag to wordnet 

def get_wordnet_pos(treebank_tag) : 
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        # As default pos in lemmatization is Noun
        return wordnet.NOUN

In [11]:
# pos tagging -> lemmitization 
from nltk.stem import WordNetLemmatizer 

lemmatizer = WordNetLemmatizer()

def pos_then_lemmatize(pos_tagged_words) :
    res = []
    for pos in pos_tagged_words : 
        word = pos[0]
        pos_tag = pos[1]

        lem = lemmatizer.lemmatize(word, get_wordnet_pos(pos_tag))
        res.append(lem)
    return res

print(pos_then_lemmatize(pos_tagged_words))

['follow', 'papa', 'share', 'parent', 'reliance', 'modern', 'technology', 'rescue', 'pet', 'pug', 'marriage', 'evolve', 'time']


In [12]:
# join back the processed words 
processed_words = pos_then_lemmatize(pos_tagged_words)
new_sent = ' '.join(processed_words)

In [13]:
# Count Vectorizer - Document-Term Matrix 

cv_test = CountVectorizer(ngram_range = (1, 1), stop_words='english')  # (1, 2) is bigram 
cv_vectors_test = cv_test.fit_transform([new_sent])
cv_feature_names_test = cv_test.get_feature_names_out()
cv_matrix_test = pd.DataFrame(cv_vectors_test.toarray(), columns=cv_feature_names_test)
cv_matrix_test

Unnamed: 0,evolve,follow,marriage,modern,papa,parent,pet,pug,reliance,rescue,share,technology,time
0,1,1,1,1,1,1,1,1,1,1,1,1,1


In [14]:
# TF-IDF Vectorizer

tf_test = TfidfVectorizer(ngram_range = (1, 1))
tf_vectors_test = tf_test.fit_transform([new_sent])
tf_feature_names_test = tf_test.get_feature_names_out()
tfidf_matrix_test = pd.DataFrame(tf_vectors_test.toarray(), columns=tf_feature_names_test)
tfidf_matrix_test

Unnamed: 0,evolve,follow,marriage,modern,papa,parent,pet,pug,reliance,rescue,share,technology,time
0,0.27735,0.27735,0.27735,0.27735,0.27735,0.27735,0.27735,0.27735,0.27735,0.27735,0.27735,0.27735,0.27735
