# Homework 2 (Due 6:29pm PST April 6th, 2021): Word Vectorization, Regex Practice, and Similarity

A. Using the **McDonalds Yelp Review CSV file**, **process the reviews**.
This means you should think briefly about:
* what stopwords to remove (should you add any custom stopwords to the set? Remove any stopwords?)
* what regex cleaning you may need to perform (for example, are there different ways of saying `hamburger` that you need to account for?)
* stemming/lemmatization (explain in your notebook why you used stemming versus lemmatization). 

Next, **count-vectorize the dataset**. Use the **`sklearn.feature_extraction.text.CountVectorizer`** examples from `Linear Algebra, Distance and Similarity (Completed).ipynb` and `Text Preprocessing Techniques (Completed).ipynb` (read the last section, `Vectorization Techniques`).

I do not want redundant features - for instance, I do not want `hamburgers` and `hamburger` to be two distinct columns in your document-term matrix. Therefore, I'll be taking a look to make sure you've properly performed your cleaning, stopword removal, etc. to reduce the number of dimensions in your dataset. 

In [310]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer, SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer

# import nltk
# nltk.download('wordnet')

In [353]:
mac_df = pd.read_csv("mcdonalds-yelp-negative-reviews.csv", encoding="latin1")

In [354]:
mac_df.head()

Unnamed: 0,_unit_id,city,review
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave..."
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo..."


In [355]:
vectorizer = CountVectorizer(stop_words="english", binary=True)

X = vectorizer.fit_transform(mac_df["review"])

In [356]:
vec_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names()).T

vec_df["num_count"] = vec_df.sum(axis=1)

In [357]:
vec_df.sort_values("num_count", ascending=False)\
    .head(50)
#     .drop(list(set(stopwords.words('english'))), errors="ignore")

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1516,1517,1518,1519,1520,1521,1522,1523,1524,num_count
food,0,1,0,0,1,0,1,0,1,1,...,1,0,0,0,0,1,1,1,0,574
order,1,0,1,0,1,0,0,1,1,1,...,1,1,1,1,0,0,0,1,1,515
mcdonald,0,0,0,0,1,1,1,0,0,1,...,0,0,0,0,0,0,1,1,0,486
drive,1,0,0,0,0,0,0,1,1,0,...,0,1,0,0,0,0,0,0,0,473
service,0,1,0,0,1,0,0,0,1,0,...,0,0,0,0,0,1,0,1,0,423
just,0,1,1,0,0,0,0,0,0,1,...,0,1,0,1,0,0,1,1,1,419
time,1,0,0,0,1,0,0,1,1,1,...,0,1,0,0,0,1,0,1,0,394
mcdonalds,0,1,0,0,0,0,0,0,1,0,...,0,0,0,1,1,1,0,0,0,389
like,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,1,1,1,361
place,0,0,0,0,1,0,0,0,0,1,...,0,0,0,1,0,0,0,1,1,350


There shows some words that needs cleaning among top 50 most frequent words. For example, mcdonald and mcdonalds are 3rd and 7th most frequent words, and most common word in reviews when combined. Stemming and regex cleaning will remove this issue.

## Preliminary processing

In [358]:
# Make all lowercase
mac_df["review"] = mac_df["review"].str.lower()

## Regex Cleaning

In [359]:
# Hamburger Variation
n_burgers = mac_df["review"].str.contains(r"\w*\s*burgers?").sum()
print(f"Number of Burgers: {n_burgers}")

mac_df["review"] = mac_df["review"].str.replace(r"\w*\s*burgers?", "burger")

Number of Burgers: 177


In [360]:
# Big Macs
n_bigmacs = mac_df["review"].str.contains(r"big\s*macs?").sum()
print(f"Number of Big Macs: {n_bigmacs}")

mac_df["review"] = mac_df["review"].str.replace(r"big\s*macs?", "bigmac")

Number of Big Macs: 55


In [361]:
# McDonald's
n_mcds = mac_df["review"].str.contains(r"(?:\bmcdonald(?:'?s?)?\b)|(?:\bmcds?\b)").sum()
print(f"Number of McDonald's: {n_mcds}")

mac_df["review"] = mac_df["review"].str.replace(r"(?:\bmcdonald(?:'?s?)?\b)|(?:\bmcds?\b)", "mcdonald")

Number of McDonald's: 891


In [362]:
# Numbers
mac_df["review"] = mac_df["review"].str.replace(r"\d+\S*\d*\w*", "NUM_TOKEN")

In [363]:
# Punctuation Removal
mac_df["review"] = mac_df["review"].str.replace(r"[!|@|#|$|%|^|&|*|(|)|+|<|>|?|:|.|,|;|\"|\'|\\]", ' ')

In [364]:
# Whitespace
mac_df["review"] = mac_df["review"].str.replace(r"\s{2,}", ' ')

## Stemming

In [365]:
# stemmer = PorterStemmer()
stemmer = SnowballStemmer("english")

In [366]:
def stmmer_func(review):
    tokens = [stemmer.stem(token) for token in review.split()]
    return ' '.join(tokens)

In [367]:
mac_df["review"] = mac_df["review"].apply(stmmer_func)

## CountVectorizer

In [368]:
stemmed_stopwords = [stemmer.stem(token) for token in list(set(stopwords.words('english')))]

In [369]:
# vectorizer = CountVectorizer(stop_words="english", binary=True)

# vectorizer = CountVectorizer(stop_words="english", binary=True, min_df = 0.05)

vectorizer = CountVectorizer(stop_words="english", binary=True)

X = vectorizer.fit_transform(mac_df["review"])

In [370]:
vec_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names()).T

vec_df["num_count"] = vec_df.sum(axis=1)

In [371]:
vec_df.drop(stemmed_stopwords, errors="ignore", inplace=True)

In [372]:
print(f"Total number of occurences: {vec_df.num_count.sum()}")
print(f"Dimension of the CountVector: {vec_df.shape}")

vec_df.sort_values("num_count", ascending=False)\
    .head(50)

Total number of occurences: 52147
Dimension of the CountVector: (5684, 1526)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1516,1517,1518,1519,1520,1521,1522,1523,1524,num_count
mcdonald,1,1,0,0,1,1,1,0,1,1,...,0,0,0,1,1,1,1,1,0,890
num_token,0,1,1,1,1,0,0,1,1,0,...,1,0,1,0,1,0,0,1,0,714
order,1,0,1,0,1,0,0,1,1,1,...,1,1,1,1,0,0,0,1,1,682
food,0,1,0,0,1,0,1,0,1,1,...,1,0,0,0,0,1,1,1,0,577
time,1,0,0,0,1,0,0,1,1,1,...,0,1,0,0,1,1,0,1,0,503
drive,1,0,0,0,0,0,0,1,1,0,...,0,1,0,0,0,0,0,0,0,486
servic,0,1,0,0,1,0,0,0,1,0,...,0,0,0,0,0,1,0,1,0,428
place,0,0,0,0,1,0,0,0,0,1,...,0,0,0,1,0,0,0,1,1,387
like,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,1,1,1,370
locat,0,0,0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,331


B. **Stopwords, Stemming, Lemmatization Practice**

Using the `tale-of-two-cities.txt` file from Week 1:
* Count-vectorize the corpus. Treat each sentence as a document.

How many features (dimensions) do you get when you:
* Perform **stemming** and then **count-vectorization**.
* Perform **lemmatization** and then **count-vectorization**.
* Perform **lemmatization**, remove **stopwords**, **remove punctuation**, and then perform **count-vectorization**?

In [243]:
import re

In [250]:
with open("../Week 1/tale-of-two-cities.txt", 'r') as rt:
    text = rt.readlines()

In [248]:
def stemmer_func(review):
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(token) for token in review.split()]
    return ' '.join(tokens)

In [376]:
def lemmatizer_func(review):
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in review.split()]
    return ' '.join(tokens)

In [251]:
text = pd.Series(text)

In [255]:
text = text.str.lower()

In [258]:
# Case 1
text1 = text.apply(stemmer_func)

In [259]:
text1

0        it wa the best of times, it wa the worst of ti...
1        age of wisdom, it wa the age of foolishness, i...
2        belief, it wa the epoch of incredulity, it wa ...
3        it wa the season of darkness, it wa the spring...
4        winter of despair, we had everyth befor us, we...
                               ...                        
12865    hear him tell the child my story, with a tende...
12866    "it is a far, far better thing that i do, than...
12867    it is a far, far better rest that i go to than...
12868                                                     
12869                                            -the end-
Length: 12870, dtype: object

In [260]:
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(text1)

vec_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names()).T

In [266]:
print(f"Case 1 has {vec_df.shape[0]} features")

Case 1 has 8749 features


In [267]:
# Case 2
text2 = text.apply(lemmatizer_func)

In [268]:
text2

0        it wa the best of times, it wa the worst of ti...
1        age of wisdom, it wa the age of foolishness, i...
2        belief, it wa the epoch of incredulity, it wa ...
3        it wa the season of darkness, it wa the spring...
4        winter of despair, we had everything before us...
                               ...                        
12865    hear him tell the child my story, with a tende...
12866    "it is a far, far better thing that i do, than...
12867    it is a far, far better rest that i go to than...
12868                                                     
12869                                            -the end-
Length: 12870, dtype: object

In [269]:
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(text2)

vec_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names()).T

In [270]:
print(f"Case 2 has {vec_df.shape[0]} features")

Case 2 has 9349 features


In [374]:
# Case 3
text3 = text.apply(lemmtizer_func)

# Remove Punctuation
text3 = text3.str.replace(r"[!|@|#|$|%|^|&|*|(|)|+|<|>|?|:|.|,|;|\"|\'|\\]", ' ')

# Whitespace Removal  
text3 = text3.str.replace(r"\s{2,}", ' ')

In [384]:
vectorizer = CountVectorizer(stop_words="english")

X = vectorizer.fit_transform(text3)

vec_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names()).T

In [385]:
print(f"Case 3 has {vec_df.shape[0]} features")

Case 3 has 9065 features
