# Homework 2 (Due 6:29pm PST April 6th, 2021): Word Vectorization, Regex Practice, and Similarity

A. Using the **McDonalds Yelp Review CSV file**, **process the reviews**.
This means you should think briefly about:
* what stopwords to remove (should you add any custom stopwords to the set? Remove any stopwords?)
* what regex cleaning you may need to perform (for example, are there different ways of saying `hamburger` that you need to account for?)
* stemming/lemmatization (explain in your notebook why you used stemming versus lemmatization). 

Next, **count-vectorize the dataset**. Use the **`sklearn.feature_extraction.text.CountVectorizer`** examples from `Linear Algebra, Distance and Similarity (Completed).ipynb` and `Text Preprocessing Techniques (Completed).ipynb` (read the last section, `Vectorization Techniques`).

I do not want redundant features - for instance, I do not want `hamburgers` and `hamburger` to be two distinct columns in your document-term matrix. Therefore, I'll be taking a look to make sure you've properly performed your cleaning, stopword removal, etc. to reduce the number of dimensions in your dataset. 

In [130]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer, SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer

# import nltk
# nltk.download('wordnet')

In [131]:
mac_df = pd.read_csv("mcdonalds-yelp-negative-reviews.csv", encoding="latin1")

In [132]:
mac_df.head()

Unnamed: 0,_unit_id,city,review
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave..."
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo..."


In [133]:
vectorizer = CountVectorizer(stop_words="english", binary=True)

X = vectorizer.fit_transform(mac_df["review"])

In [134]:
vec_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names()).T

vec_df["num_count"] = vec_df.sum(axis=1)

In [139]:
vec_df.iloc[50:100, :]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1516,1517,1518,1519,1520,1521,1522,1523,1524,num_count
150,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
15dollars,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
15mins,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
15minutes,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
15so,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
16,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
17,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
170,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
179,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
17p,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [68]:
vec_df.sort_values("num_count", ascending=False)\
#     .drop(list(set(stopwords.words('english'))), errors="ignore")\
#     .head(50)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1516,1517,1518,1519,1520,1521,1522,1523,1524,num_count
food,0,1,0,0,1,0,1,0,1,1,...,1,0,0,0,0,1,1,1,0,574
order,1,0,1,0,1,0,0,1,1,1,...,1,1,1,1,0,0,0,1,1,515
mcdonald,0,0,0,0,1,1,1,0,0,1,...,0,0,0,0,0,0,1,1,0,486
drive,1,0,0,0,0,0,0,1,1,0,...,0,1,0,0,0,0,0,0,0,473
service,0,1,0,0,1,0,0,0,1,0,...,0,0,0,0,0,1,0,1,0,423
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
olivia,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
olmsted,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
discrepancy,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
omen,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [69]:
vec_df.loc["burger"]

0             0
1             0
2             0
3             0
4             0
             ..
1521          0
1522          0
1523          0
1524          1
num_count    91
Name: burger, Length: 1526, dtype: int64

There shows some words that needs cleaning among top 50 most frequent words. For example, mcdonald and mcdonalds are 3rd and 7th most frequent words, and most common word in reviews when combined. Stemming and regex cleaning will remove this issue.

## Preliminary processing

In [70]:
# Make all lowercase
mac_df["review"] = mac_df["review"].str.lower()

## Regex Cleaning

In [71]:
# tp = mac_df.loc[mac_df["review"].str.extract(r"(?:a|the|chicken|cheese|ham)?\s*(burger)s?").notnull()[0], "review"]
# idx = tp.index

# # # mac_df["review"].str.extract(r"(big\s*mac)s?").notnull()

In [72]:
# tp[idx[1]]

# # # tp[idx[0]].findall(r"(a|the|chicken|cheese|ham)?\s*(burger)s?")

In [73]:
# Hamburger Variation
n_burgers = mac_df["review"].str.contains(r"(?:a|the|chicken|cheese|ham)?\s*burgers?").sum()
print(f"Number of Burgers: {n_burgers}")

# mac_df["review"].str.findall(r"(?:(?:chicken)|(?:cheese)|(?:ham))?\s*(burger)s?")\
#     .apply(lambda x: ' '.join(set(x)) if len(x) > 0 else '')\
#     .unique()

mac_df["review"] = mac_df["review"].str.replace(r"(?:a|the|chicken|cheese|ham)?\s*burgers?", "burger")

Number of Burgers: 177


In [74]:
# temp = mac_df["review"].str.findall(r"(?:(?:chicken)|(?:cheese)|(?:ham))?\s*burgers?")\
#     .apply(lambda x: 'x '.join(set(x)) if len(x) > 0 else '')

In [75]:
# mac_df.loc[temp != '', "review"]

In [76]:
# idx1 = mac_df.loc[temp != '', "review"].index

In [77]:
# idx1

In [78]:
# Big Macs
n_bigmacs = mac_df["review"].str.contains(r"big\s*macs?").sum()
print(f"Number of Big Macs: {n_bigmacs}")

mac_df["review"] = mac_df["review"].str.replace(r"big\s*macs?", "bigmac")

Number of Big Macs: 55


In [79]:
# McDonald's
n_mcds = mac_df["review"].str.contains(r"(?:\bmcdonald(?:'?s?)?\b)|(?:\bmcds?\b)").sum()
print(f"Number of McDonald's: {n_mcds}")

mac_df["review"] = mac_df["review"].str.replace(r"(?:\bmcdonald(?:'?s?)?\b)|(?:\bmcds?\b)", "mcdonald")

Number of McDonald's: 891


In [35]:
# ss =mac_df["review"].str.findall(r"(?:(?:chicken)|(?:cheese)|(?:ham))?\s*burgers?")\
#     .apply(lambda x: 'x '.join(set(x)) if len(x) > 0 else '')

# idx2 = mac_df.loc[ss != '', "review"].index

In [36]:
# idx2

In [37]:
# idx3 = mac_df.loc[idx1[~idx1.isin(idx2)], "review"].index

In [38]:
# mac_df.loc[idx1[~idx1.isin(idx2)], "review"][idx3[1]]

In [None]:
# Numbers!!!
# \d{1,2}\s?:?\s?\d{1,2}\s?(?:am|pm|ish|min(?:utes?|s?)|)

## Stemming

In [80]:
# stemmer = PorterStemmer()
stemmer = SnowballStemmer("english")

In [81]:
def stem_review(review):
    tokens = [stemmer.stem(token) for token in review.split()]
    return ' '.join(tokens)

In [82]:
mac_df["review"] = mac_df["review"].apply(stem_review)

## CountVectorizer

In [83]:
stemmed_stopwords = [stemmer.stem(token) for token in list(set(stopwords.words('english')))]

In [105]:
# vectorizer = CountVectorizer(stop_words="english", binary=True)

# vectorizer = CountVectorizer(stop_words="english", binary=True, min_df = 0.05)

vectorizer = CountVectorizer(stop_words="english", binary=True)

X = vectorizer.fit_transform(mac_df["review"])

In [106]:
vec_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names()).T

vec_df["num_count"] = vec_df.sum(axis=1)

In [107]:
vec_df.drop(stemmed_stopwords, errors="ignore", inplace=True)

In [108]:
print(f"Total number of occurences: {vec_df.num_count.sum()}")

vec_df.sort_values("num_count", ascending=False)\
#     .head(50)

Total number of occurences: 53451


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1516,1517,1518,1519,1520,1521,1522,1523,1524,num_count
mcdonald,1,1,0,0,1,1,1,0,1,1,...,0,0,0,1,1,1,1,1,0,891
order,1,0,1,0,1,0,0,1,1,1,...,1,1,1,1,0,0,0,1,1,649
food,0,1,0,0,1,0,1,0,1,1,...,1,0,0,0,0,1,1,1,0,575
drive,1,0,0,0,0,0,0,1,1,0,...,0,1,0,0,0,0,0,0,0,485
time,1,0,0,0,1,0,0,1,1,1,...,0,1,0,0,0,1,0,1,0,475
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
bucktown,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
opposite,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
optim,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
filets,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


B. **Stopwords, Stemming, Lemmatization Practice**

Using the `tale-of-two-cities.txt` file from Week 1:
* Count-vectorize the corpus. Treat each sentence as a document.

How many features (dimensions) do you get when you:
* Perform **stemming** and then **count-vectorization**.
* Perform **lemmatization** and then **count-vectorization**.
* Perform **lemmatization**, remove **stopwords**, **remove punctuation**, and then perform **count-vectorization**?

In [115]:
import re

In [111]:
with open("../Week 1/tale-of-two-cities.txt", 'r') as rt:
    text = rt.readlines()

In [120]:
g = re.search('\n',text[0])
g.group()

'\n'