<a href="https://colab.research.google.com/github/kaisun-msba/dso-560-nlp-text-analytics-SPRING-2021/blob/main/sun_kai_HW2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Homework 2

A. Using the **McDonalds Yelp Review CSV file**, **process the reviews**.
This means you should think briefly about:
* what stopwords to remove (should you add any custom stopwords to the set? Remove any stopwords?)
* what regex cleaning you may need to perform (for example, are there different ways of saying `hamburger` that you need to account for?)
* stemming/lemmatization (explain in your notebook why you used stemming versus lemmatization). 

Next, **count-vectorize the dataset**. Use the **`sklearn.feature_extraction.text.CountVectorizer`** examples from `Linear Algebra, Distance and Similarity (Completed).ipynb` and `Text Preprocessing Techniques (Completed).ipynb` (read the last section, `Vectorization Techniques`).

I do not want redundant features - for instance, I do not want `hamburgers` and `hamburger` to be two distinct columns in your document-term matrix. Therefore, I'll be taking a look to make sure you've properly performed your cleaning, stopword removal, etc. to reduce the number of dimensions in your dataset. 


In [None]:
import pandas as pd
df=pd.read_csv('mcdonalds-yelp-negative-reviews.csv',encoding='latin1')
df.head()

In [None]:
import nltk
nltk.download('punkt') # A popular NLTK sentence tokenizer
nltk.download('stopwords') # library of common English stopwords

In [None]:
# stemming before count vectorization
# better coverage than lemmatization
# can reduce dimensions significantly
from nltk.stem.porter import PorterStemmer
stemmer=PorterStemmer()
df['tokenized_review']=df['review'].apply(lambda x:nltk.word_tokenize(x))
df['stemmed_review']=df['tokenized_review'].apply(lambda x: [stemmer.stem(y) for y in x])
df['joined_review']=[' '.join(map(str, l)) for l in df['stemmed_review']]

df.head()

In [None]:
from nltk.corpus import stopwords
nltk_stopwords = list(stopwords.words('english'))

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
# specific stopwords to the mcdonald reviews
add_stopwords=['mcdonald','mcdonalds','order','food','restaurant']

vectorizer = CountVectorizer(stop_words=nltk_stopwords+add_stopwords, token_pattern=r'\b[a-zA-Z]{3,}\b', min_df=0.05, max_df=0.4)
X = vectorizer.fit_transform(df['joined_review'])
vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
print(f"Shape of dataframe is {vectorized_df.shape}")
print(f"Total number of occurences: {vectorized_df.sum().sum()}")
#print(f"Word counts: {vectorized_df.sum()}")
vectorized_df.head()

B. Stopwords, Stemming, Lemmatization Practice

Using the tale-of-two-cities.txt file from Week 1:

Count-vectorize the corpus. Treat each sentence as a document.
How many features (dimensions) do you get when you:

Perform stemming and then count-vectorization.
Perform lemmatization and then count-vectorization.
Perform lemmatization, remove stopwords, remove punctuation, and then perform count-vectorization?

In [None]:
with open("tale-of-two-cities.txt", "r") as text_file:
  lines=text_file.read().replace("\n", " ")
lines

In [None]:
# tokenize the text into sentences
lines=nltk.sent_tokenize(lines)
len(lines)

In [None]:
# 1.Perform stemming and then count-vectorization
import nltk
from nltk.stem.porter import PorterStemmer
stemmer=PorterStemmer()
documents=[]
for line in lines:
  if len(line)>0:
    line=[stemmer.stem(word) for word in nltk.word_tokenize(line)]
    documents.append(line)
documents=[' '.join(word_list) for word_list in documents]
len(documents)

In [None]:
# countVectorizer
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
#roman_nums=['xii',	'xiii',	'xiv',	'xix',	'xvi',	'xvii',	'xviii',	'xxi',	'xxii',	'xxiv','xxiii']
#stop_words = set(stopwords.words('english') + roman_nums)
vectorizer_1 = CountVectorizer()
X_1 = vectorizer_1.fit_transform(documents)

In [None]:
vectorized_df_1 = pd.DataFrame(X_1.toarray(), columns=vectorizer_1.get_feature_names())
print(f"Shape of dataframe is {vectorized_df_1.shape}")
print(f"Total number of occurences: {vectorized_df_1.sum().sum()}")
vectorized_df_1

In [None]:
# 2.Perform lemmatization and then count-vectorization.
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
documents=[]
for line in lines:
  if len(line)>0:
    line=[lemmatizer.lemmatize(word) for word in nltk.word_tokenize(line)]
    documents.append(line)
documents=[' '.join(word_list) for word_list in documents]

In [None]:
# countVectorizer
#roman_nums=['xii',	'xiii',	'xiv',	'xix',	'xvi',	'xvii',	'xviii',	'xxi',	'xxii',	'xxiv','xxiii']
# stop_words = set(stopwords.words('english') + roman_nums)
vectorizer_2 = CountVectorizer()
X_2 = vectorizer_2.fit_transform(documents)
vectorized_df_2 = pd.DataFrame(X_2.toarray(), columns=vectorizer_2.get_feature_names())
print(f"Shape of dataframe is {vectorized_df_2.shape}")
print(f"Total number of occurences: {vectorized_df_2.sum().sum()}")
vectorized_df_2

In [None]:
# 3.Perform lemmatization, remove stopwords, remove punctuation, and then perform count-vectorization?
# countVectorizer, removing stopwords
roman_nums=['xii',	'xiii',	'xiv',	'xix',	'xvi',	'xvii',	'xviii',	'xxi',	'xxii',	'xxiv','xxiii']
stop_words = set(stopwords.words('english') + roman_nums)
vectorizer_3 = CountVectorizer(stop_words=stop_words, token_pattern=r'\b[a-zA-Z]{3,}\b',)
X_3 = vectorizer_3.fit_transform(documents)
vectorized_df_3 = pd.DataFrame(X_3.toarray(), columns=vectorizer_3.get_feature_names())
print(f"Shape of dataframe is {vectorized_df_3.shape}")
print(f"Total number of occurences: {vectorized_df_3.sum().sum()}")
vectorized_df_3

In [None]:
# the third method results in less features comparing to the first two