# Homework 2 (Due 11:59pm PST March 31st, 2020): Word Vectorization, Regex Practice, and Similarity

**Name: Sam Dong Wook Ko**

You may work with **one other person on this assignment**. You may also work independently if you prefer.

If you just want to be assigned someone to work with, message me on Slack and I will assign you a partner to work with.

A. Using the **McDonalds Yelp Review CSV file**, **process the reviews**.
This means you should think briefly about:
* what stopwords to remove (should you add any custom stopwords to the set? Remove any stopwords?)
* what regex cleaning you may need to perform (for example, are there different ways of saying `hamburger` that you need to account for?)
* stemming/lemmatization (explain in your notebook why you used stemming versus lemmatization). 

Next, **count-vectorize the dataset**. Use the **`sklearn.feature_extraction.text.CountVectorizer`** examples from `Linear Algebra, Distance and Similarity (Completed).ipynb` and `Text Preprocessing Techniques (Completed).ipynb` (read the last section, `Vectorization Techniques`).

I do not want redundant features - for instance, I do not want `hamburgers` and `hamburger` to be two distinct columns in your document-term matrix. Therefore, I'll be taking a look to make sure you've properly performed your cleaning, stopword removal, etc. to reduce the number of dimensions in your dataset. 

In [1]:
import pandas as pd
from collections import Counter

In [2]:
# used latin-1 as UTF-8 encoding ran into 'utf-8' codec can't decode byte 0xce in position 3125: invalid continuation byte' error
mcdonalds_csv = open("mcdonalds-yelp-negative-reviews.csv", "r", encoding = 'latin-1')
mcdonalds = mcdonalds_csv.read()

# Removing Stop Words

In [3]:
from typing import List
words: List[str] = mcdonalds.split()
mcdonalds_counter = Counter(words)
mcdonalds_counter.most_common(10)

[('the', 6203),
 ('I', 4077),
 ('and', 4070),
 ('to', 3953),
 ('a', 3426),
 ('of', 1990),
 ('is', 1864),
 ('was', 1771),
 ('in', 1707),
 ('for', 1617)]

In [4]:
#First, I made all the words into lower case
mcdonalds = mcdonalds.lower()
words: List[str] = mcdonalds.split()
mcdonalds_counter = Counter(words)
mcdonalds_counter.most_common(10)

[('the', 6903),
 ('i', 4295),
 ('and', 4229),
 ('to', 3977),
 ('a', 3485),
 ('of', 2006),
 ('is', 1886),
 ('was', 1781),
 ('in', 1758),
 ('for', 1653)]

In [5]:
# Next, I wanted to choose my stopwords
# There are 2 different lists of stopwords (according to what we have learned so far + tips & tricks notebook)
# First one from gensim package
from gensim.parsing.preprocessing import STOPWORDS
stopwords_gensim = list(STOPWORDS)
stopwords_gensim.sort()
stopwords_gensim

['a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amoungst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'bill',
 'both',
 'bottom',
 'but',
 'by',
 'call',
 'can',
 'cannot',
 'cant',
 'co',
 'computer',
 'con',
 'could',
 'couldnt',
 'cry',
 'de',
 'describe',
 'detail',
 'did',
 'didn',
 'do',
 'does',
 'doesn',
 'doing',
 'don',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eg',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'etc',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifte

In [6]:
# Second one from nltk package
from nltk.corpus import stopwords
stopwords_NLTK = list(stopwords.words("english"))
stopwords_NLTK.sort()
stopwords_NLTK

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [7]:
# I decided to combine these two lists of stopwords

stopwords_combined = stopwords_gensim + stopwords_NLTK
print(len(stopwords_combined))

516


In [8]:
# Then, I removed the duplicates. Note that the length of the list gets reduced greatly. 

stopwords_combined = list(set(stopwords_combined)) 
print(len(stopwords_combined))

390


In [9]:
# Next, I added some custom stopwords for this specific dataset

custom_stopwords = ['mcdonald', 'mcdonalds', "mcdonald's", "mcdondalds'", 'mcds']
stopwords_combined += custom_stopwords
stopwords_combined

['hereupon',
 "needn't",
 'mine',
 'mightn',
 'beyond',
 'made',
 'none',
 'more',
 'just',
 'nine',
 "wasn't",
 'either',
 'neither',
 'once',
 'twenty',
 'were',
 'there',
 're',
 'under',
 'own',
 'each',
 'hence',
 'besides',
 'would',
 'sixty',
 'over',
 'anyhow',
 "shan't",
 'did',
 'thereafter',
 'hadn',
 'also',
 'hereby',
 'our',
 'still',
 'is',
 'twelve',
 'does',
 'only',
 'eg',
 'us',
 'wherever',
 'interest',
 'wouldn',
 "won't",
 'hers',
 'became',
 'same',
 'through',
 'won',
 'from',
 'most',
 'except',
 'forty',
 'if',
 'not',
 'she',
 "hasn't",
 'which',
 'herself',
 'since',
 'thus',
 'describe',
 'eight',
 'enough',
 'very',
 'between',
 'whence',
 'an',
 'these',
 'whom',
 'five',
 'much',
 'across',
 'how',
 'thick',
 'ourselves',
 'moreover',
 'we',
 'co',
 'whereas',
 'a',
 'his',
 'inc',
 'myself',
 'fifteen',
 'first',
 'however',
 'herein',
 'he',
 'mill',
 'up',
 'in',
 'via',
 'had',
 'almost',
 'latterly',
 'all',
 'do',
 'former',
 'detail',
 'whenever',

In [10]:
#Before removing these stopwords, my next concern were the negatives that can change the context of the reviews.
#For example, 'not good', 'never hot', etc. 
#In order to fix such problem, I decided to replace "(negative) word" to "(negative)-word". 
#For example, 'not good' --> 'not-good'
#By doing so, these words will not be affected when removing the stopwords.
#In this analysis, I have defined the following words as the negatives: not, no, never, neither, nor
#Refer to the test below for a more clease sense of this process.
test = "good, not good, neither happy nor sad either cold or hot"
test


'good, not good, neither happy nor sad either cold or hot'

In [11]:
import re
test = re.sub(r'\b(not )\b', 'not_' , test)
test = re.sub(r'\b(no )\b', 'no_' , test)
test = re.sub(r'\b(never )\b', 'never_' , test)
test = re.sub(r'\b(neither )\b', 'neither_' , test)
test = re.sub(r'\b(nor )\b', 'nor_' , test)
test

'good, not_good, neither_happy nor_sad either cold or hot'

In [12]:
mcdonalds = re.sub(r'\b(not )\b', 'not_' , mcdonalds)
mcdonalds = re.sub(r'\b(no )\b', 'no_' , mcdonalds)
mcdonalds = re.sub(r'\b(never )\b', 'never_' , mcdonalds)
mcdonalds = re.sub(r'\b(neither )\b', 'neither_' , mcdonalds)
mcdonalds = re.sub(r'\b(nor )\b', 'nor_' , mcdonalds)


#Coming back after performing further analysis, I've noticed phrases like 'not_a huge'.
#Howver, given the limited number of such instances, I decided to just continue the analysis with this method.

In [13]:
# Finally, removing the stopwords from the reviews
stopwords_expression = '|'.join(stopwords_combined)
stopwords_pattern = f'({stopwords_expression})'

mcdonalds_cleaned = re.sub(rf'\b{stopwords_pattern}\b','', mcdonalds)
mcdonalds_cleaned



In [14]:
words: List[str] = mcdonalds_cleaned.split()
mcdonalds_counter = Counter(words)
mcdonalds_counter.most_common(10)
#the punctuations will be removed in later steps

[("'", 1988),
 ('.', 1690),
 (',', 1143),
 ('food', 620),
 ('order', 609),
 ('drive', 488),
 ('like', 468),
 ('time', 380),
 ('place', 377),
 ('service', 348)]

# Regex Cleaning

Initially, I was trying to clean the data once more to make sure words like 'hamburger', 'hamburgers' using the similar method I did in HW #1. However, I realized that both stemming and lemmatization will reduce the variations of hamburgers to a common root.

What I did, however, for the regex cleaning is to remove the . (period) and , (comma) in the text. The reasoning behind this step is to capture the words that preceed the . and , to be counted with the word rather than on its own. 

For example, by removing the period, **'hamburger'** and **'hamburger.'** would be considered as a same word 

In [15]:
mcdonalds_cleaned = re.sub(r'\.(?!\d)', '', mcdonalds_cleaned) #getting rid of periods
mcdonalds_cleaned = re.sub(r'\,(?!\d)', '', mcdonalds_cleaned) #getting rid of commas
mcdonalds_cleaned = re.sub(r'\b\d{9}\w+\b', '',mcdonalds_cleaned) #getting rid of 9-digit numbers and the first word after that (ex:679455653 atlanta)
mcdonalds_cleaned = re.sub(r'\b\d{9}\b', '',mcdonalds_cleaned)#getting rid of just 9-digit numbers

mcdonalds_cleaned



In [16]:
def spellcheck(text):
    words = set(map(lambda word: word.replace("\n", ""), open("20k.txt").readlines()))
    import difflib
    from nltk.tokenize import word_tokenize
    new_tokens = []
    for token in word_tokenize(text):
        matches = difflib.get_close_matches(token.lower(), words, n=1, cutoff=0.7)
        if len(matches) == 0 or token.lower() in words:
            new_tokens.append(token)
        else:
            new_tokens.append(matches[0])
    return " ".join(new_tokens)

In [17]:
# Commented out and skipped for now to save time
# spellcheck(mcdonalds_cleaned)

# Stemming vs. Lemmatization 

First, I wanted to divide the text into tokens.

In [18]:
#Tokenize sentence
import nltk
from nltk.tokenize import sent_tokenize
mcdonalds_cleaned_tokenized = nltk.sent_tokenize(mcdonalds_cleaned)
mcdonalds_cleaned_tokenized [0:5]

['_unit_idcityreview\n"\' not_a huge  lover  \'   better ones    far  worst  \'   !',
 'filthy inside     drive   completely screw   order  time!',
 'staff  terribly unfriendly     care"\n"terrible customer service  came   9:30pm  stood     register  no_one bothered     help   5 minutes   no_one  waiting   food inside   outside   window   left  went  chickfila  door   greeted      way inside     dirty  floor  covered  dropped food obviously filled  surly  unhappy workers"\n"  ""lost""  order actually  gave       took 20 minutes  figure      waiting   order    asked   needed  replied "" order"" asked   ticket   asst mgr looked   ticket  incompletely filled    ask   check     filled  correctly acted      bothered     asked   begrudgingly checked     fact miss    ticket  22 minutes  finally   breakfast biscuit platter  left  woman approached  identified    manager   dressed      awoken   old -shirt  sweat pants said   heard  happened  said \'  care      intervene   saw   growing annoyed  

First, I created 2 functions to stem and lemmatize when feeding a string file. <br>
1. Both functions first tokenize or break the string into sentences and then into words. <br>
2. Then, for each of the words, the function will either stem or lemmatize.

In [19]:
def stem_text (text):
    import nltk
    from nltk.tokenize import sent_tokenize, word_tokenize
    from nltk.stem import PorterStemmer
    porter=PorterStemmer()
    token_sentence = nltk.sent_tokenize(text)
    stemmed_text=[]
    for sentence in token_sentence:
        token_words=nltk.word_tokenize(sentence)
        for word in token_words:
            stemmed_text.append(porter.stem(word))
            stemmed_text.append(" ")
    return "".join(stemmed_text)

In [20]:
def lemmatize_text (text):
    import nltk
    from nltk.tokenize import sent_tokenize, word_tokenize
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()

    token_sentence = nltk.sent_tokenize(text)
    lemmatized_text=[]
    for sentence in token_sentence:
        token_words=nltk.word_tokenize(sentence)
        for word in token_words:
            lemmatized_text.append(lemmatizer.lemmatize(word))
            lemmatized_text.append(" ")
    return "".join(lemmatized_text)

In [21]:
test = 'hamburger hamburgers fries fry drink drinks drinking'
stem_text(test)

'hamburg hamburg fri fri drink drink drink '

In [22]:
lemmatize_text(test)

'hamburger hamburger fry fry drink drink drinking '

#### For this dataset, I decided to go with the **stemming**. The rationale behind is that given the characteristics of the reviews for the fast-food chain, which is expected to repeat simple, easy words, it would still be relatively intuitive to understand them even if the stems are not valid words in the language. In return, I would be able to benefit from faster speed and smaller dimension size.

In [23]:
mcdonalds_cleaned_stemmed = stem_text(mcdonalds_cleaned)
mcdonalds_cleaned_stemmed

"_unit_idcityreview '' ' not_a huge lover ' better one far worst ' ! filthi insid drive complet screw order time ! staff terribl unfriendli care '' '' terribl custom servic came 9:30pm stood regist no_on bother help 5 minut no_on wait food insid outsid window left went chickfila door greet way insid dirti floor cover drop food obvious fill surli unhappi worker '' '' `` '' lost '' '' order actual gave took 20 minut figur wait order ask need repli `` '' order '' '' ask ticket asst mgr look ticket incomplet fill ask check fill correctli act bother ask begrudgingli check fact miss ticket 22 minut final breakfast biscuit platter left woman approach identifi manag dress awoken old -shirt sweat pant said heard happen said ' care interven saw grow annoy incompet ? '' ' not_th give 1 star not_a -25 star ! ! ! ' need ! '' ' know food review reflect sole poor servic locat countless time year consist fail servic end thing order taker tend rude no_smil lot `` '' sigh '' '' `` '' lip smack '' '' tal

# Count-Vectorization

In [24]:
import nltk
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
data_corpus = nltk.sent_tokenize(mcdonalds_cleaned_stemmed)
for token in data_corpus:
    token = re.sub(r'[^\w\s]','',token) #removed all punctuations
X = vectorizer.fit_transform(data_corpus)
X.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [25]:
X.shape

(2547, 6899)

Dimension: 6899

In [26]:
vectorizer.get_feature_names()

['00',
 '000',
 '00am',
 '00mi',
 '00pm',
 '01',
 '0200',
 '03pm',
 '04',
 '04am',
 '05',
 '05i',
 '0600',
 '07',
 '076',
 '07am',
 '08',
 '09',
 '090',
 '09chocol',
 '0food',
 '0overal',
 '0servic',
 '10',
 '100',
 '1000filthi',
 '10131',
 '1030am',
 '10am',
 '10amthank',
 '10min',
 '10minut',
 '10p',
 '10pc',
 '10pm',
 '10th',
 '11',
 '115th',
 '11am',
 '11pm',
 '12',
 '12109',
 '12am',
 '12minut',
 '13',
 '130am',
 '13777',
 '13after',
 '13th',
 '14',
 '15',
 '150',
 '15dollar',
 '15min',
 '15minut',
 '15so',
 '16',
 '17',
 '170',
 '179',
 '17p',
 '18',
 '180',
 '187',
 '18966',
 '18minut',
 '19',
 '195',
 '1960',
 '1979',
 '1980',
 '1982',
 '1999',
 '1cleanli',
 '1i',
 '1in',
 '1overal',
 '1pm',
 '1s',
 '1st',
 '1star',
 '1time',
 '1valu',
 '20',
 '200',
 '2002',
 '2004',
 '2007',
 '2008',
 '2011',
 '2012',
 '2013',
 '2014',
 '205',
 '20min',
 '21',
 '216',
 '21st',
 '22',
 '23',
 '24',
 '244',
 '24h',
 '24hr',
 '24hrs',
 '24oz',
 '25',
 '250',
 '26',
 '27',
 '28',
 '285',
 '28th',

In [27]:
# I have also created a function that returns the outputs shown above for count-vectorization

def count_vec(text):
    from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer()
    data_corpus = nltk.sent_tokenize(text)
    for token in data_corpus:
        token = re.sub(r'[^\w\s]','',token) #removed all punctuations
    X = vectorizer.fit_transform(data_corpus)
    X_shape = X.shape
    X_array = X.toarray()
    X_feature = vectorizer.get_feature_names()

    return X_shape, X_array, X_feature

In [28]:
X_shape, X_array, X_feature = count_vec(mcdonalds_cleaned_stemmed)
print('shape: ', X_shape)

print('array: \n', X_array)

print('features: ', X_feature)

shape:  (2547, 6899)
array: 
 [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
features:  ['00', '000', '00am', '00mi', '00pm', '01', '0200', '03pm', '04', '04am', '05', '05i', '0600', '07', '076', '07am', '08', '09', '090', '09chocol', '0food', '0overal', '0servic', '10', '100', '1000filthi', '10131', '1030am', '10am', '10amthank', '10min', '10minut', '10p', '10pc', '10pm', '10th', '11', '115th', '11am', '11pm', '12', '12109', '12am', '12minut', '13', '130am', '13777', '13after', '13th', '14', '15', '150', '15dollar', '15min', '15minut', '15so', '16', '17', '170', '179', '17p', '18', '180', '187', '18966', '18minut', '19', '195', '1960', '1979', '1980', '1982', '1999', '1cleanli', '1i', '1in', '1overal', '1pm', '1s', '1st', '1star', '1time', '1valu', '20', '200', '2002', '2004', '2007', '2008', '2011', '2012', '2013', '2014', '205', '20min', '21', '216', '21st', '22', '23', '24', '244', '24h', '24hr', '24hrs', '24o

#### I thought about removing just the numbers or words with number digits in them, but I didn't want to remove numbers that are potentially useful. Therefore, I counted the words again just to see if there was any important numbers as shown below.

In [29]:
mcdonalds_cleaned = re.sub(r'[^\w\s]','',mcdonalds_cleaned) # removes punctuations
words: List[str] = mcdonalds_cleaned.split()
mcdonalds_counter = Counter(words)
mcdonalds_counter.most_common(20)

[('food', 858),
 ('order', 833),
 ('drive', 678),
 ('time', 524),
 ('service', 519),
 ('like', 482),
 ('place', 468),
 ('location', 383),
 ('people', 349),
 ('vegas', 313),
 ('fries', 299),
 ('got', 294),
 ('ordered', 264),
 ('minutes', 261),
 ('good', 257),
 ('coffee', 254),
 ('window', 251),
 ('right', 238),
 ('went', 226),
 ('line', 224)]

B. **Stopwords, Stemming, Lemmatization Practice**

Using the `tale-of-two-cities.txt` file from Week 1:
* Count-vectorize the corpus. Treat each sentence as a document.

How many features (dimensions) do you get when you:
* Perform **stemming** and then **count-vectorization**
* Perform **lemmatization** and then **count-vectorization**.
* Perform **lemmatization**, remove **stopwords**, and then perform **count-vectorization**?

**Because we want to treat each sentence as a document, it makes more sense to tokenize by sentence than simply reading it line by line. This can be observed by the following lines below.**

In [30]:
dickens_line = open("tale-of-two-cities.txt", "r", encoding = 'UTF-8').readlines(  )
dickens_line[0:5]

['  IT WAS the best of times, it was the worst of times, it was the\n',
 'age of wisdom, it was the age of foolishness, it was the epoch of\n',
 'belief, it was the epoch of incredulity, it was the season of Light,\n',
 'it was the season of Darkness, it was the spring of hope, it was the\n',
 'winter of despair, we had everything before us, we had nothing\n']

In [31]:
text_file = open("tale-of-two-cities.txt", "r", encoding = 'UTF-8')
dickens = text_file.read()
token_sentence = nltk.sent_tokenize(dickens)
token_sentence[0:5]

['  IT WAS the best of times, it was the worst of times, it was the\nage of wisdom, it was the age of foolishness, it was the epoch of\nbelief, it was the epoch of incredulity, it was the season of Light,\nit was the season of Darkness, it was the spring of hope, it was the\nwinter of despair, we had everything before us, we had nothing\nbefore us, we were all going direct to Heaven, we were all going\ndirect the other way- in short, the period was so far like the present\nperiod, that some of its noisiest authorities insisted on its being\nreceived, for good or for evil, in the superlative degree of\ncomparison only.',
 'There were a king with a large jaw and a queen with a plain face, on\nthe throne of England; there were a king with a large jaw and a\nqueen with a fair face, on the throne of France.',
 'In both countries\nit was clearer than crystal to the lords of the State preserves of\nloaves and fishes, that things in general were settled for ever.',
 'It was the year of Our Lor

## **stemming** and then **count-vectorization**

In [32]:
# stemming using the function defined above
dickens_stemmed = stem_text(dickens)
dickens_stemmed

"IT wa the best of time , it wa the worst of time , it wa the age of wisdom , it wa the age of foolish , it wa the epoch of belief , it wa the epoch of incredul , it wa the season of light , it wa the season of dark , it wa the spring of hope , it wa the winter of despair , we had everyth befor us , we had noth befor us , we were all go direct to heaven , we were all go direct the other way- in short , the period wa so far like the present period , that some of it noisiest author insist on it be receiv , for good or for evil , in the superl degre of comparison onli . there were a king with a larg jaw and a queen with a plain face , on the throne of england ; there were a king with a larg jaw and a queen with a fair face , on the throne of franc . In both countri it wa clearer than crystal to the lord of the state preserv of loav and fish , that thing in gener were settl for ever . It wa the year of our lord one thousand seven hundr and seventy-f . spiritu revel were conced to england a

In [33]:
# count-vectorization using the function above
count_vec(dickens_stemmed)

((7787, 6675), array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]), ['1757',
  '1767',
  '1792',
  '21',
  'aback',
  'abandon',
  'abash',
  'abat',
  'abbay',
  'abbaye',
  'abe',
  'abhorr',
  'abid',
  'abil',
  'abject',
  'abl',
  'ablaz',
  'abneg',
  'aboard',
  'abod',
  'abolish',
  'abolished',
  'abolit',
  'abomin',
  'abound',
  'about',
  'abov',
  'abreast',
  'abridg',
  'abroad',
  'abrupt',
  'abruptli',
  'absenc',
  'absent',
  'absolut',
  'absolv',
  'absorb',
  'absorpt',
  'abstract',
  'abstractedli',
  'absurd',
  'abund',
  'abus',
  'abyss',
  'abyssinia',
  'accent',
  'accept',
  'access',
  'accessori',
  'accid',
  'accident',
  'acclam',
  'accommod',
  'accompani',
  'accomplic',
  'accomplish',
  'accord',
  'accordingli',
  'accost',
  'account',
  'accoutr',
  'accumul',
  'accur',
  'accurs'

The stemming method shows much less number of features compared to lemmatization as shown below. This is possible because the words are cut down to stems more aggresively than lemmatization, disregarding how the word was used.

#### Dimension: 6675

## **lemmatization** and then **count-vectorization**

In [34]:
# lemmatization using the function defined above
dickens_lemmatized = lemmatize_text(dickens)
dickens_lemmatized



In [35]:
# count-vectorization using the function above
count_vec(dickens_lemmatized)

((7787, 8916), array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]), ['1757',
  '1767',
  '1792',
  '21',
  'aback',
  'abandon',
  'abandoned',
  'abandoning',
  'abandonment',
  'abashed',
  'abate',
  'abated',
  'abbaye',
  'abed',
  'abhorrence',
  'abided',
  'abiding',
  'ability',
  'abject',
  'ablaze',
  'able',
  'abnegating',
  'aboard',
  'abode',
  'abolished',
  'abolishing',
  'abolition',
  'abominable',
  'abounding',
  'about',
  'above',
  'abreast',
  'abridge',
  'abroad',
  'abrupt',
  'abruptly',
  'absence',
  'absent',
  'absolute',
  'absolutely',
  'absolving',
  'absorbed',
  'absorption',
  'abstractedly',
  'abstraction',
  'absurd',
  'abundance',
  'abundant',
  'abuse',
  'abused',
  'abyss',
  'abyssinia',
  'accent',
  'accept',
  'acceptable',
  'acceptation',
  'accepted',
  'access',
  'acces

#### Dimension: 8916

## **lemmatization**, remove **stopwords**, and then **count-vectorization**?

In [36]:
# lemmatized from the question above
dickens_lemmatized



In [37]:
# removing stopwords
# refer above for the variables of stopwords 

stopwords_list = stopwords_gensim + stopwords_NLTK

stopwords_expression = '|'.join(stopwords_list)
stopwords_pattern = f'({stopwords_expression})'

dickens_lemmatized_clean = re.sub(rf'\b{stopwords_pattern}\b','', dickens_lemmatized)
dickens_lemmatized_clean



In [38]:
# count-vectorization using the function above
count_vec(dickens_lemmatized_clean)

((7787, 8816), array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]), ['1757',
  '1767',
  '1792',
  '21',
  'aback',
  'abandon',
  'abandoned',
  'abandoning',
  'abandonment',
  'abashed',
  'abate',
  'abated',
  'abbaye',
  'abed',
  'abhorrence',
  'abided',
  'abiding',
  'ability',
  'abject',
  'ablaze',
  'able',
  'abnegating',
  'aboard',
  'abode',
  'abolished',
  'abolishing',
  'abolition',
  'abominable',
  'abounding',
  'about',
  'above',
  'abreast',
  'abridge',
  'abroad',
  'abrupt',
  'abruptly',
  'absence',
  'absent',
  'absolute',
  'absolutely',
  'absolving',
  'absorbed',
  'absorption',
  'abstractedly',
  'abstraction',
  'absurd',
  'abundance',
  'abundant',
  'abuse',
  'abused',
  'abyss',
  'abyssinia',
  'accent',
  'accept',
  'acceptable',
  'acceptation',
  'accepted',
  'access',
  'acces

#### Dimension: 8816

I noticed that the dimensions after removing the stopwords decreased by 100 features.