## Case Study Background
Following up from our previous work, 
after extracting relevant text information from the `review` and `address` columns, 
it's time to "grade" whether a review is positive or negative. 
This is called "sentiment analysis", and it's a widely studied problem in Natural Language Processing (NLP). 
To do this, we need to pre-process our text-based reviews, 
and transform them into a vector representation. 
Note that we won't be building the sentiment analyser, 
but will only practice the steps leading up to it. 
However, we do have a bonus for you at the end of this spreadsheet to have a taste of how sentiment analysis works!

## Learning objectives
- Understand the process and usage of various text preprocessing techniques: tokenisation, casefolding, noise & stopwords removal
- Understand the differences between stemming and lemmatisation
- Be aware of the waterfall effects of various choices in the preprocessing pipeline
- Understand the idea of vector representation, sparse matrix, and Bag-of-Words
- Know how to implement these preprocessing and vectorization steps using the `sklearn` and `nltk` libraries

## Workshop Overview
- Apply preprocessing steps to a single review parapgraph
- Compare the differences in 2 alternative preprocessing pipelines at each step
- Compare the differences in 2

In [1]:
import re
import pandas as pd
data = pd.read_csv('nlp.csv')
data.head()

Unnamed: 0,review,address,state,captainmarvel
0,Avengers: Endgame is dumb. Very dumb. It's a m...,39 Aaron Place Norwood VIC 5091,VIC,NO MATCH
1,What an unbelievable accomplishment to have sh...,461 Achernar Close Bondi Junction VIC 5125,VIC,NO MATCH
2,"Disclosure: I'm NOT a Marvel superfan, but I'v...",24 Adair Street Bundaberg VIC 2127,VIC,capt. marvel
3,"""Avengers: Endgame"" is about memories, nostalg...",51 Academy Close Floreat VIC 2680,VIC,"""captain marvel"""
4,It feels and watches like a seasin finale of a...,12 Abercorn Crescent Joondalup NSW 3055,NSW,NO MATCH


# Text pre-processing

While there are many techniques you can apply at this phase, we will perform only a few steps in this tutorial. We will leave the remaining techniques as Challenge questions for you to perform in your own time. We will apply it

In [2]:
# We'll apply the techniques on this sample review
sample_review = data['review'][37]
print(sample_review)

Go ahead and watch the trailers for Avengers: Endgame, they won't give anything major away. It's amazing for a huge movie to be so self-aware of itself, as well as, the movie genres that they're overtly borrowing from. The minor characters or those not even in Avengers: Infinity War, step up and help set up huge sequences that are highly entertaining and actually answer questions. Avengers: Endgame acknowledges every aspect of the characters emotions in their previous MCU film's and succeeds in the most Meta way possible.Ant-Man is a major reason for this. It's no spoiler to say that he's in the film as he produces some of the biggest laughs from the trailer of him ringing the bell at the Avenger's front gate. It's Paul Rudd's wry jokes, quick timing and fish out of water facial expressions that really assist things.The pacing of Avengers: Endgame is amazing and not for the reasons you might think. It's brilliantly paced, but it throws the entire formula and how MCU films are done on t

## Pre-processing technique 1: Casefolding

<blockquote style="padding: 10px; background-color: #FFD392;">

## Discussion questions
    
1. What's the use of casefolding?
2. In which situations should you **NOT** do casefolding?

In [3]:
lowercased = sample_review.lower()
print(lowercased)

go ahead and watch the trailers for avengers: endgame, they won't give anything major away. it's amazing for a huge movie to be so self-aware of itself, as well as, the movie genres that they're overtly borrowing from. the minor characters or those not even in avengers: infinity war, step up and help set up huge sequences that are highly entertaining and actually answer questions. avengers: endgame acknowledges every aspect of the characters emotions in their previous mcu film's and succeeds in the most meta way possible.ant-man is a major reason for this. it's no spoiler to say that he's in the film as he produces some of the biggest laughs from the trailer of him ringing the bell at the avenger's front gate. it's paul rudd's wry jokes, quick timing and fish out of water facial expressions that really assist things.the pacing of avengers: endgame is amazing and not for the reasons you might think. it's brilliantly paced, but it throws the entire formula and how mcu films are done on t

## Pre-processing technique 2: Noise (punctuation) removal

<blockquote style="padding: 10px; background-color: #FFD392;">

## Exercise
Write regex to remove all the punctuations and numbers. Make sure you take care of special cases of full-stops like `way possible.ant-man` so that it doesn't become `possibleantman`!. Store your output in a variable named `no_punct`

In [4]:
# ANSWER HERE 

# SOLUTION

special_full_stop = re.sub(r'\.(?=\w)','. ', lowercased)  # Add a space after fullstop without a space
no_punct = re.sub(r'[^A-z\s]', '', special_full_stop)  # Replace everything except characters and space
print(no_punct)

go ahead and watch the trailers for avengers endgame they wont give anything major away its amazing for a huge movie to be so selfaware of itself as well as the movie genres that theyre overtly borrowing from the minor characters or those not even in avengers infinity war step up and help set up huge sequences that are highly entertaining and actually answer questions avengers endgame acknowledges every aspect of the characters emotions in their previous mcu films and succeeds in the most meta way possible antman is a major reason for this its no spoiler to say that hes in the film as he produces some of the biggest laughs from the trailer of him ringing the bell at the avengers front gate its paul rudds wry jokes quick timing and fish out of water facial expressions that really assist things the pacing of avengers endgame is amazing and not for the reasons you might think its brilliantly paced but it throws the entire formula and how mcu films are done on their head battle loss battle

**Follow-up question:** 

Discuss one disadvantage of this punctuation removal strategy.

## Pre-processing technique 3: Tokenisation

In [5]:
import nltk
# nltk.download('punkt')

In [6]:
tokens = nltk.word_tokenize(no_punct)
print(tokens)

['go', 'ahead', 'and', 'watch', 'the', 'trailers', 'for', 'avengers', 'endgame', 'they', 'wont', 'give', 'anything', 'major', 'away', 'its', 'amazing', 'for', 'a', 'huge', 'movie', 'to', 'be', 'so', 'selfaware', 'of', 'itself', 'as', 'well', 'as', 'the', 'movie', 'genres', 'that', 'theyre', 'overtly', 'borrowing', 'from', 'the', 'minor', 'characters', 'or', 'those', 'not', 'even', 'in', 'avengers', 'infinity', 'war', 'step', 'up', 'and', 'help', 'set', 'up', 'huge', 'sequences', 'that', 'are', 'highly', 'entertaining', 'and', 'actually', 'answer', 'questions', 'avengers', 'endgame', 'acknowledges', 'every', 'aspect', 'of', 'the', 'characters', 'emotions', 'in', 'their', 'previous', 'mcu', 'films', 'and', 'succeeds', 'in', 'the', 'most', 'meta', 'way', 'possible', 'antman', 'is', 'a', 'major', 'reason', 'for', 'this', 'its', 'no', 'spoiler', 'to', 'say', 'that', 'hes', 'in', 'the', 'film', 'as', 'he', 'produces', 'some', 'of', 'the', 'biggest', 'laughs', 'from', 'the', 'trailer', 'of', 

<blockquote style="padding: 10px; background-color: #FFD392;">

## Exercise

Another data scientist in your team suggested an alternative noise removal strategy. Compare your tokenised set of words with this alternative solution. Is there any difference? Which solution is better in this case?

    special_full_stop = re.sub(r'\.(?=\w)','. ', lowercased)
    punct2space = re.sub(r'[^A-z\s]', ' ', special_full_stop)
    no_punct_alt = re.sub(r'\s+', ' ', punct2space)

In [7]:
# ANSWER HERE 

# SOLUTION

special_full_stop = re.sub(r'\.(?=\w)','. ', lowercased)
punct2space = re.sub(r'[^A-z\s]', ' ', special_full_stop)
no_punct_alt = re.sub(r'\s+', ' ', punct2space)
tokens_alt = nltk.word_tokenize(no_punct_alt)
extra_terms_in_tokens = list(set(tokens) - set(tokens_alt))
extra_terms_in_tokens_alt = list(set(tokens_alt)-set(tokens))
print('Extra terms in tokens:', extra_terms_in_tokens)
print('Extra terms in tokens_alt:',extra_terms_in_tokens_alt)

Extra terms in tokens: ['selfaware', 'hes', 'antman', 'its', 'theyre', 'were', 'wouldve', 'youve', 'rudds', 'wont']
Extra terms in tokens_alt: ['ve', 'won', 's', 'would', 't', 'rudd', 'aware', 'man', 'self', 'we', 'avenger', 'ant', 're']


## Pre-processing technique 4: Stopword removal

In [8]:
from nltk.corpus import stopwords
# run the line to download it the first time:
# nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
stop_words # Check out the list of stopwords defined in NLTK

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [9]:
no_stopwords = [w for w in tokens if not w in stop_words]
print("Original tokenized size:", len(tokens))
print("After removing stopwords:", len(no_stopwords))
print()
print(no_stopwords)

Original tokenized size: 356
After removing stopwords: 188

['go', 'ahead', 'watch', 'trailers', 'avengers', 'endgame', 'wont', 'give', 'anything', 'major', 'away', 'amazing', 'huge', 'movie', 'selfaware', 'well', 'movie', 'genres', 'theyre', 'overtly', 'borrowing', 'minor', 'characters', 'even', 'avengers', 'infinity', 'war', 'step', 'help', 'set', 'huge', 'sequences', 'highly', 'entertaining', 'actually', 'answer', 'questions', 'avengers', 'endgame', 'acknowledges', 'every', 'aspect', 'characters', 'emotions', 'previous', 'mcu', 'films', 'succeeds', 'meta', 'way', 'possible', 'antman', 'major', 'reason', 'spoiler', 'say', 'hes', 'film', 'produces', 'biggest', 'laughs', 'trailer', 'ringing', 'bell', 'avengers', 'front', 'gate', 'paul', 'rudds', 'wry', 'jokes', 'quick', 'timing', 'fish', 'water', 'facial', 'expressions', 'really', 'assist', 'things', 'pacing', 'avengers', 'endgame', 'amazing', 'reasons', 'might', 'think', 'brilliantly', 'paced', 'throws', 'entire', 'formula', 'mcu', 'f

In [10]:
# Comparison with the alternative strategy
no_stopwords_alt = [w for w in tokens_alt if not w in stop_words]

print("After removing stopwords (different noise removal strategy):", len(no_stopwords_alt))
print()

print('punct-to-null results in words like:')
print([w for w in no_stopwords if w not in no_stopwords_alt])
print()
print('punct-to-space results in words like:')
print([w for w in no_stopwords_alt if w not in no_stopwords])

After removing stopwords (different noise removal strategy): 186

punct-to-null results in words like:
['wont', 'selfaware', 'theyre', 'antman', 'hes', 'rudds', 'youve', 'wouldve']

punct-to-space results in words like:
['self', 'aware', 'ant', 'man', 'avenger', 'rudd', 'would']


## Pre-processing technique 5a: Stemming 

In [11]:
from nltk.stem.porter import PorterStemmer
# Instantiating the stemmer algorithm
porterStemmer = PorterStemmer()
# Stem the no stopword lists
stemmed = [porterStemmer.stem(w) for w in no_stopwords]

In [12]:
for i in range(len(stemmed)):
    print("Before:", no_stopwords[i], "| Stemmed:", stemmed[i])

Before: go | Stemmed: go
Before: ahead | Stemmed: ahead
Before: watch | Stemmed: watch
Before: trailers | Stemmed: trailer
Before: avengers | Stemmed: aveng
Before: endgame | Stemmed: endgam
Before: wont | Stemmed: wont
Before: give | Stemmed: give
Before: anything | Stemmed: anyth
Before: major | Stemmed: major
Before: away | Stemmed: away
Before: amazing | Stemmed: amaz
Before: huge | Stemmed: huge
Before: movie | Stemmed: movi
Before: selfaware | Stemmed: selfawar
Before: well | Stemmed: well
Before: movie | Stemmed: movi
Before: genres | Stemmed: genr
Before: theyre | Stemmed: theyr
Before: overtly | Stemmed: overtli
Before: borrowing | Stemmed: borrow
Before: minor | Stemmed: minor
Before: characters | Stemmed: charact
Before: even | Stemmed: even
Before: avengers | Stemmed: aveng
Before: infinity | Stemmed: infin
Before: war | Stemmed: war
Before: step | Stemmed: step
Before: help | Stemmed: help
Before: set | Stemmed: set
Before: huge | Stemmed: huge
Before: sequences | Stemmed:

## Pre-processing technique 5b: Lemmatisation

In [13]:
# first time, run the line:
# nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
# Instantiating the lemmatiser algorithm
lemmatizer = WordNetLemmatizer()
# Lemmatise the no stopword lists
lemmatized = [lemmatizer.lemmatize(w) for w in no_stopwords]

In [14]:
for i in range(len(lemmatized)):
    print("Lemmatized:", lemmatized[i], "| Stemmed:", stemmed[i])

Lemmatized: go | Stemmed: go
Lemmatized: ahead | Stemmed: ahead
Lemmatized: watch | Stemmed: watch
Lemmatized: trailer | Stemmed: trailer
Lemmatized: avenger | Stemmed: aveng
Lemmatized: endgame | Stemmed: endgam
Lemmatized: wont | Stemmed: wont
Lemmatized: give | Stemmed: give
Lemmatized: anything | Stemmed: anyth
Lemmatized: major | Stemmed: major
Lemmatized: away | Stemmed: away
Lemmatized: amazing | Stemmed: amaz
Lemmatized: huge | Stemmed: huge
Lemmatized: movie | Stemmed: movi
Lemmatized: selfaware | Stemmed: selfawar
Lemmatized: well | Stemmed: well
Lemmatized: movie | Stemmed: movi
Lemmatized: genre | Stemmed: genr
Lemmatized: theyre | Stemmed: theyr
Lemmatized: overtly | Stemmed: overtli
Lemmatized: borrowing | Stemmed: borrow
Lemmatized: minor | Stemmed: minor
Lemmatized: character | Stemmed: charact
Lemmatized: even | Stemmed: even
Lemmatized: avenger | Stemmed: aveng
Lemmatized: infinity | Stemmed: infin
Lemmatized: war | Stemmed: war
Lemmatized: step | Stemmed: step
Lemmat

<blockquote style="padding: 10px; background-color: #ebf5fb;">

    
## Discussion question
1. What's the difference between lemmatisation and stemming?
2. When would we prefer stemming over lemmatising? And when would we prefer lemmatising over stemming?

# Bag Of Words Vector Representation

<blockquote style="padding: 10px; background-color: #ebf5fb;">

## Class Discussion Questions
    
1. What is text vector representation? 
2. What is a corpus? 
3. What is BoW?

In `sklearn`, we will use the in-built function `CountVectorizer` to encode `str` type text data into vector representation. The documentation for this function can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
## Note that by default CountVectorizer only considers alphanumeric patterns of at least 2 characters to be a word.  
## You can alter this behavior using the token_pattern parameter.
vectorizer = CountVectorizer()
bow = vectorizer.fit_transform(data['review'])
bow

<338x5175 sparse matrix of type '<class 'numpy.int64'>'
	with 30388 stored elements in Compressed Sparse Row format>

**Follow-up question:** Why is it called a sparse matrix?

In [16]:
corpus = vectorizer.get_feature_names_out()
print(len(corpus)) # There are 5175 interesting "words" after the preprocessing all 338 reviews!
print(corpus[:100]) # Print the first 100 words in the corpus

5175
['000' '06' '10' '100' '1000' '10s' '11' '116' '12' '13' '14' '15' '180'
 '1970' '1987' '1990' '1st' '20' '200' '2008' '2010s' '2012' '2014' '2019'
 '21' '21st' '22' '22nd' '23' '250' '2d' '2hr' '2hrs' '2nd' '30' '300'
 '3000' '30h' '3d' '3h2' '3rd' '3seeing' '40' '45' '45mins' '47' '48'
 '4th' '50' '56' '5h' '60' '602' '70s' '75' '80s' '8th' '90' 'aaa'
 'abandon' 'abandoned' 'abandoning' 'abandons' 'abilities' 'ability'
 'able' 'abominably' 'about' 'above' 'abrams' 'abruptly' 'absent'
 'absolute' 'absolutely' 'absorbed' 'absurd' 'absurdly' 'abundance'
 'abysmal' 'abyss' 'accept' 'accepted' 'accident' 'accidentally'
 'accompanied' 'accomplish' 'accomplished' 'accomplishment' 'according'
 'account' 'accounts' 'accurate' 'accuser' 'achieve' 'achieved'
 'achievement' 'achievements' 'achieving' 'acknowledge' 'acknowledges']


In [17]:
# Print the BoW matrix
pd.DataFrame(bow.toarray())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,5165,5166,5167,5168,5169,5170,5171,5172,5173,5174
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
333,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
334,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
335,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
336,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Follow-up question:** What does the '1' in row 3, column 2 mean?

# Putting everything together: From raw text to cleaned BoW

In [18]:
def text_preprocess(doc, stop_words, stemmer):
    doc = doc.lower()
    doc = re.sub(r'\.(?=\w)','. ', doc)  # Add a space after fullstop without a space
    doc = re.sub(r'[^A-z\s]', ' ', doc)
    doc = re.sub(r'\s+', ' ', doc) 
    tokens = nltk.word_tokenize(doc)
    tokens = [w for w in tokens if not w in stop_words]
    stemmed = [stemmer.stem(w) for w in tokens]
    return ' '.join(stemmed)
    
# Create input parameters for text_process
stop_words = set(stopwords.words('english'))
porterStemmer = PorterStemmer()
reviews = data['review']

# Loop through each review, preprocess, and append the output to cleaned_reviews
cleaned_reviews = []
for i, review in enumerate(reviews):
    cleaned_reviews.append(text_preprocess(review, stop_words, porterStemmer))

# BoW encoding    
vectorizer1 = CountVectorizer()
bow_cleaned = vectorizer1.fit_transform(cleaned_reviews)
bow_cleaned

<338x3651 sparse matrix of type '<class 'numpy.int64'>'
	with 20190 stored elements in Compressed Sparse Row format>

<blockquote style="padding: 10px; background-color: #ebf5fb;">

## Class Discussion Question
Identify the preprocessing steps in the previous code block.

Alternatively, unless the pre-processing steps required are very specific, NLTK has built-in pre-processors for Vectorizers, and we can utilise them easily. Check the documentation at https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html to find out what are the preprocessing steps built into (supported by) `CountVectorizer` function?

In [19]:
## Alternative approach using CountVectorizer()
from sklearn.feature_extraction.text import CountVectorizer

def pre_process(doc):
    doc_new = doc.lower()
    doc_new = re.sub(r'\.(?=\w)','. ', doc_new)  # Add a space after fullstop without a space
    doc_new = re.sub(r'[^A-z\s]', ' ', doc_new)  # remove punct
    doc_new = re.sub(r'\s+', ' ', doc_new)  # collapse multiple spaces.
    return doc_new

def tokeniser(doc):
    tokens = nltk.word_tokenize(doc)
    stop_words = set(stopwords.words('english'))    
    tokens = [w for w in tokens if not w in stop_words]
    porterStemmer = PorterStemmer()
    stemmed = [porterStemmer.stem(w) for w in tokens]
    return stemmed
    
    
vectorizer2 = CountVectorizer(preprocessor=pre_process, tokenizer=tokeniser, analyzer = 'word')
bow2 = vectorizer2.fit_transform(data['review'])
bow2

<338x3667 sparse matrix of type '<class 'numpy.int64'>'
	with 20219 stored elements in Compressed Sparse Row format>

**Follow-up question:** Note the difference in the feature space of the output sparse matrix: 3651 vs 3667 (# of dimensions). Can you explain why?

In [20]:
# Comparison of 3 vectorizers output

features_raw = vectorizer.get_feature_names_out()
features1 = vectorizer1.get_feature_names_out()
features2 = vectorizer2.get_feature_names_out()

print()
print("specific to features1 (not in feature2):")
print(f'{[w for w in features1 if w not in features2]}')
print()
print("specific to features2 (not in feature1):")
print(f'{[w for w in features2 if w not in features1]}')

print()
print("specific to features_raw (not in feature2), selecting the first 100 features:")
print(f'{[w for w in features_raw if w not in features2][:100]}')


specific to features1 (not in feature2):
[]

specific to features2 (not in feature1):
['[', ']', 'b', 'c', 'e', 'f', 'g', 'h', 'l', 'n', 'p', 'q', 'u', 'w', 'x', 'z']

specific to features_raw (not in feature2), selecting the first 100 features:
['000', '06', '10', '100', '1000', '10s', '11', '116', '12', '13', '14', '15', '180', '1970', '1987', '1990', '1st', '20', '200', '2008', '2010s', '2012', '2014', '2019', '21', '21st', '22', '22nd', '23', '250', '2d', '2hr', '2hrs', '2nd', '30', '300', '3000', '30h', '3d', '3h2', '3rd', '3seeing', '40', '45', '45mins', '47', '48', '4th', '50', '56', '5h', '60', '602', '70s', '75', '80s', '8th', '90', 'abandoned', 'abandoning', 'abandons', 'abilities', 'ability', 'able', 'abominably', 'about', 'above', 'abrams', 'abruptly', 'absolute', 'absolutely', 'absorbed', 'absurdly', 'abundance', 'abysmal', 'accepted', 'accidentally', 'accompanied', 'accomplished', 'accomplishment', 'according', 'accounts', 'accurate', 'accuser', 'achieve', 'achieved', 'a

# <u> Challenge questions <u>

1. Normally, we simply need to use `.lower()` to perform casefolding, but Python also has a specific string method called `.casefold()`. What are the differences between these 2 methods, and in which situations do we need to use `.casefold()` over `.lower()`?

2. Notice how the lemmatizer didn't fully return the root word of some words, like "putting", "watered", and "cheapened"? It's because the function wasn't called with the proper "Part-Of-Speech". Check out the difference by running the below code:
    

    print(lemmatizer.lemmatize("putting")) # By default, WordNet infers this word as a Noun
    print(lemmatizer.lemmatize("putting", pos="v"))

By using the `pos` parameter, we force it to treat putting as a verb. Auto-detecting the part-of-speech of a word is actually another big NLP problem. The following site will give you more insights into how this is done: https://www.nltk.org/book/ch05.html. If you want a challenge, you can try to implement a lemmatizer that is enhanced with an auto part-of-speech tagging to increase the usefulness of the lemmatization (this is beyond the scope of the tutorial and subject, but a very satisfying problem to work on!)
    
3. An alternative vector representation for text data is TF-IDF (covered in the lecture). The `TfidfVectorizer` function works similarly to `CountVectorizer`. Have a look through its documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html and produce a TF-IDF representation for the preprocessed corpus.