# Movie Reviews and Bag-of-Words Modelling

🎯 The goal of this challenge is to play with the ***Bag-of-words*** modelling of texts.

✍️ In the following dataset, we have $2000$ reviews classified either as _"positive"_ or _"negative"_.

In [50]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/movie_reviews.csv")
data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [51]:
data.shape

(2000, 2)

## 1. Preprocessing

❓ **Question (Cleaning Text)** ❓

- Write a function `preprocessing` that will clean a sentence and apply it to all our reviews. It should:
    - remove whitespace
    - lowercase characters
    - remove numbers
    - remove punctuation
    - tokenize
    - lemmatize
- You can store the cleaned reviews into a column called `cleaned_reviews`.
- Do not remove stopwords in this challenge, we will explain why in the section `3. N-gram modelling`

In [7]:
import string
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

In [19]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [52]:
def preprocessing(sentence):
    sentence = sentence.strip()
    sentence = sentence.lower()
    sentence = ''.join(char for char in sentence if not char.isdigit())
    sentence = ''.join([word for word in sentence if word not in string.punctuation])
    sentence = word_tokenize(sentence)
    sentence = ' '.join([WordNetLemmatizer().lemmatize(word, pos = "a") for word in sentence])
    return sentence

In [22]:
# ''.join([word for word in s if word not in string.punctuation])

'my name is jake'

In [53]:
# Clean reviews
for i in range(data.shape[0]):
    sentence = data.iloc[i,1]
    new_sentence = preprocessing(sentence)
    data.iloc[i,1] = new_sentence
data

Unnamed: 0,target,reviews
0,neg,plot two teen couples go to a church party dri...
1,neg,the happy bastards quick movie review damn tha...
2,neg,it is movies like these that make a jaded movi...
3,neg,quest for camelot is warner bros first feature...
4,neg,synopsis a mentally unstable man undergoing ps...
...,...,...
1995,pos,wow what a movie its everything a movie can be...
1996,pos,richard gere can be a commanding actor but hes...
1997,pos,glorystarring matthew broderick denzel washing...
1998,pos,steven spielbergs second epic film on world wa...


❓ **Question (LabelEncoding)**❓

LabelEncode your target and store it into a column called `"target_encoded"`

In [55]:
data.target.value_counts()

neg    1000
pos    1000
Name: target, dtype: int64

In [56]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

In [57]:
label_encoder.fit(data.target)

In [58]:
label_encoder.classes_

array(['neg', 'pos'], dtype=object)

In [60]:
data['target_encoded'] = label_encoder.transform(data.target)

In [61]:
# Quick check
data.head()

Unnamed: 0,target,reviews,target_encoded
0,neg,plot two teen couples go to a church party dri...,0
1,neg,the happy bastards quick movie review damn tha...,0
2,neg,it is movies like these that make a jaded movi...,0
3,neg,quest for camelot is warner bros first feature...,0
4,neg,synopsis a mentally unstable man undergoing ps...,0


In [62]:
data.target_encoded.value_counts()

0    1000
1    1000
Name: target_encoded, dtype: int64

## 2. Bag-of-Words Modelling

❓ **Question (NaiveBayes with unigrams)** ❓

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a Bag-of-Words representation of the texts.

In [63]:
data.head()

Unnamed: 0,target,reviews,target_encoded
0,neg,plot two teen couples go to a church party dri...,0
1,neg,the happy bastards quick movie review damn tha...,0
2,neg,it is movies like these that make a jaded movi...,0
3,neg,quest for camelot is warner bros first feature...,0
4,neg,synopsis a mentally unstable man undergoing ps...,0


In [64]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
X = count_vectorizer.fit_transform(data.reviews)
X.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [66]:
pd.DataFrame(count_vectorizer.get_feature_names_out()).value_counts()

aa                    1
politely              1
political             1
politically           1
politicallycharged    1
                     ..
fourarmed             1
fourcredited          1
fourday               1
fourdollar            1
zycie                 1
Length: 46446, dtype: int64

In [68]:
# vectorized_texts = pd.DataFrame(
#     X.toarray(), 
#     columns = count_vectorizer.get_feature_names_out(),
#     index = data.reviews
# )

# vectorized_texts

In [69]:
from sklearn.naive_bayes import MultinomialNB

In [73]:
from sklearn.model_selection import cross_validate

In [77]:
import numpy as np

In [75]:
y = data.target_encoded

In [78]:
cv_results = cross_validate(MultinomialNB(), X, y, cv = 5, scoring = "f1")
average_recall = cv_results["test_score"].mean()
np.round(average_recall,2)

0.81

## 3. N-gram Modelling

👀 Remember that we asked you not to remove stopwords. Why? 

👉 We will train the Naive Bayes model with bigrams. Hence, in sentence like "I do not like coriander", it is important to scan the bigram "do not" to detect negativity in this sentence for example.

❓ **Question (NaiveBayes with bigrams)** ❓

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Words representation of the texts.

In [80]:
vectorizer = CountVectorizer(ngram_range = (2,2))
naivebayes = MultinomialNB()

X_bow = vectorizer.fit_transform(data.reviews)

cv_nb = cross_validate(
    naivebayes,
    X_bow,
    data.target_encoded,
    scoring = "accuracy"
)

round(cv_nb['test_score'].mean(),2)

0.84

🏁 Congratulations! Now, you know how to train a Naive Bayes model on vectorized texts.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!