# Movie Reviews and Bag-of-Words Modeling

🎯 The goal of this challenge is to play with the ***Bag-of-words*** modeling of texts.

✍️ In the following dataset, we have $2000$ reviews classified either as _"positive"_ or _"negative"_.

In [1]:
import pandas as pd

data = pd.read_csv("reviews.csv")
data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [2]:
data.shape

(2000, 2)

## 1. Preprocessing

❓ **Question (Cleaning Text)** ❓

- Write a function `preprocessing` that will clean a sentence and apply it to all our reviews. It should:
    - remove whitespaces
    - lowercase characers
    - remove numbers
    - remove punctuation
    - tokenize
    - lemmatize
- You can store the cleaned reviews into a column called `cleaned_reviews`.
- Do not remove stopwords in this challenge, we will explain why in the section `3. N-gram modelling`

In [4]:
import string
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocessing(sentence):
    # Basic cleaning
    sentence = sentence.strip() ## remove whitespaces
    sentence = sentence.lower() ## lowercase 
    sentence = ''.join(char for char in sentence if not char.isdigit()) ## remove numbers
    
    # Advanced cleaning
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '') ## remove punctuation
    
    tokenized_sentence = word_tokenize(sentence) ## tokenize 
    stop_words = set(stopwords.words('english')) ## define stopwords
    
    tokenized_sentence_cleaned = [ ## remove stopwords
        w for w in tokenized_sentence if not w in stop_words
    ]

    lemmatized = [
        WordNetLemmatizer().lemmatize(word, pos = "v") 
        for word in tokenized_sentence_cleaned
    ]
    
    cleaned_sentence = ' '.join(word for word in lemmatized)
    
    return cleaned_sentence

In [5]:
data['clean_reviews'] = data["reviews"].apply(preprocessing)
data.head()

Unnamed: 0,target,reviews,clean_reviews
0,neg,"plot : two teen couples go to a church party ,...",plot two teen couple go church party drink dri...
1,neg,the happy bastard's quick movie review \ndamn ...,happy bastards quick movie review damn yk bug ...
2,neg,it is movies like these that make a jaded movi...,movies like make jade movie viewer thankful in...
3,neg,""" quest for camelot "" is warner bros . ' firs...",quest camelot warner bros first featurelength ...
4,neg,synopsis : a mentally unstable man undergoing ...,synopsis mentally unstable man undergo psychot...


In [6]:
data

Unnamed: 0,target,reviews,clean_reviews
0,neg,"plot : two teen couples go to a church party ,...",plot two teen couple go church party drink dri...
1,neg,the happy bastard's quick movie review \ndamn ...,happy bastards quick movie review damn yk bug ...
2,neg,it is movies like these that make a jaded movi...,movies like make jade movie viewer thankful in...
3,neg,""" quest for camelot "" is warner bros . ' firs...",quest camelot warner bros first featurelength ...
4,neg,synopsis : a mentally unstable man undergoing ...,synopsis mentally unstable man undergo psychot...
...,...,...,...
1995,pos,wow ! what a movie . \nit's everything a movie...,wow movie everything movie funny dramatic inte...
1996,pos,"richard gere can be a commanding actor , but h...",richard gere command actor hes always great fi...
1997,pos,"glory--starring matthew broderick , denzel was...",glorystarring matthew broderick denzel washing...
1998,pos,steven spielberg's second epic film on world w...,steven spielbergs second epic film world war i...


❓ **Question (LabelEncoding)**❓

LabelEncode your target and store it into a column called `"target_encoded"`

In [7]:
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
data['target_encoded'] = enc.fit_transform(data['target'])

In [8]:
# Quick check
data.head()

Unnamed: 0,target,reviews,clean_reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...",plot two teen couple go church party drink dri...,0
1,neg,the happy bastard's quick movie review \ndamn ...,happy bastards quick movie review damn yk bug ...,0
2,neg,it is movies like these that make a jaded movi...,movies like make jade movie viewer thankful in...,0
3,neg,""" quest for camelot "" is warner bros . ' firs...",quest camelot warner bros first featurelength ...,0
4,neg,synopsis : a mentally unstable man undergoing ...,synopsis mentally unstable man undergo psychot...,0


## 2. Bag-of-Words modeling

❓ **Question (NaiveBayes with unigrams)** ❓

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a Bag-of-Words representation of the texts.

In [9]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate
from sklearn.feature_extraction.text import CountVectorizer
X_bow = CountVectorizer().fit_transform(data['clean_reviews'])

accuracy = cross_validate(MultinomialNB(),X_bow,data['target_encoded'],cv=5,scoring='accuracy')
accuracy['test_score'].mean()

0.8085000000000001

## 3. N-gram modeling

👀 Remember that we asked you not to remove stopwords. Why? 

👉 We will train the Naive Bayes model with bigrams. Hence, in sentence like "I do not like coriander", it is important to scan the bigram "do not" to detect negativity in this sentence for example.

❓ **Question (NaiveBayes with bigrams)** ❓

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Words representation of the texts.

In [10]:
vectorizer = CountVectorizer(ngram_range = (2,2))
naivebayes = MultinomialNB()

X_bow = vectorizer.fit_transform(data.clean_reviews)

cv_nb = cross_validate(naivebayes,
                       X_bow,
                       data.target_encoded,
                       scoring = "accuracy")

cv_nb['test_score'].mean()




0.7515000000000001

🏁 Congratulations! Now, you know how to train a Naive Bayes model on vectorized texts.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!