# Vectorizer + NaiveBayes Tuning

🎯 The goal of this challenge is to create a Pipeline combining a Vectorizer + a NaiveBayes algorithm and to fine-tune the pipeline.

✍️ Let's reuse the previous dataset with $2000$ reviews classified either as "positive" or "negative".

In [1]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/movie_reviews.csv")
data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [2]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
data["target_encoded"] =  le.fit_transform(data.target)

In [3]:
data.head()

Unnamed: 0,target,reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...",0
1,neg,the happy bastard's quick movie review \ndamn ...,0
2,neg,it is movies like these that make a jaded movi...,0
3,neg,""" quest for camelot "" is warner bros . ' firs...",0
4,neg,synopsis : a mentally unstable man undergoing ...,0


## Preprocessing

❓ **Question (Cleaning)** ❓

Clean your texts

In [4]:
import string
from nltk.corpus import stopwords 
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocessing(sentence):
    sentence = sentence.strip()
    sentence = sentence.lower()
    
    sentence = ''.join(letter for letter in sentence if not letter.isdigit())
    
    for puncuation in string.punctuation:
        sentence = sentence.replace(puncuation, '')
    
    tokenized = word_tokenize(sentence)
    
    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(word) for word in tokenized]
    clean_sentence = ' '.join(lemmatized)
    return clean_sentence

In [5]:
# Clean reviews
data['clean_reviews'] = data['reviews'].apply(preprocessing)

In [6]:
data.head()

Unnamed: 0,target,reviews,target_encoded,clean_reviews
0,neg,"plot : two teen couples go to a church party ,...",0,plot two teen couple go to a church party drin...
1,neg,the happy bastard's quick movie review \ndamn ...,0,the happy bastard quick movie review damn that...
2,neg,it is movies like these that make a jaded movi...,0,it is movie like these that make a jaded movie...
3,neg,""" quest for camelot "" is warner bros . ' firs...",0,quest for camelot is warner bros first feature...
4,neg,synopsis : a mentally unstable man undergoing ...,0,synopsis a mentally unstable man undergoing ps...


## Tuning

❓ **Question (Pipelining a Vectorizer and a NaiveBayes Model)** ❓

* Create a Pipeline that chains a vectorizer of your choice with a NaiveBayes model
* Optimize it
* What is your best estimator ?

In [12]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import set_config; set_config("diagram")

# Create Pipeline
pipeline = make_pipeline(
    TfidfVectorizer(),
    MultinomialNB())

# Set parameters to search
params = {
    'tfidfvectorizer__ngram_range': ((1,1), (2,2)),
    'tfidfvectorizer__min_df': (0.01,0.05),
    'tfidfvectorizer__max_df': (0.8,0.9),
    'multinomialnb__alpha': (0.01,0.1,1,10)
}

# Perform grid search on pipeline
grid_search = GridSearchCV(
                pipeline,
                params, 
                n_jobs=-1,
                verbose=1, 
                scoring='accuracy', 
                cv=5)


In [13]:
grid_search.fit(data.clean_reviews, data.target_encoded)

Fitting 5 folds for each of 32 candidates, totalling 160 fits


In [19]:
grid_search.cv_results_

{'mean_fit_time': array([1.76116295, 3.68965454, 1.78130541, 3.6006526 , 1.72901382,
        3.54855251, 1.72667842, 3.6210402 , 1.77535748, 3.63778925,
        1.74424558, 3.54710135, 1.7365078 , 3.56860056, 1.74913001,
        3.51904082, 1.74342337, 3.56573954, 1.71704021, 3.51567516,
        1.72456694, 3.52173066, 1.71128445, 3.52065496, 1.74229522,
        3.58638573, 1.72997518, 3.56286597, 1.75130868, 3.45409565,
        1.7502769 , 2.66720004]),
 'std_fit_time': array([0.02190951, 0.01741907, 0.01606762, 0.03829045, 0.01535483,
        0.0497983 , 0.02555874, 0.0616174 , 0.02623008, 0.05469496,
        0.01504505, 0.01470434, 0.01648612, 0.04363108, 0.01758087,
        0.0350453 , 0.02718465, 0.03390588, 0.01279766, 0.02522043,
        0.02822749, 0.02262334, 0.01399999, 0.04061246, 0.02701035,
        0.0420751 , 0.01504951, 0.0474359 , 0.02882588, 0.11009908,
        0.01388772, 0.33321638]),
 'mean_score_time': array([0.42685785, 0.68052788, 0.43273478, 0.64561019, 0.442078

In [14]:
grid_search.best_params_

{'multinomialnb__alpha': 0.01,
 'tfidfvectorizer__max_df': 0.8,
 'tfidfvectorizer__min_df': 0.01,
 'tfidfvectorizer__ngram_range': (1, 1)}

In [17]:
round((grid_search.best_score_),2)

0.83

In [20]:
grid_search.best_estimator_

🏁 Congratulations! You've managed to chain a Vectorizer and a NLP model and fine-tuned it!

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!