# Vectorizer + NaiveBayes Tuning

🎯 The goal of this challenge is to create a Pipeline combining a Vectorizer + a NaiveBayes algorithm and to fine-tune the pipeline.

✍️ Let's reuse the previous dataset with $2000$ reviews classified either as "positive" or "negative".

In [1]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/movie_reviews.csv")
data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [2]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
data["target_encoded"] =  le.fit_transform(data.target)

In [3]:
data.head()

Unnamed: 0,target,reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...",0
1,neg,the happy bastard's quick movie review \ndamn ...,0
2,neg,it is movies like these that make a jaded movi...,0
3,neg,""" quest for camelot "" is warner bros . ' firs...",0
4,neg,synopsis : a mentally unstable man undergoing ...,0


## Preprocessing

❓ **Question (Cleaning)** ❓

Clean your texts

In [31]:
import string
from nltk.corpus import stopwords 
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

In [5]:
stop_words = set(stopwords.words('english'))

In [37]:
def preprocessing(sentence):
    sentence = sentence.strip()
    sentence = sentence.lower()
    sentence = ''.join(char for char in sentence if not char.isdigit())
    sentence = ''.join([word for word in sentence if word not in string.punctuation])
    sentence = word_tokenize(sentence)
    sentence = [word for word in sentence if word not in stop_words]
    sentence = ' '.join([WordNetLemmatizer().lemmatize(word, pos = "a") for word in sentence])
    return sentence

In [9]:
#string = "My name is Jake a as aaa if your"

In [22]:
#word_tokenize(string)

In [23]:
#clean_words = ' '.join([word for word in word_tokenize(string) if word not in stop_words])

In [24]:
#clean_words

In [25]:
#' '.join([WordNetLemmatizer().lemmatize(word, pos = "a") for word in word_tokenize(clean_words)])

In [39]:
word = 'hello my name is jefffff'

In [45]:
# Clean reviews
for i in range(data.shape[0]):
    sentence = data.iloc[i,1]
    new_sentence = preprocessing(sentence)
    data.iloc[i,1] = new_sentence
data

Unnamed: 0,target,reviews,target_encoded
0,neg,plot two teen couples go church party drink dr...,0
1,neg,happy bastards quick movie review damn yk bug ...,0
2,neg,movies like make jaded movie viewer thankful i...,0
3,neg,quest camelot warner bros first featurelength ...,0
4,neg,synopsis mentally unstable man undergoing psyc...,0
...,...,...,...
1995,pos,wow movie everything movie funny dramatic inte...,1
1996,pos,richard gere commanding actor hes always great...,1
1997,pos,glorystarring matthew broderick denzel washing...,1
1998,pos,steven spielbergs second epic film world war i...,1


In [44]:
# sentence = word_tokenize(word)
# sentence = [word for word in sentence if word not in stop_words]
# ' '.join(sentence)

'hello name jefffff'

## Tuning

❓ **Question (Pipelining a Vectorizer and a NaiveBayes Model)** ❓

* Create a Pipeline that chains a vectorizer of your choice with a NaiveBayes model
* Optimize it
* What is your best estimator ?

In [46]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import set_config; set_config("diagram")

In [47]:
pipeline = Pipeline([('tdif',TfidfVectorizer()),('m_nb',MultinomialNB())])

In [48]:
pipeline

In [50]:
pipeline.get_params()

{'memory': None,
 'steps': [('tdif', TfidfVectorizer()), ('m_nb', MultinomialNB())],
 'verbose': False,
 'tdif': TfidfVectorizer(),
 'm_nb': MultinomialNB(),
 'tdif__analyzer': 'word',
 'tdif__binary': False,
 'tdif__decode_error': 'strict',
 'tdif__dtype': numpy.float64,
 'tdif__encoding': 'utf-8',
 'tdif__input': 'content',
 'tdif__lowercase': True,
 'tdif__max_df': 1.0,
 'tdif__max_features': None,
 'tdif__min_df': 1,
 'tdif__ngram_range': (1, 1),
 'tdif__norm': 'l2',
 'tdif__preprocessor': None,
 'tdif__smooth_idf': True,
 'tdif__stop_words': None,
 'tdif__strip_accents': None,
 'tdif__sublinear_tf': False,
 'tdif__token_pattern': '(?u)\\b\\w\\w+\\b',
 'tdif__tokenizer': None,
 'tdif__use_idf': True,
 'tdif__vocabulary': None,
 'm_nb__alpha': 1.0,
 'm_nb__class_prior': None,
 'm_nb__fit_prior': True}

In [54]:
import numpy as np

In [57]:
np.linspace(0,2,10)

array([0.        , 0.22222222, 0.44444444, 0.66666667, 0.88888889,
       1.11111111, 1.33333333, 1.55555556, 1.77777778, 2.        ])

In [60]:
# Set parameters to search
grid_params = {'tdif__ngram_range': ((1, 1),(1,2),(2,2)), 'tdif__max_features':list(range(20,200,20)), 'tdif__max_df': [0.3,0.5,0.8]}

# Perform grid search on pipeline
grid_search = GridSearchCV(
    pipeline,
    param_grid=grid_params,
    scoring = "f1",
    cv = 5,
    n_jobs=-1,
    verbose=1
)

grid_search.fit(data.reviews,data.target_encoded)

Fitting 5 folds for each of 81 candidates, totalling 405 fits


In [61]:
# Best score
print(f"Best Score = {grid_search.best_score_}")

# Best params
print(f"Best params = {grid_search.best_params_}")

Best Score = 0.7409191112414165
Best params = {'tdif__max_df': 0.5, 'tdif__max_features': 180, 'tdif__ngram_range': (1, 1)}


🏁 Congratulations! You've managed to chain a Vectorizer and a NLP model and fine-tuned it!

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!