<h2>Hello, it is my notebook with experiments with IMDB Movies Dataset task solving models</h2>

<h4>Import modules and load data</h4>

In [3]:
import pandas as pd

In [4]:
data = pd.read_csv("IMDB Dataset.csv")

In [5]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


<h4>Make labels numerical</h4>

In [6]:
data = pd.DataFrame({"review": data["review"], "sentiment": data["sentiment"].map({"positive": 1, "negative": 0})})

In [7]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


<h4>Split the model into two parts - train set and test set</h4>

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(["sentiment"], axis=1), data["sentiment"], test_size=0.3, random_state=7)

In [10]:
X_train.head()

Unnamed: 0,review
46521,Sam Fuller's excellent PICK UP ON SOUTH STREET...
13908,"If at all possible, try to view all five of th..."
39915,THE 40 YEAR-OLD VIRGIN (2005) **** Steve Carel...
28440,I had seen Rik Mayall in Blackadder and the Ne...
6011,I have seen Maslin Beach a couple of times - b...


In [11]:
y_train.head()

46521    1
13908    0
39915    1
28440    0
6011     1
Name: sentiment, dtype: int64

<h4>Make pipeline component for text normalize</h4>

In [12]:
import re
import string
from spacy.lang.en.stop_words import STOP_WORDS as stop_words
from spacy.lang.en import English

In [13]:
parser = English()
punctuations = string.punctuation

In [14]:
def spacy_text_normalizer(text):
    text = re.sub(r"<.*>", "", text) #Remove all tags
    tokens = parser(text) #Get doc from text
    tokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in tokens ] #Normalize words
    tokens = [ word for word in tokens if word not in stop_words and word not in punctuations ] #Remove stop words and punctuation
    return " ".join(tokens)

In [15]:
from sklearn.base import TransformerMixin

In [16]:
class TextNormalizer(TransformerMixin):
    def __init__(self, text_column_name = "review"):
        self.text_column_name = text_column_name
        
    def transform(self, X, **transform_params):
        return [spacy_text_normalizer(text) for text in X[self.text_column_name]]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

<h4>The next step is to tune text vectorizer. First of all you need to choose binary classification model. I chose logistic regression</h4>

In [17]:
from sklearn.linear_model import LogisticRegression

<h4>Importing required modules</h4>

In [18]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline

<h4>Let's make 2 pipelines, tune hyperparameters and choose vectorizer</h4>

In [19]:
count_pipe = Pipeline([
  ("normalizer", TextNormalizer()),
  ("vectorizer", CountVectorizer()),
  ("classifier", LogisticRegression())
])

In [20]:
tfidf_pipe = Pipeline([
  ("normalizer", TextNormalizer()),
  ("vectorizer", TfidfVectorizer()),
  ("classifier", LogisticRegression())
])

<h4>Let's write metrics function</h4>

In [21]:
def get_metrics_values(model, X, y, metric):
  y_pred = model.predict(X)
  return metric(y_pred, y)

<h4>Fit pipelines</h4>

In [22]:
%%time
count_pipe.fit(X_train, y_train)

CPU times: user 48.5 s, sys: 2.74 s, total: 51.2 s
Wall time: 48.1 s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Pipeline(memory=None,
         steps=[('normalizer',
                 <__main__.TextNormalizer object at 0x7f6885a00400>),
                ('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=N

In [23]:
%%time
tfidf_pipe.fit(X_train, y_train)

CPU times: user 45.5 s, sys: 1.63 s, total: 47.1 s
Wall time: 45.2 s


Pipeline(memory=None,
         steps=[('normalizer',
                 <__main__.TextNormalizer object at 0x7f6885a006a0>),
                ('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 s...
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual

<h4>Let's import some metrics and check models</h4>

In [24]:
from sklearn import metrics
metrics = {
    "accuracy": metrics.accuracy_score,
    "precision": metrics.precision_score,
    "recall": metrics.recall_score,
    "f1": metrics.f1_score
}

In [25]:
print("CountVectorizer metrics")
for metric_name, metric in metrics.items():
  print(f"{metric_name} value is {get_metrics_values(count_pipe, X_test, y_test, metric)}")

CountVectorizer metrics
accuracy value is 0.8433333333333334
precision value is 0.8471760797342193
recall value is 0.8415841584158416
f1 value is 0.8443708609271523


In [26]:
print("TfIdfVectorizer metrics")
for metric_name, metric in metrics.items():
  print(f"{metric_name} value is {get_metrics_values(tfidf_pipe, X_test, y_test, metric)}")

TfIdfVectorizer metrics
accuracy value is 0.8552
precision value is 0.8699003322259137
recall value is 0.84584571650084
f1 value is 0.8577044025157233


<h4>Let's look at the features count</h4>

In [27]:
len(tfidf_pipe["vectorizer"].get_feature_names())

70246

<h4>TfIdf is a little bit better. Lets tune hyperparameters</h4>

In [28]:
from sklearn.model_selection import StratifiedKFold

In [29]:
skf = StratifiedKFold(n_splits=3)

In [30]:
param_grid = {
    'vectorizer__ngram_range': ((1, 1), (1, 2), (2, 2)),
    'vectorizer__max_features': (10000, 40000, None)
}

<h4>For time economy make new dataset with cleaned data and new pipes</h4>

In [31]:
X_train_clean = TextNormalizer().fit_transform(X_train)

In [32]:
count_logreg_pipe = Pipeline([
  ("vectorizer", CountVectorizer()),
  ("classifier", LogisticRegression())            
])

In [33]:
tfidf_logreg_pipe = Pipeline([
  ("vectorizer", TfidfVectorizer()),
  ("classifier", LogisticRegression())
])

In [34]:
from sklearn.model_selection import GridSearchCV

<h4>Lets tune hyperparameters</h4>

In [35]:
count_search = GridSearchCV(count_logreg_pipe, param_grid, n_jobs=-1)
count_search.fit(X_train_clean, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vectorizer',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                   

In [36]:
tfidf_search = GridSearchCV(tfidf_logreg_pipe, param_grid, n_jobs=-1)
tfidf_search.fit(X_train_clean, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vectorizer',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                 

<h4>Look at the best models</h4>

In [38]:
count_search.best_score_

0.857

In [39]:
count_search.best_params_

{'vectorizer__max_features': None, 'vectorizer__ngram_range': (1, 2)}

In [40]:
tfidf_search.best_score_

0.8604857142857142

In [41]:
tfidf_search.best_params_

{'vectorizer__max_features': 40000, 'vectorizer__ngram_range': (1, 2)}

<h4>Lets check on the test dataset</h4>

In [45]:
X_test_cleaned = TextNormalizer().fit_transform(X_test)

In [47]:
print("CountVectorizer metrics")
for metric_name, metric in metrics.items():
  print(f"{metric_name} value is {get_metrics_values(count_search, X_test_cleaned, y_test, metric)}")

CountVectorizer metrics
accuracy value is 0.8556666666666667
precision value is 0.8623255813953489
recall value is 0.8517983722761879
f1 value is 0.857029650663673


In [48]:
print("TFidfVectorizer metrics")
for metric_name, metric in metrics.items():
  print(f"{metric_name} value is {get_metrics_values(tfidf_search, X_test_cleaned, y_test, metric)}")

TFidfVectorizer metrics
accuracy value is 0.8587333333333333
precision value is 0.8742857142857143
recall value is 0.8486842105263158
f1 value is 0.8612947568239838


<h4>Tfidf a little bit better. Lets tune this hyperparameters more</h4>

In [49]:
param_grid = {
    'vectorizer__max_df': (0.25, 0.5, 0.75),
    'vectorizer__ngram_range': ((1, 2), (1, 3)),
    'vectorizer__max_features': (40000, 100000, None)
}

In [50]:
tfidf_search = GridSearchCV(tfidf_logreg_pipe, param_grid, n_jobs=-1)
tfidf_search.fit(X_train_clean, y_train)



GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vectorizer',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                 

<h4>Lets look at the best parameters and score</h4>

In [51]:
tfidf_search.best_params_

{'vectorizer__max_df': 0.25,
 'vectorizer__max_features': 100000,
 'vectorizer__ngram_range': (1, 3)}

In [52]:
tfidf_search.best_score_

0.8610857142857142

In [53]:
print("TFidfVectorizer metrics")
for metric_name, metric in metrics.items():
  print(f"{metric_name} value is {get_metrics_values(tfidf_search, X_test_cleaned, y_test, metric)}")

TFidfVectorizer metrics
accuracy value is 0.8582
precision value is 0.8742857142857143
recall value is 0.8478092783505154
f1 value is 0.8608439646712462


<h4>Setting did not give meaningful results. Probably need to change the classifier. Lets save the model and tune classifier</h4>

In [55]:
tfidf = tfidf_search.best_estimator_["vectorizer"]

In [56]:
import pickle

In [58]:
pickle.dump(tfidf, open("tfidf.pickle", "wb"))

<h4>Lets tune classifier</h4>

In [60]:
import numpy as np

In [65]:
param_grid = {
    'classifier__penalty' : ['l1', 'l2'],
    'classifier__C' : np.logspace(-4, 4, 5)
}

In [66]:
tfidf_logreg_pipe = Pipeline([
  ("vectorizer", tfidf),
  ("classifier", LogisticRegression())
])
tfidf_search = GridSearchCV(tfidf_logreg_pipe, param_grid, n_jobs=-1)
tfidf_search.fit(X_train_clean, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vectorizer',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=0.25,
                                                        max_features=100000,
                                                        min_df=1,
                                                        ngram_range=(1, 3),
                                              

In [67]:
tfidf_search.best_score_

0.8610857142857142

In [68]:
tfidf_search.best_params_

{'classifier__C': 1.0, 'classifier__penalty': 'l2'}

In [69]:
print("TFidfVectorizer metrics")
for metric_name, metric in metrics.items():
  print(f"{metric_name} value is {get_metrics_values(tfidf_search, X_test_cleaned, y_test, metric)}")

TFidfVectorizer metrics
accuracy value is 0.8582
precision value is 0.8742857142857143
recall value is 0.8478092783505154
f1 value is 0.8608439646712462


In [70]:
logreg = tfidf_search.best_estimator_["classifier"]

<h4>Maybe need to make an ensemble</h4>

In [72]:
from sklearn.ensemble import BaggingClassifier

In [74]:
param_grid = {
    'classifier__n_estimators' : (5, 10, 15)
}

In [75]:
bagging_logreg = BaggingClassifier(logreg, bootstrap = True, random_state = 7)

In [77]:
tfidf_bagging_pipe = Pipeline([
  ("vectorizer", tfidf),
  ("classifier", bagging_logreg)
])
tfidf_bagging_search = GridSearchCV(tfidf_bagging_pipe, param_grid, n_jobs=-1)
tfidf_bagging_search.fit(X_train_clean, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vectorizer',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=0.25,
                                                        max_features=100000,
                                                        min_df=1,
                                                        ngram_range=(1, 3),
                                              

In [78]:
tfidf_bagging_search.best_params_

{'classifier__n_estimators': 15}

In [79]:
tfidf_bagging_search.best_score_

0.8583714285714287

In [80]:
print("Bagging metrics")
for metric_name, metric in metrics.items():
  print(f"{metric_name} value is {get_metrics_values(tfidf_bagging_search, X_test_cleaned, y_test, metric)}")

Bagging metrics
accuracy value is 0.8564666666666667
precision value is 0.8754817275747508
recall value is 0.8441824705279344
f1 value is 0.8595472633570356


<h4>Bagging is worse than a single classifier</h4>

<h3>Summary</h3>

<h4>Maybe we need to work not with BOW models but with sequences models.In other notebook i try to use neural nets for solving this problem.</h4>