<h2>Hello, it is my notebook with experiments with IMDB Movies Dataset task solving models</h2>

<h4>Import modules and load data</h4>

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv("IMDB Dataset.csv")

In [3]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


<h4>Make labels numerical</h4>

In [5]:
data = pd.DataFrame({"review": data["review"], "sentiment": data["sentiment"].map({"positive": 1, "negative": 0})})

In [6]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


<h4>Split the model into two parts - train set and test set</h4>

In [8]:
from sklearn.model_selection import train_test_split

In [16]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(["sentiment"], axis=1), data["sentiment"], test_size=0.3, random_state=7)

In [17]:
X_train.head()

Unnamed: 0,review
46521,Sam Fuller's excellent PICK UP ON SOUTH STREET...
13908,"If at all possible, try to view all five of th..."
39915,THE 40 YEAR-OLD VIRGIN (2005) **** Steve Carel...
28440,I had seen Rik Mayall in Blackadder and the Ne...
6011,I have seen Maslin Beach a couple of times - b...


In [18]:
y_train.head()

46521    1
13908    0
39915    1
28440    0
6011     1
Name: sentiment, dtype: int64

<h4>Make pipeline component for text normalize</h4>

In [23]:
import re
import string
from spacy.lang.en.stop_words import STOP_WORDS as stop_words
from spacy.lang.en import English

In [35]:
parser = English()
punctuations = string.punctuation

In [25]:
def spacy_text_normalizer(text):
    text = re.sub(r"<.*>", "", text) #Remove all tags
    tokens = parser(text) #Get doc from text
    tokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in tokens ] #Normalize words
    tokens = [ word for word in tokens if word not in stop_words and word not in punctuations ] #Remove stop words and punctuation
    return " ".join(tokens)

In [26]:
from sklearn.base import TransformerMixin

In [27]:
class TextNormalizer(TransformerMixin):
    def __init__(self, text_column_name = "review"):
        self.text_column_name = text_column_name
        
    def transform(self, X, **transform_params):
        return [spacy_text_normalizer(text) for text in X[self.text_column_name]]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

<h4>The next step is to tune text vectorizer. First of all you need to choose binary classification model. I chose logistic regression</h4>

In [28]:
from sklearn.linear_model import LogisticRegression

<h4>Importing required modules</h4>

In [29]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline

<h4>Let's make 2 pipelines, tune hyperparameters and choose vectorizer</h4>

In [30]:
count_pipe = Pipeline([
  ("normalizer", TextNormalizer()),
  ("vectorizer", CountVectorizer()),
  ("classifier", LogisticRegression())
])

In [32]:
tfidf_pipe = Pipeline([
  ("normalizer", TextNormalizer()),
  ("vectorizer", TfidfVectorizer()),
  ("classifier", LogisticRegression())
])

<h4>Let's write metrics function</h4>

In [33]:
def get_metrics_values(model, X, y, metric):
  y_pred = model.predict(X)
  return metric(y_pred, y)

<h4>Fit pipelines</h4>

In [36]:
%%time
count_pipe.fit(X_train, y_train)

CPU times: user 46.6 s, sys: 2.43 s, total: 49.1 s
Wall time: 46.3 s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Pipeline(memory=None,
         steps=[('normalizer',
                 <__main__.TextNormalizer object at 0x7fbfb145da58>),
                ('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=N

In [37]:
%%time
tfidf_pipe.fit(X_train, y_train)

CPU times: user 43.7 s, sys: 1.4 s, total: 45.1 s
Wall time: 43.5 s


Pipeline(memory=None,
         steps=[('normalizer',
                 <__main__.TextNormalizer object at 0x7fbfb1437d30>),
                ('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 s...
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual

<h4>Let's import some metrics and check models</h4>

In [38]:
from sklearn import metrics
metrics = {
    "accuracy": metrics.accuracy_score,
    "precision": metrics.precision_score,
    "recall": metrics.recall_score,
    "f1": metrics.f1_score
}

In [40]:
print("CountVectorizer metrics")
for metric_name, metric in metrics.items():
  print(f"{metric_name} value is {get_metrics_values(count_pipe, X_test, y_test, metric)}")

CountVectorizer metrics
accuracy value is 0.8433333333333334
precision value is 0.8471760797342193
recall value is 0.8415841584158416
f1 value is 0.8443708609271523


In [41]:
print("TfIdfVectorizer metrics")
for metric_name, metric in metrics.items():
  print(f"{metric_name} value is {get_metrics_values(tfidf_pipe, X_test, y_test, metric)}")

TfIdfVectorizer metrics
accuracy value is 0.8552
precision value is 0.8699003322259137
recall value is 0.84584571650084
f1 value is 0.8577044025157233


<h4>Let's look at the features count</h4>

In [51]:
len(tfidf_pipe["vectorizer"].get_feature_names())

70246

<h4>TfIdf is a little bit better. Lets tune hyperparameters</h4>

In [42]:
from sklearn.model_selection import StratifiedKFold

In [44]:
skf = StratifiedKFold(n_splits=3)

In [54]:
param_grid = {
    'vectorizer__ngram_range': ((1, 1), (1, 2), (2, 2)),
    'vectorizer__max_features': (10000, 40000, None)
}

<h4>For time economy make new dataset with cleaned data and new pipes</h4>

In [55]:
X_train_clean = TextNormalizer().fit_transform(X_train)

In [57]:
count_logreg_pipe = Pipeline([
  ("vectorizer", CountVectorizer()),
  ("classifier", LogisticRegression())            
])

In [58]:
tfidf_logreg_pipe = Pipeline([
  ("vectorizer", TfidfVectorizer()),
  ("classifier", LogisticRegression())
])

In [60]:
from sklearn.model_selection import GridSearchCV

In [62]:
count_search = GridSearchCV(count_logreg_pipe, param_grid, n_jobs=-1)
count_search.fit(X_train_clean, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vectorizer',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                   

In [None]:
tfidf_search = GridSearchCV(tfidf_logreg_pipe, param_grid, n_jobs=-1)
tfidf_search.fit(X_train_clean, y_train)