<h3>There is process of making model after analysis and experiments</h3>

<h4>Import modules</h4>

In [1]:
import pandas as pd
import numpy as np
import sklearn
import spacy

<h4>Let's load the data</h4>

In [2]:
from sklearn.model_selection import train_test_split

def get_datasets_from_file(filename, label_column_name, test_size):
    if not (0 <= test_size <= 1):
        raise Exception("train_test_split must be from 0 to 1")
    data = pd.read_csv(filename)
    if label_column_name not in data.columns:
        raise Exception(f"There is no column '{label_column_name}' in the data")
    X = data.drop([label_column_name], axis=1)
    y = data[label_column_name]
    return train_test_split(X, y, test_size=test_size, random_state=42)

In [3]:
X_train, X_test, y_train, y_test = get_datasets_from_file("IMDB_dataset.csv", "sentiment", 0.3)

In [4]:
X_train.head()

Unnamed: 0,review
38094,"As much as I love trains, I couldn't stomach t..."
40624,"This was a very good PPV, but like Wrestlemani..."
49425,Not finding the right words is everybody's pro...
35734,I'm really suprised this movie didn't get a hi...
41708,I'll start by confessing that I tend to really...


In [5]:
y_train.head()

38094    negative
40624    positive
49425    negative
35734    positive
41708    negative
Name: sentiment, dtype: object

In [6]:
y_train, y_test = y_train.map({"positive": 1, "negative": 0}), y_test.map({"positive": 1, "negative": 0})
y_train.head()

38094    0
40624    1
49425    0
35734    1
41708    0
Name: sentiment, dtype: int64

<h4>Copy some functions from experiments</h4>

In [7]:
from sklearn.base import TransformerMixin

In [8]:
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

In [9]:
punctuations = string.punctuation
stop_words = STOP_WORDS
parser = English()

In [10]:
import re

#function from EDA
def spacy_text_normalizer(text):
    tokens = re.sub(r"<.*>", "", text) #Remove all tags
    tokens = parser(text) #Get doc from text
    tokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in tokens ] #Normalize words
    tokens = [ word for word in tokens if word not in stop_words and word not in punctuations ] #Remove stop words and punctuation
    return " ".join(tokens)

In [11]:
class TextNormalizer(TransformerMixin):
    def __init__(self, text_column_name="review"):
        self.text_column_name = text_column_name
        
    def transform(self, X, **transform_params):
        return [spacy_text_normalizer(text) for text in X[self.text_column_name]]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

In [12]:
normalizer = TextNormalizer()

<h4>I will use logistic regression and tfidf vectorization</h4>

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

In [14]:
feature_extractor = TfidfVectorizer()
classifier = LogisticRegression()

model = Pipeline([
            ("normalizer", normalizer),
            ("tfidf", feature_extractor),
            ("logreg", classifier)
        ])

In [15]:
params = {
    'tfidf__max_features':[100, 2000],
    'tfidf__ngram_range': [(1, 1), (1, 2), (2, 2)],
    
    'logreg__C': np.logspace(-3,3,3),
    'logreg__penalty': ["l1","l2"],
}

In [16]:
from sklearn.model_selection import GridSearchCV

In [17]:
search = GridSearchCV(model, params, n_jobs=-1)

In [18]:
search.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('normalizer',
                                        <__main__.TextNormalizer object at 0x00000223B56FCD88>),
                                       ('tfidf', TfidfVectorizer()),
                                       ('logreg', LogisticRegression())]),
             n_jobs=-1,
             param_grid={'logreg__C': array([1.e-03, 1.e+00, 1.e+03]),
                         'logreg__penalty': ['l1', 'l2'],
                         'tfidf__max_features': [100, 2000],
                         'tfidf__ngram_range': [(1, 1), (1, 2), (2, 2)]})

In [19]:
search.best_params_

{'logreg__C': 1.0,
 'logreg__penalty': 'l2',
 'tfidf__max_features': 2000,
 'tfidf__ngram_range': (1, 1)}

In [20]:
search.best_score_

0.8737428571428572

In [21]:
model = search.best_estimator_

In [22]:
from sklearn import metrics
metrics = {
    "accuracy": metrics.accuracy_score,
    "precision": metrics.precision_score,
    "recall": metrics.recall_score,
    "f1": metrics.f1_score
}

In [23]:
print("Metrics")
y_pred = model.predict(X_test)
for metric_name, metric in metrics.items():
    print(f"{metric_name} value is {metric(y_test, y_pred)}")
print("\n")

Metrics
accuracy value is 0.8766
precision value is 0.8707676402171104
recall value is 0.8878640137040453
f1 value is 0.8792327265609708




<h4>Save model</h4>

In [24]:
import pickle
pickle.dump(model, open("model.pickle", 'wb'))

<h3>Summary</h3>

<h4>Results after tuning hyperparameters are lower than in experiments with this combination of extractor and classifier. Maybe we need to continue looking for optimal hyperparameters. For example look for logreg__C in range from -10 to 10 with step 10 or make more tfidf max_features.</h4>

<h4>There would be a video card, would train an LSTM + 1D convolution neural network on keras. I looked on the Internet for a solution on this dataset, where it gives 94% accuracy</h4>