<h3>Expirements with some models for solving problem of binary classification of movies</h3>

<h4>First of all we need to import all modules</h4>

In [1]:
import pandas as pd
import numpy as np
import sklearn
import spacy

<h4>The next step is to get data</h4>

In [2]:
from sklearn.model_selection import train_test_split

def get_datasets_from_file(filename, label_column_name, test_size):
    if not (0 <= test_size <= 1):
        raise Exception("train_test_split must be from 0 to 1")
    data = pd.read_csv(filename)
    if label_column_name not in data.columns:
        raise Exception(f"There is no column '{label_column_name}' in the data")
    X = data.drop([label_column_name], axis=1)
    y = data[label_column_name]
    return train_test_split(X, y, test_size=test_size, random_state=42)

In [3]:
X_train, X_test, y_train, y_test = get_datasets_from_file("IMDB_dataset.csv", "sentiment", 0.3)

In [4]:
X_train.head()

Unnamed: 0,review
38094,"As much as I love trains, I couldn't stomach t..."
40624,"This was a very good PPV, but like Wrestlemani..."
49425,Not finding the right words is everybody's pro...
35734,I'm really suprised this movie didn't get a hi...
41708,I'll start by confessing that I tend to really...


In [5]:
y_train.head()

38094    negative
40624    positive
49425    negative
35734    positive
41708    negative
Name: sentiment, dtype: object

In [6]:
y_train, y_test = y_train.map({"positive": 1, "negative": 0}), y_test.map({"positive": 1, "negative": 0})
y_train.head()

38094    0
40624    1
49425    0
35734    1
41708    0
Name: sentiment, dtype: int64

<h4>Next, write functions for data processing</h4>

In [7]:
from sklearn.base import TransformerMixin

In [8]:
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

In [9]:
punctuations = string.punctuation
stop_words = STOP_WORDS
parser = English()

In [10]:
import re

#function from EDA
def spacy_text_normalizer(text):
    tokens = re.sub(r"<.*>", "", text) #Remove all tags
    tokens = parser(text) #Get doc from text
    tokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in tokens ] #Normalize words
    tokens = [ word for word in tokens if word not in stop_words and word not in punctuations ] #Remove stop words and punctuation
    return " ".join(tokens)

In [11]:
class TextNormalizer(TransformerMixin):
    def __init__(self, text_column_name):
        self.text_column_name = text_column_name
        
    def transform(self, X, **transform_params):
        return [spacy_text_normalizer(text) for text in X[self.text_column_name]]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

In [12]:
normalizer = TextNormalizer("review")
X_transformed_text = normalizer.fit_transform(X_train);

In [13]:
X_transformed_text[0]

'love trains stomach movie premise steal locomotive drive arkansas chicago hitting train way right impossible plot lines hit board imagine disgruntled nasa employees stealing crawler totes shuttles fro driving new york idea.<br /><br />having said nice try wilford brimely quaker oats best levon helm turns good performance dimwitted meaning sidekick bob balaban suitably wormy corporate guy little guy takes goliath story gets airing'

<h4>Now we are ready to make some experiments with our data</h4>

In [14]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

In [15]:
classifiers = {"Random forest": RandomForestClassifier(), "Log reg": LogisticRegression()}
feature_extractors = {"Count vectorizer": CountVectorizer(), "Tfidf vectorizer": TfidfVectorizer()}

In [16]:
pipelines = []

for classifier_name, classifier in classifiers.items():
    for feature_extractor_name, feature_extractor in feature_extractors.items():
        pipeline = Pipeline([
            ("text_normalizer", normalizer),
            (feature_extractor_name, feature_extractor),
            (classifier_name, classifier)
        ])
        pipelines.append(pipeline)

In [17]:
for pipeline in pipelines:
    pipeline.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [19]:
from sklearn import metrics
metrics = {
    "accuracy": metrics.accuracy_score,
    "precision": metrics.precision_score,
    "recall": metrics.recall_score,
    "f1": metrics.f1_score
}

In [20]:
for pipeline in pipelines:
    print(list(pipeline.get_params().keys())[4] + " and " + list(pipeline.get_params().keys())[5])
    print("Метрики")
    y_pred = pipeline.predict(X_test)
    for metric_name, metric in metrics.items():
        print(f"{metric_name} value is {metric(y_test, y_pred)}")
    print("\n")

Count vectorizer and Random forest
Метрики
accuracy value is 0.8149333333333333
precision value is 0.843638440668285
recall value is 0.7784951904071683
f1 value is 0.8097587719298246


Tfidf vectorizer and Random forest
Метрики
accuracy value is 0.8592
precision value is 0.8664525625585441
recall value is 0.8532085913822638
f1 value is 0.8597795777453192


Count vectorizer and Log reg
Метрики
accuracy value is 0.8756666666666667
precision value is 0.861454912856782
recall value is 0.898800896033733
f1 value is 0.8797317340555878


Tfidf vectorizer and Log reg
Метрики
accuracy value is 0.8956
precision value is 0.8883301096067053
recall value is 0.9077612333640795
f1 value is 0.8979405630865486




<h3>The results are that logistic regression and tfidf are a first-to-go model. Then you need to try to adjust the hyperparameters of these models, which I will do in another notepad.</h3>