<a href="https://colab.research.google.com/github/harnettd/llm-project/blob/reorg/1-classify-with-scikit-learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis of Movie Reviews using `TfidfVectorizer` and Classifiers from scikit-learn

In this notebook, I perform sentiment analysis of movie reviews using classes available in `scikit-learn`.

The dataset consists of 50k highly polarized (*clearly* favourable or unfavourable) movie reviews from IMBD. The set is partitioned into a labelled train set of 25k reviews and a labelled test set of 25k reviews. The reviews are preprocessed by lower-casing, removing HTML tags, and removing punctutation. The reviews are then tokenized, removing English stop words, and stemmed. Corpus vectorization is implemented using `TfidfVectorizer`. Multiple classification models from `scikit-learn` are trained and tested on the results. The best performing model is pickled for later deployment.

In [None]:
!git clone https://github.com/harnettd/llm-project.git
%cd llm-project
!git checkout reorg

## Installs and Imports

In [None]:
!pip install datasets

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import pickle

from scipy.stats import loguniform, uniform

from nltk import PorterStemmer

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import f1_score, make_scorer

from datasets import load_dataset

from app.cleaner.preprocessor import Preprocessor
from app.cleaner.tokenizer import Tokenizer

## Load the IMDB Dataset

In [None]:
ds = load_dataset('imdb')
train, test = pd.DataFrame(ds['train']), pd.DataFrame(ds['test'])

In the following DataFrame samples, a label of 0 corresponds to a negative review (*i.e.,* thumbs-down) whereas a label of 1 corresponds to a positive review (*i.e.,* thumbs-up).

In [None]:
train.sample(5)

In [None]:
test.sample(5)

## Exploratory Data Analysis

In [None]:
train.info()

I plot the distribution of movie review labels in the train set.

In [None]:
fig, ax = plt.subplots()
train.groupby('label').count().plot(kind='bar', alpha=0.75, ax=ax)
ax.set_ylabel('count')
ax.set_title('Distribution of movie review labels')
ax.legend().set_visible(False)
plt.show()

From the above bar graph, the train set appears to be balanced. To confirm:

In [None]:
train['label'].value_counts()

It is instructive to read a handful of reviews to better understand what is meant by "highly polarized."

In [None]:
thumbs_ups = train[train['label'] == 1]
thumbs_downs = train[train['label'] == 0]

In [None]:
thumbs_up_samples = thumbs_ups['text'].sample(3).to_list()
print('\n\n'.join(thumbs_up_samples))

In [None]:
thumbs_down_samples = thumbs_downs['text'].sample(3).to_list()
print('\n\n'.join(thumbs_down_samples))

Generally, it is pretty clear from reading a particular review whether it is a thumbs-up or thumbs-down.

## Preprocessor

The preprocessor transforms movie reviews by lower-casing, removing HTML tags, and removing punctuation.

In [None]:
preprocessor = Preprocessor()

To see the preprocessor in action, pick a random movie review:

In [None]:
doc = train['text'].sample()
doc_preprocessed = preprocessor.transform(doc)

print(doc.to_list()[0])
print()
print(doc_preprocessed[0])

## Tokenizer

The tokenizer removes English stop words and stems the corpus.

In [None]:
tokenizer = Tokenizer(PorterStemmer(), ENGLISH_STOP_WORDS)

To see the tokenizer in action, transform the previously preprocessed movie review.

In [None]:
doc_tokenized = tokenizer.transform(doc_preprocessed)

print(doc_preprocessed[0])
print()
print(doc_tokenized[0])

## Vectorizer

I use the `TfidfVectorizer` to map documents to vectors.

In [None]:
vectorizer = TfidfVectorizer(
    max_df = 0.95,
    min_df = 2,
    max_features = 10_000,
    strip_accents='unicode'
)

cleaner = Pipeline([
    ('preprocessor', preprocessor),
    ('tokenizer', tokenizer),
    ('vectorizer', vectorizer)
])

## Classifers

I train and test logistic regression, random forest, and support vector machine classifiers on the IMDB movie reviews. I score the models using F1-score because the train set is balanced and the consequences of misclassifying a positive review are the same as misclassifying a negative review.

In [None]:
X_train, X_test, y_train, y_test =\
    train['text'], test['text'], train['label'], test['label']

### Logistic Regression

In [None]:
lr = LogisticRegression(
    penalty='l2',
    solver='saga',
    max_iter=500
)

pipe = Pipeline([
    ('cleaner', cleaner),
    ('classifier', lr)
])

param_distributions = {
    'classifier__C': loguniform(1e-2, 1e2)
}

search_lr = RandomizedSearchCV(
    estimator=pipe,
    param_distributions=param_distributions,
    n_iter=1,
    scoring=make_scorer(f1_score),
    n_jobs=1,
    refit=True
)

In [None]:
search_lr.fit(X_train, y_train);

In [None]:
test_score_lr = search_lr.score(X_test, y_test)

print(f'Best parameters: {search_lr.best_params_}')
print(f'Test F1-score: {test_score_lr}')

In [None]:
best_model = search_lr.best_estimator_
best_score = test_score_lr

### Random Forest

In [None]:
rfc = RandomForestClassifier()

pipe = Pipeline([
    ('cleaner', cleaner),
    ('classifier', rfc)
])

param_distributions = {
    'classifier__n_estimators': [10, 30, 100, 300, 1000],
    'classifier__max_depth': list(range(10, 101)),
    'classifier__min_samples_split': list(range(2, 11)),
    'classifier__min_samples_leaf': list(range(1, 11))
}

search_rfc = RandomizedSearchCV(
    estimator=pipe,
    param_distributions=param_distributions,
    n_iter=1,
    scoring=make_scorer(f1_score),
    n_jobs=1,
    refit=True
)

In [None]:
search_rfc.fit(X_train, y_train);

In [None]:
test_score_rfc = search_rfc.score(X_test, y_test)

print(f'Best parameters: {search_rfc.best_params_}')
print(f'Test F1-score: {test_score_rfc}')

In [None]:
if test_score_rfc > best_score:
    best_model = search_rfc.best_estimator_
    best_score = test_score_rfc

### Support Vector Machine

In [None]:
svc = LinearSVC(penalty='l2', max_iter=500)

pipe = Pipeline([
    ('cleaner', cleaner),
    ('classifier', svc)
])

param_distributions = {
    'classifier__C': loguniform(1e-2, 1e2)
}

search_svc = RandomizedSearchCV(
    estimator=pipe,
    param_distributions=param_distributions,
    n_iter=1,
    scoring=make_scorer(f1_score),
    n_jobs=1,
    refit=True
)

In [None]:
search_svc.fit(X_train, y_train);

In [None]:
test_score_svc = search_svc.score(X_test, y_test)

print(f'Best model: {search_svc.best_params_}')
print(f'Test F1-score: {test_score_svc}')

In [None]:
if test_score_svc > best_score:
    best_model = search_svc.best_estimator_
    best_score = test_score_svc

## Conclusion

In [None]:
print(f'Test F1-scores:')
print(f'    Logistic regression: {test_score_lr}')
print(f'    Random forest: {test_score_rfc}')
print(f'    Support vector machine: {test_score_svc}')

I pickle the best-performing model so that it can be deployed later.

In [None]:
model_dir = 'app/model'
with open(f'{model_dir}/best_model.pkl', 'wb') as file:
    pickle.dump(best_model, file)