# IMDB sentiment classification 

Using https://ai.stanford.edu/~amaas/data/sentiment/ we perform simple sentiment classification. 

The notebook is supposed to demonstrate quick ML development process

In [37]:
# data can be found in https://ai.stanford.edu/~amaas/data/sentiment/
import urllib.request
import os
from os import listdir
from os.path import isfile, join
import nltk
from nltk import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

filename = 'aclImdb_v1.tar.gz'

if not os.path.exists(filename):
    urllib.request.urlretrieve('https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz', file_name)

# unzip if not exists
if not os.path.exists('aclImdb'):
    !tar -xvzf {filename}

## Evaluation metrics

In our case, the problem is simple binary classification problem with positive or negative review and our dataset
is balanced, so for MVP we propose to use accuracy as main metric

## Train/valid split

In [8]:
full_train_dir = 'aclImdb/train'
full_train_pos_dir = f'{full_train_dir}/pos'
full_train_neg_dir = f'{full_train_dir}/neg'
full_train_pos_filenames = [f for f in listdir(full_train_pos_dir) if isfile(join(full_train_pos_dir, f))]
full_train_neg_filenames = [f for f in listdir(full_train_neg_dir) if isfile(join(full_train_neg_dir, f))]

print(f'total number of pos files available: {len(full_train_pos_filenames)}')

total number of pos files available: 12500


In [9]:
train_num_files = 10000
train_pos_filenames = full_train_pos_filenames[:train_num_files]
train_neg_filenames = full_train_neg_filenames[:train_num_files]

valid_pos_filenames = full_train_pos_filenames[train_num_files:]
valid_neg_filenames = full_train_neg_filenames[train_num_files:]

print(f"""
train_pos_filenames={len(train_pos_filenames)}
train_neg_filenames={len(train_neg_filenames)}
valid_pos_filenames={len(valid_pos_filenames)}
valid_neg_filenames={len(valid_neg_filenames)}
""")


train_pos_filenames=10000
train_neg_filenames=10000
valid_pos_filenames=2500
valid_neg_filenames=2500



## Feature extraction

In [31]:
def get_content(pos_filenames, neg_filenames):
    texts = []
    labels = [1.0 for _ in range(len(pos_filenames))] + [0.0 for _ in range(len(neg_filenames))]
    for name in pos_filenames:
        with open(f'{full_train_dir}/pos/{name}', 'r') as f:
            texts.append(f.read())
            
    for name in neg_filenames:
        with open(f'{full_train_dir}/neg/{name}', 'r') as f:
            texts.append(f.read())
            
    return texts, labels

In [32]:
train_texts, train_y = get_content(train_pos_filenames, train_neg_filenames)
valid_texts, valid_y = get_content(valid_pos_filenames, valid_neg_filenames)

In [33]:
print(f'train_texts={len(train_texts)}; train_y={len(train_y)}')

train_texts=20000; train_y=20000


In [34]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(train_texts)

In [36]:
valid_X = vectorizer.transform(valid_texts)

## Training

In [41]:
# Set the parameters by cross-validation
tuned_parameters = [{'solver': ['lbfgs', 'liblinear', 'sag'],
                     'penalty': ['l1', 'l2']}]

In [42]:
scores = ['precision', 'recall']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf = GridSearchCV(
        LogisticRegression(), tuned_parameters, scoring='%s_macro' % score
    )
    clf.fit(X, train_y)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    valid_y, valid_preds = valid_y, clf.predict(valid_X)
    print(classification_report(valid_y, valid_preds))
    print()

# Tuning hyper-parameters for precision



ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver sag supports only 'l2' or 'none' penalties, got l1 penalty.



Best parameters set found on development set:

{'penalty': 'l2', 'solver': 'lbfgs'}

Grid scores on development set:

nan (+/-nan) for {'penalty': 'l1', 'solver': 'lbfgs'}
0.868 (+/-0.014) for {'penalty': 'l1', 'solver': 'liblinear'}
nan (+/-nan) for {'penalty': 'l1', 'solver': 'sag'}
0.885 (+/-0.013) for {'penalty': 'l2', 'solver': 'lbfgs'}
0.885 (+/-0.013) for {'penalty': 'l2', 'solver': 'liblinear'}
0.885 (+/-0.013) for {'penalty': 'l2', 'solver': 'sag'}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

              precision    recall  f1-score   support

         0.0       0.90      0.88      0.89      2500
         1.0       0.88      0.90      0.89      2500

    accuracy                           0.89      5000
   macro avg       0.89      0.89      0.89      5000
weighted avg       0.89      0.89      0.89      5000


# Tuning hyper-parameters for recall



ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver sag supports only 'l2' or 'none' penalties, got l1 penalty.



Best parameters set found on development set:

{'penalty': 'l2', 'solver': 'lbfgs'}

Grid scores on development set:

nan (+/-nan) for {'penalty': 'l1', 'solver': 'lbfgs'}
0.868 (+/-0.014) for {'penalty': 'l1', 'solver': 'liblinear'}
nan (+/-nan) for {'penalty': 'l1', 'solver': 'sag'}
0.884 (+/-0.013) for {'penalty': 'l2', 'solver': 'lbfgs'}
0.884 (+/-0.013) for {'penalty': 'l2', 'solver': 'liblinear'}
0.884 (+/-0.013) for {'penalty': 'l2', 'solver': 'sag'}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

              precision    recall  f1-score   support

         0.0       0.90      0.88      0.89      2500
         1.0       0.88      0.90      0.89      2500

    accuracy                           0.89      5000
   macro avg       0.89      0.89      0.89      5000
weighted avg       0.89      0.89      0.89      5000




In [43]:
# How is the performance of the best model on training?
train_y, train_preds = train_y, clf.predict(X)
print(classification_report(train_y, train_preds))

              precision    recall  f1-score   support

         0.0       0.94      0.93      0.93     10000
         1.0       0.93      0.94      0.93     10000

    accuracy                           0.93     20000
   macro avg       0.93      0.93      0.93     20000
weighted avg       0.93      0.93      0.93     20000



You can see model is definitely overfitting into the current training set. There are couple things to note:
1. Even with simple logistic regression, we are already overfitting into the model. More sophisticated model with current dataset is likely to overfit even further (MLP, tree based models)
2. Having more data likely to help us generalize better
3. We should try out better regularization (we have used L2 above)

## Evaluation