# IMDB sentiment classification 

Using https://ai.stanford.edu/~amaas/data/sentiment/ we perform simple sentiment classification. 

The notebook is supposed to demonstrate quick ML development process

In [44]:
# data can be found in https://ai.stanford.edu/~amaas/data/sentiment/
import urllib.request
import os
from os import listdir
from os.path import isfile, join
import nltk
from nltk import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

filename = 'aclImdb_v1.tar.gz'

if not os.path.exists(filename):
    urllib.request.urlretrieve('https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz', file_name)

# unzip if not exists
if not os.path.exists('aclImdb'):
    !tar -xvzf {filename}

## Evaluation metrics

In our case, the problem is simple binary classification problem with positive or negative review and our dataset
is balanced, so for MVP we propose to use accuracy as main metric

## Train/valid split

In [8]:
full_train_dir = 'aclImdb/train'
full_train_pos_dir = f'{full_train_dir}/pos'
full_train_neg_dir = f'{full_train_dir}/neg'
full_train_pos_filenames = [f for f in listdir(full_train_pos_dir) if isfile(join(full_train_pos_dir, f))]
full_train_neg_filenames = [f for f in listdir(full_train_neg_dir) if isfile(join(full_train_neg_dir, f))]

print(f'total number of pos files available: {len(full_train_pos_filenames)}')

total number of pos files available: 12500


In [49]:
train_num_files = 10000

filenames = full_train_pos_filenames + full_train_neg_filenames
labels = [1.0 for _ in range(len(full_train_pos_filenames))] + [0.0 for _ in range(len(full_train_neg_filenames))]
train_filenames, valid_filenames, train_y, valid_y = train_test_split(filenames, labels, test_size=0.2)

print(f"""
train_filenames={len(train_filenames)}
valid_filenames={len(valid_filenames)}
""")


train_filenames=20000
valid_filenames=5000



## Feature extraction

In [46]:
def get_content(filenames, labels):
    texts = []
    for i, name in enumerate(filenames):
        subdir = 'pos' if labels[i] == 1.0 else 'neg'
        with open(f'{full_train_dir}/{subdir}/{name}', 'r') as f:
            texts.append(f.read())
            
    return texts

In [50]:
train_texts = get_content(train_filenames, train_y)
valid_texts = get_content(valid_filenames, valid_y)

In [51]:
print(f'train_texts={len(train_texts)}; train_y={len(train_y)}')

train_texts=20000; train_y=20000


In [52]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(train_texts)

In [53]:
valid_X = vectorizer.transform(valid_texts)

## Training

In [54]:
# Set the parameters by cross-validation
tuned_parameters = [{'solver': ['lbfgs', 'liblinear', 'sag'],
                     'penalty': ['l1', 'l2']}]

In [55]:
scores = ['precision', 'recall']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf = GridSearchCV(
        LogisticRegression(), tuned_parameters, scoring='%s_macro' % score
    )
    clf.fit(X, train_y)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    valid_y, valid_preds = valid_y, clf.predict(valid_X)
    print(classification_report(valid_y, valid_preds))
    print()

# Tuning hyper-parameters for precision



ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver sag supports only 'l2' or 'none' penalties, got l1 penalty.



Best parameters set found on development set:

{'penalty': 'l2', 'solver': 'lbfgs'}

Grid scores on development set:

nan (+/-nan) for {'penalty': 'l1', 'solver': 'lbfgs'}
0.868 (+/-0.008) for {'penalty': 'l1', 'solver': 'liblinear'}
nan (+/-nan) for {'penalty': 'l1', 'solver': 'sag'}
0.883 (+/-0.009) for {'penalty': 'l2', 'solver': 'lbfgs'}
0.883 (+/-0.009) for {'penalty': 'l2', 'solver': 'liblinear'}
0.883 (+/-0.009) for {'penalty': 'l2', 'solver': 'sag'}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

              precision    recall  f1-score   support

         0.0       0.89      0.88      0.89      2530
         1.0       0.88      0.89      0.89      2470

    accuracy                           0.89      5000
   macro avg       0.89      0.89      0.89      5000
weighted avg       0.89      0.89      0.89      5000


# Tuning hyper-parameters for recall



ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver sag supports only 'l2' or 'none' penalties, got l1 penalty.



Best parameters set found on development set:

{'penalty': 'l2', 'solver': 'lbfgs'}

Grid scores on development set:

nan (+/-nan) for {'penalty': 'l1', 'solver': 'lbfgs'}
0.868 (+/-0.008) for {'penalty': 'l1', 'solver': 'liblinear'}
nan (+/-nan) for {'penalty': 'l1', 'solver': 'sag'}
0.882 (+/-0.009) for {'penalty': 'l2', 'solver': 'lbfgs'}
0.882 (+/-0.009) for {'penalty': 'l2', 'solver': 'liblinear'}
0.882 (+/-0.009) for {'penalty': 'l2', 'solver': 'sag'}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

              precision    recall  f1-score   support

         0.0       0.89      0.88      0.89      2530
         1.0       0.88      0.89      0.89      2470

    accuracy                           0.89      5000
   macro avg       0.89      0.89      0.89      5000
weighted avg       0.89      0.89      0.89      5000




In [56]:
# How is the performance of the best model on training?
train_y, train_preds = train_y, clf.predict(X)
print(classification_report(train_y, train_preds))

              precision    recall  f1-score   support

         0.0       0.94      0.93      0.93      9970
         1.0       0.93      0.94      0.94     10030

    accuracy                           0.93     20000
   macro avg       0.93      0.93      0.93     20000
weighted avg       0.93      0.93      0.93     20000



You can see model is definitely overfitting into the current training set. There are couple things to note:
1. Even with simple logistic regression, we are already overfitting into the model. More sophisticated model with current dataset is likely to overfit even further (MLP, tree based models)
2. Having more data likely to help us generalize better
3. We should try out better regularization (we have used L2 above)

## Next Steps

Some ideas around improving the performance of the model:
1. Simplify our features, right now tf-idf seems to have too many features
2. Can we improve classification accuracy?

## Evaluation