# IMDB sentiment classification 

Using https://ai.stanford.edu/~amaas/data/sentiment/ we perform simple sentiment classification. 

The notebook is supposed to demonstrate quick ML development process

In [72]:
# data can be found in https://ai.stanford.edu/~amaas/data/sentiment/
import urllib.request
from time import time
import os
from os import listdir
from os.path import isfile, join
import nltk
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

filename = 'aclImdb_v1.tar.gz'

if not os.path.exists(filename):
    urllib.request.urlretrieve('https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz', file_name)

# unzip if not exists
if not os.path.exists('aclImdb'):
    !tar -xvzf {filename}

## Evaluation metrics

In our case, the problem is simple binary classification problem with positive or negative review and our dataset
is balanced, so for MVP we propose to use accuracy as main metric

## Train/valid split

In [8]:
full_train_dir = 'aclImdb/train'
full_train_pos_dir = f'{full_train_dir}/pos'
full_train_neg_dir = f'{full_train_dir}/neg'
full_train_pos_filenames = [f for f in listdir(full_train_pos_dir) if isfile(join(full_train_pos_dir, f))]
full_train_neg_filenames = [f for f in listdir(full_train_neg_dir) if isfile(join(full_train_neg_dir, f))]

print(f'total number of pos files available: {len(full_train_pos_filenames)}')

total number of pos files available: 12500


In [49]:
train_num_files = 10000

filenames = full_train_pos_filenames + full_train_neg_filenames
labels = [1.0 for _ in range(len(full_train_pos_filenames))] + [0.0 for _ in range(len(full_train_neg_filenames))]
train_filenames, valid_filenames, train_y, valid_y = train_test_split(filenames, labels, test_size=0.2)

print(f"""
train_filenames={len(train_filenames)}
valid_filenames={len(valid_filenames)}
""")


train_filenames=20000
valid_filenames=5000



## Feature extraction

In [46]:
def get_content(filenames, labels):
    texts = []
    for i, name in enumerate(filenames):
        subdir = 'pos' if labels[i] == 1.0 else 'neg'
        with open(f'{full_train_dir}/{subdir}/{name}', 'r') as f:
            texts.append(f.read())
            
    return texts

In [50]:
train_texts = get_content(train_filenames, train_y)
valid_texts = get_content(valid_filenames, valid_y)

In [51]:
print(f'train_texts={len(train_texts)}; train_y={len(train_y)}')

train_texts=20000; train_y=20000


In [52]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(train_texts)

In [53]:
valid_X = vectorizer.transform(valid_texts)

## Training

In [66]:
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression()),
])

In [79]:
parameters = {
    'vect__max_df': (0.25, 0.5),
    'vect__max_features': (None, 50000),
    'vect__ngram_range': ((1, 2),),  # unigrams or bigrams
    'tfidf__norm': ('l2',),
    'clf__max_iter': (20,),
    'clf__penalty': ('l2',),
}

In [80]:
# find the best parameters for both the feature extraction and the
# classifier
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
print(parameters)
t0 = time()
grid_search.fit(train_texts, train_y)
print("done in %0.3fs" % (time() - t0))
print()

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))
    
train_y, train_preds = train_y, grid_search.predict(train_texts)
print('Validation report:')
print(classification_report(train_y, train_preds))


valid_y, valid_preds = valid_y, grid_search.predict(valid_texts)
print('Training report:')
print(classification_report(valid_y, valid_preds))


Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'vect__max_df': (0.5, 0.75, 1.0), 'vect__max_features': (None, 5000, 10000, 50000), 'vect__ngram_range': ((1, 1), (1, 2)), 'tfidf__norm': ('l2',), 'clf__max_iter': (50, 100), 'clf__penalty': ('l2',)}
Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  5.5min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 23.2min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 28.1min finished
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


done in 1704.964s

Best score: 0.892
Best parameters set:
	clf__max_iter: 50
	clf__penalty: 'l2'
	tfidf__norm: 'l2'
	vect__max_df: 0.5
	vect__max_features: 50000
	vect__ngram_range: (1, 2)
Validation report:
              precision    recall  f1-score   support

         0.0       0.95      0.94      0.95      9970
         1.0       0.94      0.96      0.95     10030

    accuracy                           0.95     20000
   macro avg       0.95      0.95      0.95     20000
weighted avg       0.95      0.95      0.95     20000

Validation report:
              precision    recall  f1-score   support

         0.0       0.90      0.89      0.90      2530
         1.0       0.89      0.90      0.90      2470

    accuracy                           0.90      5000
   macro avg       0.90      0.90      0.90      5000
weighted avg       0.90      0.90      0.90      5000



You can see model is definitely overfitting into the current training set. There are couple things to note:
1. Even with simple logistic regression, we are already overfitting into the model. More sophisticated model with current dataset is likely to overfit even further (MLP, tree based models)
2. Having more data likely to help us generalize better
3. We should try out better regularization (we have used L2 above)

## Next Steps

Some ideas around improving the performance of the model:
1. Simplify our features, right now tf-idf seems to have too many features
2. Can we improve classification accuracy?

## Evaluation