# IMDb Review Classification

In this project, we apply two different methods to classify IMDb reviews as positive (1) or negative (0) sentiments.  

The first approach utilizes traditional **NLP** techniques, involving preprocessing, feature extraction, and model training with logistic regression. This method is effective for handling moderate-sized datasets and involves a comprehensive pipeline to ensure accurate sentiment classification.  

The second approach focuses on scalability and efficiency by employing **out-of-core learning** techniques. Using Mini-batch **Gradient Descent**, this method incrementally trains the model on smaller subsets of the data, making it well-suited for very large datasets that exceed memory limits. This approach ensures that the model remains efficient and adaptable to real-time data processing.

### Data cleaning and tokenization

In [8]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import GridSearchCV
import re
from nltk.stem.porter import PorterStemmer

In [164]:
def preprocessor(text):
    text = re.sub(r'<[^>]*>', '', text)
    emots = re.findall(r'(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub(r'[\W]+', ' ', text.lower()) + ' '.join(emots).replace('-', ''))
    return text

preprocessor("</a>This~@ :) is :( a @test :-)!")

'this is a test :) :( :)'

In [165]:
# Simple tokenizer that splits words
def tokenizer(text):
    return text.split()

def tokenizer_porter(text):
    porter = PorterStemmer()
    return [porter.stem(word) for word in text.split()]

In [166]:
df = pd.read_csv('movie_data.csv', encoding='utf-8')

df['review'] = df['review'].apply(preprocessor)

df.sample(5)

Unnamed: 0,review,sentiment
40597,miraculously this is actually quite watchable ...,0
23956,i do not think i am alone when i say that 2005...,1
44138,my take on this at our local festival where pe...,1
42982,a family mother patricia clarkson father jake ...,1
34312,i think that pierre léaud or his character to ...,1


### Classification

In [173]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

# Define Term Frequency-Inverse Document Frequency Vectorizer,
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None,
                        token_pattern=None)

# and logistic classifier (lib linear is faster for big datasets)
clf = LogisticRegression(solver='liblinear')

model = Pipeline(steps=[
    ('vect', tfidf),
    ('clf', clf)
])

In [168]:
small_param_grid = [{'vect__ngram_range': [(1, 1)],
                     'vect__stop_words': [None],
                     'vect__tokenizer': [tokenizer, tokenizer_porter],
                     'clf__penalty': ['l2'],
                     'clf__C': [1.0, 10.0]}]

# Hyperparameter cross-validation
gd_model = GridSearchCV(model,
                        small_param_grid,
                        scoring='accuracy',
                        cv=5,
                        verbose=3,
                        n_jobs=-1)

gd_model.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


In [178]:
print(f'Best parameter set: {gd_model.best_params_}')

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x76e750eef1a0>}


In [176]:
print(f'CV Accuracy: {gd_model.best_score_:.4f}\n')
clf = gd_model.best_estimator_
print(f'Test Accuracy: {clf.score(X_test, y_test):.4f}')

CV Accuracy: 0.8971

Test Accuracy: 0.8988


## A more efficient way: Online algorithms and out-of-core learning

In [179]:
import numpy as np
import nltk

#nltk.download('stopwords')

### Preprocess, tokenizer

In [180]:
# Get english stop words for preprocessing
stop_words = nltk.corpus.stopwords.words('english')

# Define tokenize function which preprocesses HTML text and returns tokens
def my_tokenizer(text):
    text = re.sub(r'<[^>]*>', '', text) # Clean any text between < >
    emots = re.findall(r'(?::|;|=)(?:-)?(?:\)|\(|D|P)', text) # Find all  emoticons :) :( =D ...
    # Now replace non-word characters with spaces, and join space separated emoticons in the end and remove hyphens from faces
    text = re.sub(r'[\W]+', ' ', text.lower()) + ' '.join(emots).replace('-', '')
    tokens = [word for word in text.split() if word not in stop_words]
    return tokens

# Return review texts and labels
def stream_documents(filepath):
    with open(filepath, 'r', encoding='utf-8') as f: # Open csv
        next(f) # Skip header
        for line in f:
            text, label = line[0:-3], int(line[-2]) # Take review text and label of score
            yield text, label

In [181]:
# Collect a batch of documents and their corresponding labels
def get_minibatch(document_stream, batch_size):
    documents, y = [], []
    try:
        for _ in range(batch_size):
            text, label = next(document_stream)
            documents.append(text)
            y.append(label)
            
    except StopIteration:
        return None, None
    
    return documents, y

In [182]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

In [183]:
# Test the whole workflow
print(my_tokenizer(next(stream_documents('movie_data.csv'))[0]))

['1974', 'teenager', 'martha', 'moxley', 'maggie', 'grace', 'moves', 'high', 'class', 'area', 'belle', 'greenwich', 'connecticut', 'mischief', 'night', 'eve', 'halloween', 'murdered', 'backyard', 'house', 'murder', 'remained', 'unsolved', 'twenty', 'two', 'years', 'later', 'writer', 'mark', 'fuhrman', 'christopher', 'meloni', 'former', 'la', 'detective', 'fallen', 'disgrace', 'perjury', 'j', 'simpson', 'trial', 'moved', 'idaho', 'decides', 'investigate', 'case', 'partner', 'stephen', 'weeks', 'andrew', 'mitchell', 'purpose', 'writing', 'book', 'locals', 'squirm', 'welcome', 'support', 'retired', 'detective', 'steve', 'carroll', 'robert', 'forster', 'charge', 'investigation', '70', 'discover', 'criminal', 'net', 'power', 'money', 'cover', 'murder', 'murder', 'greenwich', 'good', 'tv', 'movie', 'true', 'story', 'murder', 'fifteen', 'years', 'old', 'girl', 'committed', 'wealthy', 'teenager', 'whose', 'mother', 'kennedy', 'powerful', 'rich', 'family', 'used', 'influence', 'cover', 'murder'

### Classification

In [184]:
hash_vect = HashingVectorizer(decode_error='ignore',
                              n_features=2**21,
                              preprocessor=None,
                              tokenizer=my_tokenizer
                              )

gd_clf = SGDClassifier(loss='log_loss',
                       random_state=13,
                       n_jobs=-1
                       )

doc_stream = stream_documents('movie_data.csv')

In [185]:
import pyprind
import sys

In [186]:
pbar = pyprind.ProgBar(45, title='Training', stream=sys.stderr)

for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, 1000)
    if not X_train:
        break
    X_train = hash_vect.transform(X_train)
    gd_clf.partial_fit(X_train, y_train, classes=np.array([0,1]))
    pbar.update()

Training
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:31


In [187]:
from sklearn.metrics import accuracy_score

In [188]:
# Now test out model's performance with the remaining 5k docs from our document stream

X_test, y_test = get_minibatch(doc_stream, 5000)
X_test = hash_vect.transform(X_test)

y_pred = gd_clf.predict(X_test)
print(f'Accuracy score: {accuracy_score(y_test, y_pred)}')

Accuracy score: 0.8686


Now we can finish training the model on the whole IMDB dataset

In [189]:
gd_clf = gd_clf.partial_fit(X_test, y_test)

## Latent Dirichlet Allocation

In [6]:
from sklearn.decomposition import LatentDirichletAllocation

In [12]:
df = pd.read_csv('movie_data.csv', encoding='utf-8')

vect = CountVectorizer(strip_accents=None,
                       stop_words='english',
                       max_features=3500, # reduce dimensionality
                       max_df=0.1, # remove frequent terms that aren't useful
                       ngram_range=(1, 2), # include unigrams and bigrams
                       )

lda = LatentDirichletAllocation(n_components=7,
                                random_state=13,
                                learning_method='batch', # update model params on entire dataset
                                n_jobs=-1,
                                verbose=2
                                )

X = vect.fit_transform(df['review'])

In [13]:
X_topics = lda.fit_transform(X)

[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:   11.2s remaining:   33.6s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:   11.6s finished
[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.


iteration: 1 of max_iter: 10


[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    9.0s remaining:   27.1s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    9.4s finished
[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.


iteration: 2 of max_iter: 10


[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    8.5s remaining:   25.4s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    8.6s finished
[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.


iteration: 3 of max_iter: 10


[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    7.7s remaining:   23.1s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    7.8s finished
[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.


iteration: 4 of max_iter: 10


[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    7.1s remaining:   21.4s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    7.2s finished
[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.


iteration: 5 of max_iter: 10


[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    6.7s remaining:   20.2s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    6.8s finished
[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.


iteration: 6 of max_iter: 10


[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    6.4s remaining:   19.2s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    6.6s finished
[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.


iteration: 7 of max_iter: 10


[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    6.9s remaining:   20.6s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    7.2s finished
[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.


iteration: 8 of max_iter: 10


[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    6.0s remaining:   18.0s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    6.1s finished
[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.


iteration: 9 of max_iter: 10


[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    5.8s remaining:   17.3s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    5.9s finished


iteration: 10 of max_iter: 10


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    5.3s remaining:   15.8s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    5.4s finished
[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    5.2s remaining:   15.5s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    5.3s finished


In [14]:
lda.components_.shape

(7, 3500)

In [23]:
# Get vocabulary
terms = vect.get_feature_names_out()

topic_word_matrix = lda.components_
n_top_terms = 7

for topic_i, topic in enumerate(topic_word_matrix):
    print(f"Topic #{topic_i + 1}:")
    top_terms_indices = topic.argsort()[-n_top_terms:][::-1] # take most relevant terms indexes of the topic
    top_terms = [terms[i] for i in top_terms_indices]
    print(", ".join(top_terms))
    print()

Topic #1:
worst, minutes, waste, awful, money, terrible, script

Topic #2:
budget, low, horror, camera, effects, production, low budget

Topic #3:
series, tv, episode, dvd, shows, book, version

Topic #4:
horror, killer, house, guy, effects, girl, night

Topic #5:
woman, wife, war, father, men, performance, role

Topic #6:
comedy, kids, fun, action, humor, jokes, guy

Topic #7:
music, role, performance, game, play, played, musical



#### Possible categories for each review topic:

1. **Topic #1:** Negative Reviews
2. **Topic #2:** Low-Budget Films
3. **Topic #3:** TV Series and Adaptations
4. **Topic #4:** Horror Films
5. **Topic #5:** Drama, Family and Relationships
6. **Topic #6:** Comedy
7. **Topic #7:** Musicals