# Film Junky Union Reviews - with BERT and SVM

The Film Junky Union, a new edgy community for classic movie enthusiasts, is developing a system for filtering and categorizing movie reviews. The goal is to train a model to automatically detect negative reviews. You'll be using a dataset of IMBD movie reviews with polarity labelling to build a model for classifying positive and negative reviews. It will need to have an F1 score of at least 0.85.

## Initialization

In [None]:
import math

import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns

from tqdm.auto import tqdm

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'png'
# the next line provides graphs of better quality on HiDPI screens
%config InlineBackend.figure_format = 'retina'

plt.style.use('seaborn')

In [None]:
# this is to use progress_apply, read more at https://pypi.org/project/tqdm/#pandas-integration
tqdm.pandas()

## Load Data

In [None]:
try:
    df_reviews = pd.read_csv('/datasets/imdb_reviews.tsv', sep='\t', dtype={'votes': 'Int64'})
except:
    df_reviews = pd.read_csv('imdb_reviews.tsv', sep='\t', dtype={'votes': 'Int64'})

In [None]:
df_reviews.info()

There are two missing values in a couple columns, but the review column is intact, so I will leave those be.

In [None]:
df_reviews.head()

The end_year column doesn't seem to add any value to the first few observations. Same goes for the title columns, but we will take a look at the distributions for the columns next. Regardless, the focus in this dataset is on reviews and pos. I want to see if there are any duplicate reviews, and whether or not there is a logical reason for them.

In [None]:
dup = df_reviews.review.duplicated()
dup_idx = np.where(dup)[0]
print(df_reviews.loc[dup_idx, 'pos'].value_counts())

There seem to be about 100 duplicate reviews. They do not follow a trend in terms of whether they are all positive or all negative, so I will just remove these observations.

In [None]:
df_reviews.drop_duplicates(subset='review', inplace=True)

In [None]:
df_reviews.pos.value_counts(normalize=True)

The two values are limited to 0 and 1 for pos which is correct. There are slightly fewer positive reviews than negative reviews, but the difference is negligible. Class balance looks good overall.

## EDA

Let's check the number of movies and reviews over years.

In [None]:
fig, axs = plt.subplots(2, 1, figsize=(16, 8))

ax = axs[0]

dft1 = df_reviews[['tconst', 'start_year']].drop_duplicates() \
    ['start_year'].value_counts().sort_index()
dft1 = dft1.reindex(index=np.arange(dft1.index.min(), max(dft1.index.max(), 2021))).fillna(0)
dft1.plot(kind='bar', ax=ax)
ax.set_title('Number of Movies Over Years')

ax = axs[1]

dft2 = df_reviews.groupby(['start_year', 'pos'])['pos'].count().unstack()
dft2 = dft2.reindex(index=np.arange(dft2.index.min(), max(dft2.index.max(), 2021))).fillna(0)

dft2.plot(kind='bar', stacked=True, label='#reviews (neg, pos)', ax=ax)

dft2 = df_reviews['start_year'].value_counts().sort_index()
dft2 = dft2.reindex(index=np.arange(dft2.index.min(), max(dft2.index.max(), 2021))).fillna(0)
dft3 = (dft2/dft1).fillna(0)
axt = ax.twinx()
dft3.reset_index(drop=True).rolling(5).mean().plot(color='orange', label='reviews per movie (avg over 5 years)', ax=axt)

lines, labels = axt.get_legend_handles_labels()
ax.legend(lines, labels, loc='upper left')

ax.set_title('Number of Reviews Over Years')

fig.tight_layout()

Most movies look to be from the 1990's/2000's, but there are movies from as early as the 1890's. There are a lot fewer movies per year up until around 1970, but these scarcer movies generated almost as many reviews in total as the sum of movies per year did in more present times - these few movies must be classics. The 1960's sees an unexpected drop in movies and movie reviews, but that shouldn't affect our sentiment analysis. It would be curious to determine whether reviews for movies pre-1990 or so were written in those time periods or written in modern times, because the language used may differ and affect the performance of our models. Class balance seems about balanced even looking at each individual year.

Let's check the distribution of number of reviews per movie with the exact counting and KDE (just to learn how it may differ from the exact counting)

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(16, 5))

ax = axs[0]
dft = df_reviews.groupby('tconst')['review'].count() \
    .value_counts() \
    .sort_index()
dft.plot.bar(ax=ax)
ax.set_title('Bar Plot of #Reviews Per Movie')

ax = axs[1]
dft = df_reviews.groupby('tconst')['review'].count()
sns.kdeplot(dft, ax=ax)
ax.set_title('KDE Plot of #Reviews Per Movie')

fig.tight_layout()

Most movies only have one or a few reviews apiece. About 400 movies have 30 or more reviews.

In [None]:
df_reviews['pos'].value_counts()

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(12, 4))

ax = axs[0]
dft = df_reviews.query('ds_part == "train"')['rating'].value_counts().sort_index()
dft = dft.reindex(index=np.arange(min(dft.index.min(), 1), max(dft.index.max(), 11))).fillna(0)
dft.plot.bar(ax=ax)
ax.set_ylim([0, 5000])
ax.set_title('The train set: distribution of ratings')

ax = axs[1]
dft = df_reviews.query('ds_part == "test"')['rating'].value_counts().sort_index()
dft = dft.reindex(index=np.arange(min(dft.index.min(), 1), max(dft.index.max(), 11))).fillna(0)
dft.plot.bar(ax=ax)
ax.set_ylim([0, 5000])
ax.set_title('The test set: distribution of ratings')

fig.tight_layout()

The classes are balanced very well. I'm glad that reviews that gave a movie a rating of 5 or 6 have been omitted, as these are not obviously positive or negative. 

Distribution of negative and positive reviews over the years for two parts of the dataset

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(16, 8), gridspec_kw=dict(width_ratios=(2, 1), height_ratios=(1, 1)))

ax = axs[0][0]

dft = df_reviews.query('ds_part == "train"').groupby(['start_year', 'pos'])['pos'].count().unstack()
dft.index = dft.index.astype('int')
dft = dft.reindex(index=np.arange(dft.index.min(), max(dft.index.max(), 2020))).fillna(0)
dft.plot(kind='bar', stacked=True, ax=ax)
ax.set_title('The train set: number of reviews of different polarities per year')

ax = axs[0][1]

dft = df_reviews.query('ds_part == "train"').groupby(['tconst', 'pos'])['pos'].count().unstack()
sns.kdeplot(dft[0], color='blue', label='negative', kernel='epa', ax=ax)
sns.kdeplot(dft[1], color='green', label='positive', kernel='epa', ax=ax)
ax.legend()
ax.set_title('The train set: distribution of different polarities per movie')

ax = axs[1][0]

dft = df_reviews.query('ds_part == "test"').groupby(['start_year', 'pos'])['pos'].count().unstack()
dft.index = dft.index.astype('int')
dft = dft.reindex(index=np.arange(dft.index.min(), max(dft.index.max(), 2020))).fillna(0)
dft.plot(kind='bar', stacked=True, ax=ax)
ax.set_title('The test set: number of reviews of different polarities per year')

ax = axs[1][1]

dft = df_reviews.query('ds_part == "test"').groupby(['tconst', 'pos'])['pos'].count().unstack()
sns.kdeplot(dft[0], color='blue', label='negative', kernel='epa', ax=ax)
sns.kdeplot(dft[1], color='green', label='positive', kernel='epa', ax=ax)
ax.legend()
ax.set_title('The test set: distribution of different polarities per movie')

fig.tight_layout()

The classes look very balanced even when looking at individual years. The class distribution curves look about identical between the training and the test sets.

## Evaluation Procedure

Composing an evaluation routine which can be used for all models in this project

In [None]:
import sklearn.metrics as metrics

def evaluate_model(model, train_features, train_target, test_features, test_target):
    
    eval_stats = {}
    
    fig, axs = plt.subplots(1, 3, figsize=(20, 6)) 
    
    for type, features, target in (('train', train_features, train_target), ('test', test_features, test_target)):
        
        eval_stats[type] = {}
    
        pred_target = model.predict(features)
        pred_proba = model.predict_proba(features)[:, 1]
        
        # F1
        f1_thresholds = np.arange(0, 1.01, 0.05)
        f1_scores = [metrics.f1_score(target, pred_proba>=threshold) for threshold in f1_thresholds]
        
        # ROC
        fpr, tpr, roc_thresholds = metrics.roc_curve(target, pred_proba)
        roc_auc = metrics.roc_auc_score(target, pred_proba)    
        eval_stats[type]['ROC AUC'] = roc_auc

        # PRC
        precision, recall, pr_thresholds = metrics.precision_recall_curve(target, pred_proba)
        aps = metrics.average_precision_score(target, pred_proba)
        eval_stats[type]['APS'] = aps
        
        if type == 'train':
            color = 'blue'
        else:
            color = 'green'

        # F1 Score
        ax = axs[0]
        max_f1_score_idx = np.argmax(f1_scores)
        ax.plot(f1_thresholds, f1_scores, color=color, label=f'{type}, max={f1_scores[max_f1_score_idx]:.2f} @ {f1_thresholds[max_f1_score_idx]:.2f}')
        # setting crosses for some thresholds
        for threshold in (0.2, 0.4, 0.5, 0.6, 0.8):
            closest_value_idx = np.argmin(np.abs(f1_thresholds-threshold))
            marker_color = 'orange' if threshold != 0.5 else 'red'
            ax.plot(f1_thresholds[closest_value_idx], f1_scores[closest_value_idx], color=marker_color, marker='X', markersize=7)
        ax.set_xlim([-0.02, 1.02])    
        ax.set_ylim([-0.02, 1.02])
        ax.set_xlabel('threshold')
        ax.set_ylabel('F1')
        ax.legend(loc='lower center')
        ax.set_title(f'F1 Score') 

        # ROC
        ax = axs[1]    
        ax.plot(fpr, tpr, color=color, label=f'{type}, ROC AUC={roc_auc:.2f}')
        # setting crosses for some thresholds
        for threshold in (0.2, 0.4, 0.5, 0.6, 0.8):
            closest_value_idx = np.argmin(np.abs(roc_thresholds-threshold))
            marker_color = 'orange' if threshold != 0.5 else 'red'            
            ax.plot(fpr[closest_value_idx], tpr[closest_value_idx], color=marker_color, marker='X', markersize=7)
        ax.plot([0, 1], [0, 1], color='grey', linestyle='--')
        ax.set_xlim([-0.02, 1.02])    
        ax.set_ylim([-0.02, 1.02])
        ax.set_xlabel('FPR')
        ax.set_ylabel('TPR')
        ax.legend(loc='lower center')        
        ax.set_title(f'ROC Curve')
        
        # PRC
        ax = axs[2]
        ax.plot(recall, precision, color=color, label=f'{type}, AP={aps:.2f}')
        # setting crosses for some thresholds
        for threshold in (0.2, 0.4, 0.5, 0.6, 0.8):
            closest_value_idx = np.argmin(np.abs(pr_thresholds-threshold))
            marker_color = 'orange' if threshold != 0.5 else 'red'
            ax.plot(recall[closest_value_idx], precision[closest_value_idx], color=marker_color, marker='X', markersize=7)
        ax.set_xlim([-0.02, 1.02])    
        ax.set_ylim([-0.02, 1.02])
        ax.set_xlabel('recall')
        ax.set_ylabel('precision')
        ax.legend(loc='lower center')
        ax.set_title(f'PRC')        

        eval_stats[type]['Accuracy'] = metrics.accuracy_score(target, pred_target)
        eval_stats[type]['F1'] = metrics.f1_score(target, pred_target)
    
    df_eval_stats = pd.DataFrame(eval_stats)
    df_eval_stats = df_eval_stats.round(2)
    df_eval_stats = df_eval_stats.reindex(index=('Accuracy', 'F1', 'APS', 'ROC AUC'))
    
    print(df_eval_stats)
    
    return

## Normalization

We assume all models below accepts texts in lowercase and without any digits, punctuations marks etc. To accomplish this, we will make everything lowercase and remove anything that isn't a letter. I will also employ stopword removal and lemmatization to clean up the text.

In [None]:
%%time

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
import spacy
import re

nltk.download('punkt')
stop_words = set(stopwords.words('english'))
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

corpus = df_reviews.review

def clear_text(text):
    
    pattern = r"[^a-zA-Z ]"
    text = re.sub(pattern, ' ', text)
    text = text.split()
    text = ' '.join(text)
    return text

def normalize(text):

    cleared_text = clear_text(text)    
    lowered = cleared_text.lower()
    tokenized = word_tokenize(lowered)
    text_no_stops = [word for word in tokenized if word not in stop_words]
    text_joined = nlp(' '.join(text_no_stops))
    lemmas = [token.lemma_ for token in text_joined]    
    return(' '.join(lemmas))

df_reviews['review_norm'] = df_reviews.review.apply(normalize)

## Train/Test Split & More Feature Engineering

Luckily, the whole dataset is already divided into train/test one parts. The corresponding flag is 'ds_part'.

In [None]:
df_reviews_train = df_reviews.query('ds_part == "train"').copy()
df_reviews_test = df_reviews.query('ds_part == "test"').copy()

target_train = df_reviews_train['pos']
target_test = df_reviews_test['pos']

print(df_reviews_train.shape)
print(df_reviews_test.shape)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

count_tf_idf = TfidfVectorizer()
corpus = df_reviews_train.review
count_tf_idf.fit(corpus)
features_train = count_tf_idf.transform(corpus)
features_test = count_tf_idf.transform(df_reviews_test.review)

In [None]:
features_train.shape

## Working with models

### Model 0 - Dummy

In [None]:
from sklearn.dummy import DummyClassifier

In [None]:
model_0 = DummyClassifier(strategy='stratified', random_state=0)
model_0.fit(features_train, target_train)

In [None]:
evaluate_model(model_0, features_train, target_train, features_test, target_test)

Using a stratified dummy model yielded scores of 0.5 across the board.

### Model 1 - Logistic Regression

Let's use logistic regression as our first "real" model.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model_1 = LogisticRegression(solver='liblinear', random_state=0)
model_1.fit(features_train, target_train)

In [None]:
evaluate_model(model_1, features_train, target_train, features_test, target_test)

The tested F1 score for this logistic regression model is at 0.88, which is already higher than the 0.85 F1 score requirement. 

### Model 2 - CatBoost

In [None]:
import catboost as cb

In [None]:
model_2 = cb.CatBoostClassifier(verbose=100, random_state=0, n_estimators=1000, learning_rate=0.12, max_depth=5, early_stopping_rounds=20)
%time model_2.fit(features_train, target_train, plot=True, eval_set=(features_test, target_test))

In [None]:
evaluate_model(model_2, features_train, target_train, features_test, target_test)

This CatBoost model performs about as well on the test set as the logistic regression model, and takes much longer to train. The model captures the training data excellently.

### Model 3 - Support Vector Machine

I would like to also use a SVM model, which seems to be a popular type of model to use for text-based classification tasks. I would have liked to focus on two of its hyperparameters: inverse regularization strength and kernel type. However, this seems to take a very long time to train and evaluate, so I will simply use its default settings. I already used logistic regression without tweaking hyperparameters, so I am interested to compare the performance of these two un-tweaked models. 

In [None]:
%%time

from sklearn.svm import SVC

model_3 = SVC(random_state=0, probability=True)
model_3.fit(features_train, target_train)

evaluate_model(model_3, features_train, target_train, features_test, target_test)

I expected a better tested F1 score from the SVM model than I got from the logistic regression model, but I suppose this may have been improved by changing the hyperparameters. The training results are impeccable. This model took a very long time to train.

###  Model 9 - BERT

In [None]:
import torch
import transformers

In [None]:
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased')
config = transformers.BertConfig.from_pretrained('bert-base-uncased')
model = transformers.BertModel.from_pretrained('bert-base-uncased')

In [None]:
def BERT_text_to_embeddings(texts, max_length=512, batch_size=100, force_device=None, disable_progress_bar=False):
    
    ids_list = []
    attention_mask_list = []

    # text to padded ids of tokens along with their attention masks
    
    for input_text in texts:
        
        ids = tokenizer.encode(
            input_text.lower(),
            add_special_tokens=True,
            truncation=True,
            max_length=max_length,
        )
        
        padded = np.array(ids + [0] * (max_length - len(ids)))
        attention_mask = np.where(padded != 0, 1, 0)
        ids_list.append(padded)
        attention_mask_list.append(attention_mask) 
    
    if force_device is not None:
        device = torch.device(force_device)
    else:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        
    model.to(device)
    if not disable_progress_bar:
        print(f'Using the {device} device.')
    
    # gettings embeddings in batches

    embeddings = []

    for i in tqdm(range(math.ceil(len(ids_list)/batch_size)), disable=disable_progress_bar):
            
        ids_batch = torch.LongTensor(ids_list[batch_size*i:batch_size*(i+1)]).to(device)
        
        attention_mask_batch = torch.LongTensor(attention_mask_list[batch_size * i : batch_size * (i + 1)])
            
        with torch.no_grad():            
            model.eval()
            batch_embeddings = model(input_ids=ids_batch, attention_mask=attention_mask_batch)   
        embeddings.append(batch_embeddings[0][:,0,:].detach().cpu().numpy())
        
    return np.concatenate(embeddings)

In [None]:
# Attention! Running BERT for thousands of texts may take long run on CPU, at least several hours
train_features_9 = BERT_text_to_embeddings(df_reviews_train['review'][:3000], 
#                                           force_device='cuda'
                                          )
test_features_9 = BERT_text_to_embeddings(df_reviews_test['review'][:1000],
                                          # force_device='cuda'
                                         )

In [None]:
target_train_9 = df_reviews_train.pos[:3000]
target_test_9 = df_reviews_test.pos[:1000]

print(df_reviews_train['review_norm'].shape)
print(train_features_9.shape)
print(target_train_9.shape)

In [None]:
%%time
model_9 = LogisticRegression(max_iter=1000, random_state=0)
model_9.fit(train_features_9, target_train_9)

In [None]:
# if you have got the embeddings, it's advisable to save them to have them ready if 
np.savez_compressed('features_9.npz', train_features_9=train_features_9, test_features_9=test_features_9)

# and load...
# with np.load('features_9.npz') as data:
#     train_features_9 = data['train_features_9']
#     test_features_9 = data['test_features_9']

In [None]:
from sklearn.metrics import f1_score

pred_9_train = model_9.predict(train_features_9)
score_9_train = f1_score(target_train_9, pred_9_train)
print('Training F1 score:', score_9_train)

pred_9_test = model_9.predict(test_features_9)
score_9_test = f1_score(target_test_9, pred_9_test)
print('Tested F1 score', score_9_test)

I first got the BERT embeddings using the fully normalized/lemmatized reviews, but with logistic regression I only got an F1 score as high as 0.77. Next I tried getting the embeddings using the raw reviews, which has gotten me this tested F1 score of 0.84. I got an F1 score of 0.88 using lemmatization and other normalization tricks with logistic regression earlier in the project, so that model/preprocessing combination is still my top pick.

## My Reviews

In [None]:
# feel free to completely remove these reviews and try your models on your own reviews, those below are just examples

my_reviews = pd.DataFrame([
    'I did not simply like it, not my kind of movie.',
    'Well, I was bored and felt asleep in the middle of the movie.',
    'I was really fascinated with the movie',    
    'Even the actors looked really old and disinterested, and they got paid to be in the movie. What a soulless cash grab.',
    'I didn\'t expect the reboot to be so good! Writers really cared about the source material',
    'The movie had its upsides and downsides, but I feel like overall it\'s a decent flick. I could see myself going to see it again.',
    'What a rotten attempt at a comedy. Not a single joke lands, everyone acts annoying and loud, even kids won\'t like this!',
    'Launching on Netflix was a brave move & I really appreciate being able to binge on episode after episode, of this exciting intelligent new drama.'
    'The movie was really good, but it could have been a little better.',
    'I thought the movie was lit.',
    'The movie was not bad, not good.',
    'Watching the movie felt like watching paint dry.',
    "I'm on the fence about this particular movie. The writing was decent, the acting was terrible. I would not watch it again.",
], columns=['review'])

my_reviews['review_norm'] = my_reviews.review.apply(normalize)

my_reviews

### Model 1

In [None]:
texts = my_reviews['review_norm']

my_reviews_pred_prob = model_1.predict_proba(count_tf_idf.transform(texts))[:, 1]

for i, review in enumerate(texts.str.slice(0, 100)):
    print(f'{my_reviews_pred_prob[i]:.2f}:  {review}')

### Model 2

In [None]:
texts = my_reviews['review_norm']

my_reviews_pred_prob = model_2.predict_proba(count_tf_idf.transform(texts))[:, 1]

for i, review in enumerate(texts.str.slice(0, 100)):
    print(f'{my_reviews_pred_prob[i]:.2f}:  {review}')

### Model 3

In [None]:
texts = my_reviews['review_norm']

my_reviews_pred_prob = model_3.predict_proba(count_tf_idf.transform(texts))[:, 1]

for i, review in enumerate(texts.str.slice(0, 100)):
    print(f'{my_reviews_pred_prob[i]:.2f}:  {review}')

### Model 9

In [None]:
texts = my_reviews['review']

my_reviews_features_9 = BERT_text_to_embeddings(texts, disable_progress_bar=True)

my_reviews_pred_prob = model_9.predict_proba(my_reviews_features_9)[:, 1]

for i, review in enumerate(texts.str.slice(0, 100)):
    print(f'{my_reviews_pred_prob[i]:.2f}:  {review}')

Although all four models had F1 scores of 0.84-0.88, which is solid, they definitely provided wrong answers for some responses - especially in the models that relied on lemmatized data and stopword removal. This normalization style removed the context from words which tended to change their meaning. Personally, I would not be able to answer conclusively about the sentiment of some normalized sentences from this batch, so I cannot blame the model for also being confused. BERT/LR seemed to have the best results, and I believe context is a big reason. Compared to the first LR model, this LR model is more confident on the more clear-cut cases, and shows a little less confidence on less clear-cut cases, but still gets almost all reviews correct. I am surprised that BERT did not have a better tested F1 score, but perhaps training on the full set or using a different model would elevate the results. Training models, tuning hyperparameters, and running preprocessing/embedding takes so long that I am happy enough with my results.

## Conclusions

We started off with a pretty clean dataset, and we found that the classes were already balanced almost perfectly. I normalized the raw reviews by removing non-letter characters, lowercasing everything, removing stop-words, and lemmatizing the text. From this normalized text, I created features by using TF-IDF, and trained logistic regression, CatBoost, and support vector machines with these features. The requirement for F1 score was 0.85, and all three of these models exceeded that. Logistic regression and SVM had the best F1 scores with 0.88 (though CatBoost was close with 0.87), but logistic regression trained much faster than SVM. It was difficult to tune hyperparameters for CatBoost and SVM due to the amount of time required for training. SVM took over an hour with default settings, so I didn't adjust anything with that model, though I suspect performance could have improved with some tuning. I also trained a logistic regression model with BERT embeddings, though I only used a training set of a few thousand reviews. The F1 score of 0.84 that the LR model yielded would likely have been improved by training on the entire dataset rather than a small fraction. Examining each model's decisions on a few short, specific reviews revealed the strength of using BERT embeddings. 

Logistic regression with lemmatized/TF-IDF features is my choice for top model based on its tested F1 score of 0.88 and its training speed.

# Checklist

- [x]  Notebook was opened
- [ ]  The text data is loaded and pre-processed for vectorization
- [ ]  The text data is transformed to vectors
- [ ]  Models are trained and tested
- [ ]  The metric's threshold is reached
- [ ]  All the code cells are arranged in the order of their execution
- [ ]  All the code cells can be executed without errors
- [ ]  There are conclusions