# IMDb Movie Review Classifications 
The aim of this project is to classify IMDb movie reviews as negative or positive. As a first approach, I use classical ML models to classify reviews. I am using the IMDb review dataset from [Maas et al. 2011](http://www.aclweb.org/anthology/P11-1015), which contains highly polar reviews and their classifications (25,000 training reviews, 25,000 testing reviews). The reviews are vectorize, and then several classical models are trained to compare performance.

## Data loading and EDA
The training and test datasets are loaded into pandas DataFrames. The first few lines of each dataset are printed and the distribution of negative and positive reviews are plotted to confirm the data structure. 

In [None]:
from imdb_classification.data import load_imdb_data
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
data_dir = r'imdb_classifier/data/' # Replace
data_train = load_imdb_data(data_dir, subset = 'train')
data_test = load_imdb_data(data_dir, subset = 'test')

In [None]:
print(data_train.head())
print(data_test.head())

In [None]:
fig, axs = plt.subplots(1, 2, figsize = [10, 4], layout = 'tight')
axs[0].set_title('Distribution of negative (0) and positive (1) \nreviews in train data')
axs[1].set_title('Distribution of negative (0) and positive (1) \nreviews in test data')

_ = sns.countplot(data_train, x = 'label', color = plt.cm.viridis(0.), ax = axs[0])
_ = sns.countplot(data_test, x = 'label', color = plt.cm.viridis(0.), ax = axs[1])

## Vectorize the review data 
The text is vectorized using TfidfVectorizer, which uses the Term Frequency-Inverse Document-Frequency (TF-IDF) weighting scheme to weight each word based on its importance. `max_features` is set to 10,000 to reduce dimensionality. `stop_words` removes common English words. `ngram_range = (1, 2)` captures both one and two word features. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer 
vectorizer = TfidfVectorizer(max_features = 10000, stop_words = 'english', 
                             ngram_range = (1, 2)) 

X_train = vectorizer.fit_transform(data_train['review']) 
X_test = vectorizer.transform(data_test['review'])

y_train = data_train.label.values
y_test = data_test.label.values

## Logistic Regression
First, the data is trained using a logistic regression model. This model is simple and does not capture advanced features (sarcasm, long-range features, etc), but typical effective for this type of problem. The parameters `C = 1` controls the regularization strength. I chose an optimal value through some trial-and-error. Overall, this model is 88% effective, which is a good baseline.

In [None]:
from sklearn.linear_model import LogisticRegression 

clf = LogisticRegression(max_iter = 1000, C = 1) 
clf.fit(X_train, y_train)

In [None]:
print("Train score:", clf.score(X_train, y_train))
print("Test score:", clf.score(X_test, y_test))
y_pred = clf.predict(X_test) 
print(classification_report(y_test, y_pred))

In [None]:
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize = [5, 4], layout = 'tight')
sns.heatmap(cm, annot=True, fmt='d', cmap='viridis', ax = ax)
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')

## Random Forest 
The random forest captures nonlinear relationships/interactions better than the linear regression, but can struggle with sparse data like text, and can require additional tuning to avoid overfitting. The model has an accuracy of 85%, which is still decent though it may not be the optimal model for this dataset.

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators = 100, random_state = 4)
clf.fit(X_train, y_train)

In [None]:
print("Train score:", clf.score(X_train, y_train))
print("Test score:", clf.score(X_test, y_test))
y_pred = clf.predict(X_test) 
print(classification_report(y_test, y_pred))

In [None]:
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize = [5, 4], layout = 'tight')
sns.heatmap(cm, annot=True, fmt='d', cmap='viridis', ax = ax)
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')

## Naive Bayes
The Naive Bayes model makes the "naive" assumption that the phrases are conditionally independent given the class label. It is simple and efficient, though potentially at the cost of accuracy. This model achieves 85% accuracy, which is decent given the simplicity.

In [None]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(X_train, y_train)

In [None]:
print("Train score:", clf.score(X_train, y_train))
print("Test score:", clf.score(X_test, y_test))
y_pred = clf.predict(X_test) 
print(classification_report(y_test, y_pred))

In [None]:
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize = [5, 4], layout = 'tight')
sns.heatmap(cm, annot=True, fmt='d', cmap='viridis', ax = ax)
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')

## Support Vector Machine (SVM) 
SVM models are accurate on text classification because they are optimized for high-dimensional sparce data, though they may be computationally more intensive compared to the linear regression. For this small dataset, the computation time is not an issue, and the model achieves 88% accuracy.

In [None]:
from sklearn.svm import LinearSVC

clf = LinearSVC(C = 0.1)
clf.fit(X_train, y_train)

In [None]:
print("Train score:", clf.score(X_train, y_train))
print("Test score:", clf.score(X_test, y_test))
y_pred = clf.predict(X_test) 
print(classification_report(y_test, y_pred))

In [None]:
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize = [5, 4], layout = 'tight')
sns.heatmap(cm, annot=True, fmt='d', cmap='viridis', ax = ax)
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')

# Summary
These four classical ML algorithms provide a baseline for classification of IMDb reviews. The Linear Regression and SVM models achieve an accuracy of 88%, which is a good baseline for these classical models. More advanced tuning could increase this accuracy be a few percent. Unsurprisingly, the Naive Bayes and Random Forest models perform slightly worse, at 85% accuracy. However, this is still a decent result given their limitations. The next step is to move to a deep learning framework.