### A simple model
First lets build a simple model which uses SVM and TF-IDF to check how it performs on the dataset.
- I choose SVM because they are known to better perform on high dimensional spaces. As TF-IDF has a high dimensional space.
- TF-IDF and bad of words are some of the common methods of pre processing before fitting the model. As the model i choose was SVM, i went with TF-IDF
- This gives us a basic benchmark of the algorithm upon which we can improve.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import re
from sklearn.svm import SVC
from joblib import dump
from sklearn.metrics import classification_report


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
# Download stopwords from nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sudhanva/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# Load the data
df = pd.read_csv("../data/IMDB Dataset.csv")

In [4]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
# We can see that the classes are balanced
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [6]:
df["sentiment"].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

In [7]:
# Lets build a simple model using TF-IDF and SVM
X = df['review'] 
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [8]:
def preprocess_text(text):
    # Lowercasing
    text = text.lower()

    # Remove punctuations and numbers and other non-letter characters
    text = re.sub(r'[^a-zA-Z]', ' ', text)

    # Tokenization
    tokens = nltk.word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]

    # Reconstruct the text
    text = ' '.join(filtered_tokens)
    return text

# Apply preprocessing to training and test data
X_train_processed = X_train.apply(preprocess_text)
X_test_processed = X_test.apply(preprocess_text)

# Vectorization using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X_train_vectors = vectorizer.fit_transform(X_train_processed)
X_test_vectors = vectorizer.transform(X_test_processed)

In [9]:
svm_model = SVC(kernel='linear')

# Train the model
svm_model.fit(X_train_vectors, y_train)

# Predict on test sets
svm_predictions = svm_model.predict(X_test_vectors)

# Evaluate the model

print(classification_report(y_test, svm_predictions))

              precision    recall  f1-score   support

    negative       0.90      0.87      0.88      4961
    positive       0.88      0.90      0.89      5039

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



In [10]:
# Save the model for inference
dump(svm_model,"../models/svm_1.joblib")

['../models/svm_1.joblib']

In [11]:
# Save the vectorizer for inference
dump(vectorizer,"../models/vectorizer_1.joblib")

['../models/vectorizer_1.joblib']

### Adding word vectors
We can see that the simple model gives us a decent performance. Lets do some additional changes to see if we can increase the performance

- Converting label into number,
- Using word vectors instead of tf-idf. Word vectors capture the semantics information and relationships between words better than tf-idf. 
- Change the model to MultinominalNB - this is a simple and efficient model, known to perform well on larger datasets with medium dimensional space.

In [13]:
df["num_sentiment"] = df["sentiment"].map({"positive":1,"negative": 0})
df.head()

Unnamed: 0,review,sentiment,num_sentiment
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1


In [14]:
import spacy
nlp = spacy.load("en_core_web_lg")

In [15]:
df["vector"] = df["review"].apply(lambda text: nlp(text).vector)

In [16]:
# We can see the some of the vectors are negative. This will cause an issue when using 
# MultinominalNB.
df.head()

Unnamed: 0,review,sentiment,num_sentiment,vector
0,One of the other reviewers has mentioned that ...,positive,1,"[-1.7168683, 0.63682824, -2.2273295, -0.330067..."
1,A wonderful little production. <br /><br />The...,positive,1,"[-1.6359671, 0.3594072, -0.9553732, -0.1634876..."
2,I thought this was a wonderful way to spend ti...,positive,1,"[-2.092113, 1.2885702, -1.5465266, -0.07942136..."
3,Basically there's a family where a little boy ...,negative,0,"[-2.128518, 0.5229247, -2.2172096, -0.5633962,..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1,"[-2.0590715, 1.256359, -1.9560167, -0.3018319,..."


In [17]:
# Creating train and test split
X_train, X_test, y_train, y_test = train_test_split(
    df.vector.values,
    df.num_sentiment,
    test_size=0.2,
    random_state=2022
)

In [18]:
# The input expects 2d arrays, we will need to stack our data
import numpy as np

X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

In [19]:
# As mentioned above, the negative values will cause an issue
# We will have to scale the vectors to fit into the model
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScaler


scaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)

In [20]:
clf = MultinomialNB()
clf.fit(scaled_train_embed, y_train)

In [21]:
y_pred = clf.predict(scaled_test_embed)

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.62      0.60      0.61      5043
           1       0.61      0.63      0.62      4957

    accuracy                           0.62     10000
   macro avg       0.62      0.62      0.62     10000
weighted avg       0.62      0.62      0.62     10000



As we can observe, this gave us worst results then the SVM model. This can be due to the smaller size of the dataset and also the spacy model we are using for word embeddings. 