This dataset was taken from Kaggle: IMDB Dataset of 50K Movie Reviews.

## Data Preparation

The goal is to find which ML model is best suited to predict sentiment given a review.

In [1]:
import pandas as pd

df_review = pd.read_csv('IMDB Dataset.csv')
df_review.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


Im gonna take a smaller sample of 10k rows to make processing faster and get imbalanced data.

In [2]:
# 9000 positives
df_positive = df_review[df_review['sentiment']=='positive'][:9000]
# 1000 negatives
df_negative = df_review[df_review['sentiment']=='negative'][:1000]

df_review_imb = pd.concat([df_positive, df_negative])
df_review_imb.value_counts(['sentiment'])

sentiment
positive     9000
negative     1000
Name: count, dtype: int64

In [3]:
from imblearn.under_sampling import  RandomUnderSampler

rus = RandomUnderSampler(random_state=0)
df_review_bal, df_review_bal['sentiment']=rus.fit_resample(df_review_imb[['review']],
                                                           df_review_imb['sentiment'])

In [4]:
print(df_review_imb.value_counts('sentiment'),"\n")
print(df_review_bal.value_counts('sentiment'))

sentiment
positive    9000
negative    1000
Name: count, dtype: int64 

sentiment
negative    1000
positive    1000
Name: count, dtype: int64



Firstly had 50k rows, then moved to 10k (9k positives and 1k negatives) and finally we undersampled it getting 2k samples (1k positives and 1k negatives).


## Data Split (Train/Test)

In [5]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df_review_bal, test_size=0.33, random_state=0)
train_x, train_y = train['review'], train['sentiment']
test_x, test_y = test['review'], test['sentiment']

These algorithms expect numerical data rather than text, so we have to convert text into numerical info. One way to do it is using TFIDF (Term Frequency – Inverse Document Frequency)

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')
train_x_vector = tfidf.fit_transform(train_x)
# also fit the test_x_vector
test_x_vector = tfidf.transform(test_x)

### ML algorith -> Supervised learning -> Classification

#### Support Vector Machines (SVM)

In [18]:
from sklearn.svm import SVC

svc = SVC(kernel="linear")
print(svc.fit(train_x_vector, train_y))

SVC(kernel='linear')


In [8]:
### Testing

print(svc.predict(tfidf.transform(['A good movie'])))
print(svc.predict(tfidf.transform(['An excellent movie'])))
print(svc.predict(tfidf.transform(['I did not like this movie at all'])))


['positive']
['positive']
['negative']


### Decision Tree

In [19]:
from sklearn.tree import DecisionTreeClassifier

dec_tree = DecisionTreeClassifier()
print(dec_tree.fit(train_x_vector, train_y))

DecisionTreeClassifier()


In [20]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
print(gnb.fit(train_x_vector.toarray(), train_y))

GaussianNB()


In [21]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
print(log_reg.fit(train_x_vector, train_y))

LogisticRegression()


## Model Evaluation

#### Mean Accuracy

In [12]:
print(f"SVM model precission: {svc.score(test_x_vector, test_y): .3f}")
print(f"Decision tree model precission: {dec_tree.score(test_x_vector, test_y): .3f}")
print(f"Gaussian naive bayes model precission: {gnb.score(test_x_vector.toarray(), test_y): .3f}")
print(f"Logistic regression model precission: {log_reg.score(test_x_vector, test_y): .3f}")

SVM model precission:  0.826
Decision tree model precission:  0.670
Gaussian naive bayes model precission:  0.633
Logistic regression model precission:  0.812


#### F1 Score

In [13]:
from sklearn.metrics import f1_score

f1_score(test_y, svc.predict(test_x_vector),
         labels=['positive', 'negative'],
         average=None)

array([0.82861401, 0.82280431])

#### Classification report

In [14]:
from sklearn.metrics import classification_report

print(classification_report(test_y, 
                            svc.predict(test_x_vector),
                            labels=['positive', 'negative']))

              precision    recall  f1-score   support

    positive       0.79      0.87      0.83       321
    negative       0.86      0.79      0.82       339

    accuracy                           0.83       660
   macro avg       0.83      0.83      0.83       660
weighted avg       0.83      0.83      0.83       660



In [15]:
from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(test_y, 
                            svc.predict(test_x_vector), 
                            labels=['positive', 'negative'])
conf_mat

array([[278,  43],
       [ 72, 267]])

## Model tuning

#### GridSearchCV

In [22]:
from sklearn.model_selection import GridSearchCV

parameters = {'C': [1,4,8,16,32] ,'kernel':['linear', 'rbf']}
svc = SVC()
svc_grid = GridSearchCV(svc,parameters, cv=5)

print(svc_grid.fit(train_x_vector, train_y))

GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': [1, 4, 8, 16, 32], 'kernel': ['linear', 'rbf']})


In [17]:
print("Mejores parámetros:",svc_grid.best_params_)
print(f"Score: {svc_grid.best_score_: .3f}")

Mejores parámetros: {'C': 4, 'kernel': 'rbf'}
Score:  0.814


The Score we obtain is a bit lower than it was before using GridSearchCV, this could mean our current model is more honest and less prone to overfitting. The real advantage of grid search is finding hyperparameters that generalize better to new data, even if this means sacrificing a bit of apparent performance on the training set.