# Hyperparameter Tuning of Machine Learning Models

Link to notebook: https://colab.research.google.com/drive/1Qz3C2gdh5CypkH3LAlZZDh3CnmoD0neN?usp=sharing

In this notebook, we attempt to improve on the limitation of this paper: Rahman A & Hossen MdS (2019) Sentiment analysis on movie review data using Machine Learning Approach. 2019 International Conference on Bangla Speech and Language Processing (ICBSLP). doi: https://ieeexplore.ieee.org/document/9084046

The authors did not conduct hyperparameter tuning of the Machine Learning Models. For this notebook, we attempt to explore with the available hyperparameters for the following ML models: Logistic Regression, Multinomial Naive Bayes, Random Forest Classifier.

In [14]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV, train_test_split, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Reading the data

In [2]:
# This cell reads files from Google Colab. If not using Colab, change the file directories accordingly
from google.colab import drive
drive.mount('/content/drive')

train = pd.read_csv('/content/drive/MyDrive/Datasets/ML_train.csv', index_col = 0)
test = pd.read_csv('/content/drive/MyDrive/Datasets/ML_test.csv', index_col = 0)

Mounted at /content/drive


In [3]:
def good_bad(row):
  if row > 4:
    return 1
  else:
    return 0

In [4]:
train['Sentiment'] = train['Sentiment'].apply(good_bad)
test['Sentiment'] = test['Sentiment'].apply(good_bad)

In [5]:
train.head()

Unnamed: 0,Text,Sentiment
0,saw premiered rewatched ifc is great telling m...,1
1,movie is one alltime favorite think sean penn ...,1
2,describing stalingrad war film may bit inaccur...,1
3,tale two sister one creepiest film have seen r...,1
4,well notice imdb offered plot infothat is is p...,0


In [6]:
trial_train = train.copy()
trial_train.head()

Unnamed: 0,Text,Sentiment
0,saw premiered rewatched ifc is great telling m...,1
1,movie is one alltime favorite think sean penn ...,1
2,describing stalingrad war film may bit inaccur...,1
3,tale two sister one creepiest film have seen r...,1
4,well notice imdb offered plot infothat is is p...,0


In [7]:
class2_X_train = trial_train['Text']
class2_y_train = trial_train['Sentiment']

In [8]:
trial_test = test.copy()
trial_test.head()

Unnamed: 0,Text,Sentiment
0,frank horrigan clint eastwood is harassed mitc...,1
1,carly jones elisha curtberth bad boy brother n...,1
2,dig would say anyone even like metallica see k...,1
3,is great premise movie overall plot is origina...,0
4,underground comedy movie is possibly worst tra...,0


In [9]:
class2_X_test = trial_test['Text']
class2_y_test = trial_test['Sentiment']

# Vectorization

BOW Vectorizer

In [10]:
bow_vectorizer = CountVectorizer()
bow_vectorizer.fit(class2_X_train)

bow_X_train = bow_vectorizer.transform(class2_X_train)
bow_X_test = bow_vectorizer.transform(class2_X_test)

TF-IDF Vectorizer

In [11]:
# ngram_range=(1, 3): This tells the vectorizer to consider unigrams, bigrams, and trigrams
# min_df=2: This means an n-gram must appear in at least two documents to be considered. This helps in removing very rare n-grams that might not be useful for modeling.
# max_df=0.85: This means an n-gram appearing in more than 85% of the documents will be ignored, helping in filtering out too common n-grams.

tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 3), min_df=2, max_df=0.85)
tfidf_vectorizer.fit(class2_X_train)

tfidf_X_train = tfidf_vectorizer.transform(class2_X_train)
tfidf_X_test = tfidf_vectorizer.transform(class2_X_test)

# Function to help us evaluate model performance

In [12]:
def train_and_eval(model, trainX, trainY, testX, testY):

    # training the model
    fitted_model = model.fit(trainX, trainY)

    # getting predictions
    y_preds_train = fitted_model.predict(trainX)
    y_preds_test = fitted_model.predict(testX)

    # evaluating the model
    print()
    print(model)
    print(f"Train accuracy score : {accuracy_score(trainY, y_preds_train)}")
    print(f"Test accuracy score : {accuracy_score(testY, y_preds_test)}")
    print(classification_report(testY, y_preds_test))
    print('\n',40*'-')

## Logistic Regression Hyperparameter Tuning


Logistic Regression with BOW


In [None]:
#Selections of param grid are based on previous mini hyperparameter tuning to find better C values
param_grid = {
    'C': [0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear'],
    'max_iter': [400]
}

In [None]:
log_model = LogisticRegression(random_state = 42, n_jobs = -1)
clf = GridSearchCV(log_model, param_grid=param_grid, cv=3, verbose=4, scoring='accuracy')

In [None]:
best_clf = clf.fit(bow_X_train, class2_y_train)

Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV 1/3] END C=0.1, max_iter=400, penalty=l1, solver=liblinear;, score=0.879 total time=   2.2s
[CV 2/3] END C=0.1, max_iter=400, penalty=l1, solver=liblinear;, score=0.875 total time=   1.5s
[CV 3/3] END C=0.1, max_iter=400, penalty=l1, solver=liblinear;, score=0.878 total time=   1.3s
[CV 1/3] END C=0.1, max_iter=400, penalty=l2, solver=liblinear;, score=0.888 total time=  15.6s
[CV 2/3] END C=0.1, max_iter=400, penalty=l2, solver=liblinear;, score=0.885 total time=   9.5s
[CV 3/3] END C=0.1, max_iter=400, penalty=l2, solver=liblinear;, score=0.890 total time=   7.0s
[CV 1/3] END C=1, max_iter=400, penalty=l1, solver=liblinear;, score=0.874 total time=   1.6s
[CV 2/3] END C=1, max_iter=400, penalty=l1, solver=liblinear;, score=0.876 total time=   2.1s
[CV 3/3] END C=1, max_iter=400, penalty=l1, solver=liblinear;, score=0.876 total time=   1.9s
[CV 1/3] END C=1, max_iter=400, penalty=l2, solver=liblinear;, score=0.883 total t

In [None]:
best_clf.best_params_

{'C': 0.1, 'max_iter': 400, 'penalty': 'l2', 'solver': 'liblinear'}

In [None]:
best_clf.best_score_

0.8873499923593785

In [None]:
best_log = LogisticRegression(**best_clf.best_params_)

In [None]:
train_and_eval(best_log, bow_X_train, class2_y_train, bow_X_test, class2_y_test)


LogisticRegression(C=0.1, max_iter=400, solver='liblinear')
Train accuracy score : 0.96795
Test accuracy score : 0.8924892489248925
              precision    recall  f1-score   support

           0       0.90      0.88      0.89      5000
           1       0.89      0.90      0.89      4999

    accuracy                           0.89      9999
   macro avg       0.89      0.89      0.89      9999
weighted avg       0.89      0.89      0.89      9999


 ----------------------------------------


Best Logistic Regression Model with BOW gives us 89.2% accuracy on test data

Logistic Regression with TF-IDF


In [None]:
param_grid = {
    'C': [0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga'],
    'max_iter': [400]
}

In [None]:
log_model = LogisticRegression(random_state = 42, n_jobs = -1)
clf = RandomizedSearchCV(log_model, param_distributions=param_grid, cv=3, verbose=4, scoring='accuracy', random_state=42, n_iter=20)


In [None]:
best_clf = clf.fit(bow_X_train, class2_y_train)

Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV 1/3] END C=0.1, max_iter=400, penalty=l1, solver=liblinear;, score=0.879 total time=   1.3s
[CV 2/3] END C=0.1, max_iter=400, penalty=l1, solver=liblinear;, score=0.875 total time=   1.3s
[CV 3/3] END C=0.1, max_iter=400, penalty=l1, solver=liblinear;, score=0.878 total time=   1.4s
[CV 1/3] END C=0.1, max_iter=400, penalty=l1, solver=saga;, score=0.878 total time= 1.9min
[CV 2/3] END C=0.1, max_iter=400, penalty=l1, solver=saga;, score=0.873 total time= 2.0min
[CV 3/3] END C=0.1, max_iter=400, penalty=l1, solver=saga;, score=0.878 total time= 1.9min
[CV 1/3] END C=0.1, max_iter=400, penalty=l2, solver=liblinear;, score=0.888 total time=   8.6s
[CV 2/3] END C=0.1, max_iter=400, penalty=l2, solver=liblinear;, score=0.884 total time=  10.4s
[CV 3/3] END C=0.1, max_iter=400, penalty=l2, solver=liblinear;, score=0.890 total time=  10.4s
[CV 1/3] END C=0.1, max_iter=400, penalty=l2, solver=saga;, score=0.888 total time=  49.6s

In [None]:
log_model = LogisticRegression(random_state = 42, n_jobs = -1)
clf = RandomizedSearchCV(log_model, param_distributions=param_grid, cv=3, verbose=4, scoring='accuracy', random_state=42, n_iter=20)

In [None]:
%%time
best_clf = clf.fit(tfidf_X_train, class2_y_train)

Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV 1/3] END C=0.1, max_iter=400, penalty=l1, solver=liblinear;, score=0.766 total time=  18.2s
[CV 2/3] END C=0.1, max_iter=400, penalty=l1, solver=liblinear;, score=0.765 total time=  15.9s
[CV 3/3] END C=0.1, max_iter=400, penalty=l1, solver=liblinear;, score=0.772 total time=  13.9s
[CV 1/3] END C=0.1, max_iter=400, penalty=l1, solver=saga;, score=0.765 total time=  20.3s
[CV 2/3] END C=0.1, max_iter=400, penalty=l1, solver=saga;, score=0.765 total time=  17.4s
[CV 3/3] END C=0.1, max_iter=400, penalty=l1, solver=saga;, score=0.772 total time=  16.3s
[CV 1/3] END C=0.1, max_iter=400, penalty=l2, solver=liblinear;, score=0.854 total time=   3.7s
[CV 2/3] END C=0.1, max_iter=400, penalty=l2, solver=liblinear;, score=0.854 total time=   4.7s
[CV 3/3] END C=0.1, max_iter=400, penalty=l2, solver=liblinear;, score=0.859 total time=   4.8s
[CV 1/3] END C=0.1, max_iter=400, penalty=l2, solver=saga;, score=0.854 total time=  15.0s

In [None]:
best_clf.best_params_

{'solver': 'liblinear', 'penalty': 'l2', 'max_iter': 400, 'C': 10}

In [None]:
best_clf.best_score_

0.9037249386311904

In [None]:
best_log = LogisticRegression(**best_clf.best_params_)

In [None]:
train_and_eval(best_log, tfidf_X_train, class2_y_train, tfidf_X_test, class2_y_test)


LogisticRegression(C=10, max_iter=400, solver='liblinear')
Train accuracy score : 0.999175
Test accuracy score : 0.9062906290629063
              precision    recall  f1-score   support

           0       0.91      0.90      0.91      5000
           1       0.90      0.91      0.91      4999

    accuracy                           0.91      9999
   macro avg       0.91      0.91      0.91      9999
weighted avg       0.91      0.91      0.91      9999


 ----------------------------------------


Best Logistic Regression Model with TFIDF gives us 90.6% accuracy on test data

## Random Forest Hyperparameter Tuning


Random Forest Classifier with BOW

In [None]:
param_grid = {
    'n_estimators': [50, 100, 150, 200, 250],
    'criterion': ['gini', 'entropy', 'log_loss'],
    'max_depth': [20, 30, 50, 75, 100],
    'min_samples_split': [10, 25, 50, 100, 150, 200],
    'min_samples_leaf': [25, 50, 100]
}

In [None]:
rf_model = RandomForestClassifier(random_state = 42, n_jobs = -1)
clf = RandomizedSearchCV(rf_model, param_distributions=param_grid, cv=3, verbose=4, scoring='accuracy', random_state=42)

In [None]:
best_clf = clf.fit(bow_X_train, class2_y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV 1/3] END criterion=log_loss, max_depth=50, min_samples_leaf=50, min_samples_split=100, n_estimators=100;, score=0.847 total time=  23.6s
[CV 2/3] END criterion=log_loss, max_depth=50, min_samples_leaf=50, min_samples_split=100, n_estimators=100;, score=0.835 total time=  19.5s
[CV 3/3] END criterion=log_loss, max_depth=50, min_samples_leaf=50, min_samples_split=100, n_estimators=100;, score=0.843 total time=  17.0s
[CV 1/3] END criterion=entropy, max_depth=100, min_samples_leaf=50, min_samples_split=150, n_estimators=50;, score=0.834 total time=  11.0s
[CV 2/3] END criterion=entropy, max_depth=100, min_samples_leaf=50, min_samples_split=150, n_estimators=50;, score=0.821 total time=   9.4s
[CV 3/3] END criterion=entropy, max_depth=100, min_samples_leaf=50, min_samples_split=150, n_estimators=50;, score=0.838 total time=   7.9s
[CV 1/3] END criterion=log_loss, max_depth=100, min_samples_leaf=50, min_samples_split=10, n_est

In [None]:
best_clf.best_params_

{'n_estimators': 250,
 'min_samples_split': 10,
 'min_samples_leaf': 50,
 'max_depth': 100,
 'criterion': 'log_loss'}

In [None]:
best_clf.best_score_

0.8492000173110673

In [None]:
best_rf = RandomForestClassifier(**best_clf.best_params_)


In [None]:
train_and_eval(best_rf, bow_X_train, class2_y_train, bow_X_test, class2_y_test)



RandomForestClassifier(criterion='log_loss', max_depth=100, min_samples_leaf=50,
                       min_samples_split=10, n_estimators=250)
Train accuracy score : 0.86275
Test accuracy score : 0.8525852585258525
              precision    recall  f1-score   support

           0       0.87      0.83      0.85      5000
           1       0.83      0.88      0.86      4999

    accuracy                           0.85      9999
   macro avg       0.85      0.85      0.85      9999
weighted avg       0.85      0.85      0.85      9999


 ----------------------------------------


Best Random Forest Classifier with BOW gives us 85.3% accuracy on test data

Random Forest with TF-IDF

In [None]:
param_grid = {
    'n_estimators': [50, 100, 150, 200, 500],
    'criterion': ['gini', 'entropy', 'log_loss'],
    'max_depth': [20, 30, 50, 75, 100],
    'min_samples_split': [10, 25, 50, 100, 150, 200, 250],
    'min_samples_leaf': [25, 50, 100]
}

In [None]:
rf_model = RandomForestClassifier(random_state = 42, n_jobs = -1)
clf = RandomizedSearchCV(rf_model, param_distributions=param_grid, cv=3, verbose=4, scoring='accuracy', random_state=42)

In [None]:
best_clf = clf.fit(tfidf_X_train, class2_y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV 1/3] END criterion=log_loss, max_depth=20, min_samples_leaf=100, min_samples_split=25, n_estimators=100;, score=0.811 total time=  28.1s
[CV 2/3] END criterion=log_loss, max_depth=20, min_samples_leaf=100, min_samples_split=25, n_estimators=100;, score=0.820 total time=  25.4s
[CV 3/3] END criterion=log_loss, max_depth=20, min_samples_leaf=100, min_samples_split=25, n_estimators=100;, score=0.817 total time=  20.1s
[CV 1/3] END criterion=log_loss, max_depth=75, min_samples_leaf=100, min_samples_split=150, n_estimators=500;, score=0.842 total time= 1.4min
[CV 2/3] END criterion=log_loss, max_depth=75, min_samples_leaf=100, min_samples_split=150, n_estimators=500;, score=0.844 total time= 1.5min
[CV 3/3] END criterion=log_loss, max_depth=75, min_samples_leaf=100, min_samples_split=150, n_estimators=500;, score=0.842 total time= 1.4min
[CV 1/3] END criterion=entropy, max_depth=75, min_samples_leaf=25, min_samples_split=150, 

In [None]:
# best parameters
best_clf.best_params_

{'n_estimators': 500,
 'min_samples_split': 250,
 'min_samples_leaf': 25,
 'max_depth': 50,
 'criterion': 'log_loss'}

In [None]:
# highest accuracy score
best_clf.best_score_

0.8643249773309729

In [None]:
best_rf = RandomForestClassifier(**best_clf.best_params_)

In [None]:
train_and_eval(best_rf, tfidf_X_train, class2_y_train, tfidf_X_test, class2_y_test)


RandomForestClassifier(criterion='log_loss', max_depth=50, min_samples_leaf=25,
                       min_samples_split=250, n_estimators=500)
Train accuracy score : 0.884975
Test accuracy score : 0.863986398639864
              precision    recall  f1-score   support

           0       0.89      0.84      0.86      5000
           1       0.85      0.89      0.87      4999

    accuracy                           0.86      9999
   macro avg       0.87      0.86      0.86      9999
weighted avg       0.87      0.86      0.86      9999


 ----------------------------------------


Best Random forest Classifier with TF-IDF gives us 86.3% accuracy on test data

# Multinomial Naive Bayes Hyperparameter Tuning

Multinomial Naive Bayes with BOW

In [15]:
param_grid = {
    'alpha': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    'force_alpha': [True, False],
    'fit_prior': [True, False]
}

In [16]:
nb_model = MultinomialNB()
clf = GridSearchCV(nb_model,
                   param_grid=param_grid,
                   cv=5,
                   verbose=4,
                   scoring='accuracy')

In [17]:
best_clf = clf.fit(bow_X_train, class2_y_train)

Fitting 5 folds for each of 44 candidates, totalling 220 fits
[CV 1/5] END alpha=0.0, fit_prior=True, force_alpha=True;, score=0.738 total time=   0.1s


  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


[CV 2/5] END alpha=0.0, fit_prior=True, force_alpha=True;, score=0.739 total time=   0.1s
[CV 3/5] END alpha=0.0, fit_prior=True, force_alpha=True;, score=0.735 total time=   0.1s


  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


[CV 4/5] END alpha=0.0, fit_prior=True, force_alpha=True;, score=0.917 total time=   0.1s
[CV 5/5] END alpha=0.0, fit_prior=True, force_alpha=True;, score=0.970 total time=   0.1s


  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


[CV 1/5] END alpha=0.0, fit_prior=True, force_alpha=False;, score=0.860 total time=   0.1s
[CV 2/5] END alpha=0.0, fit_prior=True, force_alpha=False;, score=0.861 total time=   0.1s




[CV 3/5] END alpha=0.0, fit_prior=True, force_alpha=False;, score=0.870 total time=   0.1s
[CV 4/5] END alpha=0.0, fit_prior=True, force_alpha=False;, score=0.945 total time=   0.1s




[CV 5/5] END alpha=0.0, fit_prior=True, force_alpha=False;, score=0.970 total time=   0.1s
[CV 1/5] END alpha=0.0, fit_prior=False, force_alpha=True;, score=0.738 total time=   0.1s


  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


[CV 2/5] END alpha=0.0, fit_prior=False, force_alpha=True;, score=0.739 total time=   0.1s
[CV 3/5] END alpha=0.0, fit_prior=False, force_alpha=True;, score=0.735 total time=   0.1s


  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


[CV 4/5] END alpha=0.0, fit_prior=False, force_alpha=True;, score=0.917 total time=   0.1s
[CV 5/5] END alpha=0.0, fit_prior=False, force_alpha=True;, score=0.970 total time=   0.1s


  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


[CV 1/5] END alpha=0.0, fit_prior=False, force_alpha=False;, score=0.860 total time=   0.1s
[CV 2/5] END alpha=0.0, fit_prior=False, force_alpha=False;, score=0.861 total time=   0.1s




[CV 3/5] END alpha=0.0, fit_prior=False, force_alpha=False;, score=0.870 total time=   0.1s
[CV 4/5] END alpha=0.0, fit_prior=False, force_alpha=False;, score=0.945 total time=   0.1s




[CV 5/5] END alpha=0.0, fit_prior=False, force_alpha=False;, score=0.970 total time=   0.1s
[CV 1/5] END alpha=0.1, fit_prior=True, force_alpha=True;, score=0.897 total time=   0.1s
[CV 2/5] END alpha=0.1, fit_prior=True, force_alpha=True;, score=0.897 total time=   0.1s
[CV 3/5] END alpha=0.1, fit_prior=True, force_alpha=True;, score=0.899 total time=   0.1s
[CV 4/5] END alpha=0.1, fit_prior=True, force_alpha=True;, score=0.919 total time=   0.1s
[CV 5/5] END alpha=0.1, fit_prior=True, force_alpha=True;, score=0.929 total time=   0.1s
[CV 1/5] END alpha=0.1, fit_prior=True, force_alpha=False;, score=0.897 total time=   0.1s
[CV 2/5] END alpha=0.1, fit_prior=True, force_alpha=False;, score=0.897 total time=   0.1s
[CV 3/5] END alpha=0.1, fit_prior=True, force_alpha=False;, score=0.899 total time=   0.1s
[CV 4/5] END alpha=0.1, fit_prior=True, force_alpha=False;, score=0.919 total time=   0.1s
[CV 5/5] END alpha=0.1, fit_prior=True, force_alpha=False;, score=0.929 total time=   0.1s
[CV

In [18]:
# best parameters
best_clf.best_params_

{'alpha': 0.1, 'fit_prior': True, 'force_alpha': True}

In [19]:
# highest accuracy score
best_clf.best_score_

0.9082043574968326

In [20]:
best_nb = MultinomialNB(**best_clf.best_params_)

In [21]:
train_and_eval(best_nb, bow_X_train, class2_y_train, bow_X_test, class2_y_test)


MultinomialNB(alpha=0.1, force_alpha=True)
Train accuracy score : 0.9512122541034959
Test accuracy score : 0.8533535047907211
              precision    recall  f1-score   support

           0       0.83      0.88      0.86      4940
           1       0.87      0.83      0.85      4975

    accuracy                           0.85      9915
   macro avg       0.85      0.85      0.85      9915
weighted avg       0.85      0.85      0.85      9915


 ----------------------------------------


Best Naive Bayes model with BOW gives us 85.3% accuracy on test data

Multinomial Naive Bayes with TF-IDF

In [22]:
param_grid = {
    'alpha': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    'force_alpha': [True, False],
    'fit_prior': [True, False]
}

In [23]:
nb_model = MultinomialNB()
clf = GridSearchCV(nb_model, param_grid=param_grid, cv=5, verbose=4, scoring='accuracy')

In [24]:
best_clf = clf.fit(tfidf_X_train, class2_y_train)

Fitting 5 folds for each of 44 candidates, totalling 220 fits


  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


[CV 1/5] END alpha=0.0, fit_prior=True, force_alpha=True;, score=0.699 total time=   1.0s


  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


[CV 2/5] END alpha=0.0, fit_prior=True, force_alpha=True;, score=0.696 total time=   0.9s


  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


[CV 3/5] END alpha=0.0, fit_prior=True, force_alpha=True;, score=0.688 total time=   0.9s


  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


[CV 4/5] END alpha=0.0, fit_prior=True, force_alpha=True;, score=0.939 total time=   0.9s


  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


[CV 5/5] END alpha=0.0, fit_prior=True, force_alpha=True;, score=1.000 total time=   0.9s




[CV 1/5] END alpha=0.0, fit_prior=True, force_alpha=False;, score=0.910 total time=   2.1s




[CV 2/5] END alpha=0.0, fit_prior=True, force_alpha=False;, score=0.915 total time=   3.2s




[CV 3/5] END alpha=0.0, fit_prior=True, force_alpha=False;, score=0.914 total time=   1.7s




[CV 4/5] END alpha=0.0, fit_prior=True, force_alpha=False;, score=0.981 total time=   1.7s




[CV 5/5] END alpha=0.0, fit_prior=True, force_alpha=False;, score=1.000 total time=   1.9s


  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


[CV 1/5] END alpha=0.0, fit_prior=False, force_alpha=True;, score=0.699 total time=   0.9s


  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


[CV 2/5] END alpha=0.0, fit_prior=False, force_alpha=True;, score=0.696 total time=   0.9s


  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


[CV 3/5] END alpha=0.0, fit_prior=False, force_alpha=True;, score=0.688 total time=   0.9s


  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


[CV 4/5] END alpha=0.0, fit_prior=False, force_alpha=True;, score=0.939 total time=   0.9s


  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


[CV 5/5] END alpha=0.0, fit_prior=False, force_alpha=True;, score=1.000 total time=   1.1s




[CV 1/5] END alpha=0.0, fit_prior=False, force_alpha=False;, score=0.910 total time=   1.4s




[CV 2/5] END alpha=0.0, fit_prior=False, force_alpha=False;, score=0.915 total time=   1.4s




[CV 3/5] END alpha=0.0, fit_prior=False, force_alpha=False;, score=0.914 total time=   0.9s




[CV 4/5] END alpha=0.0, fit_prior=False, force_alpha=False;, score=0.981 total time=   0.9s




[CV 5/5] END alpha=0.0, fit_prior=False, force_alpha=False;, score=1.000 total time=   0.9s
[CV 1/5] END alpha=0.1, fit_prior=True, force_alpha=True;, score=0.945 total time=   0.9s
[CV 2/5] END alpha=0.1, fit_prior=True, force_alpha=True;, score=0.945 total time=   0.9s
[CV 3/5] END alpha=0.1, fit_prior=True, force_alpha=True;, score=0.944 total time=   0.9s
[CV 4/5] END alpha=0.1, fit_prior=True, force_alpha=True;, score=0.988 total time=   0.9s
[CV 5/5] END alpha=0.1, fit_prior=True, force_alpha=True;, score=1.000 total time=   0.9s
[CV 1/5] END alpha=0.1, fit_prior=True, force_alpha=False;, score=0.945 total time=   0.9s
[CV 2/5] END alpha=0.1, fit_prior=True, force_alpha=False;, score=0.945 total time=   0.9s
[CV 3/5] END alpha=0.1, fit_prior=True, force_alpha=False;, score=0.944 total time=   0.9s
[CV 4/5] END alpha=0.1, fit_prior=True, force_alpha=False;, score=0.988 total time=   1.2s
[CV 5/5] END alpha=0.1, fit_prior=True, force_alpha=False;, score=1.000 total time=   1.3s
[CV

In [25]:
# best parameters
best_clf.best_params_

{'alpha': 0.2, 'fit_prior': True, 'force_alpha': True}

In [26]:
# highest accuracy score
best_clf.best_score_

0.9650428759010555

In [27]:
best_nb = MultinomialNB(**best_clf.best_params_)

In [28]:
train_and_eval(best_nb, tfidf_X_train, class2_y_train, tfidf_X_test, class2_y_test)


MultinomialNB(alpha=0.2, force_alpha=True)
Train accuracy score : 0.9951290565092094
Test accuracy score : 0.891981845688351
              precision    recall  f1-score   support

           0       0.89      0.90      0.89      4940
           1       0.90      0.89      0.89      4975

    accuracy                           0.89      9915
   macro avg       0.89      0.89      0.89      9915
weighted avg       0.89      0.89      0.89      9915


 ----------------------------------------


Best Multinomial Naive Bayes gives us an accuracy of 89.2% on test data