# Training on IMDB dataset for Sentiment Analysis - Traditional ML models 

## Loading the dataset

IMDB dataset having **50K movie reviews for natural language processing** or Text analytics.
This is a dataset for **binary sentiment classification** containing 50K movie reviews, **25K from each class**.

It contains two columns - 
* **review** (text)
* **sentiment** (positive or negative)

In [41]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/preprocesseddata/pre_train.txt
/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv


In [42]:
import numpy as np
import pandas as pd

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer 

In [43]:
import re

In [44]:
data = pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')

In [45]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [46]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [47]:
data.shape

(50000, 2)

In [48]:
data_X = data['review']
data_Y = data['sentiment']

## DATA PREPROCESSING

The `process_text` function standardizes and cleans text data for natural language processing tasks. It performs the following steps:

- **Contraction Expansion:** Expands common English contractions (e.g., "isn't" to "is not").
- **Special Character Removal:** Cleans special characters, emojis, and undesired unicode symbols.
- **Word Separation:** Splits up concatenated words and normalizes known compound tokens.
- **Lowercasing:** Converts all text to lowercase for consistency.
- **Noise Removal:** Removes hyperlinks, specific prefixes (like 'rt@username'), and HTML breaks.
- **Punctuation Removal:** Eliminates punctuation and special symbols.
- **Stopword Removal:** Removes English stopwords (excluding "not") and words shorter than 3 characters.
- **Stemming:** Reduces words to their root form using the Porter Stemmer.

The output is a cleaned, lowercased, and stemmed string suitable for further NLP tasks such as tokenization or feature extraction.


In [49]:
def process_text(line):
    
    #remove short
    line=re.sub("isn't",'is not',line)
    line=re.sub("he's",'he is',line)
    line=re.sub("wasn't",'was not',line)
    line=re.sub("there's",'there is',line)
    line=re.sub("couldn't",'could not',line)
    line=re.sub("won't",'will not',line)
    line=re.sub("they're",'they are',line)
    line=re.sub("she's",'she is',line)
    line=re.sub("There's",'there is',line)
    line=re.sub("wouldn't",'would not',line)
    line=re.sub("haven't",'have not',line)
    line=re.sub("That's",'That is',line)
    line=re.sub("you've",'you have',line)
    line=re.sub("He's",'He is',line)
    line=re.sub("what's",'what is',line)
    line=re.sub("weren't",'were not',line)
    line=re.sub("we're",'we are',line)
    line=re.sub("hasn't",'has not',line)
    line=re.sub("you'd",'you would',line)
    line=re.sub("shouldn't",'should not',line)
    line=re.sub("let's",'let us',line)
    line=re.sub("they've",'they have',line)
    line=re.sub("You'll",'You will',line)
    line=re.sub("i'm",'i am',line)
    line=re.sub("we've",'we have',line)
    line=re.sub("it's",'it is',line)
    line=re.sub("don't",'do not',line)
    line=re.sub("that´s",'that is',line)
    line=re.sub("I´m",'I am',line)
    line=re.sub("it’s",'it is',line)
    line=re.sub("she´s",'she is',line)
    line=re.sub("he’s'",'he is',line)
    line=re.sub('I’m','I am',line)
    line=re.sub('I’d','I did',line)
    line=re.sub("he’s'",'he is',line)
    line=re.sub('there’s','there is',line)
    line = re.sub(r'\'ll', ' will',line)
    
    
    #special characters and emojis
    line=re.sub('\x91the','the',line)
    line=re.sub('\x97','',line)
    line=re.sub('\x84the','the',line)
    line=re.sub('\uf0b7','',line)
    line=re.sub('¡¨','',line)
    line=re.sub('\x95','',line)
    line=re.sub('\x8ei\x9eek','',line)
    line=re.sub('\xad','',line)
    line=re.sub('\x84bubble','bubble',line)
    
    # remove concated words
    line=re.sub('trivialBoring','trivial Boring',line)
    line=re.sub('Justforkix','Just for kix',line)
    line=re.sub('Nightbeast','Night beast',line)
    line=re.sub('DEATHTRAP','Death Trap',line)
    line=re.sub('CitizenX','Citizen X',line)
    line=re.sub('10Rated','10 Rated',line)
    line=re.sub('_The','_ The',line)
    line=re.sub('1Sound','1 Sound',line)
    line=re.sub('blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah','blah blah',line)
    line=re.sub('ResidentHazard','Resident Hazard',line)
    line=re.sub('iameracing','i am racing',line)
    line=re.sub('BLACKSNAKE','Black Snake',line)
    line=re.sub('DEATHSTALKER','Death Stalker',line)
    line=re.sub('_is_','is',line)
    line=re.sub('10Fans','10 Fans',line)
    line=re.sub('Yellowcoat','Yellow coat',line)
    line=re.sub('Spiderbabe','Spider babe',line)
    line=re.sub('Frightworld','Fright world',line)
    
    line = line.lower()
    line = re.sub(r'^nrt@[a-z]+[\s]*', '', line)
    line = re.sub(r'^rt@[a-z]+[\s]*', '', line)
    line = re.sub(r'<br />', ' ', line)
    
    # remove hyperlinks
    line = re.sub(r'https?:\/\/.*[\r\n]*', '', line)
    
    # remove punctuation
    line = re.sub(r'[.,!"#$%&\'()*+-]', ' ', line)
    line = re.sub(r'[/:;<=>?@\[\]^_`{|}~\\]', ' ', line)
    #remove and numbers
    #line = re.sub(r'[0-9+]+','',line)
    line = re.sub(r"\s+", r" ", line).strip()
    #stopwords and stemming
    stopwords_english = stopwords.words('english')
    stopwords_english.remove('not')
    stemmer = PorterStemmer()
    text = ""
    
    for word in line.split():
        if word not in stopwords_english and len(word)>2:
            stem_word = stemmer.stem(word)
            text+=" "+stem_word
            #text+=" " +word
    
    return text.strip()

In [50]:
process_text("hi.. I am here for a walk. I !won't stop by")

'walk not stop'

train = []

for t in data_X:
    train.append(process_text(t).strip())

In [51]:
import csv
import os
import unicodedata
import codecs

#define path to new file
outdirname= '/kaggle/working'
datafile = os.path.join(outdirname,"train.txt")
delimiter = ','
delimiter = str(codecs.decode(delimiter,'unicode_escape'))

#write new csv file
print("\nwriting newly formatted file...")
with open(datafile,'w',encoding = 'utf-8') as outputfile:
    writer = csv.writer(outputfile,delimiter=delimiter)
    for text in train:
        writer.writerow([text])
print("Done writing to file...")

#visualize some lines
datafile = os.path.join(outdirname,"train.txt")
with open(datafile,'rb') as file:
    lines = file.readlines()
    for line in lines[:2]:
        print(line)

***READING DATA FROM ALREADY PREPROCESSED FILE...***

In [52]:
pre_data_path = '/kaggle/input/preprocesseddata/pre_train.txt'

In [53]:
# readthe file and split into lines
print("Reading and processing file... please wait.")
lines = open(pre_data_path, encoding = 'utf-8').read().strip().split('\n')
train = lines

Reading and processing file... please wait.


In [54]:
len(train[0])

998

### Splitting data in to train and test dataset 

In [55]:
from sklearn.model_selection import train_test_split
train_X, valid_X, train_Y, valid_Y = train_test_split(train, data_Y, test_size = 0.2,random_state=0,stratify = data_Y)

### Text Vectorization with CountVectorizer

The code below demonstrates how to convert preprocessed text data into numerical feature vectors using scikit-learn’s `CountVectorizer`:

- **Vectorizer Initialization:** A `CountVectorizer` is created to convert a corpus of text documents into a matrix of token counts (Bag-of-Words model).
- **Fitting and Transforming:** The vectorizer learns the vocabulary from the training data (`fit_transform`) and encodes both training and validation sets into sparse matrices.

In [56]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(decode_error = 'replace')
X_train = vectorizer.fit_transform(train_X)
X_valid = vectorizer.transform(valid_X)

y_train = train_Y
y_valid = valid_Y

print("X_train.shape : ", X_train.shape)
print("X_train.shape : ", X_valid.shape)
print("y_train.shape : ", y_train.shape)
print("y_valid.shape : ", y_valid.shape)

X_train.shape :  (40000, 64332)
X_train.shape :  (10000, 64332)
y_train.shape :  (40000,)
y_valid.shape :  (10000,)


In [57]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import matthews_corrcoef
from sklearn.metrics import classification_report

# Naive Bayes Classifier for MULTICLASS Classification

In [58]:
from sklearn.naive_bayes import MultinomialNB

naiveByes_clf = MultinomialNB(alpha = 0.0000000001)

naiveByes_clf.fit(X_train,y_train)

NB_prediction = naiveByes_clf.predict(X_valid)
NB_accuracy = accuracy_score(y_valid,NB_prediction)
print("training accuracy Score    : ",naiveByes_clf.score(X_train,y_train))
print("Validation accuracy Score : ",NB_accuracy )
print(classification_report(NB_prediction,y_valid))

training accuracy Score    :  0.936375
Validation accuracy Score :  0.8139
              precision    recall  f1-score   support

    negative       0.82      0.81      0.82      5065
    positive       0.81      0.82      0.81      4935

    accuracy                           0.81     10000
   macro avg       0.81      0.81      0.81     10000
weighted avg       0.81      0.81      0.81     10000



In [59]:
print(y_valid)
print(NB_prediction)

24125    negative
25498    positive
22286    positive
26283    positive
9209     positive
           ...   
34152    negative
45174    negative
36911    negative
36332    negative
4844     negative
Name: sentiment, Length: 10000, dtype: object
['negative' 'positive' 'positive' ... 'negative' 'negative' 'negative']


### Classification performance metrics

In [60]:
accuracy = accuracy_score(y_valid,NB_prediction)
print('Accuracy: %f' % accuracy)
# precision tp / (tp + fp)
precision = precision_score(y_valid,NB_prediction,pos_label='positive')
print('Precision: %f' % precision)
# recall: tp / (tp + fn)
recall = recall_score(y_valid,NB_prediction,pos_label='positive')
print('Recall: %f' % recall)
# f1: 2 tp / (2 tp + fp + fn)
f1 = f1_score(y_valid,NB_prediction,pos_label='positive')
print('F1 score: %f' % f1)
# Matthews Correaltion Coefficient
mcc =  matthews_corrcoef(y_valid,NB_prediction)
print('MCC : %f' % mcc)

Accuracy: 0.813900
Precision: 0.818034
Recall: 0.807400
F1 score: 0.812682
MCC : 0.627853


In [61]:
kappa = cohen_kappa_score(y_valid,NB_prediction)
print('Cohens kappa: %f' % kappa)
# ROC AUC

print("confusion matrix:")
matrix = confusion_matrix(y_valid,NB_prediction)
print(matrix)

Cohens kappa: 0.627800
confusion matrix:
[[4102  898]
 [ 963 4037]]


*Interpreting Results:*  
- High **Accuracy** indicates generally correct predictions.  
- High **Precision** means fewer false positives.  
- High **Recall** means fewer false negatives.  
- High **F1 Score** suggests both precision and recall are good.  
- High **MCC** indicates predictions are reliably correlated with the true labels, even with imbalanced data.

- **Cohen’s Kappa**
  Measures the agreement between predicted and actual class labels, adjusting for chance agreement.
  - *Interpretation:* 
    - 1: Perfect agreement
    - 0: No better than random chance
    - Negative: Worse than random
  - Higher values mean better model performance beyond random guessing.

# Support vector machine

In [None]:
from sklearn.svm import SVC

svc = SVC(kernel = 'rbf', C=1,gamma=0.01)

svc.fit(X_train, y_train)

svc_prediction = svc.predict(X_valid)
svc_accuracy = accuracy_score(y_valid,svc_prediction)

In [63]:
accuracy = accuracy_score(y_valid,svc_prediction)
print('Accuracy: %f' % accuracy)
# precision tp / (tp + fp)
precision = precision_score(y_valid,svc_prediction,pos_label='positive')
print('Precision: %f' % precision)
# recall: tp / (tp + fn)
recall = recall_score(y_valid,svc_prediction,pos_label='positive')
print('Recall: %f' % recall)
# f1: 2 tp / (2 tp + fp + fn)
f1 = f1_score(y_valid,svc_prediction,pos_label='positive')
print('F1 score: %f' % f1)
# Matthews Correaltion Coefficient
mcc =  matthews_corrcoef(y_valid,svc_prediction)
print('MCC : %f' % mcc)

Accuracy: 0.859700
Precision: 0.824463
Recall: 0.914000
F1 score: 0.866926
MCC : 0.723680


In [64]:
kappa = cohen_kappa_score(y_valid,svc_prediction)
print('Cohens kappa: %f' % kappa)
# ROC AUC

print("confusion matrix:")
matrix = confusion_matrix(y_valid,svc_prediction)
print(matrix)

Cohens kappa: 0.719400
confusion matrix:
[[4027  973]
 [ 430 4570]]


# RANDOM FOREST CLASSIFIER

In [65]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=20)

rf_clf.fit(X_train,y_train)

rf_prediction = rf_clf.predict(X_valid)
rf_accuracy = accuracy_score(y_valid,rf_prediction)
print("Training accuracy Score    : ",rf_clf.score(X_train,y_train))
print("Validation accuracy Score : ",rf_accuracy )
print(classification_report(rf_prediction,y_valid))

Training accuracy Score    :  0.99955
Validation accuracy Score :  0.8012
              precision    recall  f1-score   support

    negative       0.83      0.79      0.81      5284
    positive       0.77      0.82      0.80      4716

    accuracy                           0.80     10000
   macro avg       0.80      0.80      0.80     10000
weighted avg       0.80      0.80      0.80     10000



In [66]:
accuracy = accuracy_score(y_valid,rf_prediction)
print('Accuracy: %f' % accuracy)
# precision tp / (tp + fp)
precision = precision_score(y_valid,rf_prediction,pos_label='positive')
print('Precision: %f' % precision)
# recall: tp / (tp + fn)
recall = recall_score(y_valid,rf_prediction,pos_label='positive')
print('Recall: %f' % recall)
# f1: 2 tp / (2 tp + fp + fn)
f1 = f1_score(y_valid,rf_prediction,pos_label='positive')
print('F1 score: %f' % f1)
# Matthews Correaltion Coefficient
mcc =  matthews_corrcoef(y_valid,rf_prediction)
print('MCC : %f' % mcc)

Accuracy: 0.801200
Precision: 0.819338
Recall: 0.772800
F1 score: 0.795389
MCC : 0.603374


In [67]:
kappa = cohen_kappa_score(y_valid,rf_prediction)
print('Cohens kappa: %f' % kappa)
print("confusion matrix:")
matrix = confusion_matrix(y_valid,rf_prediction)
print(matrix)

Cohens kappa: 0.602400
confusion matrix:
[[4148  852]
 [1136 3864]]


# Logistic Regression

In [68]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=0.00005)

logreg.fit(X_train, y_train)

logreg_prediction = logreg.predict(X_valid)
logreg_accuracy = accuracy_score(y_valid,logreg_prediction)
print("Training accuracy Score    : ",logreg.score(X_train,y_train))
print("Validation accuracy Score : ",logreg_accuracy )
print(classification_report(logreg_prediction,y_valid))

Training accuracy Score    :  0.7936
Validation accuracy Score :  0.7925
              precision    recall  f1-score   support

    negative       0.77      0.81      0.79      4735
    positive       0.82      0.78      0.80      5265

    accuracy                           0.79     10000
   macro avg       0.79      0.79      0.79     10000
weighted avg       0.79      0.79      0.79     10000



In [69]:
accuracy = accuracy_score(y_valid,logreg_prediction)
print('Accuracy: %f' % accuracy)
# precision tp / (tp + fp)
precision = precision_score(y_valid,logreg_prediction,pos_label='positive')
print('Precision: %f' % precision)
# recall: tp / (tp + fn)
recall = recall_score(y_valid,logreg_prediction,pos_label='positive')
print('Recall: %f' % recall)
# f1: 2 tp / (2 tp + fp + fn)
f1 = f1_score(y_valid,logreg_prediction,pos_label='positive')
print('F1 score: %f' % f1)
# Matthews Correaltion Coefficient
mcc =  matthews_corrcoef(y_valid,logreg_prediction)
print('MCC : %f' % mcc)

Accuracy: 0.792500
Precision: 0.777778
Recall: 0.819000
F1 score: 0.797857
MCC : 0.585823


In [70]:
kappa = cohen_kappa_score(y_valid,logreg_prediction)
print('Cohens kappa: %f' % kappa)
print("confusion matrix:")
matrix = confusion_matrix(y_valid,logreg_prediction)
print(matrix)

Cohens kappa: 0.585000
confusion matrix:
[[3830 1170]
 [ 905 4095]]


# Stochastic Gradient Descent-SGD Classifier

In [71]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(loss = 'log', penalty = 'l2', random_state=0,alpha=0.7)

sgd_clf.fit(X_train,y_train)

sgd_prediction = sgd_clf.predict(X_valid)
sgd_accuracy = accuracy_score(y_valid,sgd_prediction)
print("Training accuracy Score    : ",sgd_clf.score(X_train,y_train))
print("Validation accuracy Score : ",sgd_accuracy )
print(classification_report(sgd_prediction,y_valid))

Training accuracy Score    :  0.788325
Validation accuracy Score :  0.7817
              precision    recall  f1-score   support

    negative       0.81      0.77      0.79      5241
    positive       0.76      0.80      0.78      4759

    accuracy                           0.78     10000
   macro avg       0.78      0.78      0.78     10000
weighted avg       0.78      0.78      0.78     10000



In [72]:
accuracy = accuracy_score(y_valid,sgd_prediction)
print('Accuracy: %f' % accuracy)
# precision tp / (tp + fp)
precision = precision_score(y_valid,sgd_prediction,pos_label='positive')
print('Precision: %f' % precision)
# recall: tp / (tp + fn)
recall = recall_score(y_valid,sgd_prediction,pos_label='positive')
print('Recall: %f' % recall)
# f1: 2 tp / (2 tp + fp + fn)
f1 = f1_score(y_valid,sgd_prediction,pos_label='positive')
print('F1 score: %f' % f1)
# Matthews Correaltion Coefficient
mcc =  matthews_corrcoef(y_valid,sgd_prediction)
print('MCC : %f' % mcc)

Accuracy: 0.781700
Precision: 0.795966
Recall: 0.757600
F1 score: 0.776309
MCC : 0.564056


In [73]:
kappa = cohen_kappa_score(y_valid,sgd_prediction)
print('Cohens kappa: %f' % kappa)
print("confusion matrix:")
matrix = confusion_matrix(y_valid,sgd_prediction)
print(matrix)

Cohens kappa: 0.563400
confusion matrix:
[[4029  971]
 [1212 3788]]


# XG BOOST( BINARY CLASSIFICATION)

In [74]:
import xgboost as xgb

xgboost_clf = xgb.XGBClassifier(eta=0.03)

xgboost_clf.fit(X_train, y_train)

xgb_prediction = xgboost_clf.predict(X_valid)
xgb_accuracy = accuracy_score(y_valid,xgb_prediction)
print("Training accuracy Score    : ",xgboost_clf.score(X_train,y_train))
print("Validation accuracy Score : ",xgb_accuracy )
print(classification_report(xgb_prediction,y_valid))



Training accuracy Score    :  0.8041
Validation accuracy Score :  0.7872
              precision    recall  f1-score   support

    negative       0.72      0.83      0.77      4300
    positive       0.86      0.75      0.80      5700

    accuracy                           0.79     10000
   macro avg       0.79      0.79      0.79     10000
weighted avg       0.80      0.79      0.79     10000



In [75]:
accuracy = accuracy_score(y_valid,xgb_prediction)
print('Accuracy: %f' % accuracy)
# precision tp / (tp + fp)
precision = precision_score(y_valid,xgb_prediction,pos_label='positive')
print('Precision: %f' % precision)
# recall: tp / (tp + fn)
recall = recall_score(y_valid,xgb_prediction,pos_label='positive')
print('Recall: %f' % recall)
# f1: 2 tp / (2 tp + fp + fn)
f1 = f1_score(y_valid,xgb_prediction,pos_label='positive')
print('F1 score: %f' % f1)
# Matthews Correaltion Coefficient
mcc =  matthews_corrcoef(y_valid,xgb_prediction)
print('MCC : %f' % mcc)

Accuracy: 0.787200
Precision: 0.751930
Recall: 0.857200
F1 score: 0.801121
MCC : 0.580113


In [76]:
kappa = cohen_kappa_score(y_valid,xgb_prediction)
print('Cohens kappa: %f' % kappa)
print("confusion matrix:")
matrix = confusion_matrix(y_valid,xgb_prediction)
print(matrix)

Cohens kappa: 0.574400
confusion matrix:
[[3586 1414]
 [ 714 4286]]


# KNN Classifier

In [77]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=5)
neigh.fit(X_train, y_train)

knn_prediction = neigh.predict(X_valid)
knn_accuracy = accuracy_score(y_valid,knn_prediction)
print("Training accuracy Score    : ",neigh.score(X_train,y_train))
print("Validation accuracy Score : ",knn_accuracy )
print(classification_report(knn_prediction,y_valid))

Training accuracy Score    :  0.7917
Validation accuracy Score :  0.6488
              precision    recall  f1-score   support

    negative       0.53      0.70      0.60      3790
    positive       0.77      0.62      0.69      6210

    accuracy                           0.65     10000
   macro avg       0.65      0.66      0.64     10000
weighted avg       0.68      0.65      0.65     10000



In [78]:
accuracy = accuracy_score(y_valid,knn_prediction)
print('Accuracy: %f' % accuracy)
# precision tp / (tp + fp)
precision = precision_score(y_valid,knn_prediction,pos_label='positive')
print('Precision: %f' % precision)
# recall: tp / (tp + fn)
recall = recall_score(y_valid,knn_prediction,pos_label='positive')
print('Recall: %f' % recall)
# f1: 2 tp / (2 tp + fp + fn)
f1 = f1_score(y_valid,knn_prediction,pos_label='positive')
print('F1 score: %f' % f1)
# Matthews Correaltion Coefficient
mcc =  matthews_corrcoef(y_valid,knn_prediction)
print('MCC : %f' % mcc)

Accuracy: 0.648800
Precision: 0.619807
Recall: 0.769800
F1 score: 0.686708
MCC : 0.306717


In [79]:
kappa = cohen_kappa_score(y_valid,knn_prediction)
print('Cohens kappa: %f' % kappa)
print("confusion matrix:")
matrix = confusion_matrix(y_valid,knn_prediction)
print(matrix)

Cohens kappa: 0.297600
confusion matrix:
[[2639 2361]
 [1151 3849]]


# Final Results

In [80]:
models = pd.DataFrame({
    'Model': [ 'Logistic Regression', 
              'Random Forest', 'Naive Bayes','KNN ', 
              'Stochastic Gradient Decent', 'XGBoost'],
    'Test accuracy': [ logreg_accuracy, 
              rf_accuracy, NB_accuracy,knn_accuracy, 
              sgd_accuracy, xgb_accuracy]})

models.sort_values(by='Test accuracy', ascending=False)

Unnamed: 0,Model,Test accuracy
2,Naive Bayes,0.8139
1,Random Forest,0.8012
0,Logistic Regression,0.7925
5,XGBoost,0.7872
4,Stochastic Gradient Decent,0.7817
3,KNN,0.6488
