# **The ninth in-class-exercise (20 points in total, 4/16/2021)**

The purpose of the exercise is to practice different machine learning algorithms for text classification as well as the performance evaluation. In addition, you are requried to conduct *10 fold cross validation (https://scikit-learn.org/stable/modules/cross_validation.html)* in the training. 

The dataset can be download from here: https://github.com/unt-iialab/info5731_spring2021/blob/main/class_exercises/exercise09_datacollection.zip. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data. 

Algorithms:

(1) MultinominalNB

(2) SVM 

(3) KNN 

(4) Decision tree

(5) Random Forest

(6) XGBoost

Evaluation measurement:

(1) Accuracy

(2) Recall

(3) Precison 

(4) F-1 score

In [1]:
# Write your code here

#importing libraries needed

import pandas as pd 
import re 
import nltk 
from nltk.corpus import stopwords 
from nltk.stem.porter import PorterStemmer 
from sklearn.feature_extraction.text import CountVectorizer
import sklearn
from sklearn.model_selection import train_test_split
import numpy
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier


In [3]:
#Loading Train and Test data

my_data_train=pd.read_fwf("/content/stsa-train.txt", header=None)
my_data_train= pd.DataFrame(my_data_train)
my_data_test=pd.read_fwf("/content/stsa-test.txt", header=None)
my_data_test= pd.DataFrame(my_data_test)

In [4]:
#Now splitting my_data_train into training and validation data

del my_data_train[2]
my_data_train = my_data_train.rename(columns={0: "Review", 1: "Text"})
del my_data_test[2]
del my_data_test[3]
my_data_test = my_data_test.rename(columns={0: "Review", 1: "Text"})
x_train, x_validate, y_train, y_validate = sklearn.model_selection.train_test_split(my_data_train['Text'], my_data_train['Review'], train_size=0.8, test_size=0.2)
x_train = x_train.to_numpy()
y_train = y_train.to_numpy()

In [5]:
# Defining K-fold

my_kf = KFold(n_splits=10)

In [6]:
#MultinominalNB
pln = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])
for train_index, test_index in my_kf.split(x_train, y_train):
    x_train_k, y_train_k = x_train[train_index], y_train[train_index]
    x_test_k, y_test_k = x_train[test_index], y_train[test_index]
    MNB_algorithm = pln.fit(x_train_k, y_train_k)
pred_validate = MNB_algorithm.predict(x_validate)
validation = {'Actual': y_validate, 'Predicted': pred_validate}
validation_df = pd.DataFrame(validation, columns = ['Actual', 'Predicted'])

In [7]:
#Evaluation

my_final = MNB_algorithm.predict(my_data_test['Text'])
print('Accuracy of MultinomialNB :', (accuracy_score(my_data_test['Review'], my_final)*100))
print('Recall of MultinomialNB :', (recall_score(my_data_test['Review'], my_final)*100))
print('Precision of MultinomialNB :', (accuracy_score(my_data_test['Review'], my_final)*100))
print('F1-score of MultinomialNB :', (f1_score(my_data_test['Review'], my_final, average='macro')*100))

Accuracy of MultinomialNB : 80.83470620538165
Recall of MultinomialNB : 87.8987898789879
Precision of MultinomialNB : 80.83470620538165
F1-score of MultinomialNB : 80.742508329129


In [8]:
#SVM
pln = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', LinearSVC())])
for train_index, test_index in my_kf.split(x_train, y_train):
    x_train_k, y_train_k = x_train[train_index], y_train[train_index]
    x_test_k, y_test_k = x_train[test_index], y_train[test_index]
    SVM_algorithm = pln.fit(x_train_k, y_train_k)
pred_validate = SVM_algorithm.predict(x_validate)
validation = {'Actual': y_validate, 'Predicted': pred_validate}
validation_df = pd.DataFrame(validation, columns = ['Actual', 'Predicted'])

In [9]:
#Evaluation
my_final = SVM_algorithm.predict(my_data_test['Text'])
print(' Accuracy of SVM :', (accuracy_score(my_data_test['Review'], my_final)*100))
print('Recall of SVM :', (recall_score(my_data_test['Review'], my_final)*100))
print(' Precision of SVM :', (accuracy_score(my_data_test['Review'], my_final)*100))
print('F1-score of SVM :', (f1_score(my_data_test['Review'], my_final, average='macro')*100))

 Accuracy of SVM : 79.40691927512356
Recall of SVM : 81.95819581958196
 Precision of SVM : 79.40691927512356
F1-score of SVM : 79.39488941961706


In [10]:
#KNN
pln = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', KNeighborsClassifier())])
for train_index, test_index in my_kf.split(x_train, y_train):
    x_train_k, y_train_k = x_train[train_index], y_train[train_index]
    x_test_k, y_test_k = x_train[test_index], y_train[test_index]
    KNN_algorithm = pln.fit(x_train_k, y_train_k)
pred_validate = KNN_algorithm.predict(x_validate)
validation = {'Actual': y_validate, 'Predicted': pred_validate}
validation_df = pd.DataFrame(validation, columns = ['Actual', 'Predicted'])

In [11]:
#Evaluation of KNN
my_final = KNN_algorithm.predict(my_data_test['Text'])
print(' Accuracy of KNN :', (accuracy_score(my_data_test['Review'], my_final)*100))
print('Recall of KNN :', (recall_score(my_data_test['Review'], my_final)*100))
print(' Precision of KNN :', (accuracy_score(my_data_test['Review'], my_final)*100))
print('F1-score of KNN :', (f1_score(my_data_test['Review'], my_final, average='macro')*100))

 Accuracy of KNN : 73.47611202635915
Recall of KNN : 77.55775577557755
 Precision of KNN : 73.47611202635915
F1-score of KNN : 73.4345820432595


In [12]:
#Decision Tree
pln = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', tree.DecisionTreeClassifier())])
for train_index, test_index in my_kf.split(x_train, y_train):
    x_train_k, y_train_k = x_train[train_index], y_train[train_index]
    x_test_k, y_test_k = x_train[test_index], y_train[test_index]
    DT_algorithm = pln.fit(x_train_k, y_train_k)
pred_validate = DT_algorithm.predict(x_validate)
validation = {'Actual': y_validate, 'Predicted': pred_validate}
validation_df = pd.DataFrame(validation, columns = ['Actual', 'Predicted'])

In [13]:
#Evaluation
my_final = DT_algorithm.predict(my_data_test['Text'])
print(' Accuracy of Decision Tree :', (accuracy_score(my_data_test['Review'], my_final)*100))
print('Recall of Decision Tree :', (recall_score(my_data_test['Review'], my_final)*100))
print(' Precision of Decision Tree :', (accuracy_score(my_data_test['Review'], my_final)*100))
print('F1-score of Decision Tree :', (f1_score(my_data_test['Review'], my_final, average='macro')*100))

 Accuracy of Decision Tree : 60.73585941790225
Recall of Decision Tree : 65.8965896589659
 Precision of Decision Tree : 60.73585941790225
F1-score of Decision Tree : 60.63538354511475


In [14]:
#Random Forest:
pln = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', RandomForestClassifier(n_estimators=100))])
for train_index, test_index in my_kf.split(x_train, y_train):
    x_train_k, y_train_k = x_train[train_index], y_train[train_index]
    x_test_k, y_test_k = x_train[test_index], y_train[test_index]
    RF_algorithm = pln.fit(x_train_k, y_train_k)
pred_validate = RF_algorithm.predict(x_validate)
validation = {'Actual': y_validate, 'Predicted': pred_validate}
validation_df = pd.DataFrame(validation, columns = ['Actual', 'Predicted'])

In [15]:
#Evaluation
my_final = RF_algorithm.predict(my_data_test['Text'])
print(' Accuracy of Random Forest :', (accuracy_score(my_data_test['Review'], my_final)*100))
print('Recall of Random Forest :', (recall_score(my_data_test['Review'], my_final)*100))
print(' Precision of Random Forest :', (accuracy_score(my_data_test['Review'], my_final)*100))
print('F1-score of Random Forest :', (f1_score(my_data_test['Review'], my_final, average='macro')*100))

 Accuracy of Random Forest : 72.32289950576606
Recall of Random Forest : 76.56765676567657
 Precision of Random Forest : 72.32289950576606
F1-score of Random Forest : 72.27587106877202


In [16]:
#XGBoost:
pln = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', GradientBoostingClassifier(n_estimators=20,verbose=2))])
for train_index, test_index in my_kf.split(x_train, y_train):
    x_train_k, y_train_k = x_train[train_index], y_train[train_index]
    x_test_k, y_test_k = x_train[test_index], y_train[test_index]
    
    XGB_algorithm = pln.fit(x_train_k, y_train_k)
pred_validate = XGB_algorithm.predict(x_validate)
validation = {'Actual': y_validate, 'Predicted': pred_validate}
validation_df = pd.DataFrame(validation, columns = ['Actual', 'Predicted'])

      Iter       Train Loss   Remaining Time 
         1           1.3759            0.59s
         2           1.3684            0.56s
         3           1.3621            0.52s
         4           1.3562            0.49s
         5           1.3509            0.45s
         6           1.3458            0.43s
         7           1.3410            0.40s
         8           1.3368            0.37s
         9           1.3324            0.34s
        10           1.3284            0.31s
        11           1.3244            0.28s
        12           1.3212            0.25s
        13           1.3178            0.21s
        14           1.3140            0.18s
        15           1.3106            0.16s
        16           1.3073            0.12s
        17           1.3047            0.09s
        18           1.3008            0.06s
        19           1.2976            0.03s
        20           1.2943            0.00s
      Iter       Train Loss   Remaining Time 
        

In [17]:
#Evaluation
my_final = XGB_algorithm.predict(my_data_test['Text'])
print(' Accuracy of XG Boost :', (accuracy_score(my_data_test['Review'], my_final)*100))
print('Recall of XG Boost :', (recall_score(my_data_test['Review'], my_final)*100))
print(' Precision of XG Boost :', (accuracy_score(my_data_test['Review'], my_final)*100))
print('F1-score of XG Boost :', (f1_score(my_data_test['Review'], my_final, average='macro')*100))

 Accuracy of XG Boost : 59.30807248764415
Recall of XG Boost : 87.56875687568757
 Precision of XG Boost : 59.30807248764415
F1-score of XG Boost : 55.81511098769867
