<a href="https://colab.research.google.com/github/preethi9999/info5731_spring2021/blob/main/somaraju_inclassexercise9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The purpose of the exercise is to practice different machine learning algorithms for text classification as well as the performance evaluation. In addition, you are requried to conduct 10 fold cross validation (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.

The dataset can be download from here: https://github.com/unt-iialab/info5731_spring2021/blob/main/class_exercises/exercise09_datacollection.zip. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.

Algorithms:

(1) MultinominalNB

(2) SVM

(3) KNN

(4) Decision tree

(5) Random Forest

(6) XGBoost

Evaluation measurement:

(1) Accuracy

(2) Recall

(3) Precison

In [32]:
#Importing the required libraries

import pandas as pd 
import re 
import nltk 
import numpy
from nltk.corpus import stopwords 
from nltk.stem.porter import PorterStemmer 
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree

In [33]:
#we are loading the train and test data

data_train=pd.read_fwf("/content/sample_data/stsa-train.txt", header=None)
data_train=pd.DataFrame(data_train)
data_test=pd.read_fwf("/content/sample_data/stsa-test.txt", header=None)
data_test= pd.DataFrame(data_test)

In [34]:
#splitting the train data into training and validation data

del data_train[2]
data_train = data_train.rename(columns={0: "Review", 1: "Text"})
del data_test[2]
del data_test[3]
data_test = data_test.rename(columns={0: "Review", 1: "Text"})
x_train, x_validate, y_train, y_validate = sklearn.model_selection.train_test_split(data_train['Text'], data_train['Review'], train_size=0.8, test_size=0.2)
x_train = x_train.to_numpy()
y_train = y_train.to_numpy()

In [35]:
#Defining Kfold = 10

my_kf = KFold(n_splits=10)

*Analysis of various algorithms*

In [36]:
#MultinominalNB

pln = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])
for train_index, test_index in my_kf.split(x_train, y_train):
    x_train_k, y_train_k = x_train[train_index], y_train[train_index]
    x_test_k, y_test_k = x_train[test_index], y_train[test_index]
    MNB_algorithm = pln.fit(x_train_k, y_train_k)
pred_validate = MNB_algorithm.predict(x_validate)
validation = {'Actual value': y_validate, 'Predicted value': pred_validate}
validation_df = pd.DataFrame(validation, columns = ['Actual value', 'Predicted value'])

In [38]:
#Evaluating MultinominalNB and getting values

final = MNB_algorithm.predict(data_test['Text'])
print('Accuracy of MultinomialNB:', (accuracy_score(data_test['Review'], final)*100))
print('Precision of MultinomialNB:', (accuracy_score(data_test['Review'], final)*100))
print('Recall of MultinomialNB:', (recall_score(data_test['Review'], final)*100))
print('F1-score of MultinomialNB:', (f1_score(data_test['Review'], final, average='macro')*100))

Accuracy of MultinomialNB: 81.10928061504667
Precision of MultinomialNB: 81.10928061504667
Recall of MultinomialNB: 88.66886688668868
F1-score of MultinomialNB: 81.00488323181159


In [43]:
#SVM:

pln = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', LinearSVC())])
for train_index, test_index in my_kf.split(x_train, y_train):
    x_train_k, y_train_k = x_train[train_index], y_train[train_index]
    x_test_k, y_test_k = x_train[test_index], y_train[test_index]
    SVM_algorithm = pln.fit(x_train_k, y_train_k)
pred_validate = SVM_algorithm.predict(x_validate)
validation = {'Actual value': y_validate, 'Predicted value': pred_validate}
validation_df = pd.DataFrame(validation, columns = ['Actual value', 'Predicted value'])

In [44]:
#Evaluating SVM and getting values

final = SVM_algorithm.predict(data_test['Text'])
print('Accuracy of SVM :', (accuracy_score(data_test['Review'], final)*100))
print('Precision of SVM :', (accuracy_score(data_test['Review'], final)*100))
print('Recall of SVM :', (recall_score(data_test['Review'], final)*100))
print('F1-score of SVM :', (f1_score(data_test['Review'], final, average='macro')*100))

Accuracy of SVM : 80.23064250411862
Precision of SVM : 80.23064250411862
Recall of SVM : 80.52805280528052
F1-score of SVM : 80.23058884835852


In [45]:
#Decision Tree

pln = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', tree.DecisionTreeClassifier())])
for train_index, test_index in my_kf.split(x_train, y_train):
    x_train_k, y_train_k = x_train[train_index], y_train[train_index]
    x_test_k, y_test_k = x_train[test_index], y_train[test_index]
    DT_algorithm = pln.fit(x_train_k, y_train_k)
pred_validate = DT_algorithm.predict(x_validate)
validation = {'Actual value': y_validate, 'Predicted value': pred_validate}
validation_df = pd.DataFrame(validation, columns = ['Actual value', 'Predicted value'])

In [46]:
#Evaluating Decision tree and getting values

final = DT_algorithm.predict(data_test['Text'])
print('Accuracy of Decision Tree :', (accuracy_score(data_test['Review'], final)*100))
print('Precision of Decision Tree :', (accuracy_score(data_test['Review'], final)*100))
print('Recall of Decision Tree :', (recall_score(data_test['Review'], final)*100))
print('F1-score of Decision Tree :', (f1_score(data_test['Review'], final, average='macro')*100))

Accuracy of Decision Tree : 61.44975288303131
Precision of Decision Tree : 61.44975288303131
Recall of Decision Tree : 67.32673267326733
F1-score of Decision Tree : 61.32115548003399


In [47]:
#KNN

pln = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', KNeighborsClassifier())])
for train_index, test_index in my_kf.split(x_train, y_train):
    x_train_k, y_train_k = x_train[train_index], y_train[train_index]
    x_test_k, y_test_k = x_train[test_index], y_train[test_index]
    KNN_algorithm = pln.fit(x_train_k, y_train_k)
pred_validate = KNN_algorithm.predict(x_validate)
validation = {'Actual value': y_validate, 'Predicted value': pred_validate}
validation_df = pd.DataFrame(validation, columns = ['Actual value', 'Predicted value'])

In [49]:
#Evaluating KNN and getting values

my_final = KNN_algorithm.predict(data_test['Text'])
print('Accuracy of KNN :', (accuracy_score(data_test['Review'], final)*100))
print('Precision of KNN :', (accuracy_score(data_test['Review'], final)*100))
print('Recall of KNN :', (recall_score(data_test['Review'], final)*100))
print('F1-score of KNN :', (f1_score(data_test['Review'], final, average='macro')*100))

Accuracy of KNN : 61.44975288303131
Precision of KNN : 61.44975288303131
Recall of KNN : 67.32673267326733
F1-score of KNN : 61.32115548003399


In [51]:
#Random Forest

pln = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', RandomForestClassifier(n_estimators=100))])
for train_index, test_index in my_kf.split(x_train, y_train):
    x_train_k, y_train_k = x_train[train_index], y_train[train_index]
    x_test_k, y_test_k = x_train[test_index], y_train[test_index]
    RF_algorithm = pln.fit(x_train_k, y_train_k)
pred_validate = RF_algorithm.predict(x_validate)
validation = {'Actual value': y_validate, 'Predicted value': pred_validate}
validation_df = pd.DataFrame(validation, columns = ['Actual value', 'Predicted value'])

In [53]:
#Evaluating Random Forest and getting values

final = RF_algorithm.predict(data_test['Text'])
print('Accuracy of Random Forest :', (accuracy_score(data_test['Review'], final)*100))
print('Precision of Random Forest :', (accuracy_score(data_test['Review'], final)*100))
print('Recall of Random Forest :', (recall_score(data_test['Review'], final)*100))
print('F1-score of Random Forest :', (f1_score(data_test['Review'], final, average='macro')*100))

Accuracy of Random Forest : 71.55409115870401
Precision of Random Forest : 71.55409115870401
Recall of Random Forest : 76.89768976897689
F1-score of Random Forest : 71.47646095452819


In [54]:
#XGBoost

pln = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', GradientBoostingClassifier(n_estimators=20,verbose=2))])
for train_index, test_index in my_kf.split(x_train, y_train):
    x_train_k, y_train_k = x_train[train_index], y_train[train_index]
    x_test_k, y_test_k = x_train[test_index], y_train[test_index]
    
    XGB_algorithm = pln.fit(x_train_k, y_train_k)
pred_validate = XGB_algorithm.predict(x_validate)
validation = {'Actual value': y_validate, 'Predicted value': pred_validate}
validation_df = pd.DataFrame(validation, columns = ['Actual value', 'Predicted value'])

      Iter       Train Loss   Remaining Time 
         1           1.3763            0.61s
         2           1.3695            0.58s
         3           1.3634            0.53s
         4           1.3583            0.50s
         5           1.3536            0.46s
         6           1.3490            0.43s
         7           1.3450            0.41s
         8           1.3409            0.39s
         9           1.3369            0.35s
        10           1.3330            0.32s
        11           1.3293            0.30s
        12           1.3255            0.26s
        13           1.3224            0.23s
        14           1.3194            0.20s
        15           1.3161            0.16s
        16           1.3127            0.13s
        17           1.3091            0.10s
        18           1.3059            0.07s
        19           1.3028            0.03s
        20           1.3000            0.00s
      Iter       Train Loss   Remaining Time 
        

In [55]:
#Evaluating XG Boost and getting values

final = XGB_algorithm.predict(data_test['Text'])
print('Accuracy of XG Boost :',(accuracy_score(data_test['Review'],final)*100))
print('Precision of XG Boost :',(accuracy_score(data_test['Review'], final)*100))
print('Recall of XG Boost :',(recall_score(data_test['Review'], final)*100))
print('F1-score of XG Boost :',(f1_score(data_test['Review'], final, average='macro')*100))


Accuracy of XG Boost : 59.30807248764415
Precision of XG Boost : 59.30807248764415
Recall of XG Boost : 87.12871287128714
F1-score of XG Boost : 55.93239767800719
