## Introduction

 In this project, we tackle the problem of Text classification. Unlike the classification problems that we've seen in this class, the data used in text classification is non-numeric. Hence, one of the challenges of project is to find ways to transform our text data into numeric data that can be trained and tested using classification models such as Logistical Regression. One such method of transforming our data is vectorizing our text data into a "Bag of Word". However, as we will see throughout this notebook, simply using a "Bag of Word" is not optimal but there are many Natural Language Processing(NLP) techniques that we can use to optimize our data and improve the performance of our classification.

In [59]:
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json

### Data

 For this project, we used Software reviews from Amazon. This dataset and other Amazon reviews can be found in the following link: http://deepyeti.ucsd.edu/jianmo/amazon/index.html?fbclid=IwAR1oMrSAwo2dc48WhzGEzMUdPrpv6Mv0S_zGvQzcgDoyNW3_IvxHCmwD7FY

In [5]:
df = pd.read_json('Software.json',lines=True,)
df.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,4,True,"03 11, 2014",A240ORQ2LF9LUI,77613252,{'Format:': ' Loose Leaf'},Michelle W,The materials arrived early and were in excell...,Material Great,1394496000,,
1,4,True,"02 23, 2014",A1YCCU0YRLS0FE,77613252,{'Format:': ' Loose Leaf'},Rosalind White Ames,I am really enjoying this book with the worksh...,Health,1393113600,,
2,1,True,"02 17, 2014",A1BJHRQDYVAY2J,77613252,{'Format:': ' Loose Leaf'},Allan R. Baker,"IF YOU ARE TAKING THIS CLASS DON""T WASTE YOUR ...",ARE YOU KIDING ME?,1392595200,7.0,
3,3,True,"02 17, 2014",APRDVZ6QBIQXT,77613252,{'Format:': ' Loose Leaf'},Lucy,This book was missing pages!!! Important pages...,missing pages!!,1392595200,3.0,
4,5,False,"10 14, 2013",A2JZTTBSLS1QXV,77775473,,Albert V.,I have used LearnSmart and can officially say ...,Best study product out there!,1381708800,,


We have many features such as reviewerID that are relevant to our classifification. Hence, we filtered out all of the unverified reviews and non-relevant features.

In [6]:
df = df.drop(columns = ['reviewTime','reviewerID','asin','style','reviewerName','unixReviewTime','image','vote'])
df = df[df['verified'] == True]
df['overall'].value_counts()

5    156117
1     56416
4     50506
3     27002
2     19304
Name: overall, dtype: int64

One issue that we've observed during our testing is that having an imbalanced dataset (i.e. having too many data points of one class and too few of another) significantly decreases the accuracy of our classifications. Since we also wanted to reduce the data size to reduce runtime, we balanced the dataset such that each class has an equal amount of datapoints 

Additionally, looking at our confusion matrix during testing, we notice 5 and 4 stars reviews and 1 and 2 star reviews were too similar for our classification models to distingush them. Hence, we combined our class 5 and 4 into class 5, class 1 and 2 into class 1. So, our dataset has 3 classes: 5 = postive review, 3 = mixed review, and 1 = negative review. 

In [7]:
data = df[df['overall']==3].sample(6000)
for i in range(5):
    if i+1 != 3:
        data = data.append(df[df['overall'] == i+1].sample(3000), ignore_index=True)

data.loc[data[data['overall']==2].index, 'overall'] = 1
data.loc[data[data['overall']==4].index, 'overall'] = 5
data = data.dropna(subset=['reviewText'])
data = data.dropna(subset=['summary'])
data.head()

Unnamed: 0,overall,verified,reviewText,summary
0,3,True,Thought this was the product that would solve ...,Not what I was looking for.
1,3,True,times out too often.,Three Stars
2,3,True,Read what is being sold in the fine print care...,Bait & switch seller.
3,3,True,I have used this software for years. This yea...,Like the software but the Amazon Download vers...
4,3,True,"TurboTax did the job of doing the taxes, but ...",TurboTax Can Be Improved


### Data Transformation

In [90]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

#### Bag of Word

One of the main challenges of text classification is transforming the text data such that it can be trained and tested by a classification model. One simple method of transforming the data is vectorizing it into a "Bag of Words". A Bag of Words usually consists of two main components: a list of all of the words used in our dataset and a measure of the presence/relevance of each word relative to each data point/review.

In [52]:
X_train, X_test, y_train, y_test = train_test_split(data['reviewText'], data['overall'], test_size = 0.3, random_state=0, shuffle= True, stratify=data['overall'])

One of the simplest forms of a Bag of Words is a matrix that counts the occurrance of each words in each datapoint/review.

<img src="cbag.PNG">

In [53]:
clf = Pipeline([('count',CountVectorizer()), ('clf', LogisticRegression(max_iter=3000))])

In [54]:
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
confusion_matrix(y_test, y_pred)

array([[1160,  447,  193],
       [ 448,  915,  437],
       [ 152,  371, 1277]], dtype=int64)

In [55]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.66      0.64      0.65      1800
           3       0.53      0.51      0.52      1800
           5       0.67      0.71      0.69      1800

    accuracy                           0.62      5400
   macro avg       0.62      0.62      0.62      5400
weighted avg       0.62      0.62      0.62      5400



As you can see, simply using a counting Bag of Words and logistic regression yield a pretty good accuracy of 62%. Looking at our confusion matrix, we see that many of the misclassification are due to reviews from class 3, which is understandable since these 3 star reviews are usually a mixed bag of positive and negative sentiments. 

#### TF-IDF

Now let's look at how we can improve the performance of our classifications. Instead of using a bag of words that simply counts the occurances of words, we use a Bag of Words that uses a better measure of the relevance of a word in a review: Term Frequency-Inverse Document Frequency (TF-IDF). The TF-IDF of any word is calculated from the following formula:

<img src= "tfidf.PNG">

Hence, our TF-IDF Bag of Words is just a matrix with the TF-IDF of each word in each review. 

<img src="tfidfbag.png">

In [49]:
clf2 = Pipeline([('tfidf',TfidfVectorizer()), ('clf2', LogisticRegression(max_iter=3000))])

In [50]:
clf2.fit(X_train, y_train)
y_pred = clf2.predict(X_test)
confusion_matrix(y_test, y_pred)

array([[1252,  400,  148],
       [ 450,  977,  373],
       [ 141,  386, 1273]], dtype=int64)

In [51]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.68      0.70      0.69      1800
           3       0.55      0.54      0.55      1800
           5       0.71      0.71      0.71      1800

    accuracy                           0.65      5400
   macro avg       0.65      0.65      0.65      5400
weighted avg       0.65      0.65      0.65      5400



Using a TF-IDF Bag of Words improved the performance of our classification for all classes.

### Natural Language Processing

Another issue of classifying text data is that text data is messy. It is not uncommon for people to misspell or put nonsense in their reviews. Natural Language Processing (NLP) tries to understand and analyze the contents of human languages. In our project, we noticed that the number of features/words in our Bag of Words was quite large. Hence, we utilized NLP techniques such as tokenization, lemmatization, removing stop words, and normalizing cases to not only reduce the number of features in our Bag of Words but also to create a more accurate data set. For example, before using NLP, 'run' and 'running' would be considered as two different features even if their meaning is interchangable. Using lemmatization solves this issue by transforming every word into their root form.

<img src="nlp.png">

In [72]:
import spacy
from spacy import displacy
from spacy.lang.en.stop_words import STOP_WORDS
import string

In [73]:
nlp = spacy.load('en_core_web_md')

In [74]:
punct = string.punctuation
stopwords = list(STOP_WORDS)

In [86]:
def tokenize(sentence):
    doc = nlp(sentence)
    tokens = []
    for token in  doc:
        if token.lemma_!= "-PRON-":
            temp = token.lemma_.lower().strip()
        else:
            temp = token.lower_
        tokens.append(temp)
    
    cleaned_tokens = []
    for token in tokens:
        if token not in stopwords and token not in punct:
            cleaned_tokens.append(token)
    return tokens

In [87]:
tfidf = TfidfVectorizer(tokenizer = tokenize)
classifier = LogisticRegression(max_iter=3000)
X_train, X_test, y_train, y_test = train_test_split(data['summary'], data['overall'], test_size = 0.3, random_state=0, shuffle= True, stratify=data['overall'])

In [83]:
clf = Pipeline([('tfidf',tfidf), ('clf', classifier)])

In [84]:
clf.fit(X_train, y_train)

Pipeline(steps=[('tfidf',
                 TfidfVectorizer(tokenizer=<function clean_data at 0x000001FA61B7F0D8>)),
                ('clf', LogisticRegression(max_iter=3000))])

In [85]:
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.67      0.69      0.68      1800
           3       0.60      0.59      0.59      1800
           5       0.74      0.73      0.73      1800

    accuracy                           0.67      5400
   macro avg       0.67      0.67      0.67      5400
weighted avg       0.67      0.67      0.67      5400



In [88]:
confusion_matrix(y_test, y_pred)

array([[1247,  411,  142],
       [ 417, 1062,  321],
       [ 184,  305, 1311]], dtype=int64)

Processing our text data by using lemmatization, normalizing cases, and removing stop words and punctuations improved the performance of our classification. Most notably, the accuracy of our class 3 classifications improved significantly. 

### Other Classification Models

Throughout this project, we used logistic regression, but we also tested many other classification models. For example, we tested: Random Forest, Bayes Nets, and Support Vector Machine. But ultimately, the logisitic regression actually performed the best. For example, it obtained an 1-2% higher accuracy than SVC throughout several runs. 

In [89]:
svc_classifier = LinearSVC()
clf = Pipeline([('tfidf',tfidf), ('clf', svc_classifier)])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.67      0.68      0.68      1800
           3       0.60      0.57      0.59      1800
           5       0.71      0.73      0.72      1800

    accuracy                           0.66      5400
   macro avg       0.66      0.66      0.66      5400
weighted avg       0.66      0.66      0.66      5400



### Future Improvements

The classier that we've made performs decently well, however, there is still a lot of room for improvement. For example, there are still many NLP techiqnes such as POS tagging and Dependency parsing that was not used in our algorithm. Additionally, it also possible that we can improve our classification model by imploding a deep learning and building a neutral network. Furthermore, instead of a vectorizing our data into a Bag of Words, we can try to see if vectorizing it into an N-Gram can improve its perfomance. Although our algorithm was made particularly for sentiment. the NLP techniques and the classification used in this project have many other applications such as chat bots, speech recognition, and spam filters. 