# Amazon Reviews Text Processing and Modeling

In [1]:
import pandas as pd
import numpy as np
import nltk
import string
from nltk.corpus import stopwords
from nltk import PorterStemmer

In [2]:
reviews_df = pd.read_csv('Data/Reviews with Label.csv')

In [3]:
reviews_df.head()

Unnamed: 0,asin,Review ID,Title,Body,Rating,Review Length,Review Label
0,B08CQ4HXHV,R2Y2A5WJ9Q84I9,\nI keep stocked up\n,\nThis is my go to worm on the Little Pigeon R...,5.0,250.0,positive
1,B08CQ4HXHV,RJHR3X7CVOZE8,\nIt just works\n,"\nOne of my most successful soft plastics, the...",4.0,876.0,positive
2,B08CQ4HXHV,R2D40LMXK190YP,\nThese baits catch fish!\n,\nThey catch fish and they’re durable too. I’v...,5.0,132.0,positive
3,B08CQ4HXHV,R1KKR6D1SQ3D4D,\nThese things just catch fish.\n,\nDon't have the action of a Yamamoto but stil...,4.0,132.0,positive
4,B08CQ4HXHV,R1V6NVM2KWFOZ5,\nBass love it\n,"\nIt’s a hit with the bass, but rips easily. ...",4.0,79.0,positive


### Text Normalization

In [4]:
# Function to process all text and returns a list of tokens for each review
def review_process(review):
    # Returns characters that are not punctuation marks
    no_punc = [char for char in review if char not in string.punctuation]
    
    # Rejoins characters for review without punctuation
    no_punc = ''.join(no_punc)
    
    # Stems words in review
    ps = PorterStemmer()
    stemmed = []
    for word in no_punc.split():
        stemmed.append(ps.stem(word))
        
    # Removes stopwords from review and returns
    return [word for word in stemmed if word.lower() not in stopwords.words('english')]

Compare the original review to the tokenized list of the review.

In [5]:
reviews_df['Body'][0]

'\nThis is my go to worm on the Little Pigeon River in Tennessee. I saw this on You Tubes Creek Fishing Adventures. When all else fails I turn to this. Many days this was all I needed to haul in large Smallmouth. My current PB was caught on this worm.\n'

In [6]:
reviews_df['Body'].head().apply(review_process)[0]

['thi',
 'go',
 'worm',
 'littl',
 'pigeon',
 'river',
 'tennesse',
 'saw',
 'thi',
 'tube',
 'creek',
 'fish',
 'adventur',
 'els',
 'fail',
 'turn',
 'thi',
 'mani',
 'day',
 'thi',
 'wa',
 'need',
 'haul',
 'larg',
 'smallmouth',
 'current',
 'pb',
 'wa',
 'caught',
 'thi',
 'worm']

In [7]:
reviews_df['Body'].head().apply(review_process)

0    [thi, go, worm, littl, pigeon, river, tennesse...
1    [one, success, soft, plastic, yum, 5inch, stic...
2    [catch, fish, they’r, durabl, i’v, caught, som...
3    [dont, action, yamamoto, still, catch, fish, w...
4    [it’, hit, bass, rip, easili, good, amount, th...
Name: Body, dtype: object

### Pipeline: Text Vectorization, TF-IDF and Modeling

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import ComplementNB
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,classification_report

In [9]:
reviews_df['Body'].isnull().sum()

18

In [10]:
# Dropping rows with null values for the review body
reviews_df = reviews_df.dropna(subset=['Body'])

In [11]:
X = reviews_df['Body']
y = reviews_df['Review Label']

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3,random_state=55)

In [13]:
 np.unique(y_train, return_counts=True)

(array(['negative', 'positive'], dtype=object),
 array([ 289, 1491], dtype=int64))

#### Complement Naive Bayes

Complement Naive Bayes is an adaptation of Multinomial Naive Bayes, but instead returns the value with the lowest probability of an instance not belonging to a class. This method is better for imbalanced datasets, which I have with 2143 positive and 419 negative reviews. If I were to use a Multinomial Naive Bayes classifier, I would overfit the model on the positive reviews class.

In [14]:
pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=review_process)),  # using text normalizing function to create vector of words
    ('tfidf', TfidfTransformer()),  # calculate weighted TF-IDF scores based on word vectors
    ('classifier', ComplementNB())  # create model on TF-IDF vectors with Complement Naive Bayes
])

In [15]:
pipeline.fit(X_train,y_train)

In [16]:
predictions_NB = pipeline.predict(X_test)

In [17]:
print(confusion_matrix(predictions_NB,y_test))

[[ 23  42]
 [103 596]]


In [18]:
print(classification_report(predictions_NB,y_test))

              precision    recall  f1-score   support

    negative       0.18      0.35      0.24        65
    positive       0.93      0.85      0.89       699

    accuracy                           0.81       764
   macro avg       0.56      0.60      0.57       764
weighted avg       0.87      0.81      0.84       764



#### Support Vector Machine

Support Vector Machine is a binary classifier model that separates points with a hyperplane that maximizes the margin between the two classes. Because this dataset has two attributes, positive or negative, the data points are in a two-dimensional space separated by a line. The parameter 'class_weight' is set to 'balanced' to put more emphasis on observations in the negative class.

In [19]:
pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=review_process)),  # using text normalizing function to create vector of words
    ('tfidf', TfidfTransformer()),  # calculate weighted TF-IDF scores based on word vectors
    ('classifier', SVC(class_weight='balanced',random_state=55)),  # create model on TF-IDF vectors with Support Vector Machine
])

In [20]:
pipeline.fit(X_train,y_train)

In [21]:
predictions_SVC = pipeline.predict(X_test)

In [22]:
print(confusion_matrix(predictions_SVC,y_test))

[[ 48  24]
 [ 78 614]]


In [23]:
print(classification_report(predictions_SVC,y_test))

              precision    recall  f1-score   support

    negative       0.38      0.67      0.48        72
    positive       0.96      0.89      0.92       692

    accuracy                           0.87       764
   macro avg       0.67      0.78      0.70       764
weighted avg       0.91      0.87      0.88       764



#### Support Vector Machine with Cross-Validation

Support Vector Machine is a binary classifier model that separates points with a hyperplane that maximizes the margin between the two classes. To expand to non-linearly separable data, I will use cross-validation with a kernel function. I will use a Radial Basis Function kernel and a Gridsearch of the best possible parameters for C, which controls the cost of misclassification, and gamma, which determines the margin between similar points. The parameter 'class_weight' is set to 'balanced' to put more emphasis on observations in the negative class.

In [24]:
param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']}

In [25]:
pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=review_process)),  # using text normalizing function to create vector of words
    ('tfidf', TfidfTransformer()),  # calculate weighted TF-IDF scores based on word vectors
    ('classifier', GridSearchCV(SVC(class_weight='balanced',random_state=55),param_grid,refit=True,verbose=3)),  # create model on TF-IDF vectors with Support Vector Machine
])

In [26]:
pipeline.fit(X_train,y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.337 total time=   0.2s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.292 total time=   0.2s
[CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.315 total time=   0.2s
[CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.362 total time=   0.3s
[CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.360 total time=   0.2s
[CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.840 total time=   0.2s
[CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.837 total time=   0.2s
[CV 3/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.837 total time=   0.2s
[CV 4/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.837 total time=   0.2s
[CV 5/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.837 total time=   0.2s
[CV 1/5] END .....C=0.1, gamma=0.01, kernel=rbf;, score=0.840 total time=   0.2s
[CV 2/5] END .....C=0.1, gamma=0.01, kernel=rbf

[CV 2/5] END .......C=1000, gamma=1, kernel=rbf;, score=0.851 total time=   0.2s
[CV 3/5] END .......C=1000, gamma=1, kernel=rbf;, score=0.843 total time=   0.2s
[CV 4/5] END .......C=1000, gamma=1, kernel=rbf;, score=0.846 total time=   0.2s
[CV 5/5] END .......C=1000, gamma=1, kernel=rbf;, score=0.829 total time=   0.2s
[CV 1/5] END .....C=1000, gamma=0.1, kernel=rbf;, score=0.865 total time=   0.1s
[CV 2/5] END .....C=1000, gamma=0.1, kernel=rbf;, score=0.840 total time=   0.1s
[CV 3/5] END .....C=1000, gamma=0.1, kernel=rbf;, score=0.831 total time=   0.1s
[CV 4/5] END .....C=1000, gamma=0.1, kernel=rbf;, score=0.837 total time=   0.1s
[CV 5/5] END .....C=1000, gamma=0.1, kernel=rbf;, score=0.831 total time=   0.1s
[CV 1/5] END ....C=1000, gamma=0.01, kernel=rbf;, score=0.848 total time=   0.1s
[CV 2/5] END ....C=1000, gamma=0.01, kernel=rbf;, score=0.834 total time=   0.1s
[CV 3/5] END ....C=1000, gamma=0.01, kernel=rbf;, score=0.826 total time=   0.1s
[CV 4/5] END ....C=1000, gam

In [27]:
predictions_SVC_CV = pipeline.predict(X_test)

In [28]:
print(confusion_matrix(predictions_SVC_CV,y_test))

[[ 71  62]
 [ 55 576]]


In [29]:
print(classification_report(predictions_SVC_CV,y_test))

              precision    recall  f1-score   support

    negative       0.56      0.53      0.55       133
    positive       0.90      0.91      0.91       631

    accuracy                           0.85       764
   macro avg       0.73      0.72      0.73       764
weighted avg       0.84      0.85      0.85       764



#### Logistic Regression

Logistic Regression is a binary classifier model that returns the probability of a point belonging to a class and then assigns it to a class with a probability cutoff of 0.5. The parameter 'class_weight' is set to 'balanced' to put more emphasis on observations in the negative class.

In [30]:
pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=review_process)),   # using text normalizing function to create vector of words
    ('tfidf', TfidfTransformer()),  # calculate weighted TF-IDF scores based on word vectors
    ('classifier', LogisticRegression(class_weight='balanced',random_state=55)),  # create model on TF-IDF vectors with Logistic Regression
])

In [31]:
pipeline.fit(X_train,y_train)

In [32]:
predictions_log = pipeline.predict(X_test)

In [33]:
print(confusion_matrix(predictions_log,y_test))

[[ 86  96]
 [ 40 542]]


In [34]:
print(classification_report(predictions_log,y_test))

              precision    recall  f1-score   support

    negative       0.68      0.47      0.56       182
    positive       0.85      0.93      0.89       582

    accuracy                           0.82       764
   macro avg       0.77      0.70      0.72       764
weighted avg       0.81      0.82      0.81       764



### Model Evaluation

The Logistic Regression and Support Vector Machine with Cross-Validation (SVM with CV) classifier models performed the best, even though the weighted average F1 score was the lowest for the Logistic Regression model. Because there are many more instances in the 'positive' class, it is easy for a model to overfit and only predict a review is positive. The precision, recall and F1 score for all models on the positive class was high, with all values ranging from 0.85 to 0.96. However, the Naive Bayes and Support Vector Machine classifiers overall performed much worse in predicting negative reviews compared to the Logistic Regression and SVM with CV models. The Support Vector Machine model classified actual negative reviews fairly well with a recall of 0.67 for the negative class, which is better than the Logistic Regression model at 0.47. However, it simply predicted too many false negatives with 78 and a precision of 0.38 compared to a precision of .68 for the Logistic Regression model.

Between the Logistic Regression and SVM with CV models, the SVM with CV model is more consistent with a precision of 0.56, recall of 0.53 and F1 score of 0.55. While the Logistic Regression model has a higher F1 score of 0.56, its precision is 0.68 and recall is 0.47. It is more accurate at predicting if a review is negative, but is worse at detecting negative reviews with a lot more false positives. If I had to choose one model as the best, I would choose the Logistic Regression model because of its relatively high precision with negative reviews. However, they were both much more successful than the other models at classifying positive and negative reviews.

### Redefining Positive and Negative

#### Predicting 1 or 5 Stars

Instead of positive being 4 and 5 stars and negative being 1-3 stars, I will predict if a review is 5 star or 1 star. These reviews should be the most different in words used and therefore I should have a more accurate model. For the remaining models, I will use the Logistic Regression and SVM with CV classifiers since I found them most successful above.

In [35]:
reviews_df_class = reviews_df[(reviews_df['Rating']==1) | (reviews_df['Rating']==5)]

In [36]:
reviews_df_class['Rating'].value_counts()

5.0    1769
1.0     139
Name: Rating, dtype: int64

In [37]:
X = reviews_df_class['Body']
y = reviews_df_class['Rating']

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3,random_state=55)

In [39]:
 np.unique(y_train, return_counts=True)

(array([1., 5.]), array([  99, 1236], dtype=int64))

#### Logistic Regression

In [40]:
pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=review_process)),   # using text normalizing function to create vector of words
    ('tfidf', TfidfTransformer()),  # calculate weighted TF-IDF scores based on word vectors
    ('classifier', LogisticRegression(class_weight='balanced',random_state=55)),  # create model on TF-IDF vectors with Logistic Regression
])

In [41]:
pipeline.fit(X_train,y_train)

In [42]:
predictions_log = pipeline.predict(X_test)

In [43]:
print(confusion_matrix(predictions_log,y_test))

[[ 31  29]
 [  9 504]]


In [44]:
print(classification_report(predictions_log,y_test))

              precision    recall  f1-score   support

         1.0       0.78      0.52      0.62        60
         5.0       0.95      0.98      0.96       513

    accuracy                           0.93       573
   macro avg       0.86      0.75      0.79       573
weighted avg       0.93      0.93      0.93       573



#### Support Vector Machine with Cross-Validation

In [45]:
param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']}

In [46]:
pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=review_process)),  # using text normalizing function to create vector of words
    ('tfidf', TfidfTransformer()),  # calculate weighted TF-IDF scores based on word vectors
    ('classifier', GridSearchCV(SVC(class_weight='balanced',random_state=55),param_grid,refit=True,verbose=3)),  # create model on TF-IDF vectors with Support Vector Machine
])

In [47]:
pipeline.fit(X_train,y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.543 total time=   0.0s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.644 total time=   0.1s
[CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.689 total time=   0.1s
[CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.775 total time=   0.1s
[CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.640 total time=   0.1s
[CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.929 total time=   0.1s
[CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.075 total time=   0.1s
[CV 3/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.075 total time=   0.1s
[CV 4/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.075 total time=   0.1s
[CV 5/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.075 total time=   0.0s
[CV 1/5] END .....C=0.1, gamma=0.01, kernel=rbf;, score=0.929 total time=   0.1s
[CV 2/5] END .....C=0.1, gamma=0.01, kernel=rbf

[CV 2/5] END .......C=1000, gamma=1, kernel=rbf;, score=0.925 total time=   0.1s
[CV 3/5] END .......C=1000, gamma=1, kernel=rbf;, score=0.929 total time=   0.0s
[CV 4/5] END .......C=1000, gamma=1, kernel=rbf;, score=0.925 total time=   0.0s
[CV 5/5] END .......C=1000, gamma=1, kernel=rbf;, score=0.925 total time=   0.1s
[CV 1/5] END .....C=1000, gamma=0.1, kernel=rbf;, score=0.940 total time=   0.0s
[CV 2/5] END .....C=1000, gamma=0.1, kernel=rbf;, score=0.918 total time=   0.0s
[CV 3/5] END .....C=1000, gamma=0.1, kernel=rbf;, score=0.940 total time=   0.0s
[CV 4/5] END .....C=1000, gamma=0.1, kernel=rbf;, score=0.933 total time=   0.0s
[CV 5/5] END .....C=1000, gamma=0.1, kernel=rbf;, score=0.925 total time=   0.0s
[CV 1/5] END ....C=1000, gamma=0.01, kernel=rbf;, score=0.940 total time=   0.0s
[CV 2/5] END ....C=1000, gamma=0.01, kernel=rbf;, score=0.921 total time=   0.0s
[CV 3/5] END ....C=1000, gamma=0.01, kernel=rbf;, score=0.940 total time=   0.0s
[CV 4/5] END ....C=1000, gam

In [48]:
predictions_SVC_CV = pipeline.predict(X_test)

In [49]:
print(confusion_matrix(predictions_SVC_CV,y_test))

[[ 22   9]
 [ 18 524]]


In [50]:
print(classification_report(predictions_SVC_CV,y_test))

              precision    recall  f1-score   support

         1.0       0.55      0.71      0.62        31
         5.0       0.98      0.97      0.97       542

    accuracy                           0.95       573
   macro avg       0.77      0.84      0.80       573
weighted avg       0.96      0.95      0.96       573



#### Predict Negative as 1 or 2 Stars or Positive as 5 stars

In [51]:
reviews_df_class = reviews_df[(reviews_df['Rating']==1) | (reviews_df['Rating']==2)| (reviews_df['Rating']==5)].reset_index()

In [52]:
reviews_df_class['Rating'].value_counts()

5.0    1769
1.0     139
2.0      96
Name: Rating, dtype: int64

In [53]:
reviews_df_class.drop(['Review Label'],axis=1,inplace=True)

In [54]:
label_list = []
for row in range(len(reviews_df_class)):
        if (reviews_df_class["Rating"].iloc[row] == 1) | (reviews_df_class["Rating"].iloc[row] == 2):
            label_list.append('negative')
        else:
            label_list.append('positive')

In [55]:
reviews_df_class['Review Label'] = label_list

In [56]:
reviews_df_class.head()

Unnamed: 0,index,asin,Review ID,Title,Body,Rating,Review Length,Review Label
0,0,B08CQ4HXHV,R2Y2A5WJ9Q84I9,\nI keep stocked up\n,\nThis is my go to worm on the Little Pigeon R...,5.0,250.0,positive
1,2,B08CQ4HXHV,R2D40LMXK190YP,\nThese baits catch fish!\n,\nThey catch fish and they’re durable too. I’v...,5.0,132.0,positive
2,5,B08CQ4HXHV,R15AR1ET95JFX7,\nGood plastic!\n,\nPlastic last a long time!\n,5.0,27.0,positive
3,7,B08CQ4HXHV,R3RMSZ9M1B6KZ9,\nIt works\n,\nFirst bass I caught using a worm. We usually...,5.0,148.0,positive
4,8,B08CQ4HXHV,RPOORXS4DCCI1,\nThe best bass soft lure around.\n,\nThis is a fantastic bass lure. The Yum Fluor...,5.0,88.0,positive


In [57]:
reviews_df_class['Review Label'].value_counts()

positive    1769
negative     235
Name: Review Label, dtype: int64

In [58]:
X = reviews_df_class['Body']
y = reviews_df_class['Review Label']

In [59]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3,random_state=55)

In [60]:
np.unique(y_train, return_counts=True)

(array(['negative', 'positive'], dtype=object),
 array([ 172, 1230], dtype=int64))

#### Logistic Regression

In [61]:
pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=review_process)),   # using text normalizing function to create vector of words
    ('tfidf', TfidfTransformer()),  # calculate weighted TF-IDF scores based on word vectors
    ('classifier', LogisticRegression(class_weight='balanced',random_state=55)),  # create model on TF-IDF vectors with Logistic Regression
])

In [62]:
pipeline.fit(X_train,y_train)

In [63]:
predictions_log = pipeline.predict(X_test)

In [64]:
print(confusion_matrix(predictions_log,y_test))

[[ 46  45]
 [ 17 494]]


In [65]:
print(classification_report(predictions_log,y_test))

              precision    recall  f1-score   support

    negative       0.73      0.51      0.60        91
    positive       0.92      0.97      0.94       511

    accuracy                           0.90       602
   macro avg       0.82      0.74      0.77       602
weighted avg       0.89      0.90      0.89       602



#### Support Vector Machine with Cross-Validation

In [66]:
param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']}

In [67]:
pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=review_process)),  # using text normalizing function to create vector of words
    ('tfidf', TfidfTransformer()),  # calculate weighted TF-IDF scores based on word vectors
    ('classifier', GridSearchCV(SVC(class_weight='balanced',random_state=55),param_grid,refit=True,verbose=3)),  # create model on TF-IDF vectors with Support Vector Machine
])

In [68]:
pipeline.fit(X_train,y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.384 total time=   0.1s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.327 total time=   0.1s
[CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.421 total time=   0.1s
[CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.379 total time=   0.1s
[CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.411 total time=   0.1s
[CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.125 total time=   0.1s
[CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.125 total time=   0.1s
[CV 3/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.121 total time=   0.1s
[CV 4/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.121 total time=   0.1s
[CV 5/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.121 total time=   0.1s
[CV 1/5] END .....C=0.1, gamma=0.01, kernel=rbf;, score=0.125 total time=   0.1s
[CV 2/5] END .....C=0.1, gamma=0.01, kernel=rbf

[CV 2/5] END .......C=1000, gamma=1, kernel=rbf;, score=0.879 total time=   0.1s
[CV 3/5] END .......C=1000, gamma=1, kernel=rbf;, score=0.896 total time=   0.1s
[CV 4/5] END .......C=1000, gamma=1, kernel=rbf;, score=0.871 total time=   0.1s
[CV 5/5] END .......C=1000, gamma=1, kernel=rbf;, score=0.889 total time=   0.1s
[CV 1/5] END .....C=1000, gamma=0.1, kernel=rbf;, score=0.890 total time=   0.0s
[CV 2/5] END .....C=1000, gamma=0.1, kernel=rbf;, score=0.883 total time=   0.0s
[CV 3/5] END .....C=1000, gamma=0.1, kernel=rbf;, score=0.904 total time=   0.0s
[CV 4/5] END .....C=1000, gamma=0.1, kernel=rbf;, score=0.900 total time=   0.0s
[CV 5/5] END .....C=1000, gamma=0.1, kernel=rbf;, score=0.893 total time=   0.0s
[CV 1/5] END ....C=1000, gamma=0.01, kernel=rbf;, score=0.900 total time=   0.0s
[CV 2/5] END ....C=1000, gamma=0.01, kernel=rbf;, score=0.879 total time=   0.0s
[CV 3/5] END ....C=1000, gamma=0.01, kernel=rbf;, score=0.904 total time=   0.0s
[CV 4/5] END ....C=1000, gam

In [69]:
predictions_SVC_CV = pipeline.predict(X_test)

In [70]:
print(confusion_matrix(predictions_SVC_CV,y_test))

[[ 30  16]
 [ 33 523]]


In [71]:
print(classification_report(predictions_SVC_CV,y_test))

              precision    recall  f1-score   support

    negative       0.48      0.65      0.55        46
    positive       0.97      0.94      0.96       556

    accuracy                           0.92       602
   macro avg       0.72      0.80      0.75       602
weighted avg       0.93      0.92      0.92       602



#### Predict Negative as 1 or 2 Stars or Positive as 4 or 5 stars

In [72]:
reviews_df_class = reviews_df[(reviews_df['Rating']==1) | (reviews_df['Rating']==2)| (reviews_df['Rating']==4) | (reviews_df['Rating']==5)].reset_index()

In [73]:
reviews_df_class['Rating'].value_counts()

5.0    1769
4.0     360
1.0     139
2.0      96
Name: Rating, dtype: int64

In [74]:
reviews_df_class.drop(['Review Label'],axis=1,inplace=True)

In [75]:
label_list = []
for row in range(len(reviews_df_class)):
        if (reviews_df_class["Rating"].iloc[row] == 1) | (reviews_df_class["Rating"].iloc[row] == 2):
            label_list.append('negative')
        else:
            label_list.append('positive')

In [76]:
reviews_df_class['Review Label'] = label_list

In [77]:
reviews_df_class.head()

Unnamed: 0,index,asin,Review ID,Title,Body,Rating,Review Length,Review Label
0,0,B08CQ4HXHV,R2Y2A5WJ9Q84I9,\nI keep stocked up\n,\nThis is my go to worm on the Little Pigeon R...,5.0,250.0,positive
1,1,B08CQ4HXHV,RJHR3X7CVOZE8,\nIt just works\n,"\nOne of my most successful soft plastics, the...",4.0,876.0,positive
2,2,B08CQ4HXHV,R2D40LMXK190YP,\nThese baits catch fish!\n,\nThey catch fish and they’re durable too. I’v...,5.0,132.0,positive
3,3,B08CQ4HXHV,R1KKR6D1SQ3D4D,\nThese things just catch fish.\n,\nDon't have the action of a Yamamoto but stil...,4.0,132.0,positive
4,4,B08CQ4HXHV,R1V6NVM2KWFOZ5,\nBass love it\n,"\nIt’s a hit with the bass, but rips easily. ...",4.0,79.0,positive


In [78]:
reviews_df_class['Review Label'].value_counts()

positive    2129
negative     235
Name: Review Label, dtype: int64

In [79]:
X = reviews_df_class['Body']
y = reviews_df_class['Review Label']

In [80]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3,random_state=55)

In [81]:
np.unique(y_train, return_counts=True)

(array(['negative', 'positive'], dtype=object),
 array([ 163, 1491], dtype=int64))

#### Logistic Regression

In [82]:
pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=review_process)),   # using text normalizing function to create vector of words
    ('tfidf', TfidfTransformer()),  # calculate weighted TF-IDF scores based on word vectors
    ('classifier', LogisticRegression(class_weight='balanced',random_state=55)),  # create model on TF-IDF vectors with Logistic Regression
])

In [83]:
pipeline.fit(X_train,y_train)

In [84]:
predictions_log = pipeline.predict(X_test)

In [85]:
print(confusion_matrix(predictions_log,y_test))

[[ 55  45]
 [ 17 593]]


In [86]:
print(classification_report(predictions_log,y_test))

              precision    recall  f1-score   support

    negative       0.76      0.55      0.64       100
    positive       0.93      0.97      0.95       610

    accuracy                           0.91       710
   macro avg       0.85      0.76      0.79       710
weighted avg       0.91      0.91      0.91       710



#### Support Vector Machine with Cross-Validation

In [87]:
param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']}

In [88]:
pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=review_process)),  # using text normalizing function to create vector of words
    ('tfidf', TfidfTransformer()),  # calculate weighted TF-IDF scores based on word vectors
    ('classifier', GridSearchCV(SVC(class_weight='balanced',random_state=55),param_grid,refit=True,verbose=3)),  # create model on TF-IDF vectors with Support Vector Machine
])

In [89]:
pipeline.fit(X_train,y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.447 total time=   0.2s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.453 total time=   0.2s
[CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.296 total time=   0.2s
[CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.480 total time=   0.2s
[CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.415 total time=   0.2s
[CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.903 total time=   0.2s
[CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.100 total time=   0.2s
[CV 3/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.900 total time=   0.2s
[CV 4/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.900 total time=   0.2s
[CV 5/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.903 total time=   0.2s
[CV 1/5] END .....C=0.1, gamma=0.01, kernel=rbf;, score=0.903 total time=   0.2s
[CV 2/5] END .....C=0.1, gamma=0.01, kernel=rbf

[CV 2/5] END .......C=1000, gamma=1, kernel=rbf;, score=0.900 total time=   0.1s
[CV 3/5] END .......C=1000, gamma=1, kernel=rbf;, score=0.900 total time=   0.1s
[CV 4/5] END .......C=1000, gamma=1, kernel=rbf;, score=0.906 total time=   0.1s
[CV 5/5] END .......C=1000, gamma=1, kernel=rbf;, score=0.903 total time=   0.1s
[CV 1/5] END .....C=1000, gamma=0.1, kernel=rbf;, score=0.897 total time=   0.0s
[CV 2/5] END .....C=1000, gamma=0.1, kernel=rbf;, score=0.906 total time=   0.0s
[CV 3/5] END .....C=1000, gamma=0.1, kernel=rbf;, score=0.918 total time=   0.0s
[CV 4/5] END .....C=1000, gamma=0.1, kernel=rbf;, score=0.903 total time=   0.0s
[CV 5/5] END .....C=1000, gamma=0.1, kernel=rbf;, score=0.891 total time=   0.0s
[CV 1/5] END ....C=1000, gamma=0.01, kernel=rbf;, score=0.888 total time=   0.0s
[CV 2/5] END ....C=1000, gamma=0.01, kernel=rbf;, score=0.912 total time=   0.0s
[CV 3/5] END ....C=1000, gamma=0.01, kernel=rbf;, score=0.912 total time=   0.0s
[CV 4/5] END ....C=1000, gam

In [90]:
predictions_SVC_CV = pipeline.predict(X_test)

In [91]:
print(confusion_matrix(predictions_SVC_CV,y_test))

[[ 35  26]
 [ 37 612]]


In [92]:
print(classification_report(predictions_SVC_CV,y_test))

              precision    recall  f1-score   support

    negative       0.49      0.57      0.53        61
    positive       0.96      0.94      0.95       649

    accuracy                           0.91       710
   macro avg       0.72      0.76      0.74       710
weighted avg       0.92      0.91      0.91       710



### Model Evaluation

Overall, the models performed better with redefined positive and negative reviews and the Logistic Regression outperformed SVM with CV. When predicting 1 or 5 stars, the two models had the same F1 score of 0.62, but with opposite ways of achieving it. The Logistic Regression model has a precision of 0.78 and recall of 0.52 and the SVM with CV model has a precision of 0.55 and a recall of 0.71. When I redefined negative reviews as 1 and 2 stars, I thought the models might perform better than solely predicting 1 or 5 star since the negative class would have more observations. However, the models performed slightly worse. Next, when I redefined positive to 4 and 5 star reviews, the Logistic Regression model had the best F1 score for the negative class out of all other models investigated, with an F1 score of 0.64. All the Logistic Regression models had an F1 score of 0.60 or greater in this section, while the SVM with CV models had more fluctuation.

It seems both models were better able to distinguish 5 star from 1 star reviews than the original positive from negative reviews, which makes sense given 3 star reviews are very likely to have a mix of words used from 5 star and 1 star reviews. Removing those ambiguous reviews from classification allowed the models to better predict positive and negative reviews. An area for further exploration could be a multiclass classifier model, where 1 and 2 star reviews are negative, 3 star reviews are neutral, and 4 and 5 star reviews are positive.

### Conclusion

As I have shown in the previous section, defining a review as positive or negative is subjective, and this definition alters the success of classification. Additionally, since most reviews are 5 star reviews, it is extremely important to have enough observations in the negative class and ensure the classifier models give more consideration to the negative class. The Logistic Regression models were successful in classifying positive or negative reviews, even with changing definitions. To improve the results even more, the next step to have the most impact is simply gathering more data. Even though I gathered over 2500 reviews, these reviews are heavily skewed towards 5 stars, so the negative class had less than 500 observations. However, because soft plastic fishing lures is a relatively narrow product category, I am limited in the amount of data I can collect.