# Noah Miller GitHub Python Project 2: Natural Language Processing  Amazon Product Reviews

### Note: This project is adapted from a project from one of my graduate classes with ideas and code borrowed from other souces. I will make citations in the code and at the end of this document.

In this notebook, we will be investigating Amazon product reviews. The dataset in question comes from https://www.kaggle.com/kritanjalijain/amazon-reviews?select=train.csv and consists of a spreadsheet containing a review, the review title, and whether or not the review is positive (denoted by a 2) or negative (denoted by a 1). The business application of this project is to identify terms associated with good reviews, terms associated with bad reviews, and whether or not we can use machine learning methods to identify if a review is good or bad. We can use this information to suggest the sale of an alternative product in the case of a bad review. Likewise, we can use this information to suggest more of the same or similar products to users who leave good reviews.


## Step 1: Data Importing and Preprocessing

In this step, let's first view our data frame before any preprocessing takes place. For the sake of speed and computational resources, we will limit our training data to only 500,000 of the original 3.6 million samples. We will sample these observations randomly to ensure the data set is not biased.

In [1]:
import warnings
warnings.filterwarnings('ignore')  # Filters out red boxes due to warnings
import pandas as pd  # Great for manipulating data frames
from dask import dataframe as dd  # I found this was faster for reading in csv files
#pd.options.mode.chained_assignment = None  # Disables chained assignments
# The solution above helped me assign one column to be another one and comes from
# https://stackoverflow.com/questions/49728421/pandas-dataframe-settingwithcopywarning-a-value-is-trying-to-be-set-on-a-copy
import numpy as np  # Great for numerical manipulation
import nltk  # Natural lanuage toolkit
from nltk.corpus import stopwords, wordnet  # For preprocessing our data
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer  # For analyzing sentiments later on
import multiprocess as mp  # Similar to multiprocessING, but just works better for Jupyter Notebook

amazon_reviews = dd.read_csv("/Users/noahmiller/Downloads/train.csv", header=None)  # Reading in through Dask is a little faster
amazon_reviews = amazon_reviews.compute()  # Turns Dask data frame into Pandas data frame
amazon_reviews = amazon_reviews.rename(columns={0: "Sentiment", 1:"Title",2:"Review"})  # Had to assign these manually
amazon_reviews = amazon_reviews.dropna()  # Removes null values
amazon_reviews = amazon_reviews.sample(n=500_000, random_state=123)  # Using 500,000 samples; the entire set was too much, even for my gaming PC with 6 cores and 32 GB of RAM. 
amazon_reviews.head()

Unnamed: 0,Sentiment,Title,Review
91242,1,"""Signs"" is the most hideous sci-fi film I ever...",This was the most disappointing sci-fi film I ...
62299,1,Mukluks,"Comfy, but not made as well as I hoped. Socks ..."
110813,1,Not that great,"Barbara Delinski is a great author, but this b..."
120160,1,Booring Bugggery,Buns! Buggery! Brutality! Booring! The littery...
117321,1,HOUSEBOY,THIS MOVIE IS FAIR. ENTERTAINING BUT NOT A GRE...


With these records read in and in a well-structured format, we can begin to process the data. First, we can reset the index of this data set. Observe that the index of the data frame above above has no discernable order.

In [2]:
amazon_reviews = amazon_reviews.reset_index(drop=True)
amazon_reviews.head()

Unnamed: 0,Sentiment,Title,Review
0,1,"""Signs"" is the most hideous sci-fi film I ever...",This was the most disappointing sci-fi film I ...
1,1,Mukluks,"Comfy, but not made as well as I hoped. Socks ..."
2,1,Not that great,"Barbara Delinski is a great author, but this b..."
3,1,Booring Bugggery,Buns! Buggery! Brutality! Booring! The littery...
4,1,HOUSEBOY,THIS MOVIE IS FAIR. ENTERTAINING BUT NOT A GRE...


This data frame is a lot more simple than before. We can make it even more simple, however, by considering sentiments of 1 to be Negative and by considering sentiments of 2 to be Positive. For this, it's easy enough to just use a lamba function on the desired column.

In [3]:
amazon_reviews['Sentiment'] = amazon_reviews['Sentiment'].map(lambda x: "Negative" if x==1 else "Positive")
amazon_reviews.head()

Unnamed: 0,Sentiment,Title,Review
0,Negative,"""Signs"" is the most hideous sci-fi film I ever...",This was the most disappointing sci-fi film I ...
1,Negative,Mukluks,"Comfy, but not made as well as I hoped. Socks ..."
2,Negative,Not that great,"Barbara Delinski is a great author, but this b..."
3,Negative,Booring Bugggery,Buns! Buggery! Brutality! Booring! The littery...
4,Negative,HOUSEBOY,THIS MOVIE IS FAIR. ENTERTAINING BUT NOT A GRE...


Observe that the sentiment column now consists of the words "Negative" and "Positive" to denote the sentiment of a review.

This next cell is where a lot of the preprocessing takes place. The function below will, in order:
* Remove puncutation from the columns
* Remove non-alphanumeric characters from the columns
* Convert the columns to lowercase
* Remove stopwords from the sentence

Likewise, I'd added in some additional logic which will allow Python to perform this processing in parallel. I have done this through the *multiprocess* module. That's not a typo; *multiprocess* is a fork of the *multiprocessing* module which just works better for Jupyter Notebooks, like this one. 

In [4]:
# Credit for parts of this InputProcessing function go to 
# https://erleem.medium.com/nlp-complete-sentiment-analysis-on-amazon-reviews-374e4fea9976
# This is where I got the parts for removing non-alphanumeric, converting to lowercase, and removing stopwords


import string

stopwords_list = stopwords.words('english')
def InputProcessing(raw_input):
    # Removing punctuation
    raw_input = raw_input.translate(str.maketrans('', '', string.punctuation))
    # Removing non-alphanumeric characters from string
    raw_input = raw_input.replace('[^a-zA-Z0-9 ]', '')
    # Converting to lowercase
    raw_input = raw_input.lower()
    raw_input = raw_input.split()
    # Remove stopwords
    raw_input = [item for item in raw_input if item not in stopwords_list]
    clean_output = ' '.join([str(elem) for elem in raw_input])
    return clean_output

with mp.Pool(mp.cpu_count()) as pool:  # Processing in parallel
    amazon_reviews['Review'] = pool.map(InputProcessing, amazon_reviews['Review'])
    pool.close()
    pool.join()
    
with mp.Pool(mp.cpu_count()) as pool:  # Processing in parallel
    amazon_reviews['Title'] = pool.map(InputProcessing, amazon_reviews['Title'])
    pool.close()
    pool.join()
    
amazon_reviews.head()

Unnamed: 0,Sentiment,Title,Review
0,Negative,signs hideous scifi film ever watched,disappointing scifi film ever watchedyou alien...
1,Negative,mukluks,comfy made well hoped socks different sizes te...
2,Negative,great,barbara delinski great author book one worst c...
3,Negative,booring bugggery,buns buggery brutality booring littery device ...
4,Negative,houseboy,movie fair entertaining great movie looking mo...


Now that our text has been altered to be cleaner, let's try another fundamental preprocessing step. Next, we'll lemmatize the words in our data set. This transforms them into smaller root words, making columns less unique which could help our models determine if they are positive or negative. 

In [5]:
# All credit for this lemmatization function goes to https://erleem.medium.com/nlp-complete-sentiment-analysis-on-amazon-reviews-374e4fea9976
# I, however, introduced the parallelization to this function
# Note: This may take a VERY long time so parallelization was necessary

# On a side note, I am NOT responsible if your CPU burns itself to death 🔥 

def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

lemmatizer = nltk.stem.WordNetLemmatizer()

def get_lemmatizer(input_string):
    return " ".join(lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(input_string))

with mp.Pool(mp.cpu_count()) as pool:  # Processing in parallel
    amazon_reviews['Review'] = pool.map(get_lemmatizer, amazon_reviews['Review'])
    pool.close()
    pool.join()
    
with mp.Pool(mp.cpu_count()) as pool:  # Processing in parallel
    amazon_reviews['Title'] = pool.map(get_lemmatizer, amazon_reviews['Title'])
    pool.close()
    pool.join()
    
amazon_reviews.head()

Unnamed: 0,Sentiment,Title,Review
0,Negative,sign hideous scifi film ever watch,disappoint scifi film ever watchedyou alien te...
1,Negative,mukluks,comfy make well hop sock different size tend t...
2,Negative,great,barbara delinski great author book one bad cou...
3,Negative,booring bugggery,bun buggery brutality booring littery device u...
4,Negative,houseboy,movie fair entertain great movie look movie wo...


Next, let's use a dictionary-based method to get counts of terms which are generally accepted to be positive and negative. I like the VADER dictionary, which not only classifies sentences into positive and negative categories, but also assigns weight based on the magnitude of positivity or negativity. Sometimes you can get unexpected results, as we may see in the next output.

In [6]:
sid_obj = SentimentIntensityAnalyzer()
def sentiment_scores(review_text):
    sentiment_dict = sid_obj.polarity_scores(review_text)
    return sentiment_dict['compound']

with mp.Pool(mp.cpu_count()) as pool:  # Processing in parallel
    amazon_reviews['ReviewDictionarySent'] = pool.map(sentiment_scores, amazon_reviews['Review'])
    pool.close()
    pool.join()
    
with mp.Pool(mp.cpu_count()) as pool:  # Processing in parallel
    amazon_reviews['TitleDictionarySent'] = pool.map(sentiment_scores, amazon_reviews['Title'])
    pool.close()
    pool.join()
    
    
amazon_reviews.head()

Unnamed: 0,Sentiment,Title,Review,ReviewDictionarySent,TitleDictionarySent
0,Negative,sign hideous scifi film ever watch,disappoint scifi film ever watchedyou alien te...,-0.9517,0.0
1,Negative,mukluks,comfy make well hop sock different size tend t...,0.8625,0.0
2,Negative,great,barbara delinski great author book one bad cou...,-0.2937,0.6249
3,Negative,booring bugggery,bun buggery brutality booring littery device u...,0.1027,0.0
4,Negative,houseboy,movie fair entertain great movie look movie wo...,0.8271,0.0


Finally, now that our text is as processed as it can be before vectorization, the last step is, of course, vectornization. We'll use the TfidfVectorizer from sklearn, which numerically transforms words based on their originality. There's a bit of math going on here, so check out https://www.analyticsvidhya.com/blog/2021/11/how-sklearns-tfidfvectorizer-calculates-tf-idf-values/ for more details.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(
                       max_df=0.95,
                       min_df=0.05,
                       ngram_range=(1,3))

reviews_unprocessed = tfidf.fit_transform(amazon_reviews['Review'])
titles_unprocessed = tfidf.fit_transform(amazon_reviews['Title'])

The text of the reviews and titles has now been transformed into a numeric form. Finally, we need to transform this numeric form into a NumPy array.

In [8]:
reviews_processed = reviews_unprocessed.toarray()  # Transforming reviews into NumPy array
titles_processed = titles_unprocessed.toarray()  # Transforming titles into NumPy array
sentiment_array = amazon_reviews[['ReviewDictionarySent','TitleDictionarySent']].to_numpy()

## Step 2: Machine Learning Classifications

Now we can begin to see if we can classify text. First, we will split our data into testing and training sets along an 80/20 split. From there, we will use an AdaBoost, a Random Forest, and an Artificial Neural Network to attempt to make these classifications. Results of these predictions will be reported though confusion matrices. At the end, we will use a stacked ensemble of these three models to see if they can all work well together.

First, let's import some preprocessing modules, separate our features and our target variable, and split our data into a training set and a testing set.

In [9]:
from sklearn.model_selection import train_test_split # Allows us to split into testing and training sets quite easily
from sklearn.preprocessing import StandardScaler # Allows easy scaling for training data
from sklearn.pipeline import make_pipeline  # Allows use of make_pipeline function
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,classification_report,confusion_matrix # Different metrics to score against

features = np.concatenate((reviews_processed, titles_processed, sentiment_array), axis=1)  # Concatenating column-wise
target = amazon_reviews['Sentiment']  # This is the data we are attempting to predict

features_train, features_test, target_train, target_test = train_test_split(
        features, target.values, test_size=0.20, random_state=123) # I am setting a 80/20 split and setting a seed so my results are consistent in the future.

First let's build our AdaBoost classifier. In this model (or should I say models), the model is improved by training different models. In each sequence, the following model learns from the last. This model can take a while, so use this time to grab yourself a snack or a cup of coffee.

In [10]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn import tree

clf_AB = make_pipeline(StandardScaler(),AdaBoostClassifier(tree.DecisionTreeClassifier(class_weight="balanced"),
                         algorithm="SAMME.R",
                         random_state=123))


#Train
clf_AB.fit(features_train, target_train)

#Validate
target_predicted=clf_AB.predict(features_test)
print(classification_report(target_test, target_predicted))
print(confusion_matrix(target_test, target_predicted))
#extracting true_positives, false_positives, true_negatives, false_negatives
tn, fp, fn, tp = confusion_matrix(target_test, target_predicted).ravel()
print("True Negatives: ",tn)
print("False Positives: ",fp)
print("False Negatives: ",fn)
print("True Positives: ",tp)
print("Accuracy Score", accuracy_score(target_test, target_predicted))

              precision    recall  f1-score   support

    Negative       0.77      0.76      0.77     49955
    Positive       0.77      0.78      0.77     50045

    accuracy                           0.77    100000
   macro avg       0.77      0.77      0.77    100000
weighted avg       0.77      0.77      0.77    100000

[[38161 11794]
 [11244 38801]]
True Negatives:  38161
False Positives:  11794
False Negatives:  11244
True Positives:  38801
Accuracy Score 0.76962


Okay, not bad. Overall, the accuracy is pretty good, The precision, recall, and F1 scores are all also generally consistent. The balance between the classes is also roughly balanced.

Let's see if we can improve, however. One of my favorite algorithms is the classic Random Forest. There are just so many things I love about this algorithm; its power and its simplicity, its ability to run in parallel, and, honestly, it just has a cool name. Let's see if we can improve our predictive performance over our AdaBoost model.

In [11]:
from sklearn.ensemble import RandomForestClassifier # A random forest is a very large collection of decision trees that votes on the result.
clf_RF = RandomForestClassifier(criterion='gini',
                                     n_estimators=100, # I like a lot of trees in my forest
                                     random_state=123,
                                     n_jobs=-1) # Seeing what happens if I use all processing cores; so much faster


clf_RF.fit(features_train, target_train)

#Validate
target_predicted = clf_RF.predict(features_test)
print(classification_report(target_test, target_predicted))
print(confusion_matrix(target_test, target_predicted))
tn, fp, fn, tp = confusion_matrix(target_test, target_predicted).ravel()
print("True Negatives: ",tn)
print("False Positives: ",fp)
print("False Negatives: ",fn)
print("True Positives: ",tp)
print("Accuracy Score", accuracy_score(target_test, target_predicted))

              precision    recall  f1-score   support

    Negative       0.80      0.81      0.81     49955
    Positive       0.81      0.80      0.80     50045

    accuracy                           0.81    100000
   macro avg       0.81      0.81      0.81    100000
weighted avg       0.81      0.81      0.81    100000

[[40600  9355]
 [10064 39981]]
True Negatives:  40600
False Positives:  9355
False Negatives:  10064
True Positives:  39981
Accuracy Score 0.80581


It looks like the Random Forest did a bit better than the AdaBoost model, coming in at 81 percent accuracy overall with precision, recall, and F1 scores to match. This is our better model so far, but can the artificial neural network perform better? 

In [12]:
from sklearn.neural_network import MLPClassifier
clf_ANN = make_pipeline(StandardScaler(),MLPClassifier(solver="adam", learning_rate="adaptive", max_iter=1000,
                                                        random_state=123, hidden_layer_sizes=(20,15,10,5))) # I am making this one more complex to hopefully pick up on more nuance in our data

#Train
clf_ANN.fit(features_train, target_train)

#Validate
target_predicted=clf_ANN.predict(features_test)
print(classification_report(target_test, target_predicted))
print(confusion_matrix(target_test, target_predicted))
tn, fp, fn, tp = confusion_matrix(target_test, target_predicted).ravel()
print("True Negatives: ",tn)
print("False Positives: ",fp)
print("False Negatives: ",fn)
print("True Positives: ",tp)
print("Accuracy Score", accuracy_score(target_test, target_predicted))

              precision    recall  f1-score   support

    Negative       0.81      0.81      0.81     49955
    Positive       0.81      0.81      0.81     50045

    accuracy                           0.81    100000
   macro avg       0.81      0.81      0.81    100000
weighted avg       0.81      0.81      0.81    100000

[[40381  9574]
 [ 9619 40426]]
True Negatives:  40381
False Positives:  9574
False Negatives:  9619
True Positives:  40426
Accuracy Score 0.80807


Yes, it looks like the artificial neural network CAN perform better, and in less time as well. The performance improvement is only marginal, but this marginal improvement given its speed makes it a better choice than the Random Forest.

Finally, let's combine the models above to see if they can do any better than alone. This can take place through a stacked model which essentially combines multiple models into one. Therefore, the model below will use our AdaBoost classifier, our Random Forest, and our Artificial Neural Network in one. This can take a very long time, but we may see a better result.

In [13]:
from sklearn.ensemble import StackingClassifier # Allows us to combine completely different models into one
from sklearn.linear_model import LogisticRegression # Makes the final call on a particular observation
learner1 = clf_AB # AdaBoost Classifier from Above
learner2 = clf_RF # Random Forest Model from Above
learner3 = clf_ANN # Artificial Neural Network from Above

estimators = [
     ('ab', learner1),
     ('rf', learner2),
     ('ann', learner3)] # List of our estimators


stacked_learner = StackingClassifier(estimators = estimators, final_estimator = LogisticRegression(),n_jobs=-1)

stacked_learner.fit(features_train, target_train)

#Validate
target_predicted=stacked_learner.predict(features_test)
print(classification_report(target_test, target_predicted))
print(confusion_matrix(target_test, target_predicted))
tn, fp, fn, tp = confusion_matrix(target_test, target_predicted).ravel()
print("True Negatives: ",tn)
print("False Positives: ",fp)
print("False Negatives: ",fn)
print("True Positives: ",tp)
print("Accuracy Score", accuracy_score(target_test, target_predicted))

              precision    recall  f1-score   support

    Negative       0.81      0.81      0.81     49955
    Positive       0.81      0.81      0.81     50045

    accuracy                           0.81    100000
   macro avg       0.81      0.81      0.81    100000
weighted avg       0.81      0.81      0.81    100000

[[40589  9366]
 [ 9443 40602]]
True Negatives:  40589
False Positives:  9366
False Negatives:  9443
True Positives:  40602
Accuracy Score 0.81191


Although our accuracy was slightly better than our Artificial Neural Network, it took so much longer. If a company wanted to use a model like this to make a prediction in real time, maybe a stacked model isn't worth the slight accuracy increase given the time it takes to complete.

Overall, I hope you found this notebook useful. I'll be uploading more projects like this to my GitHub page (https://github.com/noahmiller-ds) throughout 2022. My next project will be experimenting with the *Ray* framework, which is supposed to help make distributed computing easy. I think the application here will be using Facebook Prophet across multiple machines to predict the next 30 days of stock prices from the top 1000 US stocks. **Please don't use that example as a get-rich-quick oracle; it's more about demonstrating how models can be distributed as opposed to ensuring the predictions are right, in that case.**

## Project Information

Credit to parts of InputProcessing and entirety of lemmatizer functions: 
https://erleem.medium.com/nlp-complete-sentiment-analysis-on-amazon-reviews-374e4fea9976

Credit to user Olivier Cruchant from StackOverflow for easy parallelization of functions using Base Python:
https://stackoverflow.com/questions/45545110/make-pandas-dataframe-apply-use-all-cores

System Information:
2020 M1 MacBook Pro, 16 GB Unified Memory