# 1. Importing Libraries and Datasets

## Project Description

This project first of all involves an EDA of product reviews from Amazon website. The product of interest is Amazon's Echo in it different colors and designs. The performance of each product variation is analyzed based on star ratings and sentiment of the review text. Oak Finish and Walnut Finish top the positivity percentage scale with a 100% positive review each.

Three different ML models are trained to decipher the sentiment in a body of review text. Review texts were prepared for modelling by cleaning first- removing stop words and punctuation and then using NLP algorithms such as tokenization The models trained are; Naive Bayes, Logistic Regression and Gradient Boosting Classifier models with a 92%, 94% and 90% accuracy respectively.

In [None]:
#importing all relevant packages

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [None]:
amazon_reviews = pd.read_csv('/kaggle/input/amazon-alexa-reviews/amazon_alexa.tsv', sep = '\t')
amazon_reviews.head()


In [None]:
amazon_reviews.info()

From the above, I can see that I am not missing any values on all five columns. They all have a non-null count of 3150  which is the same as the total number of entries.
However, I am uncomfortable with the **rating** and **feedback** columns having dtype of int64. I will run into a "***no memory***" error when I am performing vectorization at the later stage of the project. Plus, these columns contain just one digit values, a smaller dtype size can house them.
I will also convert the **date** column to a datetime object.

# 2. Exploratory Data Analysis

In [None]:
amazon_reviews['date'] = pd.to_datetime(amazon_reviews['date'], dayfirst = True)
amazon_reviews['rating'] = amazon_reviews['rating'].astype(np.int8)
amazon_reviews['feedback'] = amazon_reviews['feedback'].astype(np.int8)
amazon_reviews.info() #to confirm that I have succesfully changed the dtypes of the selected columns


In [None]:
print(amazon_reviews.describe()) #to see what the min and max values are for the rating and feedback columns
print('\n')
print(amazon_reviews['date'].agg(['min', 'max'])) #to see what the min and max values are for the date column

Now, I have confirmed that the rating values range from 1-5 and my feedback values are 0s and 1s.
I now also know that the data gathered is from May 16 to July 31 of 2018.

Next, I'll like to investigate the ***Variation*** column.

In [None]:
#what are the different Amazon product variations represented in our dataset?
#which ones are more popular?

print("There are ", amazon_reviews['variation'].nunique(), "unique enties in the variation column")
print('\n\n')
amazon_reviews['variation'].value_counts()

We can see that the most popular variations are 'Black Dot', 'Charcoal Fabric' and 'Configuration: Fire TV Stick'.
Next, let's see how this popularity relates to performance. Are the more popular variations also getting good reviews?
I can assume that the reviews with positive feedback (1) also have high ratings (let's assume 3,4,5). We can check the validity of this assumption with a seaborn catplot. 

In [None]:
ax2 = sns.catplot(x="variation", hue="rating", col="feedback",
                data=amazon_reviews, kind="count",
                height=4, aspect = 7/4)
ax2.set_xticklabels(rotation = 90)

We can see from the above chart that the assumption that positive feedback reviews also have high ratings, so that assumption is valid. 
Also, this chart shows high 5-star ratings for the most popular variations; 'Black Dot', 'Charcoal Fabric' and 'Configuration: Fire TV Stick' but we can't really tell if these variations are actually the best performing because their high number of 5-star ratings is a function of their popularity. 
I will create a measure that is the percentage of positive feedback of each variation. This new measure will break the bias of popularity and actually show which variations have a higher ratio of positive feedback.

In [None]:
performance_df = amazon_reviews.pivot_table(values = 'feedback', index = 'variation', aggfunc = [np.mean, np.sum])
performance_df['total_reviews'] = amazon_reviews['variation'].value_counts()
performance_df['positivity_perc'] = performance_df['mean']['feedback'] * 100
performance_df['total_positives'] = performance_df['sum']['feedback']
performance_df.drop(performance_df.columns[[0,1]], axis = 1, inplace  = True)
performance_df.sort_values('positivity_perc', ascending = False)

Generally, we can see that all variations of products have high percentage of positive reviews (****positivity_perc****).
*Oak Finish* and *Walnut Finish* top the positivity percentage scale with a 100% positive review each. But note that these are also our least reviewed variations. Is this a coincidence or is there a relationship between number of reviews and positivity percentage?
Let's find out.

In [None]:
sns.scatterplot(x = 'total_reviews', y = 'positivity_perc', data  = performance_df )

It does not look like there is any relationship between the total number of reviews and the positivity percentage of a variation of product.
What if we wanted to decipher which variation was performing best? Number of reviews alone is not a good measure because what if a varitaion topped the list for most reviewed but performed relatively average on the positivity scale?
A better way to judge performance would be to aggregate both number of reviews and positivity percentage.
I'll get to this in a a later version of the project. 

In [None]:
plt.figure(figsize = (9,8))
sns.set_style('darkgrid')
ax1 = sns.countplot(x = 'rating', data = amazon_reviews)
for p in ax1.patches:
    ax1.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='center', va='top', color='white', size=10)
ax1.set_title('Count of Different Ratings', fontsize=12)
plt.show()

In [None]:
plt.figure(figsize = (7,5))
ax3= sns.countplot(x = 'feedback', data = amazon_reviews, palette = ['#FF0000', '#00FF00'])
for p in ax3.patches:
    ax3.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='center', va='top', color='white', size=10)
ax3.set_xticklabels(['Negative', 'Positive'])                
ax3.set_title('Count of Negative and Positive Feedback', fontsize=12)    
plt.show()

In [None]:
amazon_reviews['review_length'] = amazon_reviews['verified_reviews'].apply(len)
print(amazon_reviews['review_length'].describe())
print('\n\n')
amazon_reviews[amazon_reviews['review_length'] == 1]

In [None]:
amazon_reviews.hist('review_length', bins = 50)
plt.show()

In [None]:
sns.heatmap(amazon_reviews.isnull(), yticklabels = False, cbar = False, cmap="Blues")

# 3. Plotting the Word Cloud

A word cloud helps you easily visualize the most popular words in a string. I want to see what a word cloud for all the reviews together looks like. I also want to see a word cloud for the positive and negative reviews separately,
The pipeline for creating a simple word cloud is as follows:
1. Get the reviews column and turn it into a list using the *df[col].tolist()* method.
2. Put all the reviews into one large string using *"".join()*.
3. Import WordCloud from wordcloud and plot.


In [None]:
positive_reviews = amazon_reviews[amazon_reviews['feedback'] == 1]
print(positive_reviews[:4])

In [None]:
negative_reviews = amazon_reviews[amazon_reviews['feedback'] == 0]
print(negative_reviews[:4])

In [None]:
all_reviews = amazon_reviews['verified_reviews'].tolist()
len(all_reviews)

In [None]:
reviews_as_one_string = "".join(all_reviews)
#print(reviews_as_one_string) #uncomment to view output
positive_reviews_as_one_string =  "".join(positive_reviews['verified_reviews'].tolist())
#print(positive_reviews_as_one_string) #uncomment to view output
negative_reviews_as_one_string =  "".join(negative_reviews['verified_reviews'].tolist())
#print(negative_reviews_as_one_string) #uncomment to view output

In [None]:
!pip install wordcloud

In [None]:
from wordcloud import WordCloud

plt.figure(figsize=(15,15))
plt.imshow(WordCloud().generate(reviews_as_one_string)) 

In [None]:
plt.figure(figsize=(15,15))
plt.imshow(WordCloud().generate(positive_reviews_as_one_string))

In [None]:
plt.figure(figsize=(15,15))
plt.imshow(WordCloud().generate(negative_reviews_as_one_string))

# 4. Data Cleaning (Removal of Punctuations and Stopwords)

In [None]:
import string
string.punctuation #common punctuations we would want to clean out from our reviews

In [None]:
import nltk # Natural Language tool kit 
nltk.download('stopwords')

from nltk.corpus import stopwords
#stopwords.words('english') #uncomment to see common stopwords we would want to remove from our reviews

I would prefer to use a simple function to easily rid any given string off punctuations and stopwords.

In [None]:
def text_cleaner(text):
    punc_free_text = [char for char in text if char not in string.punctuation]
    joined_punc_free_text = "".join(punc_free_text)
    cleaned_text = [word for word in joined_punc_free_text.split() if word.lower() 
                                not in stopwords.words('english')]
    return cleaned_text

In [None]:
testing_text = "I am$$ going to use this sample (text) to test my new #function.... Let us see if it works!!!"
text_cleaner(testing_text)

# 5. Perform Count Vectorization (Tokenization)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Call the cleaning function we defined earlier
vectorizer = CountVectorizer(analyzer = text_cleaner)
reviews_countvectorizer = vectorizer.fit_transform(amazon_reviews['verified_reviews'])

In [None]:
print(vectorizer.get_feature_names_out()) #these are all the unique tokens in the verified reviews column

In [None]:
print(reviews_countvectorizer.toarray())  
print('\n\n')
reviews_countvectorizer.shape #rows represent total number of reviews, columns represent number of unique tokens

# 6. Train and Evaluate Different Models

In [None]:
X = pd.DataFrame(reviews_countvectorizer.toarray())
y = amazon_reviews['feedback']
print(X.shape, y.shape) 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Using the Naive Bayes Classifier Model...

In [None]:
from sklearn.naive_bayes import MultinomialNB

NB_classifier = MultinomialNB()
NB_classifier.fit(X_train, y_train)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
# Predicting the Test set results
y_predict_test = NB_classifier.predict(X_test)
cm = confusion_matrix(y_test, y_predict_test)
sns.heatmap(cm, annot=True)

In [None]:
print(classification_report(y_test, y_predict_test))

The Naive Bayes model has a 92% precision.

## Using the Logistic Regression Model...

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

cm = confusion_matrix(y_pred, y_test)
sns.heatmap(cm, annot = True)

print(classification_report(y_test, y_pred))

The Logistic Regression model has a 94% precision.

## Using the Gradient Boosting Classifier Model...

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

cm = confusion_matrix(y_pred, y_test)
sns.heatmap(cm, annot = True)

print(classification_report(y_test, y_pred))

The Gradient Boosting Classifier model has a 90% precision.

## The Logistic Regression gives the best precision.