## AI Capstone - Luke J Nonyane

### DESCRIPTION
Problem Statement
•	Amazon is an online shopping website that now caters to millions of people everywhere. Over 34,000 consumer reviews for Amazon brand products like Kindle, Fire TV Stick and more are provided. 
•	The dataset has attributes like brand, categories, primary categories, reviews.title, reviews.text, and the sentiment. Sentiment is a categorical variable with three levels "Positive", "Negative“, and "Neutral". For a given unseen data, the sentiment needs to be predicted.
•	You are required to predict Sentiment or Satisfaction of a purchase based on multiple features and review text.

#### Project Task: Week 1

##### EDA - Class Imbalance Problem
•	See what a positive, negative, and neutral review looks like

•	Check the class count for each class. It’s a class imbalance problem.


In [1]:
# load data
# train
import pandas as pd
train_data = pd.read_csv('data/train_data.csv')

# test
test_data = pd.read_csv('data/test_data.csv')

print(train_data.shape, test_data.shape)
train_data.head()

(4000, 8) (1000, 7)


Unnamed: 0,name,brand,categories,primaryCategories,reviews.date,reviews.text,reviews.title,sentiment
0,"All-New Fire HD 8 Tablet, 8"" HD Display, Wi-Fi...",Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...",Electronics,2016-12-26T00:00:00.000Z,Purchased on Black FridayPros - Great Price (e...,Powerful tablet,Positive
1,Amazon - Echo Plus w/ Built-In Hub - Silver,Amazon,"Amazon Echo,Smart Home,Networking,Home & Tools...","Electronics,Hardware",2018-01-17T00:00:00.000Z,I purchased two Amazon in Echo Plus and two do...,Amazon Echo Plus AWESOME,Positive
2,Amazon Echo Show Alexa-enabled Bluetooth Speak...,Amazon,"Amazon Echo,Virtual Assistant Speakers,Electro...","Electronics,Hardware",2017-12-20T00:00:00.000Z,Just an average Alexa option. Does show a few ...,Average,Neutral
3,"Fire HD 10 Tablet, 10.1 HD Display, Wi-Fi, 16 ...",Amazon,"eBook Readers,Fire Tablets,Electronics Feature...","Office Supplies,Electronics",2017-08-04T00:00:00.000Z,"very good product. Exactly what I wanted, and ...",Greattttttt,Positive
4,"Brand New Amazon Kindle Fire 16gb 7"" Ips Displ...",Amazon,"Computers/Tablets & Networking,Tablets & eBook...",Electronics,2017-01-23T00:00:00.000Z,This is the 3rd one I've purchased. I've bough...,Very durable!,Positive


In [2]:
test_data.head()

Unnamed: 0,name,brand,categories,primaryCategories,reviews.date,reviews.text,reviews.title
0,"Fire Tablet, 7 Display, Wi-Fi, 16 GB - Include...",Amazon,"Fire Tablets,Computers/Tablets & Networking,Ta...",Electronics,2016-05-23T00:00:00.000Z,Amazon kindle fire has a lot of free app and c...,very handy device
1,Amazon Echo Show Alexa-enabled Bluetooth Speak...,Amazon,"Computers,Amazon Echo,Virtual Assistant Speake...","Electronics,Hardware",2018-01-02T00:00:00.000Z,The Echo Show is a great addition to the Amazo...,Another winner from Amazon
2,"All-New Fire HD 8 Tablet, 8"" HD Display, Wi-Fi...",Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...",Electronics,2017-01-02T00:00:00.000Z,Great value from Best Buy. Bought at Christmas...,simple to use and reliable so far
3,"Brand New Amazon Kindle Fire 16gb 7"" Ips Displ...",Amazon,"Computers/Tablets & Networking,Tablets & eBook...",Electronics,2017-03-25T00:00:00.000Z,"I use mine for email, Facebook ,games and to g...",Love it!!!
4,Amazon Echo Show Alexa-enabled Bluetooth Speak...,Amazon,"Computers,Amazon Echo,Virtual Assistant Speake...","Electronics,Hardware",2017-11-15T00:00:00.000Z,This is a fantastic item & the person I bought...,Fantastic!


In [3]:
# checking class count data weight
train_data['sentiment'].value_counts()

Positive    3749
Neutral      158
Negative      93
Name: sentiment, dtype: int64

We have a class imbalance that favors the Positive sentiment.

In [4]:
# text clean up
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('stopwords')
import re
import string
import warnings
warnings.filterwarnings('ignore')
import numpy as np
#words = set(nltk.corpus.words.words())
import matplotlib.pyplot as plt

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/LNonyane/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
# text-cleanup method
def cleanup_text(df, column):
    df[column].apply(lambda x : ' '.join\
     ([lemmatizer.lemmatize\
      (word.lower()) \
      for word in word_tokenize\
      (re.sub(r'([^\s\w]|_)+', ' ',\
       str(x))) if word not in stop_words]))

In [6]:
# TFIDF vectorizer method

In [7]:
lemmatizer = WordNetLemmatizer()
stop_words = stopwords.words('english')
stop_words = stop_words + list(string.printable)

# extract tokens from each 'review' of the dataframe using lambda function.
# check whether tokens are stop words, lemmatize them and join them side by side using join function.
# replace anything other than digits, letters, and white spaces with blank spaces.

train_data['cleaned_reviews.text'] = train_data['reviews.text']\
.apply(lambda x : ' '.join\
 ([lemmatizer.lemmatize\
  (word.lower()) \
  for word in word_tokenize\
  (re.sub(r'([^\s\w]|_)+', ' ',\
   str(x))) if word not in stop_words])) 

In [8]:
train_data[['cleaned_reviews.text', 'reviews.text']].head()

Unnamed: 0,cleaned_reviews.text,reviews.text
0,purchased black fridaypros great price even sa...,Purchased on Black FridayPros - Great Price (e...
1,purchased two amazon echo plus two dot plus fo...,I purchased two Amazon in Echo Plus and two do...
2,just average alexa option doe show thing scree...,Just an average Alexa option. Does show a few ...
3,good product exactly wanted good price,"very good product. Exactly what I wanted, and ..."
4,this 3rd one purchased bought one niece no cas...,This is the 3rd one I've purchased. I've bough...


In [9]:
# remove terms with a length of 1
train_data['extra_cleaned_reviews.text'] = train_data['cleaned_reviews.text']\
.apply(lambda x : ' '.join\
 ([lemmatizer.lemmatize\
  (word.lower()) \
  for word in word_tokenize\
  (re.sub(r'\b\w{1}?\b', ' ',\
   str(x)))]))
train_data[['extra_cleaned_reviews.text', 'cleaned_reviews.text']].head()

Unnamed: 0,extra_cleaned_reviews.text,cleaned_reviews.text
0,purchased black fridaypros great price even sa...,purchased black fridaypros great price even sa...
1,purchased two amazon echo plus two dot plus fo...,purchased two amazon echo plus two dot plus fo...
2,just average alexa option doe show thing scree...,just average alexa option doe show thing scree...
3,good product exactly wanted good price,good product exactly wanted good price
4,this 3rd one purchased bought one niece no cas...,this 3rd one purchased bought one niece no cas...


In [10]:
# TFIDF vectorizer with 5000 max features
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_model = TfidfVectorizer(max_features=5000) # 5000 is an arbitrary value.
tfidf_df = pd.DataFrame(tfidf_model.fit_transform(train_data['extra_cleaned_reviews.text']).todense()) # todense() creates matrix
tfidf_df.columns = sorted(tfidf_model.vocabulary_)
tfidf_df.head()

Unnamed: 0,00,10,100,1000,1080,10th,10x,11,11yr,12,...,äù,äú,äúalexa,äúbest,äúdropping,äúdual,äúshow,äúskills,äústar,äúthings
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
# clean test data reviews.text
test_data['cleaned_reviews.text'] = test_data['reviews.text']\
.apply(lambda x : ' '.join\
 ([lemmatizer.lemmatize\
  (word.lower()) \
  for word in word_tokenize\
  (re.sub(r'([^\s\w]|_)+', ' ',\
   str(x))) if word not in stop_words]))

# remove terms with a length of 1
test_data['extra_cleaned_reviews.text'] = test_data['cleaned_reviews.text']\
.apply(lambda x : ' '.join\
 ([lemmatizer.lemmatize\
  (word.lower()) \
  for word in word_tokenize\
  (re.sub(r'\b\w{1}?\b', ' ',\
   str(x)))]))
test_data[['extra_cleaned_reviews.text', 'cleaned_reviews.text']].head()

Unnamed: 0,extra_cleaned_reviews.text,cleaned_reviews.text
0,amazon kindle fire lot free app used one want ...,amazon kindle fire lot free app used one want ...
1,the echo show great addition amazon family wor...,the echo show great addition amazon family wor...
2,great value best buy bought christmas sale,great value best buy bought christmas sale
3,use mine email facebook game go line also load...,use mine email facebook game go line also load...
4,this fantastic item person bought love,this fantastic item person bought love


In [12]:
# TFIDF vectorizer with 5000 max features for test set
from sklearn.feature_extraction.text import TfidfVectorizer
#tfidf_model = TfidfVectorizer(max_features=5000) # 5000 is an arbitrary value.
tfidf_df_test = pd.DataFrame(tfidf_model.fit_transform(test_data['extra_cleaned_reviews.text']).todense()) # todense() creates matrix
tfidf_df_test.columns = sorted(tfidf_model.vocabulary_)
tfidf_df_test.head()

Unnamed: 0,00,10,100,105,11,12,128,128gb,139,15,...,äôre,äôs,äôt,äôve,äù,äùcrestron,äú,äúalexa,äúthings,ôºå
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


TFIDF representation of cleaned up 'reviews.text'.

###### Multinomial Naive Bayes Classifier

In [13]:
# training and validation sets from train_data
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split\
                                     (tfidf_df, train_data['sentiment'],\
                                      test_size=0.2, \
                                      random_state=42, \
                                      stratify=train_data['sentiment'])

In [14]:
X_train.shape, y_train.shape, X_valid.shape, y_valid.shape

((3200, 4370), (3200,), (800, 4370), (800,))

In [15]:
# classifier method
def clf_model(model_type, X_train, y_train, X_valid):
    model = model_type.fit(X_train, y_train)
    predicted_labels = model.predict(X_valid)
    predicted_probab = model.predict_proba(X_valid)[:,1]
    return [predicted_labels, predicted_probab, model]

In [16]:
# model evaluation method
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.metrics import precision_recall_curve, auc, roc_curve
def model_evaluation(actual_values, predicted_values, predicted_probabilities):
    # confusion matrix
    cfn_mat = confusion_matrix(actual_values, predicted_values)
    print("confusion matrix:",cfn_mat)
    print("\naccuracy:",accuracy_score(actual_values, predicted_values))
    print("\nclassification report:",classification_report(actual_values, predicted_values))
    fpr,tpr,threshold = roc_curve(actual_values,predicted_probabilities)
    print("\nArea under ROC curve for validation set:",auc(fpr,tpr))
    fig, ax = plt.subplots(figsize=(6,6))
    ax.plot(fpr,tpr,label='Validation set AUC')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    ax.legend(loc='best')
    plt.show()

In [17]:
def model_evaluation_mnb(actual_values, predicted_values, predicted_probabilities):
    # confusion matrix
    cfn_mat = confusion_matrix(actual_values, predicted_values)
    print("confusion matrix:",cfn_mat)
    print("\naccuracy:",accuracy_score(actual_values, predicted_values))
    print("\nclassification report:",classification_report(actual_values, predicted_values))

In [18]:
#  MultinomialNB model
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
multinomialNB_results = clf_model(clf, X_train, y_train, X_valid)

In [19]:
# model evaluation
# actual values, pred values, pred probab
model_evaluation_mnb(y_valid, multinomialNB_results[0], multinomialNB_results[1])

confusion matrix: [[  0   0  19]
 [  0   0  31]
 [  0   0 750]]

accuracy: 0.9375

classification report:               precision    recall  f1-score   support

    Negative       0.00      0.00      0.00        19
     Neutral       0.00      0.00      0.00        31
    Positive       0.94      1.00      0.97       750

    accuracy                           0.94       800
   macro avg       0.31      0.33      0.32       800
weighted avg       0.88      0.94      0.91       800



##### Tackling Class Imbalance Problem