## Amazon Alexa Reviews

![alexa1.jfif](attachment:2e49f29b-4c17-40d3-904f-ef2f12212032.jfif)

#### Problem Statement

The Amazon Alexa reviews dataset consists of approximately 3000 reviews and corresponding star ratings of various Alexa products Alexa Echo, Echo Dots etc.

Since customer reviews are vailable in text format along with ratings (1-5), we can treat this as multiclass classification problem and build the machine learning model to predict the ratings given any customer review.

In this notebook, we will convert the problem into binary classification problem by encoding ratings as below:
* Rating 0: All the ratings below 3 (i.e. 1,2) - Negative Sentiment
* Rating 1: Ratings >= 3 (i.e. 3,4,5) - Positive Sentiment

Above problem is also called as **Sentiment Analysis or opinion mining** which extensively uses natural language processing (NLP) techniques to analyze the text in order to detemine positive or negative sentiment within it.

### Import Libraries

In [None]:
import os

In [None]:
import numpy as np
import pandas as pd

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import string

In [None]:
ROOT_DIR = '../input'
DATA_DIR = os.path.join(ROOT_DIR,'amazon-alexa-reviews')

In [None]:
print(DATA_DIR)

### Import Amazon Alexa Reviews data

In [None]:
# Read the source data present in .tsv format
reviews_df = pd.read_csv(os.path.join(DATA_DIR,'amazon_alexa.tsv'), sep = '\t')

In [None]:
print('Number of observations: ',reviews_df.shape[0])
print('Number of features: ',reviews_df.shape[1])

In [None]:
# Print DataFrame column names
reviews_df.columns

In [None]:
# Display top 5 instaces or observations
reviews_df.head(n = 5)

In [None]:
# Display bottom 5 instances or observations
reviews_df.tail(n = 5)

In [None]:
reviews_df.index

In [None]:
# To check missing value count per feature
reviews_df.isnull().sum()

In [None]:
# Print the metadata of Pandas DataFrame
reviews_df.info()

### Exploratory Data Analysis (EDA)

In [None]:
reviews_df['rating'].value_counts()

In [None]:
reviews_df['date'].value_counts().head(n = 5)

In [None]:
reviews_df['variation'].value_counts()

In [None]:
reviews_df['feedback'].value_counts()

In [None]:
reviews_df['verified_reviews'].iloc[19]

In [None]:
reviews_df['verified_reviews'].iloc[67]

In [None]:
reviews_df['verified_reviews'] = reviews_df['verified_reviews'].apply(lambda x: x.lower())

In [None]:
reviews_df['verified_reviews'].iloc[199]

In [None]:
def remove_punctuation(text):
    return "".join([word for word in text if word not in string.punctuation])

In [None]:
reviews_df['verified_reviews'] = reviews_df['verified_reviews'].apply(lambda x: remove_punctuation(x))

In [None]:
reviews_df['verified_reviews'].iloc[19]

In [None]:
reviews_df['verified_reviews'] = reviews_df['verified_reviews'].str.strip()

In [None]:
from nltk.tokenize import word_tokenize

In [None]:
reviews_df['verified_reviews'] = reviews_df['verified_reviews'].apply(lambda x: word_tokenize(x))

In [None]:
reviews_df['verified_reviews'].iloc[19][0:9]

In [None]:
from nltk.corpus import stopwords

In [None]:
corpus = []

In [None]:
for i in range(0, len(reviews_df)):
    subset = [x for x in reviews_df['verified_reviews'].iloc[i] if x not in stopwords.words('english')]
    subset = ' '.join(subset)
    corpus.append(subset)

In [None]:
corpus[0:4]

In [None]:
from nltk.stem.snowball import SnowballStemmer
ss = SnowballStemmer('english')

In [None]:
s_corpus = []
for sent in corpus:    
    s_text = [ss.stem(word) for word in sent.split()]
    s_text = ' '.join(s_text)
    s_corpus.append(s_text)

In [None]:
s_corpus[0:4]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cv = CountVectorizer(max_features=5000)

In [None]:
X = cv.fit_transform(s_corpus).toarray()

In [None]:
y = reviews_df['feedback']

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
smote = SMOTE()

In [None]:
X_smote, y_smote = smote.fit_resample(X, y)

In [None]:
print(X_smote.shape)
print(y_smote.shape)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size = 0.2, random_state=33)

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
nb_model = MultinomialNB().fit(X_train,y_train)

In [None]:
y_pred = nb_model.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score, classification_report, f1_score, roc_auc_score

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
f1_score(y_test, y_pred)

In [None]:
roc_auc_score(y_test, y_pred)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lr = LogisticRegression()

In [None]:
lr_model = lr.fit(X_train,y_train)

In [None]:
y_pred = lr_model.predict(X_test)

In [None]:
accuracy_score(y_pred,y_test)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
f1_score(y_test, y_pred)

In [None]:
roc_auc_score(y_test, y_pred)