## Introduction

This dataset contains consumer reviews of some selected online shopping products.

**Description of the data:**

- **`product_review.csv`** contains the dataset. 
- Each observation (row) in this dataset is a review of a particular product by a particular user.
- The **date** column is the date when the review was provided.
- The **product** column is the name of the product reviewed.
- The **category** column is the primary category of the product reviewed.
- The **text** column is the review text.
- The **user** column is the name of the user who gives the review
- The **rating** column is the number of stars (1 through 5) assigned by the reviewer to the product. (Higher stars is better.) In other words, it is the rating of the product by the user who wrote the review.

**Goal**:
 - Perform some data explorations.
 - Generate training, validation, and test datasets before model building and prediction

**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.

## Import libraries

In [2]:
import math
import warnings

import numpy
import pandas as pd
from scipy.sparse import vstack, hstack
import matplotlib.pyplot as plt
from pandas.core.common import SettingWithCopyWarning

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.dummy import DummyClassifier
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

# import other libraries/functions if they are needed in your coding

Read **product_review.csv** into a Pandas DataFrame and find the **user** who have written the most number of reviews.

In [3]:
product_review_df = pd.read_csv("product_review.csv")
print(product_review_df[['text','user']].groupby('user').count().sort_values(by='text', ascending=False).head())
print('The username with the most number of written reviews is Mike')

                   text
user                   
ByAmazon Customer   889
Mike                 63
ByKindle Customer    45
Dave                 44
Chris                38
The username with the most number of written reviews is Mike



Create another column named `review_length`, which is the number of words in the review text.


In [4]:
product_review_df['review_length'] = product_review_df['text'].apply(lambda row : len(row.split()))
product_review_df.head()

Unnamed: 0,date,product,category,text,user,rating,review_length
0,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,I order 3 of them and one of the item is bad q...,Byger yang,3,31
1,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Bulk is always the less expensive way to go fo...,ByMG,4,13
2,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Well they are not Duracell but for the price i...,BySharon Lambert,5,12
3,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Seem to work as well as name brand batteries a...,Bymark sexson,5,14
4,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,These batteries are very long lasting the pric...,Bylinda,5,10


What is the product (or products) with the maximum number of words in a single review?

In [5]:
print("Products with max review length:")
product_review_df[product_review_df['review_length'] == product_review_df['review_length'].max()][['product', 'review_length']]

Products with max review length:


Unnamed: 0,product,review_length
15434,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",1539
15435,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",1539
18411,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",1539
24278,"Fire HD 8 Tablet with Alexa, 8 HD Display, 32 ...",1539


Create a new DataFrame that only contains products with number of reviews more than `1000`.

In [6]:
filtered_products = product_review_df.groupby(['product']).count()['text'][(product_review_df.groupby(['product']).count()['text'] > 1000)].index
product_review_df_post = product_review_df[product_review_df['product'].isin(filtered_products)]
product_review_df_post

Unnamed: 0,date,product,category,text,user,rating,review_length
0,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,I order 3 of them and one of the item is bad q...,Byger yang,3,31
1,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Bulk is always the less expensive way to go fo...,ByMG,4,13
2,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Well they are not Duracell but for the price i...,BySharon Lambert,5,12
3,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Seem to work as well as name brand batteries a...,Bymark sexson,5,14
4,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,These batteries are very long lasting the pric...,Bylinda,5,10
...,...,...,...,...,...,...,...
28327,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,I got 2 of these for my 8 yr old twins. My 11 ...,Mom2twinsplus1,5,29
28328,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,I bought this for my niece for a Christmas gif...,fireman21,4,18
28329,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,"Very nice for light internet browsing, keeping...",suzannalicious,5,57
28330,2017-03-06T14:59:43Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",Electronics,This Tablet does absolutely everything I want!...,SandyJ,5,43


Create a new DataFrame that only contains the ratings: 1, 2 and 5. Then create a new column `target`, whose value is 1 if rating is 5 and 0 otherwise.

In [7]:
product_review_df_post = product_review_df_post[product_review_df_post['rating'].isin([1,2,5])]
product_review_df_post['target'] = product_review_df_post['rating'].apply(lambda num: 1 if num == 5 else 0)
product_review_df_post.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  product_review_df_post['target'] = product_review_df_post['rating'].apply(lambda num: 1 if num == 5 else 0)


Unnamed: 0,date,product,category,text,user,rating,review_length,target
2,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Well they are not Duracell but for the price i...,BySharon Lambert,5,12,1
3,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Seem to work as well as name brand batteries a...,Bymark sexson,5,14,1
4,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,These batteries are very long lasting the pric...,Bylinda,5,10,1
5,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,Bought a lot of batteries for Christmas and th...,ByPainter Marlow,5,48,1
6,2015-10-30T08:59:32Z,AmazonBasics AAA Performance Alkaline Batterie...,Health & Beauty,ive not had any problame with these batteries ...,ByAmazon Customer,5,17,1


Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the `text` and `product` as the only features and the `target` as the target variable.


In [8]:
X = product_review_df_post[['text', 'product']]
y = product_review_df_post['target']
X_train, X_test, y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=2022, shuffle=True, stratify=y)

Use CountVectorizer to create **document-term matrices** from the column: `text` of **X_train** and **X_test**.

In [9]:
vectorizer = CountVectorizer()

training_set_matrix = vectorizer.fit_transform(X_train['text'])
test_set_matrix = vectorizer.transform(X_test['text'])

print(training_set_matrix.shape)
print(test_set_matrix.shape)

(13724, 6762)
(3432, 6762)


Use one-hot encoding to process the feature **product**.



In [10]:
enc = OneHotEncoder(handle_unknown='ignore').fit(X_train[['product']])

training_set_one_hot_encoded = enc.transform(X_train[['product']])
test_set_one_hot_encoded = enc.transform(X_test[['product']])

print(enc.get_feature_names())
print(training_set_one_hot_encoded.shape)
print(test_set_one_hot_encoded.shape)

['x0_All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi, 16 GB - Includes Special Offers, Black'
 'x0_AmazonBasics AA Performance Alkaline Batteries (48 Count) - Packaging May Vary'
 'x0_AmazonBasics AAA Performance Alkaline Batteries (36 Count)'
 'x0_Fire HD 8 Tablet with Alexa, 8 HD Display, 16 GB, Tangerine - with Special Offers'
 'x0_Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16 GB, Blue Kid-Proof Case'
 'x0_Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16 GB, Green Kid-Proof Case'
 'x0_Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16 GB, Pink Kid-Proof Case'
 'x0_Fire Tablet, 7 Display, Wi-Fi, 16 GB - Includes Special Offers, Black']
(13724, 8)
(3432, 8)


Concatenate the feature matrices from **CountVectorizer** and **one hot encoding** for both train and test datasets.

In [11]:
training_set = hstack([training_set_matrix, training_set_one_hot_encoded]).toarray()
test_set = hstack([test_set_matrix, test_set_one_hot_encoded]).toarray()
training_set.shape

(13724, 6770)

create a chatbot using the concepts of vectorization and cosine similarity. For the purposes of the chatbot that you will create, you will be using a repository of questions and answers gathered from
online shopping website for electronic items. Being trained on Q&A data for electronic items,your chatbot could be deployed as automated Q&A support under the Electronic Items section. The corpus **Electronics_QA.json** is in a JavaScript Object Notation (JSON)-like format. It contains multiple features for each pair of Q&A, but you will only use the feautres **question** and **answer**.

## Import libraries

In [135]:
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

#library for loading json file
import ast
import json

# import other libraries/functions if they are needed in your coding
from nltk.corpus import stopwords
import math
import warnings

In [136]:
file_read = open('Electronics_QA.json', 'r')
corpus = file_read.read()
prev = 0
qa_list = []
for i in range(len(corpus)):
    if corpus[i:i+1] == "\n":
        qa_list.append(ast.literal_eval(corpus[prev:i]))
        prev = i

questions = []
answers = []
for dict in qa_list:
    questions.append(dict['question'].lower())
    answers.append(dict['answer'].lower())

file_read.close()
print(len(answers))
print(len(questions))

Exception ignored in: <_io.FileIO name='Electronics_QA.json' mode='rb' closefd=True>
Traceback (most recent call last):
  File "/var/folders/k_/vgj1wdbx7sngfs9bhqsd6twm0000gn/T/ipykernel_72771/4092241544.py", line 1, in <module>


314263
314263


Use `CountVectorizer` module of the sklearn library to convert the questions list into a sparse matrix and apply TF-IDF transformation. This will generate a repository matrix.

In [138]:
stop = stopwords.words('english')
vectorizer = CountVectorizer(stop_words = 'english')
word_count_vector = vectorizer.fit_transform(questions)
transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
tf_idf_matrix = transformer.fit_transform(word_count_vector)
tf_idf_matrix

<314263x69189 sparse matrix of type '<class 'numpy.float64'>'
	with 2033712 stored elements in Compressed Sparse Row format>

- Calculate the angle between every row of the repository matrix and the new question vector. Use the sklearn library's `cosine_similarity` module to calculate the cosine between each row and the vector, and then convert the cosine into degrees by using numpy library's function `rad2deg`. (1 mark)
- Search the row that has the maximum cosine (or the minimum angle) with the new question vector and return the corresponding answer to that question as the response. If the smallest angle between the question vector and every row of the matrix is greater than a threshold value, i.e., 60,then you consider that question to be different enough and return a message that states the chatbot cannot understand the question. (1 mark)

In [143]:
def conversation(im):
    new_im = transformer.transform(vectorizer.transform([im]))
    cos_similarity = cosine_similarity(new_im,tf_idf_matrix)
    degree_array = np.array(list(map(to_degree_array,cos_similarity[0])))
    min_degrees = degree_array.min()
    if min_degrees > 60:
        return "chatbot cannot understand the question"
    else:
        return answers[np.argmin(degree_array)]

def to_degree_array(num):
    warnings.filterwarnings('error')
    try:
        return np.rad2deg(np.arccos(num))
    except Warning:
        if num > 1:
            return np.rad2deg(np.arccos(1))
        else:
            return np.rad2deg(np.arccos(0))

- The user enters their username and is then greeted by the chatbot
- The chat is initiated with the user asking questions and the bot providing a response based on the `conversation` function created earlier
- The chat continues until the user types 'bye'.

- Please demonstrate the interactions with your chatbot using the functions that you have generated.

In [154]:
 def main():
    name = input("Please enter your username: ")
    print("Welcome home {name}".format(name = name))
    conversation_getter()

def conversation_getter():
    #get question
    query = input("What would you like to know?")
    if query.lower() == 'bye':
        return print("Thank you for coming!")
    else:
        print(conversation(query))
        return conversation_getter()

In [155]:
main()

Welcome home Noah
you are looking at this item on amazon and asking how much it costs????
no manual comes with the tablet, but i think it is available online.
Thank you for coming!
