#Amazon Phone Customer Reviews

Dataset: https://data.world/promptcloud/amazon-mobile-phone-reviews/workspace/project-summary

Target: Rating
Features: Price, Reviews, Brand Name, Votes

As an advertiser (i.e. Samsung), in marketing a new phone model on Amazon, are we able to create a model to be able to use features such as previous customer rating, pricing, and reviews in determining what to keep or improve for a new phone model? 

The model should accurately predict a high rating (i.e 4,5) by incorporating text analysis on customer reviews (and number of votes) as an important component to the analysis. 

In [70]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [71]:
file = pd.read_csv("Amazon_Unlocked_Mobile.csv")
file.head(5)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


In [72]:
file.shape

(413840, 6)

In [73]:
vx = file['Reviews'][5]
vx

'I already had a phone with problems... I know it stated it was used, but dang, it did not state that it did not charge. I wish I would have read these comments then I would have not purchased this item.... and its cracked on the side.. damaged goods is what it is.... If trying to charge it another way does not work I am requesting for my money back... AND I WILL GET MY MONEY BACK...SIGNED AN UNHAPPY CUSTOMER....'

In [77]:
# limit the scope of our sample to just the highest reviewed brands and products
brand_names_topten = pd.DataFrame(file['Brand Name'].value_counts().head(10))
brand_names_topten = list(brand_names_topten.index)
file_top_brands = file[file['Brand Name'].isin(brand_names_topten)].copy()
file_top_brands.shape

(290019, 6)

In [78]:
#create function to cleanup text and output tokenized words
stop_words = set(stopwords.words('english'))
def tokenize_text(text):
    text = str(text)
    if len(text) > 0:
        text = text.lower() #make lower
        text = re.sub('\d+','',text) #remove numbers
        text = text.translate(str.maketrans('', '', string.punctuation)).strip() #remove punctuation and whitespaces
        text = word_tokenize(text) 
        text = [i for i in text if not i in stop_words]
    else:
        text = None
    return text

In [79]:
file_top_brands['tokenized_review'] = file_top_brands['Reviews'].apply(tokenize_text)

In [80]:
file_top_brands.head(5)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,tokenized_review
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,"[feel, lucky, found, used, phone, us, used, ha..."
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,"[nice, phone, nice, grade, pantach, revue, cle..."
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,[pleased]
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,"[works, good, goes, slow, sometimes, good, pho..."
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,"[great, phone, replace, lost, phone, thing, vo..."
