# Capstone Project: Predict Amazon Book Review Rating 
### Data Cleaning

_By: Jin Park_

---
### Objectives
 
- The overall goal of this project is to predict the Amazon book rating based on the Amazon customer's book reviews.
- The first step of this project is to clean the dataset to prepare for EDA process to understand the dataset structures and patterns.
- Web Scrape Amazon.com to gather book categories.
- Use nltk Stopwords to clean text data.

---
### Project Guide (Data Cleaning)
- [Get Book Categories by Web-Scraping the Amazon.com With Selenium](#Get Book Categories by Web-Scraping the Amazon.com With Selenium)
- [Imports](#Imports)
- [Data Cleaning](#Data Cleaning)
- [Clean Text Data Using Stopwords ](#Clean Text Data Using Stopwords )
- [Save Cleaned Data for EDA](#Save Cleaned Data for EDA)


In [1]:
# import gzip
# import pandas as pd
# def parse(path):
#   g = gzip.open(path, 'rb')
#   for l in g:
#     yield eval(l)

# def getDF(path):
#   i = 0
#   df = {}
#   for d in parse(path):
#     df[i] = d
#     i += 1
#   return pd.DataFrame.from_dict(df, orient='index')

# df = getDF('./data/reviews_Books_5.json.gz')
# df = pd.read_csv('data/amazon_book.csv')
# sample = df.sample(frac=0.001)
# sample.to_csv('data/amazon_sample.csv', index=False)

Dataset was obtained from http://jmcauley.ucsd.edu/data/amazon/ which contains book reviews and metadata from Amazon from May 1996 - July 2014.

The compressed data size is  4GB and 9GB after decompressing the dataset. The overall size of the dataset is too large for any personal laptop computer to process since the dataset contains lots of text data to process. To override the problem, renting a cloud computing service from Amazon Web Service (AWS) was a one option. However, due to personal budget and lack of resource, EC2 size m5.2xlarge was the only option to rent with enough computing power. After testing the performance of the EC2 by creating many different subset data (10%, 5%, and 1%), EC2 instance did not perform any better than the laptop. Also, its small storage space was another issue. Therefore, the only viable option was to create a subset dataset below 1% which still contains enough data to work with.

<a id='Get Book Categories by Web-Scraping the Amazon.com With Selenium'></a>
# Get Book Categories by Web-Scraping the Amazon.com With Selenium
The column ASIN contains an Amazon Standard Identification Number, 10-character alphanumeric unique identifier assigned by Amazon.com and its partners for product identification within the Amazon organization. Feature engineering was performed by converting the asin number to book categories by using the Selenium to build a web-scraper that crawls into Amazon.com to collect book categories by the asin number.

In [1]:
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options

In [2]:
# Load sample data
amazon = pd.read_csv('data/amazon_sample.csv')

In [3]:
asin = amazon['asin']

In [4]:
%%time
# Set driver and chrome driver path
chrome_path = r"/Users/jinpark/Desktop/chromedriver_mac/chromedriver"
driver = webdriver.Chrome(chrome_path)

# Url that I will access to scrape
book_categories_list = []
for i in asin:
    try:
        url = u"https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Dstripbooks&field-keywords="+str(i)
        driver.get(url)
        #time.sleep(1)
        books = driver.find_element_by_class_name('a-expander-container') 
        book_categories_list.append(books.text)
    except:
        book_categories_list.append('unknown') 

CPU times: user 11.8 s, sys: 777 ms, total: 12.6 s
Wall time: 1h 45min 18s


In [5]:
amazon['book_categories'] = book_categories_list

In [6]:
new_book_categories = []
for i in amazon['book_categories']:
    new_book_categories.append(i.split('\n')[0])

In [7]:
amazon['book_category'] = new_book_categories

In [8]:
amazon.drop(['book_categories', 'asin'], axis=1, inplace=True)

In [9]:
# Save the scraped data.
amazon.to_csv('data/amazon_sample.csv', index=False)

In [10]:
amazon.head()

Unnamed: 0,reviewerID,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,book_category
0,A3SHA4Y9DHEK39,Chad Oberholtzer,"[10, 11]",We live in a church culture where holiness is ...,5.0,This is a great one...,1137542400,"01 18, 2006",Christian Books & Bibles
1,A11M98R135HMSY,Paul Skinner,"[2, 2]","I have read most of Philip Craig's books, and ...",5.0,the begining of a beautiful relationship,1205625600,"03 16, 2008","Mystery, Thriller & Suspense"
2,AD20B29YQDZYQ,Amazon Customer,"[0, 0]",The storyline was good. It was the reason I f...,2.0,Just ok,1361836800,"02 26, 2013",Literature & Fiction
3,AZR6CYHTQ9TL,Inez,"[1, 1]","I love most apocalypse stories, and this one r...",5.0,"Oh, no, bugs everywhere!",1373932800,"07 16, 2013",Science Fiction & Fantasy
4,A14BTJRH9VNLJJ,Kurt A. Johnson,"[1, 1]",What is a sociopath? A sociopath is a person w...,4.0,An interesting and thought provoking read,1365897600,"04 14, 2013",Biographies & Memoirs


<a id='Imports'></a>
## Imports

In [2]:
import time
import string
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from pprint import pprint
from ast import literal_eval
from collections import Counter
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS

<a id='Data Cleaning'></a>
## Data Cleaning 

The data cleaning process was simple. First, by checking the overall dataset information, it is a very useful way to see the dataset size, check missing values, and decide which columns need to be reassigned to different types. The column "overall" was converted into integer rather than being in float since it contains the product rating that ranges from 1 to 5. In addition, the column "reviewTime" was converted to date-time for a better visual purpose. 

The column "helpful" contains two different data in a tuple inside the list. By using the import literal_eval from ast, this method allows splitting the data into two separate columns into a one single data frame. And, the helpful column was converted into a binary classification, 1 or 0 (helpful and not helpful).  Furthermore, all missing values were dropped since the overall size of the missing data are very small compared to the overall dataset size which will not affect the statistic inference.

After the general data cleaning, all columns were renamed and rearranged for personal preference. 

In [3]:
# Load sample data
amazon = pd.read_csv('data/amazon_sample.csv')

In [4]:
# Check for duplicates
amazon.duplicated().sum()

0

In [5]:
# Column overall needs to be coverted to int and reviewTime needs to be converted into datetime
amazon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8898 entries, 0 to 8897
Data columns (total 9 columns):
reviewerID        8898 non-null object
reviewerName      8876 non-null object
helpful           8898 non-null object
reviewText        8898 non-null object
overall           8898 non-null float64
summary           8897 non-null object
unixReviewTime    8898 non-null int64
reviewTime        8898 non-null object
book_category     8835 non-null object
dtypes: float64(1), int64(1), object(7)
memory usage: 625.7+ KB


In [6]:
# Convert column 'overall' to type interger
amazon['overall'] = amazon.overall.astype(int)

In [7]:
# Convert column 'reviewTime' to datetime
amazon['reviewTime'] = pd.to_datetime(amazon.reviewTime)

In [8]:
# Split column 'helpful' that contains list inside the string to  
# two columns 'helpful' and 'not helpful' by making a seperate dataframe
helpful_df = pd.DataFrame(amazon['helpful'].map(literal_eval).tolist(),
                          columns=['helpful_comment', 'not_helpful_comment'])

In [9]:
# Concat two dataframe into one
amazon = pd.concat([amazon, helpful_df], axis=1)

In [10]:
# Function that creates helpful comments to binary classification
def helpful(df):
    helpful_list = []
    for h in df['helpful_comment']:
        for n in df['not_helpful_comment']:
            if h >= n:
                helpful_list.append(1)
            else:
                helpful_list.append(0)
        return helpful_list

In [11]:
# Creating a new column to store the helpful result
amazon['helpful'] = helpful(amazon)

In [12]:
# Check for missing values
amazon.isnull().sum()

reviewerID              0
reviewerName           22
helpful                 0
reviewText              0
overall                 0
summary                 1
unixReviewTime          0
reviewTime              0
book_category          63
helpful_comment         0
not_helpful_comment     0
dtype: int64

In [13]:
# Drop all missing data since its small to overall data
amazon.dropna(inplace=True)

In [14]:
amazon.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8812 entries, 0 to 8897
Data columns (total 11 columns):
reviewerID             8812 non-null object
reviewerName           8812 non-null object
helpful                8812 non-null int64
reviewText             8812 non-null object
overall                8812 non-null int64
summary                8812 non-null object
unixReviewTime         8812 non-null int64
reviewTime             8812 non-null datetime64[ns]
book_category          8812 non-null object
helpful_comment        8812 non-null int64
not_helpful_comment    8812 non-null int64
dtypes: datetime64[ns](1), int64(5), object(5)
memory usage: 826.1+ KB


In [15]:
# Drop all columns that are useless
amazon.drop(['helpful_comment',
             'not_helpful_comment',
             'unixReviewTime'],
            axis=1,
            inplace=True)

In [16]:
amazon.columns

Index(['reviewerID', 'reviewerName', 'helpful', 'reviewText', 'overall',
       'summary', 'reviewTime', 'book_category'],
      dtype='object')

In [17]:
# Rename columns 
new_col = [u'reviewer_id', u'reviewer_name', u'helpful_review', u'review_text',
           u'rating', u'review_summary', u'review_date', u'book_category']
amazon.columns = new_col

In [18]:
# Rearrange columns 
amazon = amazon[['review_date', 'reviewer_id','reviewer_name', 'book_category',
                 'review_text', 'review_summary', 'rating', 'helpful_review']]

In [19]:
amazon.head(2)

Unnamed: 0,review_date,reviewer_id,reviewer_name,book_category,review_text,review_summary,rating,helpful_review
0,2006-01-18,A3SHA4Y9DHEK39,Chad Oberholtzer,Christian Books & Bibles,We live in a church culture where holiness is ...,This is a great one...,5,0
1,2008-03-16,A11M98R135HMSY,Paul Skinner,"Mystery, Thriller & Suspense","I have read most of Philip Craig's books, and ...",the begining of a beautiful relationship,5,1


<a id='Clean Text Data Using Stopwords'></a>
## Clean Text Data Using Stopwords
The column "review_text" contains detailed book reviews. On the other hand, the column "review_summary" contains book review summaries. These two columns were cleaned by using the nltk library using the stopwords and function that eliminates the punctuations. The stopwords contains a list that holds the most of the common English words such as "the, a, an, are, be, and etc". Words "book, books, read, reads' were added into the stopwords since the dataset involves book reviews and words like book and read does not generate any values. 

In [20]:
# Function that removes punctuation
def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

In [21]:
# Apply the remove_punctuation function to remove punctuation for columns review_text and review_summary
amazon['review_text'] = amazon['review_text'].apply(remove_punctuation)
amazon['review_summary'] = amazon['review_summary'].apply(remove_punctuation)
amazon.head()

Unnamed: 0,review_date,reviewer_id,reviewer_name,book_category,review_text,review_summary,rating,helpful_review
0,2006-01-18,A3SHA4Y9DHEK39,Chad Oberholtzer,Christian Books & Bibles,We live in a church culture where holiness is ...,This is a great one,5,0
1,2008-03-16,A11M98R135HMSY,Paul Skinner,"Mystery, Thriller & Suspense",I have read most of Philip Craigs books and fi...,the begining of a beautiful relationship,5,1
2,2013-02-26,AD20B29YQDZYQ,Amazon Customer,Literature & Fiction,The storyline was good It was the reason I fi...,Just ok,2,1
3,2013-07-16,AZR6CYHTQ9TL,Inez,Science Fiction & Fantasy,I love most apocalypse stories and this one re...,Oh no bugs everywhere,5,1
4,2013-04-14,A14BTJRH9VNLJJ,Kurt A. Johnson,Biographies & Memoirs,What is a sociopath A sociopath is a person wi...,An interesting and thought provoking read,4,1


In [22]:
# Extracting the stopwords from nltk library
sw = stopwords.words('english')
# Add additional stopwords that does not generate any values
new_stopwords = ['book', 'books', 'read', 'reads']
[sw.append(each) for each in new_stopwords]
sw

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [23]:
# Remove stopwords in the text data columns by applying stopwords function
def stopwords(text):
    text = [word.lower() for word in text.split() if word.lower() not in sw]
    return " ".join(text)

In [24]:
# Apply the function above to remove stopwords
amazon['review_text'] = amazon['review_text'].apply(stopwords)
amazon['review_summary'] = amazon['review_summary'].apply(stopwords)
amazon.head()

Unnamed: 0,review_date,reviewer_id,reviewer_name,book_category,review_text,review_summary,rating,helpful_review
0,2006-01-18,A3SHA4Y9DHEK39,Chad Oberholtzer,Christian Books & Bibles,live church culture holiness rarely mentioned ...,great one,5,0
1,2008-03-16,A11M98R135HMSY,Paul Skinner,"Mystery, Thriller & Suspense",philip craigs finally got around first j w jac...,begining beautiful relationship,5,1
2,2013-02-26,AD20B29YQDZYQ,Amazon Customer,Literature & Fiction,storyline good reason finished filled way much...,ok,2,1
3,2013-07-16,AZR6CYHTQ9TL,Inez,Science Fiction & Fantasy,love apocalypse stories one really creeped lik...,oh bugs everywhere,5,1
4,2013-04-14,A14BTJRH9VNLJJ,Kurt A. Johnson,Biographies & Memoirs,sociopath sociopath person little conscience p...,interesting thought provoking,4,1


<a id='Save Cleaned Data for EDA'></a>
## Save Cleaned Data for EDA
After cleaning the overall dataset, the cleaned dataset was saved for exploratory data analysis.

In [25]:
# Save the cleaned data.
amazon.to_csv('data/amazon_sample_eda.csv', index=False)

In [26]:
# Load clean data
amazon = pd.read_csv('data/amazon_sample_eda.csv')

In [27]:
# Make sure everything looks correctly
amazon.head()

Unnamed: 0,review_date,reviewer_id,reviewer_name,book_category,review_text,review_summary,rating,helpful_review
0,2006-01-18,A3SHA4Y9DHEK39,Chad Oberholtzer,Christian Books & Bibles,live church culture holiness rarely mentioned ...,great one,5,0
1,2008-03-16,A11M98R135HMSY,Paul Skinner,"Mystery, Thriller & Suspense",philip craigs finally got around first j w jac...,begining beautiful relationship,5,1
2,2013-02-26,AD20B29YQDZYQ,Amazon Customer,Literature & Fiction,storyline good reason finished filled way much...,ok,2,1
3,2013-07-16,AZR6CYHTQ9TL,Inez,Science Fiction & Fantasy,love apocalypse stories one really creeped lik...,oh bugs everywhere,5,1
4,2013-04-14,A14BTJRH9VNLJJ,Kurt A. Johnson,Biographies & Memoirs,sociopath sociopath person little conscience p...,interesting thought provoking,4,1
