# Book Success Prediction using Machine Learning

Kathryn Hamilton and Frank Shannon

w207 Spring 2018

### Introduction

The team would like to assess the relationship between the synopsis of a novel and its success by constructing a supervised machine learning classifier.

A book's synopsis, which is a couple paragraphs traditionally found on the back or inside cover of a book, serves to provide a brief explanation of the book's contents and any applicable critical acclaim of the the author. The team would like to see if this information can be used to reliably predict whether or not the book will be successful. To do this, the team will study harness information found on Amazon.com, one of the world's largest e-commerce and cloud computing companies which, fittingly, started as an online bookstore.

We begin by importing the necessary libraries and setting up our document.

In [34]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
import pandas as pd
import gzip
import re
import copy

The data we will be using for this project come from two sources.

The first is an online repository of `.json` files compiled by Julian McAuley, Assistant Professor of Computer Science and Engineering at University of California, San Diego. These files, which can be found at http://jmcauley.ucsd.edu/data/amazon/ and will provides us with customer review information and product metadata [1, 2].

The second will be an API the team uses to scrape book synopsis data from http://www.amazon.com/ using the list of product ID numbers included in the dataset of reviews.

Prof. McAuley's papers related to the Amazon dataset are as follows:

[1] R. He, J. McAuley. Modeling the visual evolution of fashion trends with one-class collaborative filtering. WWW, 2016

[2] J. McAuley, C. Targett, J. Shi, A. van den Hengel. Image-based recommendations on styles and substitutes. SIGIR, 2015

### Import and Clean Data

We first explore the datasets provided by McAuley. These files are very large so we are looking to get rid of any information that will not be useful to us.

In addition, we will want to narrow down the data into a subset that seems well suited for the purpose of this project. For starters, this means selecting a category of books (Fiction, Travel, Money & Business, etc) that has enough examples and a good range of descriptive synopses.

At this point in the project our inputs and outputs are very loosely defined. It is hard to know specifically what we data will need in the end and if we even have it to begin with. So, some upfront exploration is a good first approach to begin forming a problem that we can reasonably solve.

In [2]:
# Unpackage McAuley metadata and reviews files, which are currently compressed .json files, using the code supplied on his site.

def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0 
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

metadata_import = getDF('meta_Books.json.gz')
#reviews_import = getDF('reviews_Books.json.gz')

Let's take a look at the first few rows of each file.

In [75]:
print "Length: ", metadata_import.size   # print length of dataframe
metadata_import.head(n=5)   # print first 5 rows of dataframe

Length:  21335265


Unnamed: 0,asin,salesRank,imUrl,categories,title,description,related,price,brand
0,1048791,{u'Books': 6334800},http://ecx.images-amazon.com/images/I/51MKP0T4...,[[Books]],"The Crucible: Performed by Stuart Pankin, Jero...",,,,
1,1048775,{u'Books': 13243226},http://ecx.images-amazon.com/images/I/5166EBHD...,[[Books]],Measure for Measure: Complete &amp; Unabridged,William Shakespeare is widely regarded as the ...,,,
2,1048236,{u'Books': 8973864},http://ecx.images-amazon.com/images/I/51DH145C...,[[Books]],The Sherlock Holmes Audio Collection,"&#34;One thing is certain, Sherlockians, put a...","{u'also_viewed': [u'1442300191', u'9626349786'...",9.26,
3,401048,{u'Books': 6448843},http://ecx.images-amazon.com/images/I/41bchvIf...,[[Books]],The rogue of publishers' row;: Confessions of ...,,{u'also_viewed': [u'068240103X']},,
4,1019880,{u'Books': 9589258},http://ecx.images-amazon.com/images/I/61LcHUdv...,[[Books]],Classic Soul Winner's New Testament Bible,,"{u'also_viewed': [u'B003HMB5FC', u'0834004593'...",5.39,


In [76]:
#print "Length: ", reviews_import.size   # print length of dataframe
#reviews_import.head(n=5)   # print first 5 rows of dataframe

Let's start with the `metadata` file, which describes each book.

There are several columns that are of use to us:

* `asin`, which is the unique product identification number used by Amazon.
* `salesRank`, which describes the popularity of the book within the Amazon category "Books".
* `categories`, which describes the category and sub category that the book is classified as.

There are several columns that are not of use to us:

* `imUrl`, which is a link to the product's photo.
* `related`, which is a list of similar products.
* `brand`, which might describe affiliate companies such as the book's publisher

We drop the columns that are of no use to us, and drop rows that do not contain information in all of the columns that are of use to us. We also convert column headers to ASCII from Unicode.

In [211]:
# create a duplicate data frame of the imported file
metadata = metadata_import.copy()

# convert column headers from unicode to ascii
metadata = metadata.rename(index=str,columns={u'asin':'asin', u'salesRank':'salesRank', u'imUrl':'imUrl', 
                                              u'categories':'categories', u'title':'title', u'description':'description',
                                              u'related':'related', u'price':'price', u'brand':'brand'})

# drop unrelated columns
metadata = metadata.drop(['imUrl','related','brand'],axis=1)

# drop rows that have NaN in any of: asin, sales rank, or category
metadata = metadata.dropna(axis=0,subset=['asin','salesRank','categories'])
metadata = metadata.reset_index()

print "Length: ", metadata.size   # print length of dataframe

Length:  13238141


In reducing the dataset this way, we still have over 11 million entries to work with (originally 21 million).

Now, let's check to see how many books fall into each category.

In [155]:
# check how many entries are in each category
metadata2 = metadata.copy()
metadata2['categories'] = metadata2['categories'].astype(str)
metadata2['categories'] = metadata2['categories'].astype('category')
metadata2.drop(['salesRank','title','description','price','asin'],axis=1).groupby(['categories']).count().sort_values('index', ascending=False).head(n=15)

Unnamed: 0_level_0,index
categories,Unnamed: 1_level_1
[['Books']],1890949
[,56
"[['Books', ""Children's Books""]]",39
"[['Books', 'Teen & Young Adult']]",14
"[['Books', 'Science Fiction & Fantasy', 'Gaming']]",8
"[['Books', 'Reference']]",7
"[['Books', ""Children's Books"", 'Activities, Crafts & Games', 'Games', 'Puzzles']]",5
"[['Books', 'Humor & Entertainment', 'Puzzles & Games', 'Board Games']]",4
"[['Books', 'Crafts, Hobbies & Home', 'Crafts & Hobbies', 'Scrapbooking']]",3
"[['Books', 'Health, Fitness & Dieting', 'Nutrition']]",2


This is the first 15 lines in a larger table, listed in descending number of books per category. We see that 1.9 million entries are uncategorized (their category is simply `Books`). However, this still leaves over 9 million books that have been categorized. The second through fifteenth rows of this table do not have enough examples to proceed, and it is clear that there are many more rows in this table to account for the remaining 9 million books.

We see that there is an issue here with overcategorization. For example, we see that there are several of the above rows that fall under `Humor & Entertainment` but have been specified further such that in their current format these entries are not bucketed together. 

The next logical step is to roll back these classifications to a higher level (for example, change `['Books', 'Crafts, Hobbies & Home', 'Crafts & Hobbies', 'Decorating']` into simply `['Crafts, Hobbies & Home']`), and then regroup the data.

## -----EVERYTHING IS GOOD ABOVE HERE-----

In [213]:
orig_cats = metadata.categories.values
orig_cats = np.unique(orig_cats)

new_cats = copy.copy(orig_cats)

for i in range(0,len(new_cats)):
    text = re.sub('[^a-zA-Z&, ]+', '', new_cats[i])
    text = text.split(",")
    if len(text)>1:
        new_cats[i] = text[1]
    else:
        new_cats[i] = 'Books'

print np.unique(new_cats)

#metadata['categories'] = metadata['categories'].astype(str)
#orig_cats = metadata.categories.values
#orig_cats = np.unique(orig_cats)

#for i in range(0,len(orig_cats)):
#    metadata = metadata.replace(orig_cats[i], new_cats[i])


TypeError: expected string or buffer

In [200]:
metadata.head(n=500)   # print first 5 rows of dataframe

Unnamed: 0,index,asin,salesRank,categories,title,description,price
0,0,0001048791,{u'Books': 6334800},[,"The Crucible: Performed by Stuart Pankin, Jero...",,
1,1,0001048775,{u'Books': 13243226},[,Measure for Measure: Complete &amp; Unabridged,William Shakespeare is widely regarded as the ...,
2,2,0001048236,{u'Books': 8973864},[,The Sherlock Holmes Audio Collection,"&#34;One thing is certain, Sherlockians, put a...",9.26
3,3,0000401048,{u'Books': 6448843},[,The rogue of publishers' row;: Confessions of ...,,
4,4,0001019880,{u'Books': 9589258},[,Classic Soul Winner's New Testament Bible,,5.39
5,6,0001148427,{u'Books': 5806769},[,Sonatas - For Piano,,
6,7,0001057170,{u'Books': 9318563},[,Classic Connolly Boxed Set (Vol 1 &amp; 2),[Editor's Note: The following is a combined re...,
7,8,0001047566,{u'Books': 3628249},[,Hand in Glove,,
8,9,0001053396,{u'Books': 12249714},[,War Poems: An Anthology of Poetry from the 18t...,Writing poetry has always been a way to expres...,17.99
9,10,0000913154,{u'Books': 455782},[,The Way Things Work: An Illustrated Encycloped...,,23.26


In [196]:
metadata2 = metadata.copy()
metadata2['categories'] = metadata2['categories'].astype(str)
metadata2['categories'] = metadata2['categories'].astype('category')
metadata2.drop(['salesRank','title','description','price','asin'],axis=1).groupby(['categories']).count().sort_values('index', ascending=False)

Unnamed: 0_level_0,index
categories,Unnamed: 1_level_1
[['Books']],1890949
[,56
"[['Books', ""Children's Books""]]",39
"[['Books', 'Teen & Young Adult']]",14
"[['Books', 'Science Fiction & Fantasy', 'Gaming']]",8
"[['Books', 'Reference']]",7
"[['Books', ""Children's Books"", 'Activities, Crafts & Games', 'Games', 'Puzzles']]",5
"[['Books', 'Humor & Entertainment', 'Puzzles & Games', 'Board Games']]",4
"[['Books', 'Crafts, Hobbies & Home', 'Crafts & Hobbies', 'Scrapbooking']]",3
"[['Books', 'Health, Fitness & Dieting', 'Nutrition']]",2


In [None]:
metadata['categories'] = metadata['categories'].astype(str)
for i in range(0,10):
    categ = metadata.iloc[i]['categories']
    categ = categ[0]
    if len(categ) > 1:
        categ = categ[1]
    metadata['categories'][i] = categ

In [None]:
metadata['categories'] = metadata['categories'].astype(str)
metadata['categories'] = metadata.categories.astype('category')
metadata.groupby(['categories']).count()

In [187]:
type(metadata)

pandas.core.frame.DataFrame

In [66]:
metadata

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,related,price,brand
0,0001048791,{u'Books': 6334800},http://ecx.images-amazon.com/images/I/51MKP0T4...,0001048791,"The Crucible: Performed by Stuart Pankin, Jero...",,,,
1,0001048775,{u'Books': 13243226},http://ecx.images-amazon.com/images/I/5166EBHD...,0001048775,Measure for Measure: Complete &amp; Unabridged,William Shakespeare is widely regarded as the ...,,,
2,0001048236,{u'Books': 8973864},http://ecx.images-amazon.com/images/I/51DH145C...,0001048236,The Sherlock Holmes Audio Collection,"&#34;One thing is certain, Sherlockians, put a...","{u'also_viewed': [u'1442300191', u'9626349786'...",9.26,
3,0000401048,{u'Books': 6448843},http://ecx.images-amazon.com/images/I/41bchvIf...,0000401048,The rogue of publishers' row;: Confessions of ...,,{u'also_viewed': [u'068240103X']},,
4,0001019880,{u'Books': 9589258},http://ecx.images-amazon.com/images/I/61LcHUdv...,0001019880,Classic Soul Winner's New Testament Bible,,"{u'also_viewed': [u'B003HMB5FC', u'0834004593'...",5.39,
5,0001048813,,http://ecx.images-amazon.com/images/I/41k5u0lr...,0001048813,Archer Christmas 4 Tape Pack,,,,
6,0001148427,{u'Books': 5806769},http://ecx.images-amazon.com/images/I/41tN4KuO...,0001148427,Sonatas - For Piano,,,,
7,0001057170,{u'Books': 9318563},http://ecx.images-amazon.com/images/I/51M65KR8...,0001057170,Classic Connolly Boxed Set (Vol 1 &amp; 2),[Editor's Note: The following is a combined re...,,,
8,0001047566,{u'Books': 3628249},http://ecx.images-amazon.com/images/I/51FWARNT...,0001047566,Hand in Glove,,,,
9,0001053396,{u'Books': 12249714},http://ecx.images-amazon.com/images/I/51WTKK4V...,0001053396,War Poems: An Anthology of Poetry from the 18t...,Writing poetry has always been a way to expres...,,17.99,


In [43]:
metadata['salesRank']=metadata['salesRank'].astype(str)
metadata['salesRank']=metadata['salesRank'].map(lambda x: x.lstrip("{u'Books': ").rstrip("}"))
metadata['salesRank']=metadata['salesRank'].astype(int)

In [47]:
metadata['categories'] = metadata['categories'].astype(str)
metadata['categories'] = metadata.categories.astype('category')
metadata.groupby(['categories']).count()

Unnamed: 0_level_0,asin,salesRank,title,description,price
categories,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"[['Books', ""Children's Books"", 'Activities, Crafts & Games', 'Activity Books']]",1,1,1,1,1
"[['Books', ""Children's Books"", 'Activities, Crafts & Games', 'Games', 'Card Games']]",1,1,1,1,1
"[['Books', ""Children's Books"", 'Activities, Crafts & Games', 'Games', 'Party Games']]",1,1,1,1,1
"[['Books', ""Children's Books"", 'Activities, Crafts & Games', 'Games', 'Puzzles']]",4,4,4,4,4
"[['Books', ""Children's Books"", 'Activities, Crafts & Games', 'Games']]",1,1,1,1,1
"[['Books', ""Children's Books"", 'Education & Reference']]",1,1,1,1,1
"[['Books', ""Children's Books"", 'Geography & Cultures', 'Multicultural Stories', 'Hispanic & Latino']]",1,1,1,1,1
"[['Books', ""Children's Books"", 'Growing Up & Facts of Life', 'Family Life', 'Money']]",1,1,1,1,1
"[['Books', ""Children's Books"", 'Growing Up & Facts of Life', 'Friendship, Social Skills & School Life', 'School']]",1,1,1,1,1
"[['Books', ""Children's Books"", 'Literature & Fiction']]",1,1,1,1,1


In [None]:
from lxml import html  
import csv,os,json
import requests
from exceptions import ValueError
from time import sleep
 
def AmzonParser(url):
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
    page = requests.get(url,headers=headers)
    while True:
        sleep(3)
        try:
            doc = html.fromstring(page.content)
            XPATH_NAME = '//h1[@id="title"]//text()'
            XPATH_SALE_PRICE = '//span[contains(@id,"ourprice") or contains(@id,"saleprice")]/text()'
            XPATH_ORIGINAL_PRICE = '//td[contains(text(),"List Price") or contains(text(),"M.R.P") or contains(text(),"Price")]/following-sibling::td/text()'
            XPATH_CATEGORY = '//a[@class="a-link-normal a-color-tertiary"]//text()'
            XPATH_AVAILABILITY = '//div[@id="availability"]//text()'
 
            RAW_NAME = doc.xpath(XPATH_NAME)
            RAW_SALE_PRICE = doc.xpath(XPATH_SALE_PRICE)
            RAW_CATEGORY = doc.xpath(XPATH_CATEGORY)
            RAW_ORIGINAL_PRICE = doc.xpath(XPATH_ORIGINAL_PRICE)
            RAw_AVAILABILITY = doc.xpath(XPATH_AVAILABILITY)
 
            NAME = ' '.join(''.join(RAW_NAME).split()) if RAW_NAME else None
            SALE_PRICE = ' '.join(''.join(RAW_SALE_PRICE).split()).strip() if RAW_SALE_PRICE else None
            CATEGORY = ' > '.join([i.strip() for i in RAW_CATEGORY]) if RAW_CATEGORY else None
            ORIGINAL_PRICE = ''.join(RAW_ORIGINAL_PRICE).strip() if RAW_ORIGINAL_PRICE else None
            AVAILABILITY = ''.join(RAw_AVAILABILITY).strip() if RAw_AVAILABILITY else None
 
            if not ORIGINAL_PRICE:
                ORIGINAL_PRICE = SALE_PRICE
 
            if page.status_code!=200:
                raise ValueError('captha')
            data = {
                    'NAME':NAME,
                    'SALE_PRICE':SALE_PRICE,
                    'CATEGORY':CATEGORY,
                    'ORIGINAL_PRICE':ORIGINAL_PRICE,
                    'AVAILABILITY':AVAILABILITY,
                    'URL':url,
                    }
 
            return data
        except Exception as e:
            print e
 
def ReadAsin():
    # AsinList = csv.DictReader(open(os.path.join(os.path.dirname(__file__),"Asinfeed.csv")))
    AsinList = ['B0046UR4F4',
    'B00JGTVU5A',
    'B00GJYCIVK',
    'B00EPGK7CQ',
    'B00EPGKA4G',
    'B00YW5DLB4',
    'B00KGD0628',
    'B00O9A48N2',
    'B00O9A4MEW',
    'B00UZKG8QU',]
    extracted_data = []
    for i in AsinList:
        url = "http://www.amazon.com/dp/"+i
        print "Processing: "+url
        extracted_data.append(AmzonParser(url))
        sleep(5)
    f=open('data.json','w')
    json.dump(extracted_data,f,indent=4)
 
 
if __name__ == "__main__":
    ReadAsin()

To begin, the team used preprocessing techniques and a vectorizer to decompose the synopses and ratings.

In [2]:
# partition & discard unwanted data (we just want one genre?)

Once the data was imported correctly, the team partitioned the data into a training, development, and test set.

In [None]:
# preprocessing

Then, the team operationalized the definition of "success" as it pertains to the books and the available data.

In [None]:
# define a function for success

Finally, the team constructed a classifier to determine if success can be predicted from synopsis

In [None]:
# classifier

The team found that...

Some alternative methods/explorations...