# Books versus eBooks : The customer's choice

For the same content, which format seems to be prefered by people, based on Amazon reviews ?

What is the price difference between the two supports, globally and per book category ?

Per region, what is the favorite format between virtual and physical ?

In [3]:
import pandas as pd
import re
import numpy as np
import requests
import time
from ast import literal_eval
from bs4 import BeautifulSoup
from nltk.sentiment.vader import SentimentIntensityAnalyzer



In [5]:
metadata_path = 'data/metadata.json'
books_metadata_path = 'data/books_metadata_with_bracket.csv'
ebooks_metadata_title_path = 'data/ebooks_metadata_title.csv'
ebooks_asin = 'data/ebooks_asin.csv'
kindle_5core = 'data/reviews_Kindle_Store_5.json'
books_path = 'data/Books_5.json'
amazon_ebooks = 'data/ebooks_title_from_amazon_complete.csv'
asindb_ebooks = 'data/ebooks_title_from_asindb.csv'

matched_books_path = 'test/matched_books.csv'
matched_ebooks_path = 'test/matched_ebooks.csv'
weighted_scores_books_path = 'test/weighted_scores_books.csv'
weighted_scores_ebooks_path = 'test/weighted_scores_ebooks.csv'


WRITE_BOOKS_METADATA = False
WRITE_EBOOKS_METADATA_TITLE = False
WRITE_EBOOK_ASIN = False
AMAZON_GET_TITLE = False
ASINDB_GET_TITLE = False

As for our project, we need to obtain ebook data and book data, we chose the Amazon dataset. On this <a href='http://jmcauley.ucsd.edu/data/amazon/'>link</a>, we have downloaded the Books and Kindle Store 5-core files. However, those files contain reviews, so that we have no information about the article title or price.
For that reason, we had to use the metadata file, acting as an intermediate table (relationship).

TODO : how we obtained it from the cluster.

It's a json file, that is not readable using the pandas read_json method. We had to use the Code part from <a href='http://jmcauley.ucsd.edu/data/amazon/'>here</a> to read it. We can see a way to read the file (a limited portion of it) below :

In [54]:
def read_json(path, limit = 2): 
    g = open(path, 'r') 
    df = {}
    for i, l in enumerate(g): 
        if i < limit:
            df[i] = eval(l)
        else:
            break
    return pd.DataFrame.from_dict(df, orient='index')

def read_csv(path, limit = 2): 
    return pd.read_csv(path, nrows=limit)
            
read_json(metadata_path)

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related
0,1048791,{'Books': 6334800},http://ecx.images-amazon.com/images/I/51MKP0T4...,[[Books]],"The Crucible: Performed by Stuart Pankin, Jero...",,,
1,143561,{'Movies & TV': 376041},http://g-ecx.images-amazon.com/images/G/01/x-s...,"[[Movies & TV, Movies]]","Everyday Italian (with Giada de Laurentiis), V...","3Pack DVD set - Italian Classics, Parties and ...",12.99,"{'also_viewed': ['B0036FO6SI', 'B000KL8ODE', '..."


However, this metadata.json file takes more than 10 Go, for 9430088 entries (obtained by doing a wc -l metadata.json), so it does not fit in memory. Thus, as we will do a lot of tests later, we wanted to create a subfile containing only the Books metadata (we don't need video games metadata for example), with a subset of columns. We also want to write it in the csv format, to manipulate it in an easier way later.

We use the regex "\[\'books" in an ignore case mode, to obtain only entries that have a category tag beginning with [Books. In fact, if we want to use the regex 'book', some entries like 0078800242 or B00000078S are not books at all, even if there is Books in the title or the category tag. The '[' is useful here to avoir this behavior.

In [49]:
def read_book_metadata(path, regex): 
    g = open(path, 'r') 
    for l in g: 
        book = regex.search(l)
        if book:
            yield eval(l) 
            
def write_df_books_metadata(from_, to, regex, columns_to_keep): 
    i = 0 
    df = {} 
    for d in read_book_metadata(from_, regex): 
        df[i] = d 
        i += 1 
        if i % 10000 == 0: # Here, we choose to write everything every 10'000 book entries, and clear the dataframe to free memory.
            pd.DataFrame.from_dict(df, orient='index')[columns_to_keep].to_csv(to, header=False,mode='a')
            df = {}

COLUMNS_TO_KEEP = ['asin', 'salesRank', 'categories', 'title', 'price']
regex = re.compile('\[\'books', re.IGNORECASE)

if WRITE_BOOKS_METADATA:
    pd.DataFrame(columns=[COLUMNS_TO_KEEP]).to_csv(books_metadata_path)
    write_df_books_metadata(metadata_path, books_metadata_path, regex, COLUMNS_TO_KEEP)

And if we read what we just wrote :

In [55]:
read_csv(books_metadata_path)

Unnamed: 0.1,Unnamed: 0,asin,salesRank,categories,title,price
0,0,1048791,{'Books': 6334800},[['Books']],"The Crucible: Performed by Stuart Pankin, Jero...",
1,1,1048775,{'Books': 13243226},[['Books']],Measure for Measure: Complete &amp; Unabridged,


We also wanted to obtain the ebooks titles, price etc..
For the category filter, we have to use the same trick as for the Books one : "\[\'Kindle". Please note that some book metadatas above are in fact kindle store metadatas, because the category can contain both. However, it's not a big deal if we want to do the merge later with asin column.

However, for the metadatas for ebooks, there was a problem at that step :

In [71]:
def read_ebook_metadata(path, regex): 
    g = open(path, 'r') 
    for l in g: 
        ebook = regex.search(l)
        if ebook:
            yield eval(l) 
def obtain_df_ebooks_metadata(from_, to, regex): 
    i = 0 
    df = {} 
    count = 0
    for d in read_ebook_metadata(from_, regex): 
        count += 1
        if(d.get('title')):
            df[i] = {'asin': d.get('asin'), 'title': d.get('title')}
            i += 1 
    pd.DataFrame.from_dict(df, orient='index').to_csv(to)
    print('Total ebooks in metadatas:', count)


regex = re.compile('\[\'Kindle', re.IGNORECASE)

if WRITE_EBOOKS_METADATA_TITLE:
    obtain_df_ebooks_metadata(metadata_path, ebooks_metadata_title_path, regex)

Total ebooks in metadatas: 434702


We see right below that 44 entries out of 434702 have a title. Of course, it's not good at all, since we want to merge books and ebooks using the title field.

In [82]:
read_csv(ebooks_metadata_title_path, None).shape[0]

44

Thus, we need to obtain the title field from another source. The first idea was to retrieve this information from Amazon directly, as we wanted to do for the user location. For that, we need to have a list of the ebooks asin (Amazon Standard Identification Numbers). We obtain it from the Kindle Store 5-core file.

In [80]:
def read_ebook_5core(path, regex): 
    g = open(path, 'r') 
    for l in g: 
        yield eval(l) 
def write_ebook_asin(from_, to): 
    i = 0 
    df = {} 
    for d in read_ebook_5core(from_, regex): 
        df[i] = d 
        i += 1 
        if i % 10000 == 0:
            if i % 100000 == 0:
                print(i) #to show the progression
            pd.DataFrame.from_dict(df, orient='index')[['asin']].to_csv(to, header=False,mode='a')
            df = {}
    pd.DataFrame.from_dict(df, orient='index')[['asin']].to_csv(to, header=False,mode='a')
    df = {}

if WRITE_EBOOK_ASIN:
    pd.DataFrame(columns=[['asin']]).to_csv(ebooks_asin)
    write_ebook_asin(kindle_5core, ebooks_asin)

100000
200000
300000
400000
500000
600000
700000
800000
900000


As we were using the Kindle Store 5-core file, there are asin duplicates. We thus make it unique when we read.

In [106]:
ebooks_asin_unique = pd.read_csv('ebooks_asin.csv',usecols=[1]).asin.unique()
ebooks_asin_unique

array(['B000F83SZQ', 'B000FA64PA', 'B000FA64PK', ..., 'B00M029T4O',
       'B00M0RE7CS', 'B00M13FNSS'], dtype=object)

For every Amazon article with asin *xasinx*, the corresponding web page is https://www.amazon.com/dp/*xasinx*/ref=rdr_kindle_ext_tmb.


In [107]:
prefix = 'https://www.amazon.com/dp/'
suffix = '/ref=rdr_kindle_ext_tmb'

USER_AGENT_CHOICES = [
    'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:23.0) Gecko/20100101 Firefox/23.0',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.62 Safari/537.36',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20140205 Firefox/24.0 Iceweasel/24.3.0',
    'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0',
    'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
]

We have to rotate the user-agent so that the bot is less likely to be considered as one. This is why we have a User-Agent array.
At the beginning, this method was working quite well : we had obtained 503 tuples (title, category, page number, language) over 1000 requests, which could mean that we had solved the problem. However, when looking at the distribution of the 503 tuples, we could see that at the beginning, everything behaves well, we obtain most of the entries (the other ones being like the B000JMKU0Y one, an obsolete entry, that only has customer reviews).

In [108]:
if AMAZON_GET_TITLE:
    
    LIMIT = 10
    
    undefined = 0
    defined = 0
    dataframe_original = pd.DataFrame(columns=[['asin', 'title', 'Category', 'PageNum', 'Language']])
    dataframe = dataframe_original.copy()

    dataframe_original.to_csv(amazon_ebooks)

    for i, asin in enumerate(ebooks_asin_unique[:LIMIT]):

        if i%10==0:
            headers = {'User-Agent':USER_AGENT_CHOICES[np.random.randint(0, len(USER_AGENT_CHOICES))]}
            if i > 0:
                print('undefined:', undefined, '/ defined:', defined)
                dataframe.to_csv(amazon_ebooks, header=False,mode='a')
                dataframe = dataframe_original.copy()


        r = requests.get(prefix + asin + suffix, headers=headers)
        page_body = r.text
        soup = BeautifulSoup(page_body, 'html.parser')
        title = soup.find_all('span', id='ebooksProductTitle')
        if(len(title) == 0):
            undefined += 1
        else:
            defined += 1
            title = title[0].text

            ul = soup.find_all('ul', class_='a-unordered-list a-horizontal a-size-small')
            if(len(ul) > 0):
                details = ul[0].find_all('a', class_='a-link-normal a-color-tertiary')
                if(len(details) > 0):
                    category = details[-1].text.strip()
                else:
                    category = ""
            else:
                category = ""

            details = soup.find_all('table', id='productDetailsTable')
            if(len(details) > 0):
                length = details[0].find_all('b', text='Print Length:')
                if(len(length) > 0):
                    page_number = length[0].parent.text.split()[2]
                else:
                    page_number = 0

                length = details[0].find_all('b', text='Language:')
                if(len(length) > 0):
                    language = length[0].parent.text.split()[1]
                else:
                    language = ""
            else:
                page_number = pd.np.nan
                language = ""

            dataframe.loc[asin] = (asin, title, category, page_number, language)

        waiting = np.random.rand()
        time.sleep(waiting+1)

    print(defined,',',undefined)
    dataframe.to_csv(amazon_ebooks, header=False,mode='a')
    dataframe = dataframe_original.copy()

But after some time, we get less and less entries : a message is sent by Amazon when retrieving the page, saying that it's not a good idea to continue scraping data, and that it might be a good idea to go through their API. So, there were some options :
- we continue to work with the bot while tweaking the parameters to behave like a normal user for the bot (by increasing the waiting time and rotating the user-agent as said before) :

After some online search (https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/,
http://blog.datahut.co/tutorial-how-to-scrape-amazon-using-python-scrapy/,
http://docs.aws.amazon.com/AWSECommerceService/latest/DG/rest-signature.html,
https://www.scrapehero.com/tutorial-how-to-scrape-amazon-product-details-using-python/,
https://blog.hartleybrody.com/scrape-amazon/), we saw that Amazon was detecting the IP, it could ban it, and the solution to avoid it was to use a proxy crawler. As it costs money, we decided not to use that. Furthermore, as said here, it's a legally speaking grey area : https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/.

- we try to go through the Amazon API : 

For the standard account, we need to put bank account information, so we prefer not to do so. For the student account, as we realized some days before the deadline that it existed, we might consider this option in the future if needed, but we wait for the epfl to accept or not the account request.

- we find a field in the metadata, different from the title, that can help us to merge a book with an ebook :

With some manual analysis, we found a pair of book-ebook : 

    {'asin': 'B000JML1QG', 'price': 0.99, 'imUrl': 'http://ecx.images-amazon.com/images/I/41VbZ%2BvxslL._BO2,204,203,200_PIsitb-sticker-v3-big,TopRight,0,-55_SX278_SY278_PIkin4,BottomRight,1,22_AA300_SH20_OU01_.jpg', 'related': {'also_viewed': ['B005LSCQ4Y', 'B0082UXYTE', 'B004TS2B4W'], 'buy_after_viewing': ['B00CS6P31U', 'B005LSCQ4Y', 'B0051EZX8Y', 'B006CRC98G']}, 'categories': [['Books', "Children's Books", 'Fairy Tales, Folk Tales & Myths', 'Anthologies'], ['Books', 'Literature & Fiction'], ['Kindle Store', 'Kindle eBooks', "Children's eBooks", 'Fairy Tales, Folk Tales & Myths', 'Anthologies'], ['Kindle Store', 'Kindle eBooks', "Children's eBooks", 'Fairy Tales, Folk Tales & Myths', 'Collections'], ['Kindle Store', 'Kindle eBooks', 'Literature & Fiction', 'Mythology & Folk Tales'], ['Kindle Store', 'Kindle eBooks', 'Science Fiction & Fantasy', 'Fantasy', 'Fairy Tales']]}


    {'asin': '0554319187', 'title': "Grimm's Fairy Stories", 'price': 0.99, 'imUrl': 'http://ecx.images-amazon.com/images/I/41O2olixwXL.jpg', 'related': {'also_viewed': ['1607103133', '0394709306', '1937994317'], 'buy_after_viewing': ['1607103133', '0394709306', '0393088863', '0385189508']}, 'salesRank': {'Books': 2586251}, 'categories': [['Books']]}

As we can see here, the only entry that is the same is the price, and it's dangerous to merge on the price as ebooks are often less expensive than the book version for the same content.

- we find another service that can give us the title for a given asin :

This is the option that we finally chose. The website http://asindb.com/ does exactly that. For this website, there is no bot detection as Amazon does. We can't retrieve the price, the category and the number of pages, but at least we can get the title. We can see the result below :

In [119]:
if ASINDB_GET_TITLE:
    
    LIMIT = 20
    
    prefix_asindb = 'http://asindb.com/USA/ASIN/'

    dataframe = pd.DataFrame(columns=[['asin', 'title']])
    notdefined = pd.DataFrame(columns=[['asin','notfound']])
    dataframe.to_csv(asindb_ebooks)

    for i, asin in enumerate(ebooks_asin_unique[:LIMIT]):
        r = requests.get(prefix_asindb + asin, headers={'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:23.0) Gecko/20100101 Firefox/23.0'})
        page_body = r.text
        soup = BeautifulSoup(page_body, 'html.parser')

        if i%10 == 0:
            dataframe.to_csv(asindb_ebooks, header=False,mode='a')
            dataframe = pd.DataFrame(columns=[['asin', 'title']])

        notfound = soup.find_all('h6', text = 'No item found!!!')
        if(len(notfound) > 0):
            notdefined.loc[asin] = (asin,1)
        else:
            title = soup.find_all('th', text='Title')
            if(len(title) > 0 and len(title[0].parent.findChildren()) >= 2):
                dataframe.loc[asin] = (asin, title[0].parent.findChildren()[1].text)
            else:
                print('alerte :', asin)

        waiting = np.random.rand()
        time.sleep(waiting+1)

    dataframe.to_csv(asindb_ebooks, header=False,mode='a')
    dataframe = pd.DataFrame(columns=[['asin', 'title']])
    
    print(notdefined.head(2))

                  asin notfound
B000FC26RI  B000FC26RI        1
B000JMKU0Y  B000JMKU0Y        1


Right above is the undefined entries dataframe (see below for more explanation).

We can see below what kind of output it gives to us.

In [116]:
read_csv(asindb_ebooks)

Unnamed: 0.1,Unnamed: 0,asin,title
0,B000FA64PA,B000FA64PA,Saboteur: Star Wars Legends (Darth Maul) (Shor...
1,B000FA64PK,B000FA64PK,Recovery: Star Wars Legends (The New Jedi Orde...


In [117]:
read_csv(asindb_ebooks, None).shape[0] # number of titles retrieved

2741

This solution is of course not the best one : the asindb website does not contain everything. We have managed to retrieve 2741 titles over 4000 asins by using this technique, but we have no problem with the Amazon bot detection (and possible ban). We can see which entries were not retrieved by printing the notdefined dataframe.

Of course, tu retrieve the 2741 entries, we set the LIMIT constant in the code to be 4000.

We thus continue our analysis by using it.

In [7]:
kindle = pd.read_json(kindle_5core, lines=True)
metadata = pd.read_csv(books_metadata_path)
cross = pd.read_csv(asindb_ebooks)

In [10]:
metadata.head()

Unnamed: 0.1,Unnamed: 0,asin,salesRank,categories,title,price
0,0,1048791,{'Books': 6334800},[['Books']],"The Crucible: Performed by Stuart Pankin, Jero...",
1,1,1048775,{'Books': 13243226},[['Books']],Measure for Measure: Complete &amp; Unabridged,
2,2,1048236,{'Books': 8973864},[['Books']],The Sherlock Holmes Audio Collection,9.26
3,3,401048,{'Books': 6448843},[['Books']],The rogue of publishers' row;: Confessions of ...,
4,4,1019880,{'Books': 9589258},[['Books']],Classic Soul Winner's New Testament Bible,5.39


In [11]:
data_kindle = kindle.set_index('asin')

merged_ebooks = pd.merge(kindle, metadata, on='asin')
merged_asins = pd.merge(metadata, cross, on='title')

In [12]:
def find_matched(list_asins, in_path, out_path):
    pd.DataFrame(columns=[['asin','overall', 'summary', 'reviewerID', 'helpful','reviewText', 'reviewerName']]).to_csv(out_path)

    for chunck in pd.read_json(in_path, lines=True, chunksize=50000):
        filtered = chunck[chunck['asin'].isin(list_asins)].dropna(how='all')
        if len(filtered) > 0:
            filtered[['asin','overall', 'summary', 'reviewerID', 'helpful','reviewText', 'reviewerName']].to_csv(out_path,header=False, mode='a')


In [14]:
find_matched(merged_asins['asin_x'].values, books_path, matched_books_path)
find_matched(merged_asins['asin_y'].values, kindle_5core, matched_ebooks_path)

In [15]:
sid = SentimentIntensityAnalyzer()

In [16]:
def weighted_helpful(x):
    x = literal_eval(x)
    voters = int(x[0]) + int(x[1])
    return 0.5 if voters == 0 else int(x[0])/voters

def weighted_score(data):
    
    data['weighted_help'] = (data.helpful.astype(list)).apply(weighted_helpful)
    
    data['weighted_overall'] = data['weighted_help'] * data['overall']
    weighted_score = data.groupby(data.index).sum()
    weighted_score['score'] = weighted_score['weighted_overall']/weighted_score['weighted_help']
    
    return weighted_score['score']

In [17]:
def weighted_scores(data):
    data['weighted_help'] = (data['helpful'].astype(list)).apply(weighted_helpful)
    
    func = lambda x: sid.polarity_scores(x)['compound']
    
    data['sentiment'] = data['reviewText'].apply(func)
    data['weighted_sentiment'] = data['weighted_help'] * data['sentiment']
    data['weighted_overall'] = data['weighted_help'] * data['overall']
    weighted_score = data.groupby(data['asin']).sum()
    weighted_score['sentiment score'] = weighted_score['weighted_sentiment']/weighted_score['weighted_help']
    weighted_score['overall score'] = weighted_score['weighted_overall']/weighted_score['weighted_help']
    
    return weighted_score[['sentiment score', 'overall score']]

In [18]:
matched_ebooks= pd.read_csv(matched_ebooks_path)
matched_books = pd.read_csv(matched_books_path)

In [19]:
weighted_scores_books = weighted_scores(matched_books)
weighted_scores_ebooks = weighted_scores(matched_ebooks)

weighted_scores_books.to_csv(weighted_scores_books_path)
weighted_scores_ebooks.to_csv(weighted_scores_ebooks_path)

In [20]:
merged_with_books = pd.merge(merged_asins, weighted_scores_books, left_on='asin_x', right_index=True)
merged_with_all = pd.merge(merged_with_books, weighted_scores_ebooks, left_on='asin_y', right_index=True, suffixes=[' book', ' ebook'])

In [23]:
merged_with_all.head(50)

Unnamed: 0,Unnamed: 0_x,asin_x,salesRank,categories,title,price,Unnamed: 0_y,asin_y,sentiment score book,overall score book,sentiment score ebook,overall score ebook
2,1380542,1579660584,{'Books': 2486827},[['Books']],The Space Between,20.08,B002DYJ7DM,B002DYJ7DM,0.1324,5.0,0.68844,3.8
4,1448217,1595143394,{'Books': 141638},[['Books']],The Space Between,7.13,B002DYJ7DM,B002DYJ7DM,0.784577,4.221992,0.68844,3.8
13,1200133,1463597029,{'Books': 4801158},[['Books']],Epiphany,2.99,B00480OQS0,B00480OQS0,0.802369,4.564737,-0.057837,4.125
14,1258058,1484967887,{'Books': 3543419},[['Books']],Epiphany,3.99,B00480OQS0,B00480OQS0,0.68,4.524272,-0.057837,4.125
20,3606,0007269854,{'Books': 2527081},[['Books']],The Ice Princess,7.59,B003ZUY19I,B003ZUY19I,0.451335,3.587818,0.835556,4.325581
24,1014015,0987930044,{'Books': 9330490},[['Books']],New Beginnings,2.99,B003CT32PQ,B003CT32PQ,0.912843,4.347826,0.863943,4.352699
31,1768373,9769528706,{'Books': 2614811},[['Books']],New Beginnings,2.99,B003CT32PQ,B003CT32PQ,0.801781,4.428571,0.863943,4.352699
32,4298,0007492316,{'Books': 3117732},[['Books']],Sacrifice,3.79,B004GNFWZ0,B004GNFWZ0,0.64645,4.6,0.399021,4.263158
33,161650,0312381867,{'Books': 153575},[['Books']],Sacrifice,2.99,B004GNFWZ0,B004GNFWZ0,0.277431,3.996652,0.399021,4.263158
38,460547,0679764100,{'Books': 458795},[['Books']],Sacrifice,7.99,B004GNFWZ0,B004GNFWZ0,-0.215695,4.509804,0.399021,4.263158
