# Books versus eBooks : The customer's choice

For the questions asked for Milestone 2, please refer to the README file, where you will see a part dedicated to milestone 2. In this Notebook, we don't answer specifically to these questions, but we speak at each step what we technically made and failed.

## Data retrieval

In [1]:
import pandas as pd
import re
import numpy as np
import requests
import time
from bs4 import BeautifulSoup
from ast import literal_eval

#### Warning :
There are several booleans in this cell. They are useful to indicate if we need to perform some key data filtering and transformation in the whole notebook. The first one is set to True as it creates a file that is bigger than 300Mo, so that we can't put it in github. You will have to compute it if you run the notebook (it takes some time). The other files are on github, so all other booleans are set to False.

In [2]:
metadata_path = 'metadata.json'
books_metadata_path = 'books_metadata_with_bracket.csv'
ebooks_metadata_title_path = 'ebooks_metadata_title.csv'
ebooks_asin = 'ebooks_asin.csv'
books_5core_path = 'reviews_Books_5.json'
kindle_5core_path = 'reviews_Kindle_Store_5.json'
amazon_ebooks = 'ebooks_title_from_amazon_complete.csv'
asindb_ebooks = 'ebooks_title_from_asindb.csv'

matched_books_path = 'matched_books.csv'
matched_ebooks_path = 'matched_ebooks.csv'
weighted_scores_books_path = 'weighted_scores_books.csv'
weighted_scores_ebooks_path = 'weighted_scores_ebooks.csv'


WRITE_BOOKS_METADATA = True
WRITE_EBOOKS_METADATA_TITLE = False
WRITE_EBOOK_ASIN = False
AMAZON_GET_TITLE = False
ASINDB_GET_TITLE = False
WRITE_FIND_MATCHED = False
WRITE_WEIGHTED_SCORE = False

As for our project, we need to obtain ebook data and book data, we chose the Amazon dataset. On this <a href='http://jmcauley.ucsd.edu/data/amazon/'>link</a>, we have downloaded the Books and Kindle Store 5-core files. However, those files contain reviews, so that we have no information about the article title or price.
For that reason, we had to use the metadata file, acting as an intermediate table (relationship).

We obtained the metadata.json file from the cluster, as it was not available in the website.
We accessed the cluster using ssh@iccluster060.iccluster.epfl.ch, then using 
```shell
hadoop fs -get /datasets/productGraph/metadata.json /buffer
```
to move the dataset to a folder we could connect to with SCP to download it on our computer.


It's a json file, that is not readable using the pandas read_json method. We had to use the Code part from <a href='http://jmcauley.ucsd.edu/data/amazon/'>here</a> to read it. We can see a way to read the file (a limited portion of it) below :

In [6]:
def read_json(path, limit = 2): 
    g = open(path, 'r') 
    df = {}
    for i, l in enumerate(g): 
        if i < limit:
            df[i] = eval(l)
        else:
            break
    return pd.DataFrame.from_dict(df, orient='index')

def read_csv(path, limit = 2): 
    return pd.read_csv(path, nrows=limit)
            
read_json(metadata_path)

Unnamed: 0,categories,salesRank,asin,imUrl,title,price,related,description
0,[[Books]],{'Books': 6334800},1048791,http://ecx.images-amazon.com/images/I/51MKP0T4...,"The Crucible: Performed by Stuart Pankin, Jero...",,,
1,"[[Movies & TV, Movies]]",{'Movies & TV': 376041},143561,http://g-ecx.images-amazon.com/images/G/01/x-s...,"Everyday Italian (with Giada de Laurentiis), V...",12.99,"{'buy_after_viewing': ['B0036FO6SI', 'B000KL8O...","3Pack DVD set - Italian Classics, Parties and ..."


However, this metadata.json file takes more than 10 Go, for 9430088 entries (obtained by doing a wc -l metadata.json), so it does not fit in memory. Thus, as we will do a lot of tests later, we wanted to create a subfile containing only the Books metadata (we don't need video games metadata for example), with a subset of columns. We also want to write it in the csv format, to manipulate it in an easier way later.

We use the regex "\[\'books" in an ignore case mode, to obtain only entries that have a category tag beginning with [Books. In fact, if we want to use the regex 'book', some entries like 0078800242 or B00000078S are not books at all, even if there is Books in the title or the category tag. The '[' is useful here to avoir this behavior.

In [7]:
def read_book_metadata(path, regex): 
    g = open(path, 'r') 
    for l in g: 
        book = regex.search(l)
        if book:
            yield eval(l) 
            
def write_df_books_metadata(from_, to, regex, columns_to_keep): 
    i = 0 
    df = {} 
    for d in read_book_metadata(from_, regex): 
        df[i] = d 
        i += 1 
        if i % 10000 == 0: # Here, we choose to write everything every 10'000 book entries, and clear the dataframe to free memory.
            pd.DataFrame.from_dict(df, orient='index')[columns_to_keep].to_csv(to, header=False,mode='a')
            df = {}

COLUMNS_TO_KEEP = ['asin', 'salesRank', 'categories', 'title', 'price']
regex = re.compile('\[\'books', re.IGNORECASE)

if WRITE_BOOKS_METADATA:
    pd.DataFrame(columns=[COLUMNS_TO_KEEP]).to_csv(books_metadata_path)
    write_df_books_metadata(metadata_path, books_metadata_path, regex, COLUMNS_TO_KEEP)

And if we read what we just wrote :

In [8]:
read_csv(books_metadata_path)

Unnamed: 0.1,Unnamed: 0,asin,salesRank,categories,title,price
0,0,1048791,{'Books': 6334800},[['Books']],"The Crucible: Performed by Stuart Pankin, Jero...",
1,1,1048775,{'Books': 13243226},[['Books']],Measure for Measure: Complete &amp; Unabridged,


We also wanted to obtain the ebooks titles, price etc..
For the category filter, we have to use the same trick as for the Books one : "\[\'Kindle". Please note that some book metadatas above are in fact kindle store metadatas, because the category can contain both. However, it's not a big deal if we want to do the merge later, because as we will see, pretty much no ebook has a title in the given metadata.

However, for the metadatas for ebooks, there was a problem at that step :

In [9]:
def read_ebook_metadata(path, regex): 
    g = open(path, 'r') 
    for l in g: 
        ebook = regex.search(l)
        if ebook:
            yield eval(l) 
def obtain_df_ebooks_metadata(from_, to, regex): 
    i = 0 
    df = {} 
    count = 0
    for d in read_ebook_metadata(from_, regex): 
        count += 1
        if(d.get('title')):
            df[i] = {'asin': d.get('asin'), 'title': d.get('title')}
            i += 1 
    pd.DataFrame.from_dict(df, orient='index').to_csv(to)
    print('Total ebooks in metadatas:', count)


regex = re.compile('\[\'Kindle', re.IGNORECASE)

if WRITE_EBOOKS_METADATA_TITLE:
    obtain_df_ebooks_metadata(metadata_path, ebooks_metadata_title_path, regex)

We see right below that 44 entries out of 434702 have a title. Of course, it's not good at all, since we want to merge books and ebooks using the title field.

In [10]:
read_csv(ebooks_metadata_title_path, None).shape[0]

44

Thus, we need to obtain the title field from another source. The first idea was to retrieve this information from Amazon directly, as we wanted to do for the user location. For that, we need to have a list of the ebooks asin (Amazon Standard Identification Numbers). We obtain it from the Kindle Store 5-core file.

In [11]:
def read_ebook_5core(path, regex): 
    g = open(path, 'r') 
    for l in g: 
        yield eval(l) 
def write_ebook_asin(from_, to): 
    i = 0 
    df = {} 
    for d in read_ebook_5core(from_, regex): 
        df[i] = d 
        i += 1 
        if i % 10000 == 0:
            if i % 100000 == 0:
                print(i) #to show the progression
            pd.DataFrame.from_dict(df, orient='index')[['asin']].to_csv(to, header=False,mode='a')
            df = {}
    pd.DataFrame.from_dict(df, orient='index')[['asin']].to_csv(to, header=False,mode='a')
    df = {}

if WRITE_EBOOK_ASIN:
    pd.DataFrame(columns=[['asin']]).to_csv(ebooks_asin)
    write_ebook_asin(kindle_5core_path, ebooks_asin)

As we were using the Kindle Store 5-core file, there are asin duplicates. We thus make it unique when we read.

In [12]:
ebooks_asin_unique = pd.read_csv('ebooks_asin.csv',usecols=[1]).asin.unique()
ebooks_asin_unique

array(['B000F83SZQ', 'B000FA64PA', 'B000FA64PK', ..., 'B00M029T4O',
       'B00M0RE7CS', 'B00M13FNSS'], dtype=object)

For every Amazon article with asin *xasinx*, the corresponding web page is https://www.amazon.com/dp/*xasinx*/ref=rdr_kindle_ext_tmb.


In [13]:
prefix = 'https://www.amazon.com/dp/'
suffix = '/ref=rdr_kindle_ext_tmb'

USER_AGENT_CHOICES = [
    'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:23.0) Gecko/20100101 Firefox/23.0',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.62 Safari/537.36',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20140205 Firefox/24.0 Iceweasel/24.3.0',
    'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0',
    'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
]

We have to rotate the user-agent so that the bot is less likely to be considered as one. This is why we have a User-Agent array.
At the beginning, this method was working quite well : we had obtained 503 tuples (title, category, page number, language) over 1000 requests, which could mean that we had solved the problem. However, when looking at the distribution of the 503 tuples, we could see that at the beginning, everything behaves well, we obtain most of the entries (the other ones being like the B000JMKU0Y one, an obsolete entry, that only has customer reviews).

In [14]:
if AMAZON_GET_TITLE:
    
    LIMIT = 10
    
    undefined = 0
    defined = 0
    dataframe_original = pd.DataFrame(columns=[['asin', 'title', 'Category', 'PageNum', 'Language']])
    dataframe = dataframe_original.copy()

    dataframe_original.to_csv(amazon_ebooks)

    for i, asin in enumerate(ebooks_asin_unique[:LIMIT]):

        if i%10==0:
            headers = {'User-Agent':USER_AGENT_CHOICES[np.random.randint(0, len(USER_AGENT_CHOICES))]}
            if i > 0:
                print('undefined:', undefined, '/ defined:', defined)
                dataframe.to_csv(amazon_ebooks, header=False,mode='a')
                dataframe = dataframe_original.copy()


        r = requests.get(prefix + asin + suffix, headers=headers)
        page_body = r.text
        soup = BeautifulSoup(page_body, 'html.parser')
        title = soup.find_all('span', id='ebooksProductTitle')
        if(len(title) == 0):
            undefined += 1
        else:
            defined += 1
            title = title[0].text

            ul = soup.find_all('ul', class_='a-unordered-list a-horizontal a-size-small')
            if(len(ul) > 0):
                details = ul[0].find_all('a', class_='a-link-normal a-color-tertiary')
                if(len(details) > 0):
                    category = details[-1].text.strip()
                else:
                    category = ""
            else:
                category = ""

            details = soup.find_all('table', id='productDetailsTable')
            if(len(details) > 0):
                length = details[0].find_all('b', text='Print Length:')
                if(len(length) > 0):
                    page_number = length[0].parent.text.split()[2]
                else:
                    page_number = 0

                length = details[0].find_all('b', text='Language:')
                if(len(length) > 0):
                    language = length[0].parent.text.split()[1]
                else:
                    language = ""
            else:
                page_number = pd.np.nan
                language = ""

            dataframe.loc[asin] = (asin, title, category, page_number, language)

        waiting = np.random.rand()
        time.sleep(waiting+1)

    print(defined,',',undefined)
    dataframe.to_csv(amazon_ebooks, header=False,mode='a')
    dataframe = dataframe_original.copy()

But after some time, we get less and less entries : a message is sent by Amazon when retrieving the page, saying that it's not a good idea to continue scraping data, and that it might be a good idea to go through their API. So, there were some options :
- we continue to work with the bot while tweaking the parameters to behave like a normal user for the bot (by increasing the waiting time and rotating the user-agent as said before) :

After some online search (https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/,
http://blog.datahut.co/tutorial-how-to-scrape-amazon-using-python-scrapy/,
http://docs.aws.amazon.com/AWSECommerceService/latest/DG/rest-signature.html,
https://www.scrapehero.com/tutorial-how-to-scrape-amazon-product-details-using-python/,
https://blog.hartleybrody.com/scrape-amazon/), we saw that Amazon was detecting the IP, it could ban it, and the solution to avoid it was to use a proxy crawler. As it costs money, we decided not to use that. Furthermore, as said here, it's a legally speaking grey area : https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/.

- we try to go through the Amazon API : 

For the standard account, we need to put bank account information, so we prefer not to do so. For the student account, as we realized some days before the deadline that it existed, we might consider this option in the future if needed, but we wait for the epfl to accept or not the account request.

- we find a field in the metadata, different from the title, that can help us to merge a book with an ebook :

With some manual analysis, we found a pair of book-ebook : 

    {'asin': 'B000JML1QG', 'price': 0.99, 'imUrl': 'http://ecx.images-amazon.com/images/I/41VbZ%2BvxslL._BO2,204,203,200_PIsitb-sticker-v3-big,TopRight,0,-55_SX278_SY278_PIkin4,BottomRight,1,22_AA300_SH20_OU01_.jpg', 'related': {'also_viewed': ['B005LSCQ4Y', 'B0082UXYTE', 'B004TS2B4W'], 'buy_after_viewing': ['B00CS6P31U', 'B005LSCQ4Y', 'B0051EZX8Y', 'B006CRC98G']}, 'categories': [['Books', "Children's Books", 'Fairy Tales, Folk Tales & Myths', 'Anthologies'], ['Books', 'Literature & Fiction'], ['Kindle Store', 'Kindle eBooks', "Children's eBooks", 'Fairy Tales, Folk Tales & Myths', 'Anthologies'], ['Kindle Store', 'Kindle eBooks', "Children's eBooks", 'Fairy Tales, Folk Tales & Myths', 'Collections'], ['Kindle Store', 'Kindle eBooks', 'Literature & Fiction', 'Mythology & Folk Tales'], ['Kindle Store', 'Kindle eBooks', 'Science Fiction & Fantasy', 'Fantasy', 'Fairy Tales']]}


    {'asin': '0554319187', 'title': "Grimm's Fairy Stories", 'price': 0.99, 'imUrl': 'http://ecx.images-amazon.com/images/I/41O2olixwXL.jpg', 'related': {'also_viewed': ['1607103133', '0394709306', '1937994317'], 'buy_after_viewing': ['1607103133', '0394709306', '0393088863', '0385189508']}, 'salesRank': {'Books': 2586251}, 'categories': [['Books']]}

As we can see here, the only entry that is the same is the price, and it's dangerous to merge on the price as ebooks are often less expensive than the book version for the same content.

- we find another service that can give us the title for a given asin :

This is the option that we finally chose. The website http://asindb.com/ does exactly that. For this website, there is no bot detection as Amazon does. We can't retrieve the price, the category and the number of pages, but at least we can get the title. We can see the result below :

In [15]:
if ASINDB_GET_TITLE:
    
    LIMIT = 20
    
    prefix_asindb = 'http://asindb.com/USA/ASIN/'

    dataframe = pd.DataFrame(columns=[['asin', 'title']])
    notdefined = pd.DataFrame(columns=[['asin','notfound']])
    dataframe.to_csv(asindb_ebooks)

    for i, asin in enumerate(ebooks_asin_unique[:LIMIT]):
        r = requests.get(prefix_asindb + asin, headers={'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:23.0) Gecko/20100101 Firefox/23.0'})
        page_body = r.text
        soup = BeautifulSoup(page_body, 'html.parser')

        if i%10 == 0:
            dataframe.to_csv(asindb_ebooks, header=False,mode='a')
            dataframe = pd.DataFrame(columns=[['asin', 'title']])

        notfound = soup.find_all('h6', text = 'No item found!!!')
        if(len(notfound) > 0):
            notdefined.loc[asin] = (asin,1)
        else:
            title = soup.find_all('th', text='Title')
            if(len(title) > 0 and len(title[0].parent.findChildren()) >= 2):
                dataframe.loc[asin] = (asin, title[0].parent.findChildren()[1].text)
            else:
                print('alerte :', asin)

        waiting = np.random.rand()
        time.sleep(waiting+1)

    dataframe.to_csv(asindb_ebooks, header=False,mode='a')
    dataframe = pd.DataFrame(columns=[['asin', 'title']])
    
    print(notdefined.head(2))

Right above is the undefined entries dataframe (see below for more explanation).

We can see below what kind of output it gives to us.

In [16]:
ebooks_metadata = read_csv(asindb_ebooks, None)
ebooks_metadata.head()

Unnamed: 0.1,Unnamed: 0,asin,title
0,B000FA64PA,B000FA64PA,Saboteur: Star Wars Legends (Darth Maul) (Shor...
1,B000FA64PK,B000FA64PK,Recovery: Star Wars Legends (The New Jedi Orde...
2,B000FA64QO,B000FA64QO,Ylesia: Star Wars Legends (The New Jedi Order)...
3,B000FBFMVG,B000FBFMVG,A Forest Apart: Star Wars Legends (Short Story...
4,B000FC1BN8,B000FC1BN8,Fool's Bargain: Star Wars Legends (Novella) (S...


In [17]:
read_csv(asindb_ebooks, None).shape[0] # number of titles retrieved

2741

This solution is of course not the best one : the asindb website does not contain everything. We have managed to retrieve 2741 titles over 4000 asins by using this technique, but we have no problem with the Amazon bot detection (and possible ban). We can see which entries were not retrieved by printing the notdefined dataframe.

Of course, tu retrieve the 2741 entries, we set the LIMIT constant in the code to be 4000.

We thus continue our analysis by using it.

Now, as we have the title information for books and ebooks, let's merge them. We read the book metadata information in books_metadata and we have the ebook metadata information with title in ebooks_metadata.

In [18]:
books_metadata = pd.read_csv(books_metadata_path)
books_metadata.head()

Unnamed: 0.1,Unnamed: 0,asin,salesRank,categories,title,price
0,0,1048791,{'Books': 6334800},[['Books']],"The Crucible: Performed by Stuart Pankin, Jero...",
1,1,1048775,{'Books': 13243226},[['Books']],Measure for Measure: Complete &amp; Unabridged,
2,2,1048236,{'Books': 8973864},[['Books']],The Sherlock Holmes Audio Collection,9.26
3,3,401048,{'Books': 6448843},[['Books']],The rogue of publishers' row;: Confessions of ...,
4,4,1019880,{'Books': 9589258},[['Books']],Classic Soul Winner's New Testament Bible,5.39


We merge them on title, and we see that we only get 1297 entries. It's not bad, but we can for sure have a better result. As we can remark, there is a lot of time the title The Space Between. We will discuss later about it.

In [19]:
#We merge, and get the head. Everything after is here just to have a nicer representation of the first elements
books_metadata.merge(ebooks_metadata, left_on='title', right_on='title').head().iloc[:,[1,7,2,3,4,5]].rename(columns={'asin_x':'asin_books', 'asin_y':'asin_ebooks'})

Unnamed: 0,asin_books,asin_ebooks,salesRank,categories,title,price
0,2008505,B002DYJ7DM,{'Books': 5587764},[['Books']],The Space Between,4.74
1,615891411,B002DYJ7DM,{'Books': 4145053},[['Books']],The Space Between,2.99
2,1579660584,B002DYJ7DM,{'Books': 2486827},[['Books']],The Space Between,20.08
3,1588515508,B002DYJ7DM,{'Books': 5985767},[['Books']],The Space Between,
4,1595143394,B002DYJ7DM,{'Books': 141638},[['Books']],The Space Between,7.13


In [20]:
books_metadata.merge(ebooks_metadata, left_on='title', right_on='title').shape[0]

1297

As we were saying, 1297 is not a big number, and we can do better. We have done the most basic possible thing to do : we have put the title for books and ebooks in lower form (minuscule).

In [21]:
books_metadata['title'] = books_metadata.title.str.lower()
books_metadata.head()

Unnamed: 0.1,Unnamed: 0,asin,salesRank,categories,title,price
0,0,1048791,{'Books': 6334800},[['Books']],"the crucible: performed by stuart pankin, jero...",
1,1,1048775,{'Books': 13243226},[['Books']],measure for measure: complete &amp; unabridged,
2,2,1048236,{'Books': 8973864},[['Books']],the sherlock holmes audio collection,9.26
3,3,401048,{'Books': 6448843},[['Books']],the rogue of publishers' row;: confessions of ...,
4,4,1019880,{'Books': 9589258},[['Books']],classic soul winner's new testament bible,5.39


In [22]:
ebooks_metadata['title'] = ebooks_metadata.title.str.lower()
ebooks_metadata.head()

Unnamed: 0.1,Unnamed: 0,asin,title
0,B000FA64PA,B000FA64PA,saboteur: star wars legends (darth maul) (shor...
1,B000FA64PK,B000FA64PK,recovery: star wars legends (the new jedi orde...
2,B000FA64QO,B000FA64QO,ylesia: star wars legends (the new jedi order)...
3,B000FBFMVG,B000FBFMVG,a forest apart: star wars legends (short story...
4,B000FC1BN8,B000FC1BN8,fool's bargain: star wars legends (novella) (s...


In [23]:
merged_metadatas = books_metadata.merge(ebooks_metadata, left_on='title', right_on='title')
merged_metadatas = merged_metadatas[['asin_x', 'asin_y', 'title', 'price', 'categories', 'salesRank']]
merged_metadatas = merged_metadatas.rename(columns={'asin_x': 'asin_book', 'asin_y': 'asin_ebook'})

In [24]:
merged_metadatas.head()

Unnamed: 0,asin_book,asin_ebook,title,price,categories,salesRank
0,2008505,B002DYJ7DM,the space between,4.74,[['Books']],{'Books': 5587764}
1,615891411,B002DYJ7DM,the space between,2.99,[['Books']],{'Books': 4145053}
2,1579660584,B002DYJ7DM,the space between,20.08,[['Books']],{'Books': 2486827}
3,1588515508,B002DYJ7DM,the space between,,[['Books']],{'Books': 5985767}
4,1595143394,B002DYJ7DM,the space between,7.13,[['Books']],{'Books': 141638}


As we see, we have a bigger number of entries. We could have tried to increase this number, however as we will see later, we already have some problems with this 'strict' way of doing.

In [25]:
merged_metadatas.shape[0]

1506

So we have merged the two dataframes into one, and it seems that we can do some analysis on it. We have 1506 entries, so it's good for a first analysis in milestone 2 to do so.
But, as said before, there are title duplicates. It corresponds to books (there is only one duplicate entry for ebooks, for the article 'Second Chances') that have the same title.

In [26]:
merged_metadatas.head(7)

Unnamed: 0,asin_book,asin_ebook,title,price,categories,salesRank
0,2008505,B002DYJ7DM,the space between,4.74,[['Books']],{'Books': 5587764}
1,615891411,B002DYJ7DM,the space between,2.99,[['Books']],{'Books': 4145053}
2,1579660584,B002DYJ7DM,the space between,20.08,[['Books']],{'Books': 2486827}
3,1588515508,B002DYJ7DM,the space between,,[['Books']],{'Books': 5985767}
4,1595143394,B002DYJ7DM,the space between,7.13,[['Books']],{'Books': 141638}
5,1601540817,B002DYJ7DM,the space between,10.99,[['Books']],{'Books': 11114107}
6,1625530226,B002DYJ7DM,the space between,6.99,[['Books']],{'Books': 2678708}


In [27]:
ebooks_metadata.title.describe()

count               2741
unique              2739
top       second chances
freq                   2
Name: title, dtype: object

We thus thought that we could consider only pairs that have a unique title in the whole dataframe. In that way, we only have articles that have a unique name, at least in the period in which the dataset has been created, so that we could only have exactly the same content for the book and the ebook. 

We thus drop all elements that have a title that exists more than one time in the dataframe.

It's naive, as we don't have all books and ebooks of Amazon, even for the period given, but we wanted to try.

In [28]:
merged_metadatas = merged_metadatas.drop_duplicates('title',keep=False)
merged_metadatas.head(7)

Unnamed: 0,asin_book,asin_ebook,title,price,categories,salesRank
39,7269854,B003ZUY19I,the ice princess,7.59,[['Books']],{'Books': 2527081}
95,60517689,B0036ZAHDG,in the mood,2.99,[['Books']],{'Books': 2663548}
121,60595620,B00480P58K,the sweetest taboo,8.7,[['Books']],{'Books': 2956119}
173,60813032,B0049H8X86,"dragons from the sea (the strongbow saga, book 2)",3.6,[['Books']],{'Books': 1269204}
200,61084220,B004QS98KU,raven's bride,7.69,[['Books']],{'Books': 2911109}
460,140249249,B003XVYGXK,iced,,[['Books']],{'Books': 1225702}
461,140259678,B003C1R5CA,a timely death,,[['Books']],{'Books': 3077680}


We now only have 148 elements in the merged collection. We reming the reader about the fact that we tried to obtain 4000 ebooks titles, we got only 2741. By merging, we got 1506 entries, and when we drop all elements that have duplicates, we only have 148 elements.

In [29]:
merged_metadatas.shape[0]

148

After some manual analysis, we have unfortunately seen that for most entries, even if there is only one tuple (asin_book, asin_ebook), the two are not on the same content. We have an example for the first entry (the ice princess) : 'https://www.amazon.com/Princess-Patrik-Hedstrom-Erica-Falck/dp/0007269854' and 'https://www.amazon.com/Ice-Princess-Elizabeth-Hoyt-ebook/dp/B003ZUY19I' have the same title but are written by two different people, and have different content.

There are some tuples that match : for the article 'dragons from the sea (the strongbow saga, book 2)', we have the same content.

It's hard for us to quantify the number of such articles that do not match. We have done some by hand, and we have seen that a lot was not matching at all. A way to have good matches automatically could be to obtain the authors. We think that for the same author, it's rare to have two book with the same name. If we forget a minute about problems like the number of authors which is different for the ebook and the book even if it's the same, or different naming conventions (A fictive example : J. F. Brown or J. Brown), we would need the author information for each book and ebook of the tuples that we have merged.

The website we were <a href='http://asindb.com/'>using</a> does not provide this information. We then need to obtain it from somewhere else. 

We have tried to look at the library genesis, using this <a href='http://garbage.world/posts/libgen/'>tutorial</a>. However, there are only two ways to get back information about a book : using a special id (Libgen id), or by date. Furthermore, the asin field exists, but after some tests for which none of the articles had asin, it's hard to say if it's a good option. Thus, we have thought it's not the best solution for us.

We have also been told to look at the Gutenberg project. It could have been a good idea if we could search by asin in the metadatas, that are available. However, it seems that the asin data is not available, so we won't use it neither.

-- --
## Review analysis

Even if we have some problems that will need to be fixed for the milestone 3, we have continued in the analysis. Of course we had to test with wrong data, but we will ensure that in the future, we will make it work on correct data (one with same author and title for books and ebooks). The code is here to show that we have worked on other steps of the pipeline.

In [30]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer



Here you will need to uncomment this and download vader_lexicon as described in the README.

In [31]:
#nltk.download_shell()

Next we are going to fetch the reviews of the books for which we have the ebooks.

As the file for the reviews of the books is quite big, we are going the read it into chunks.

In [32]:
def find_matched(list_asins, in_path, out_path):
    pd.DataFrame(columns=[['asin','overall', 'summary', 'reviewerID', 'helpful','reviewText', 'reviewerName']]).to_csv(out_path)
    
    for chunk in pd.read_json(in_path, lines=True, chunksize=50000):
        
        # look if some asins of the chunk match the ones in the given list
        filtered = chunk[chunk['asin'].isin(list_asins)].dropna(how='all')
        
        if len(filtered) > 0:
            # write the data if any
            filtered[['asin','overall', 'summary', 'reviewerID', 'helpful','reviewText', 'reviewerName']].to_csv(out_path,header=False, mode='a')


In [33]:
if WRITE_FIND_MATCHED:
    find_matched(merged_metadatas['asin_book'].values, books_5core_path, matched_books_path)
    find_matched(merged_metadatas['asin_ebook'].values, kindle_5core_path, matched_ebooks_path)

Now that we have our reviews it is time to analyse their content. To compute the score of a book, we use two approaches:

1) A weighted average of the stars taking into account the helpfulness of the review as described below.

Let $s_{i,j}$ be the $j$th rating of book $i$ and $n$ the number of ratings for this book. As other users can review a review by saying whether it is helpful, let $k_{i,j}$ be the number of the person who reviewed the review $s_{i,j}$ and let $u_{i,j}$ be the number of reviewer who found the review helpful among the $k_{i,j}$ reviewers. Then what we want to do is to give more weight to reviews which are helpful and less weight to reviews that the other users found useless. The maximum weight a review can have is 1 and the minimum is 0, then if a review has never been reviewed or it has been voted equally helpful and not helpful, its weight will be 0.5 which is neutral.
$$w_{i,j}=\cases{\frac{u_{i,j}}{k_{i,j}}, \text{if } k_{i,j} \ne 0 \\0.5, \text{if } k_{i,j}=0}$$

    
The weighted average is then:

$$S_i = \frac{\sum_{j=1}^{n} w_{i,j}s_{i,j}}{\sum_{j=1}^{n} w_{i,j}}$$

2) A weighted average of the sentiment's intensity in the review taking into account the helpfulness with the weight being derived similarly as above.

Here we are going to use the VADER (Valence Aware Dictionary sEntiment Reasoner) sentiment analyzer from the nltk package. VADER is based on lexicons of sentiment-related words and each words is rated as whether it is positive and negative, and how negative or positive it is. For example, the 'excellent' would be treated as more positive than 'good'.

The score Vader returns is between -1 and 1, 1 for a very positive review, -1 for a very negative review, and 0 if it is neutral.

Although Vader is not the most accurate tool and to analyse a piece of text it checks if any of the words in the text are present in the lexicon, therefore its accuracy depends on the coverage of the lexicons. It is the easiest approach we have for the moment as we cannot train a classifier since we don't have a proper training set.

In [34]:
sid = SentimentIntensityAnalyzer()

In [35]:
# compute the weight of a review
def weighted_helpful(x):
    # literal_eval will evaluate '[1 ,1]' as a list
    # we use literal_eval here since using astype(list) on the column was not working
    x = literal_eval(x)
    voters = int(x[0]) + int(x[1])
    return 0.5 if voters == 0 else int(x[0])/voters

In [36]:
# compute the weighted average of the two scores for a book
def weighted_scores(data):
    # get the weight
    data['weighted_help'] = (data['helpful'].astype(list)).apply(weighted_helpful)
    
    func = lambda x: sid.polarity_scores(x)['compound']
    
    data['sentiment_review'] = data['reviewText'].apply(func)
    data['sentiment_summary'] = data['summary'].apply(func)
    
    # multiply the scores with the weight
    data['weighted_sentiment'] = data['weighted_help'] * data['sentiment']
    data['weighted_overall'] = data['weighted_help'] * data['overall']
    
    # sum everything
    weighted_score = data.groupby(data['asin']).sum()
    
    # divide by the sum of the weights to obtain the weighted average
    weighted_score['sentiment score'] = weighted_score['weighted_sentiment']/weighted_score['weighted_help']
    weighted_score['overall score'] = weighted_score['weighted_overall']/weighted_score['weighted_help']
    
    return weighted_score[['sentiment score', 'overall score']]

In [37]:
matched_ebooks= pd.read_csv(matched_ebooks_path)
matched_books = pd.read_csv(matched_books_path)

In [38]:
if WRITE_WEIGHTED_SCORE:
    weighted_scores_books = weighted_scores(matched_books)
    weighted_scores_ebooks = weighted_scores(matched_ebooks)

    weighted_scores_books.to_csv(weighted_scores_books_path)
    weighted_scores_ebooks.to_csv(weighted_scores_ebooks_path)

weighted_scores_books = pd.read_csv(weighted_scores_books_path)
weighted_scores_ebooks = pd.read_csv(weighted_scores_ebooks_path)

Once we have computed the different scores for the books and the ebooks, it is now time to combine those scores with our data.

In [39]:
merged_with_books = pd.merge(merged_metadatas, weighted_scores_books, left_on='asin_book', right_on='asin')
merged_with_all = pd.merge(merged_with_books, weighted_scores_ebooks, left_on='asin_ebook', right_on='asin', suffixes=['_book', '_ebook'])

In [40]:
merged_with_all.head()

Unnamed: 0,asin_book,asin_ebook,title,price,categories,salesRank,asin_book.1,sentiment score_book,overall score_book,asin_ebook.1,sentiment score_ebook,overall score_ebook
0,0007269854,B003ZUY19I,the ice princess,7.59,[['Books']],{'Books': 2527081},0007269854,0.451335,3.587818,B003ZUY19I,0.835556,4.325581
1,0060595620,B00480P58K,the sweetest taboo,8.7,[['Books']],{'Books': 2956119},0060595620,0.861742,3.576512,B00480P58K,0.934593,4.571429
2,0060813032,B0049H8X86,"dragons from the sea (the strongbow saga, book 2)",3.6,[['Books']],{'Books': 1269204},0060813032,0.478786,4.714286,B0049H8X86,0.58028,4.6
3,0140249249,B003XVYGXK,iced,,[['Books']],{'Books': 1225702},0140249249,0.502363,4.75,B003XVYGXK,0.551014,4.231644
4,030788922X,B004N626PY,made in italy,16.14,[['Books']],{'Books': 213697},030788922X,0.88201,4.659259,B004N626PY,0.780861,4.060606


In [41]:
merged_with_all.shape[0]

53

Here we are going to see if the sentiment's score is consistent with the ratings, meaning that reviewers giving high ratings should also be positive in their reviews.

In [91]:
reviews_books_with_sentiment = matched_books.copy()
f_sentiment = lambda x: sid.polarity_scores(x)['compound']
reviews_books_with_sentiment['sentiment_review'] = matched_books['reviewText'].apply(f_sentiment)
reviews_books_with_sentiment['sentiment_summary'] = matched_books['summary'].apply(f_sentiment)
reviews_books_with_sentiment['average'] = (reviews_books_with_sentiment['sentiment_review'] + reviews_books_with_sentiment['sentiment_summary'])*0.5

In [94]:
reviews_books_with_sentiment.groupby('overall').mean()

Unnamed: 0_level_0,Unnamed: 0,sentiment,sentiment_summary,average
overall,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,3683335.0,0.212659,-0.160626,0.026016
2,3191242.0,0.258701,-0.034349,0.112176
3,3451677.0,0.467796,0.138467,0.303131
4,3435581.0,0.67507,0.279179,0.477124
5,3741658.0,0.728469,0.356776,0.542622


In [93]:
reviews_books_with_sentiment[reviews_books_with_sentiment['overall'] == 1]

Unnamed: 0.1,Unnamed: 0,asin,overall,summary,reviewerID,helpful,reviewText,reviewerName,sentiment,sentiment_summary,average
11,23417,0007269854,1,Absurd police ineptitude,A38L3I3R3VRPTT,"[0, 0]","I agree with the complaints about cliches, car...","Bay Area Bibliophile ""marisylvia""",0.6467,0.0000,0.32335
64,23470,0007269854,1,Hated it,A2T1ZGMIFSHWW2,"[5, 5]","Giving up at pg 139 (of 389, paperback). I no ...",In Vino Veritas,0.8126,-0.6369,0.08785
65,23471,0007269854,1,Not so good,A23J25BQYRAIVX,"[13, 15]","If you're a fan of Scandinavian crime fiction,...",Jack Tierney,-0.9862,0.4927,-0.24675
67,23473,0007269854,1,Audiobook CDs poor grade,A3UYDNMNB7VX75,"[0, 0]","My complaint is not with the novel, but with t...","Jane F. Wiedel ""dog lover""",-0.9821,-0.4767,-0.72940
100,23506,0007269854,1,I don't usually write bad reviews.,A3RIGC6OUSKQ8R,"[6, 6]","However, I am 78 pages into this book and I ma...",Mary Ann Moore,-0.7228,0.4310,-0.14590
101,23507,0007269854,1,Poorly written and/or translated,A25UW8MSZUTX1Q,"[2, 2]","I was disappointed with this book, I felt it w...","Mary Brydone Hall ""Satyamurti""",0.8291,0.0000,0.41455
110,23516,0007269854,1,Electrifying? More like agonizing.,A2N45V7IF4CSZU,"[5, 5]",Here's the problem. A thriller can have as man...,Misha,-0.3694,-0.2927,-0.33105
212,1330743,0312869967,1,Buyer Beware,A3JFWAZS1SSPHO,"[1, 5]",This book is just plain horrible and for me to...,Ariel Pawlak,0.9735,0.0000,0.48675
214,1330745,0312869967,1,Knight Errant,AF0VPT3C3W740,"[1, 2]",The first few pages grab my attention however ...,"Donna Marie Chelland ""Adena Evans""",-0.7991,0.0000,-0.39955
224,1330755,0312869967,1,Don't waste your money,A1N6VNFI5MEPK9,"[2, 5]","Like the previous reviews, I too purchased thi...","V. Payne ""seems all I do these days is read!""",0.9019,0.3252,0.61355


In [87]:
reviews_books_with_sentiment.iloc[212]['reviewText']

"This book is just plain horrible and for me to say that is saying a lot.  I read everything and a Danielle Steel or Nora Roberts novel is like Shakespeare compared to this book.  I picked this book up because I loved Michael Chriton's Timeline so much.  I am facinated with the medieval ages and think the concept of someone from today's time comming in contact with the past is an intresting concept.  Unfortunately this book is not intresting in the least.  I went into the book really wanting to like it but the writing is so stiff and awkard.  The dialogue is not believable in the least.  The romance between Edward and Robyn which should be the focal point of the book is boring.  Edward is like a Ken doll, looks cute but is plastic with no personality.  I guess it doesn't matter in the end because he is not around for most of the story anyway.  If you read this book thinking you will learn something about the medieval ages you will be sorely dissappointed.  Edward doesn't talk like he's

This review above has a really positive sentiment score of 0.9735, but if we read the review, we can understand that it is quite the opposite. A second example is "Not so good" which yields a score of 0.4927, showing us the limitation of Vader.

In [88]:
reviews_books_with_sentiment.iloc[276]['reviewText']

"I was stunned at the lack of development of Lora Leigh's short story. There seemed to be all the components of a cool story with great action & sex. Instead it was boring. Most of the action was over in the early pages - leaving the same sex scenes to recycle until the 'story' was mercifully over. It seemed to be more of a teaser for the Bengal story and 'bad boy Devil' was just the vehicle to deliver the teaser.  At the end, I just felt irritated."

We can observe here that the better the rating is, the more positive the text of the review is in average. Even if users gave a rating of 1, the reviews for this rating are not really negative in average. Even if the scores are not always accurate due to the limitation of Vader, they are consistent in average with the ratings.

In [89]:
reviews_ebooks_with_sentiment = matched_ebooks.copy()
reviews_ebooks_with_sentiment['sentiment_review'] = matched_ebooks['reviewText'].apply(f_sentiment)
reviews_ebooks_with_sentiment['sentiment_summary'] = matched_ebooks['summary'].apply(f_sentiment)
reviews_ebooks_with_sentiment['average'] = (reviews_ebooks_with_sentiment['sentiment_review'] + reviews_ebooks_with_sentiment['sentiment_summary'])*0.5

In [90]:
reviews_ebooks_with_sentiment.groupby('overall').mean()

Unnamed: 0_level_0,Unnamed: 0,sentiment,sentiment_summary,average
overall,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,23266.285714,-0.150765,-0.153098,-0.151932
2,28643.590909,0.053472,-0.045726,0.003873
3,28790.469697,0.503538,0.186164,0.344851
4,29388.108069,0.663903,0.268574,0.466238
5,27860.713587,0.688401,0.35537,0.521885


The ratings seem to be consistent with the sentiment's score here again. Compared to books, the sentiment's score is slightly lower and even negative for a rating of 1.

For the same rating, the books reviewer seems to be more positive in their review, this may be explained by the fact that books reviews tend to be longer and more descriptive than ebooks reviews.

This analysis may not be really representative of the reality due to the limitation in size of our final dataset.

Sources:
* *Improving the Amazon Review System by Exploiting the Credibility and Time-Decay of Public Reviews*, Bo-Chun Wang, Wen-Yuan Zhu, and Ling-Jyh Chen
* Hutto, C.J. & Gilbert, E.E. (2014). *VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14)*. Ann Arbor, MI, June 2014.