# Best Selling Books on Amazon Category wise


Here are the steps I followed to build the project:

- Selected https://www.amazon.co.uk/best-sellers-books-Amazon/zgbs/books/ref=zg_bs_nav_0 to scrape
- I have found the list of all categories of books. For each category, I have stored the `CATEGORY_NAME`, `CATEGORY_PAGE_URL`
- For each category, I have created a pandas data frame to store the `Book Name`, `Book URL`, `Author Name`, `Number of reviews`, `Rating`, `Format`, `Cheapest price in which it is available`.
- For each category, I have created a csv file to store all the scraped information for the respective category


In [98]:
import requests
from bs4 import BeautifulSoup

### Part 1: Scraping the list of book categories from Amazon

For this, we are going to use:
- `requests` to download the page and `bs4` to parse and extract information
- We have to make a note that if `response.status_code` is not `200` indicates some error in loading the page. I have raised an exception for this down below. \
(More information on HTTP response status codes: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)

In [99]:
def get_home_page():
    # The URL we wanted to parse 
    home_url = 'https://www.amazon.co.uk/best-sellers-books-Amazon/zgbs/books/ref=zg_bs_nav_0'
    response = requests.get(home_url)
    if response.status_code!=200:
        raise Exception(f'Failed to load page {home_url}')
    #Parsing the HTML Content of the web page
    home_doc = BeautifulSoup(response.text,'html.parser')
    return home_doc

In [102]:
home_doc = get_home_page()

Now, as we have got the home page in the variable `home_doc`, we have to identify the section of page that we would be needing to scrape the book categories

I have selected all the children of the `div` element with id `zg-left-col` to get the entire categories section of the page. We can use the function `.findChildren()` to get all the children

![](https://i.imgur.com/o3tyrRh.png)

There are `73` children elements to the `div` element we have selected but we only need the first element out of these `73` children elements. I have selected all the `a` elements in the first child of the `div` section. There are exactly 34 such elements which correspond to all the 33 categories of books and also one extra element corresponding to `Any Department` which we do not need and can slice it down

In [103]:
def get_categories(home_doc):
    #Getting the entire left section of the page into the variable 'categories_section'
    selection_class='zg-left-col'
    categories_section = home_doc.find('div',{'id':selection_class}).findChildren()
    # Getting all the book categories
    book_categories = categories_section[0].find_all('a')
    # Slicing the 'Any Department' element from the list
    book_categories = book_categories[1:]
    return book_categories

In [104]:
book_categories = get_categories(home_doc)

In [106]:
len(book_categories) #The length of book_categories should be 33 representing all the 33 categories of books

33

I have noticed one category by name `Calendars, Diaries, Annuals & More` which contains only weekly planners, diaries. So they do not contain any `author` or `book_format`. We can exclude this category by slicing it out from our list using `book_categories[:3]+book_categories[4:]`

I have not excluded this category and worked out for scraping this category too

In [107]:
category_names=[]
category_urls=[]

I have defined two lists by names `category_names` and `category_urls` to store the category names and their urls respectively by creating a helper function down below

The name of the ith category can be found by `book_categories[i].text` and the url can be found by `book_categories[i]['href']`

In [108]:
def get_names_and_urls(book_categories):
    # Iterating throught the book_categories list to get the category names and url
    for category in range(len(book_categories)):
        category_names.append(book_categories[category].text)
        category_urls.append(book_categories[category]['href'])
    

In [109]:
get_names_and_urls(book_categories) #Running the above function

In [110]:
# Checking if the code is running as expected
print(category_names[:5])
print()
print(len(category_urls))

['Art, Architecture & Photography', 'Biography', 'Business, Finance & Law', 'Calendars, Diaries, Annuals & More', "Children's Books"]

33


This completes the part 1 which is to scrape the list of book categories from Amazon

### Part 2: Getting top 50 best selling books from each category

- First we have to get the web page for each category given the category_url
- I have created a helper function to do this down below

Note:
1. We have 33 category pages to get and while running, some pages may fail to load and result in `status_code!=200`
2. For this problem, I have created a `Queue` which contains all the category_urls that are not yet visited
3. We try to get the category_url page and if the `status_code!=200` then we push the category_url back to the queue
4. I have also initialized a queue which contains all the unvisited_category_names. We push the category name back into this queue when we are pushing its corresponding url back into its queue i.e, when `status_code!=200`. 
5. This unvisited_category_names queue would be of use to name our `csv` files

In [111]:
from queue import Queue
unvisited_urls = Queue() #Initializing Queue to store all the unvisited urls
unvisited_category_names = Queue() #Initializing Queue to store all the unvisited category names

In [112]:
def get_category_page(category_url):
    #Getting the web page
    response = requests.get(category_url)
    # Checking the status code of the response
    if response.status_code!=200:
        print(f'Failed to load page {category_url}') 
        # I have used 'return 0' to make the calling function notice that there is a `status_code!=200` and the url should be added back to unvisited_queue
        return 0 
    # Parsing the web page using Beautiful Soup
    category_doc = BeautifulSoup(response.text,'html.parser')
    return category_doc

In [113]:
#Checking whether the above function is working
category_doc = get_category_page(category_urls[5])

Failed to load page https://www.amazon.co.uk/Best-Sellers-Books-Comics-Graphic-Novels/zgbs/books/274081


- Next, we have to get all the best selling books from this category
- I have selected all the span classes with class `aok-inline-block zg-item` to get all the books in the page. There are exactly 50 span classes with this class corresponding to all the books in the respective category page

This can be done by:
`books = category_doc.find_all('span',{'class':'aok-inline-block zg-item'})`

By following the above steps, we will have all the 50 best selling books in the variable `books`

The question is: 
How do we get the attributes:
1. Name of Book
2. Author of the book
3. URL of the book
4. Number of reviews for the book
5. Current rating 
6. Format of the book (such as: hardcover, paperback, kindle edition, audio book)
7. Price of the book

for all the 50 books in the variable `books`

Here is what I have found something interesting. `books[i]` contains information for ith book. All the above discussed attributes are contained in the children of `books[i]` as it is a `span` element. 

There are 4 classes of books we have to consider while scraping.

1. When the length of `books[i].findChildren()` is `18`
2. When the length of `books[i].findChildren()` is `16`
3. When the length of `books[i].findChildren()` is `14`
4. When the length of `books[i].findChildren()` is `13`

The number of children for `books[i]` only represents the attributed that we discussed above. If the length of `books[i].findChildren() < 18` then it means that there are some missing attributes for that book

Let us see some examples of each class:
1. When the length of `books[i].findChildren()` is `18`
![](https://i.imgur.com/iz5Emu2.png)

The books with `number of children = 18` contains all the 7 attributes that we want without any missing attributes.<br/>

2. When the length of `books[i].findChildren()` is `16`
![](https://i.imgur.com/lF49cm5.png)

The books with `number of children = 16` contains 1 missing attribute, `author_name`. Therefore,`author_name` should be `NULL` for these books

3. When the length of `books[i].findChildren()` is `14`
![](https://i.imgur.com/DuwFIQC.png)

The book with `number of children = 14` contains 2 missing attributes: `author_name` and `book_format`. Therefore, `author_name` and `book_format` should be `NULL` for these books

4. When the length of `books[i].findChildren()` is `13`
![](https://i.imgur.com/Q6vdS3L.png)

The book with `number of children = 13` contains 2 missing attributes: `number_of_reviews` and `rating`. Therefore, `number_of_reviews` and `rating` should be `NULL` for these books

###### How do we find the attributes given the length of `.findChildren()` ?

- This needs to be tested out manually. I have individually checked for each and every attribute for all the above mentioned cases.

- For Example: Let `book = books[i].findChildren()`
- `book[4]` contains the name of the book irrespective of `len(book)`.
- `book[11]` contains the number of reviews when `len(book)==18`
- `book[9]` contains the number of reviews when `len(book)==16`
- `book[7]` contains the book format when `len(book)==13`

- We have to manually find out all the cases


- Having all the information to get the attributes for the 50 best selling books in a respective page, we need to now create a helper function to do the following

- We also need to import pandas to convert our category dictionary to a Data Frame and then to `.csv` file.

In [114]:
#Importing pandas 
import pandas as pd

In [115]:
# Passing `category_doc` to scrape and `category_name` for naming the csv file after parsing the page
def get_info_books(category_doc,category_name):
    selection_class = 'aok-inline-block zg-item'
    # `books` contains all the information associated with the 50 best selling books in the category page
    books = category_doc.find_all('span',{'class':selection_class})
    base_url = 'https://amazon.co.uk'
    # Creating a dictionary to store the attributes in the current page
    category_books_dict={
        'book_name':[],
        'book_author':[],
        'book_URL':[],
        'rating':[],
        'number_of_reviews':[],
        'book_format':[],
        'price_of_book':[]
    }
    #The length of books is 50. So, we get exactly 50 best selling books
    for i in range(len(books)):
        # 'book' contains all the information associated with the ith book in the page
        book = books[i].findChildren()
        #I have used '.strip()' method to clear out extra spaces in the text. We can achieve our task without this.
        if(len(book)==18): #No missing attributes
            category_books_dict['book_name'].append(book[4].text.strip())
            category_books_dict['book_author'].append(book[5].text.strip())
            current_url = base_url+book[0]['href']
            category_books_dict['book_URL'].append(current_url)
            category_books_dict['rating'].append(book[9].text.strip())
            category_books_dict['number_of_reviews'].append(book[11].text.strip())
            category_books_dict['book_format'].append(book[12].text.strip())
            category_books_dict['price_of_book'].append(book[17].text.strip())
        elif(len(book)==16): #'author_name' is missing
            category_books_dict['book_name'].append(book[4].text.strip())
            category_books_dict['book_author'].append('NULL')
            current_url = base_url+book[0]['href']
            category_books_dict['book_URL'].append(current_url)
            category_books_dict['rating'].append(book[8].text.strip())
            category_books_dict['number_of_reviews'].append(book[9].text.strip())
            category_books_dict['book_format'].append(book[11].text.strip())
            category_books_dict['price_of_book'].append(book[15].text.strip())
        elif(len(book)==14): # 'author_name' and 'book_format' are missing
            category_books_dict['book_name'].append(book[4].text.strip())
            category_books_dict['book_author'].append('NULL')
            current_url = base_url+book[0]['href']
            category_books_dict['book_URL'].append(current_url)
            category_books_dict['rating'].append(book[8].text.strip())
            category_books_dict['number_of_reviews'].append(book[9].text.strip())
            category_books_dict['book_format'].append('NULL')
            category_books_dict['price_of_book'].append(book[13].text.strip())
        else: #len(book)==13. 'num_reviews' and 'rating' are missing
            category_books_dict['book_name'].append(book[4].text.strip())
            category_books_dict['book_author'].append(book[5].text.strip())
            current_url = base_url+book[0]['href']
            category_books_dict['book_URL'].append(current_url)
            category_books_dict['rating'].append('NULL')
            category_books_dict['number_of_reviews'].append('NULL')
            category_books_dict['book_format'].append(book[7].text.strip())
            category_books_dict['price_of_book'].append(book[12].text.strip())
            
        
    #Converting the dictionary to a Data Frame using pandas 
    category_books_df = pd.DataFrame(category_books_dict)
    #'path' variable specifies the name of the `csv` file that we would like to store all our information into
    path = category_name+'.csv'
    # Converting the dataframe to a csv file
    category_books_df.to_csv(path,index=None)

 - We have to create a helper function to initialize the `unvisited_urls` queue and `unvisited_category_names` queue to carry out further steps
 
 - Initially, we can fill the queues with all the category_names and all the category_urls respectively as all of them are unvisited initially and all of them needs to be scraped

In [116]:
#Creating a function to fill the queues with all the 33 category_urls and category_names
def fill_queues():
    for idx in range(len(category_urls)):
        unvisited_urls.put(category_urls[idx])
        unvisited_category_names.put(category_names[idx])

In [117]:
# Calling the fill_queues function
fill_queues()

In [118]:
# Checking if the queues are properly filled. The size of both queues should be `33`

print(f'Size of unvisited_urls queue: {unvisited_urls.qsize()} and size of unvisited_category_names queue: {unvisited_category_names.qsize()}')

Size of unvisited_urls queue: 33 and size of unvisited_category_names queue: 33


- All the helper functions that we need are created now
- The only part remaining is to create the final `scraping` function

In [119]:
# The final scrape function which performs the entire thing for us
def scrape_categories():
    # Iterate until the queue become empty as there should not be any unvisited url
    while unvisited_urls.empty()==False:
        unvisited_url = unvisited_urls.get()
        unvisited_cat_name = unvisited_category_names.get()
        print(f'Scraping Category: {unvisited_url}')
        category_doc = get_category_page(unvisited_url)
        if category_doc == 0: # if category_doc==0 , it implies that the status_code!=200 for this page and we need to push this back to unvisited queue
            print(f'Pushing the category: {unvisited_cat_name} to the unvisited queue')
            unvisited_urls.put(unvisited_url)
            #Also pushing the category_name to its corresponding unvisited queue
            unvisited_category_names.put(unvisited_cat_name)
        else:
            # If the status_code==200 then we can parse the page and save the info in the 'category_name.csv' file
            get_info_books(category_doc,unvisited_cat_name)
    print("SCRAPING IS DONE!!!!")

In [120]:
# Calling the above defined scraping function
scrape_categories()

Scraping Category: https://www.amazon.co.uk/Best-Sellers-Books-Arts-Photography/zgbs/books/91
Scraping Category: https://www.amazon.co.uk/Best-Sellers-Books-Biographies-Memoirs/zgbs/books/67
Scraping Category: https://www.amazon.co.uk/Best-Sellers-Books-Business-Finance-Law/zgbs/books/68
Scraping Category: https://www.amazon.co.uk/Best-Sellers-Books-Calendars-Diaries-Annuals/zgbs/books/507848
Scraping Category: https://www.amazon.co.uk/Best-Sellers-Books-Childrens/zgbs/books/69
Failed to load page https://www.amazon.co.uk/Best-Sellers-Books-Childrens/zgbs/books/69
Pushing the category: Children's Books to the unvisited queue
Scraping Category: https://www.amazon.co.uk/Best-Sellers-Books-Comics-Graphic-Novels/zgbs/books/274081
Failed to load page https://www.amazon.co.uk/Best-Sellers-Books-Comics-Graphic-Novels/zgbs/books/274081
Pushing the category: Comics & Graphic Novels to the unvisited queue
Scraping Category: https://www.amazon.co.uk/Best-Sellers-Books-Computing-Internet/zgbs/book

### References and Future Works