# 🔋 ▶ ----------     Project  Overview ----------  💚

⭐ 💻 **Web scraping** is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications. Reference [GeekForGeeks](https://www.geeksforgeeks.org/what-is-web-scraping-and-how-to-use-it/) )

⭐ 💻 **Amazon.in** is a popular e-commerce platform where the sellers can sell their products and also the consumers or the customers can purchase those products throgh online. Books are one of the popular selling item among the customers. Amazon allows their customers to buy large variety of Books belong to different categories like 
            
  * ▶ Action & Adventure
  * ▶ Arts, Film & Photography
  * ▶ Business & Economics and so on. 

In this Project we aee going to scrape the details of the books belonging to each categories listed in the amazon website. The scraper will collcet the information about the **Book Name**, **Author** ,**price of the book** , **number of ratings**, **rating stars** , **number of pages** and the **language** of the book. 

⭐ 💻 **Tools and Technologies**
* ▶ Python
* ▶ requests
* ▶ BeautifulSoup
* ▶ pandas


# 🔋 ▶ ---------- Project Outline  ---------- 💚

* ⏳ We are going to scrape https://www.amazon.in/gp/bestsellers/books/
* ⏳ We will get List of Book Categories, for each category, category name and the URL
* ⏳ For each Book Category we will get the book names and the urls to the book details pages.
* ⏳ For each Book we will grab the book name, author, price , ratings , stars, pages and the language
* ⏳ For each Book category  we'll create a CSV file in the following format:

![picture](https://drive.google.com/file/d/1DS8Joyz-bO0W9NhHkh0zaXNeL9vAauJB/view?usp=sharing)


## 🥗 ---------- Use the requests library to download web pages ---------- 💚

In [3]:
import requests
import time
import os

In [4]:
def get_page_content(url):
  '''
    -Returns the page content when the url of the page is passed  
    :param url | String, The url of the page which needs to get the content
    :return String of Page content/Error Message
  '''
  headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
  
  number_of_attempts = 0

  while True:
    response = requests.get(url , headers = headers)
    if response.status_code == 200 :
      return response.text
    else:
      print("Server is Busy || Waiting 5 seconds....")
      time.sleep(5)
    number_of_attempts = number_of_attempts+1;
    if(number_of_attempts>3):
      break

 💻 **get_page_conten**t Function will accept the page url as a parameter and it will return the page content as a string, Additionally It is capable of handling a common error when using the request library,which is '*the server is not ready to handle the reques*t'. by waiting for few seconds and sending the resquest again.


In [5]:
home_page_url = "https://www.amazon.in/gp/bestsellers/books/"

In [6]:
home_page_content = get_page_content(home_page_url)
home_page_content[:1000]

'<!doctype html><html lang="en-in" class="a-no-js" data-19ax5a9jf="dingo"><!-- sp:feature:head-start -->\n<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>\n<!-- sp:end-feature:head-start -->\n\n<script type=\'text/javascript\'>var ue_t0=ue_t0||+new Date();</script>\n<!-- sp:feature:cs-optimization -->\n<meta http-equiv=\'x-dns-prefetch-control\' content=\'on\'>\n<link rel="dns-prefetch" href="https://images-eu.ssl-images-amazon.com">\n<link rel="dns-prefetch" href="https://m.media-amazon.com">\n<link rel="dns-prefetch" href="https://completion.amazon.com">\n<!-- sp:end-feature:cs-optimization -->\n<script type=\'text/javascript\'>\nwindow.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;\nif (window.ue_ihb === 1) {\n\nvar ue_csm = window,\n    ue_hob = +new Date();\n(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=function(b){return f()-(b?0:d.ue_t0)};e.stub=function(b,a){if(!b[a]){var c=[];b[a]=function(){c.push([c.sl

## 🥗 ---------- Use Beautiful Soup to parse and extract information ----------------------- 💚

In [7]:
from bs4 import BeautifulSoup
import pandas as pd

In [8]:
home_doc = BeautifulSoup(home_page_content , 'html.parser')

###  📗 Scraping ▶ Book Categories and their links

In [9]:
div_selection_class = '_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8'
div_tags = home_doc.find_all('div' , {'class':div_selection_class})

In [10]:
def get_categories_df(base_url=""):
  '''
    -Obtain the Book Category details (category name , url)  
    :param base_url | String, The base url which need to append for the book category url , default : ""
    :return Pandas DataFrame
  '''
  book_categories = []
  category_links = []
  div_selection_class = '_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8'
  div_tags = home_doc.find_all('div' , {'class':div_selection_class}) 
  for div_tag in div_tags:
    category = div_tag.find('a')
    if(category):
      book_categories.append(category.text)
      category_links.append(base_url +category['href'])

  category_details = {"category" : book_categories,"url": category_links}
  return pd.DataFrame(category_details)

In [11]:
df = get_categories_df("https://www.amazon.in")

In [12]:
df.head()

Unnamed: 0,category,url
0,Action & Adventure,https://www.amazon.in/gp/bestsellers/books/131...
1,"Arts, Film & Photography",https://www.amazon.in/gp/bestsellers/books/131...
2,"Biographies, Diaries & True Accounts",https://www.amazon.in/gp/bestsellers/books/131...
3,Business & Economics,https://www.amazon.in/gp/bestsellers/books/131...
4,Children's & Young Adult,https://www.amazon.in/gp/bestsellers/books/131...


###  📗 Grabbing  ▶ URL s for the Book Details Pages

In [13]:
categories_df = get_categories_df("https://www.amazon.in")

In [14]:
category_page_url = "https://www.amazon.in/gp/bestsellers/books/1318158031"
category_page_content = get_page_content(category_page_url)
category_page_content[:1000]

'<!doctype html><html lang="en-in" class="a-no-js" data-19ax5a9jf="dingo"><!-- sp:feature:head-start -->\n<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>\n<!-- sp:end-feature:head-start -->\n\n<script type=\'text/javascript\'>var ue_t0=ue_t0||+new Date();</script>\n<!-- sp:feature:cs-optimization -->\n<meta http-equiv=\'x-dns-prefetch-control\' content=\'on\'>\n<link rel="dns-prefetch" href="https://images-eu.ssl-images-amazon.com">\n<link rel="dns-prefetch" href="https://m.media-amazon.com">\n<link rel="dns-prefetch" href="https://completion.amazon.com">\n<!-- sp:end-feature:cs-optimization -->\n<script type=\'text/javascript\'>\nwindow.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;\nif (window.ue_ihb === 1) {\n\nvar ue_csm = window,\n    ue_hob = +new Date();\n(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=function(b){return f()-(b?0:d.ue_t0)};e.stub=function(b,a){if(!b[a]){var c=[];b[a]=function(){c.push([c.sl

In [15]:
categorzed_books_doc = BeautifulSoup(category_page_content , 'html.parser')

In [16]:
def get_book_page_url(category_url):
  '''
    -Obtain the links for the book details page  
    :param doc | bs4.BeautifulSoup, The bs4 object of a specific book category page
    :return list of links which are belong to a specific category
  '''
  category_page_content = get_page_content(category_url)
  categorzed_books_doc = BeautifulSoup(category_page_content , 'html.parser')
  
  book_card_class = 'zg-grid-general-faceout'
  book_card_divs = categorzed_books_doc.find_all('div' ,{'class':book_card_class } )
 
  books_links = []
  base_url = "https://www.amazon.in"
  for book_card in book_card_divs:
    a_tag = book_card.find('a')
    link = base_url + a_tag['href']
    books_links.append(link)

  return books_links

In [17]:
links = get_book_page_url('https://www.amazon.in/gp/bestsellers/books/1318064031/ref=zg_bs_nav_books_1')
links[:5]

['https://www.amazon.in/My-Journey-Transforming-Dreams-Actions-ebook/dp/B016APQJB2/ref=zg_bs_1318064031_1/259-6009193-9790211?pd_rd_i=B016APQJB2&psc=1',
 'https://www.amazon.in/Think-Grow-Rich-Napoleon-Hill/dp/9353338158/ref=zg_bs_1318064031_2/259-6009193-9790211?pd_rd_i=9353338158&psc=1',
 'https://www.amazon.in/Wish-Could-Tell-Pre-order-signed/dp/9390441617/ref=zg_bs_1318064031_3/259-6009193-9790211?pd_rd_i=9390441617&psc=1',
 'https://www.amazon.in/Theory-Everything-Stephen-Hawking/dp/8179925919/ref=zg_bs_1318064031_4/259-6009193-9790211?pd_rd_i=8179925919&psc=1',
 'https://www.amazon.in/Three-Thousand-Stitches-Ordinary-Extraordinary/dp/0143440055/ref=zg_bs_1318064031_5/259-6009193-9790211?pd_rd_i=0143440055&psc=1']

### 📗 Scraping ▶ The details of a particular Book

In [19]:
book_page_url = links[16]
book_page = get_page_content(book_page_url)
book_doc = BeautifulSoup(book_page , 'html.parser')

In [20]:
def get_book_name(book_doc):
  span_class = 'a-size-extra-large'
  span_tags = book_doc.find('span', {'class': span_class})  
  if(span_tags):
    return span_tags.text.strip()
  else:
    ""

In [21]:
book_name = get_book_name(book_doc)
book_name

'Learning How to Fly: Life Lessons for the Youth'

In [22]:
def get_book_author(book_doc):
  author_a_tag_class = 'a-link-normal contributorNameID'
  a_tags_authors =  book_doc.find('a', {'class': author_a_tag_class}) 
  if(a_tags_authors):
    return a_tags_authors.text.strip()
  else:
    return ""

In [23]:
#Author
author = get_book_author(book_doc)
author

'A.P.J. Abdul  Kalam'

In [24]:
def get_book_price(book_doc):
  price_a_tag_class = 'a-size-mini a-link-normal'
  a_tags_price = book_doc.find('a', {'class': price_a_tag_class})
  if(a_tags_price):
    price_text = a_tags_price.text.strip()
    if(price_text):
      return price_text.split(' ')[0].strip()
    else:
      return '0'

In [25]:
#Price 
price = get_book_price(book_doc)
price

'₹149.47'

In [26]:
def get_numof_ratings(book_doc):
  ratings_span_tag_class = 'a-size-base'
  ratings_span_tag_id = 'acrCustomerReviewText'
  span_tags_price = book_doc.find_all('span', {'id': ratings_span_tag_id , 'class': ratings_span_tag_class}) 
  if(span_tags_price):
    ratings_text = span_tags_price[0].text.strip()
    if(ratings_text):
      return ratings_text.split(' ')[0].strip()
    else:
      return 0

In [27]:
#Ratings
ratings = get_numof_ratings(book_doc)
ratings

'2,762'

In [28]:
def get_numof_stars(book_doc):
  starts_span_tag_class = 'a-icon-alt'
  span_tags_stars = book_doc.find('span', {'class': starts_span_tag_class})
  if(span_tags_stars):
    stars_text = span_tags_stars.text.strip()
    if(stars_text!=""):
      stars = stars_text.split(' ')[0].strip()
      return stars
    else:
      return '0'

In [29]:
stars = get_numof_stars(book_doc)
stars

'4.7'

In [30]:
#Number Of Pages 
def get_book_pages(book_doc):
  pages_div_tag_class = 'a-section a-spacing-small a-text-center rpi-attribute-label'
  div_tags = book_doc.find_all('div', {"class": pages_div_tag_class})

  if(len(div_tags) > 0):
    for tag in div_tags:
      tag_name = tag.text
      if(tag_name.strip() == 'Print length'):
        span_tag = tag.parent.find_all('span')
        pages = span_tag[-1].text.split(' ')[0].strip() 
        return pages
  
  return "0"

In [31]:
pages = get_book_pages(book_doc)
pages

'117'

In [32]:
#Language
def get_book_language(book_doc):
  pages_div_tag_class = 'a-section a-spacing-small a-text-center rpi-attribute-label'
  div_tags = book_doc.find_all('div', {"class": pages_div_tag_class})

  if(len(div_tags) > 0):
    for tag in div_tags:
      tag_name = tag.text
      if(tag_name.strip() == 'Language'):
        span_tag = tag.parent.find_all('span')
        language = span_tag[-1].text.split(' ')[0].strip() 
        return language
  
  return ""

In [33]:
lang  = get_book_language(book_doc)
lang

'English'

## 🥗 ---------- Combining All-Together - Saving Book details to csv for each category ---------- 💚

In [2]:
def get_book_details_df(book_url_list , path):
  
  book_details_dict = {
      "name": [],
      "author": [],
      "language":[],
      "price": [],
      "ratings": [],
      "stars": [],
      "pages" : []
  }
  print("---> Scraping Books Details ||")

  for book_page_url in book_url_list:
    book_page = get_page_content(book_page_url)
    doc = BeautifulSoup(book_page , 'html.parser')
    print("#" , end="")

    book_details_dict['name'].append(get_book_name(doc))
    book_details_dict['author'].append(get_book_author(doc))
    book_details_dict['price'].append(get_book_price(doc))
    book_details_dict['ratings'].append(get_numof_ratings(doc))
    book_details_dict['stars'].append(get_numof_stars(doc))
    book_details_dict['pages'].append(get_book_pages(doc))
    book_details_dict['language'].append(get_book_language(doc))

  print("")
  cat_book_df =  pd.DataFrame(book_details_dict)
  cat_book_df.to_csv(path, index=None)
  print("------------- File Saved --------------------------")


💻 **get_book_details_df** Function will accept a list of urls for the book pages which are belong to a specific category and the path that the csv files wanted to be saved. Inside the get_book_details_df function, it will call other helper functions like get_book_name , get_numof_ratings  that we are defined earlier to grab the details of each book. 

In [37]:
def scrape_categorized_books():
    print('Scraping List of Book Categories')
    categories_df = get_categories_df("https://www.amazon.in")
    
    for index, caturl in categories_df.iterrows():
      print("Obtaining Books Details for Category {}".format(caturl["category"]))
      book_links = get_book_page_url(caturl['url'])
      
      os.makedirs('data', exist_ok=True)
      #'data/{}.csv'.format(row['title'])
      get_book_details_df(book_links, 'data/{}.csv'.format(caturl['category']))
      time.sleep(3)
      
    

💻 **scrape_categorized_books** function will first obtain the list of urls for the each book categories and then it will scrape the details of the books belong to each categories using the get_book_details_df function.

In [36]:
scrape_categorized_books()

Scraping List of Book Categories
Obtaining Books Details for Category Action & Adventure
---> Scraping Books Details ||
##############################
------------- File Saved --------------------------


📓 about output shows a sample output for scraping book details of one category

## 🥗 ----------References and  Future Works ---------- 💚

📗 **Summary :**

* ▶ Scrape the Book Categories listed in https://www.amazon.in/gp/bestsellers/books/
* ▶ Grab the Details of the Books for each category.
* ▶ Save the data into csv files 

📗 **References :**
* ▶ Video Tutorial - https://www.youtube.com/watch?v=RKsLLG-bzEY&t=8236s

📗 **Future Works :**
* ▶ Visualize the collected data in a dashboard
* ▶ Trying to Automate the scraping by creating a data pipeline

