# Web Scraping
## Using requests, BeautifulSoup and Pandas

![](https://i.imgur.com/6zM7JBq.png)

Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It’s a useful technique for creating datasets for research and learning. There are many example where by using we can create dataset and automate the process to get the data e.g. Laptop’s price scraper from various website, top movie rating scraping, Mutual fund NAV scrapping which may further be use to create a data set for EDA and Machine learning project.

We are going to Scrap a list of top seller books and details about the book for each category from Amazon best selling website. Amazon is an multinational technology company that focuses on e-commerce , cloud computing , digital streaming and AI.

We will use this https://www.amazon.in/gp/bestsellers/books/ page to retrieve the information using web scraping.

### The steps we’ll follow:
* We’re going to scrape https://www.amazon.in/gp/bestsellers/books/
* We’ll get a list of topics.
* For each topic, we’ll get topic title, topic page URL
* For each topic, we’ll get the top 50 books in the topic from the topic page
* For each book, we’ll grab the book name, book URL, author name, book price, star rating and No of customer rated as rating.
* Save the information data to CSV file Using Pandas library
* By the end of the project we will be able to create a CSV file with the following info:
#### title, url ,book_name ,author name ,book price ,star rating , rating, book_url.

### Install and Import important libraries

In [82]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

### Downloading a web page using requests

When you access a URL like using a web browser, it downloads the contents of the web page the URL points to and displays the output on the screen. Before we can extract information from a web page, we need to download the page using Python.

We’ll use a library called requests to download web pages from the internet. We can download a web page using the requests.get function.

In [83]:
topics_url = 'https://www.amazon.in/gp/bestsellers/books/'

In [84]:
response = requests.get(topics_url)

requests.getfunction returns a response object with the page contents and some information indicating whether the request was successful, using a status code.response.status_code will provide you the code whether the request was successful or not. If the status.code lies between 200 to 209 then the request was successful otherwise it was not successful.

In [85]:
response.status_code

200

The contents of the web page can be accessed using the .text property of the response.

In [86]:
page_contents = response.text
len(page_contents)    #The `len` fucnction tells us the length of the response object

327511

### Inspect HTML of the web page
We can view the source code of the webpage by doing right-clicking anywhere on the web page and selecting ‘Inspect’ option. It opens the “Developer Tools” pane, where we can see the source code as a tree. We can expand and collapse various nodes and find the source code for a specific portion of the page.

Here’s how our web page look like:

![](https://imgur.com/tXEOKKj.png)

As shown above ,We can find out ‘topic title’ are present in the “div” tag under class -”p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8"

### Extracting information by parsing HTML source code using BeautifulSoup library
To extract information from the HTML source code, we will use the Beautiful Soup library. Beautiful Soup will return an object containing several properties and methods to extract the information from HTML documents.

In [87]:
doc = BeautifulSoup(page_contents, 'html.parser') 

In [88]:
type(doc)

bs4.BeautifulSoup

Lets create the helper function and extract the topic title and topic URL’s

In [118]:
def get_topic_page(topic_url):
    response = requests.get(topic_url)
    #Check successful Response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    #Parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc

In [119]:
doc = get_topic_page('https://www.amazon.in/gp/bestsellers/books')

Lets find the Topic title, which is inside the div tag with class set to '_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8'

In [120]:
sel_class = '_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8'
topic_title_tags = doc.find_all('div',class_=sel_class) 

In [164]:
len(topic_title_tags)  # this is the table length which contains topic title

36

The above topic title contains the 36 different categories of books , we will parse each category of book and get the top seller 50 books in each category.

Lets create the helper function and extract the topic title and topic URL’s

Lets create the helper function
#### Topic Title

In [93]:
def get_topic_titles(topic_title_tags): # this function is created to get the topic title
    topic = topic_title_tags.find('a').text
    return topic

In [98]:
a = get_topic_titles(topic_title_tags[1])

In [99]:
a

'Action & Adventure'

#### Topic URL

In [100]:
def get_topic_urls(topic_title_tags): # this function is created to get the topic title url
    base_url ='https://www.amazon.in/'
    table_tag_href = base_url + topic_title_tags.find('a')['href']
    return table_tag_href

In [101]:
b = get_topic_urls(topic_title_tags[1])

In [102]:
b

'https://www.amazon.in//gp/bestsellers/books/1318158031/ref=zg_bs_nav_books_1'

Now we got the 36 different URL’s for all the category. Lets parse each URL and get the Book Name, Book URL, Author Name, Price and Rating of the each book.

Lets find the Book Name , which is inside the div tag with class set to zg-grid-general-faceout which is inside span tag and div tag,using the helper function get_books_name

#### Book Name

In [103]:
def get_books_name(topic_doc):
    books_tag = topic_doc.find_all('div',class_="a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc")
    books_name = []
    for i in range(len( books_tag)):
        try:
            author_tag = books_tag[i].find('div',class_='zg-grid-general-faceout').find('span').find('div').text
            books_name.append(author_tag)
        except AttributeError:
            books_name.append(None)
    return books_name

In [104]:
books_name = get_books_name(get_topic_page(b)) # b contains the url for the  first topic "action and adventure" 

In [105]:
books_name[0:5] # Printing first 5 book name from action and adventure

["Harry Potter and the Philosopher's Stone",
 'THE SILENT PATIENT [Paperback] Michaelides, Alex',
 'THE LION INSIDE',
 'The Housemaid : An addictive psychological thriller with mind-bending twists',
 'Samsara: Enter the Valley of the Gods ("India\'s answer to Harry Potter") | Mythological fiction novel']

#### Book URL

In [106]:
def get_books_url(topic_doc):
    books_tag = topic_doc.find_all('div',class_="a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc")
    base_url = 'https://www.amazon.in/'
    books_url = []
    for i in range(len( books_tag)):
        url = base_url + books_tag[i].find('a')['href']
        books_url.append(url)
    return books_url

In [107]:
books_url = get_books_url(get_topic_page(b))

In [108]:
books_url [:5] # printing first 5 URLS

['https://www.amazon.in//Harry-Potter-Philosophers-Stone-Rowling-ebook/dp/B019PIOJYU/ref=zg_bs_g_1318158031_d_sccl_1/257-2841279-6502826?psc=1',
 'https://www.amazon.in//Silent-Patient-Alex-Michaelides/dp/1409181634/ref=zg_bs_g_1318158031_d_sccl_2/257-2841279-6502826?psc=1',
 'https://www.amazon.in//LION-INSIDE-Rachel-Bright/dp/1408349043/ref=zg_bs_g_1318158031_d_sccl_3/257-2841279-6502826?psc=1',
 'https://www.amazon.in//Housemaid-addictive-psychological-thriller-mind-bending/dp/014346115X/ref=zg_bs_g_1318158031_d_sccl_4/257-2841279-6502826?psc=1',
 'https://www.amazon.in//Samsara-Valley-Indias-answer-Potter/dp/0143458280/ref=zg_bs_g_1318158031_d_sccl_5/257-2841279-6502826?psc=1']

#### Author Name

In [109]:
def get_author_name(topic_doc):
    books_tag = topic_doc.find_all('div', class_="a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc")
    author_name = []
    
    for i in range(len(books_tag)):
        try:
            author_tag = books_tag[i].find('div', class_='a-row a-size-small').text
            author_name.append(author_tag)
        except AttributeError:
            author_name.append(None)
        except Exception as e:
            author_name.append(None)
            print(f"An error occurred: {e}")
            
    return author_name

In [110]:
author_name = get_author_name(get_topic_page(b))

In [111]:
author_name[:5]

['J.K. Rowling',
 'Alex Michaelides',
 'Rachel Bright',
 'Freida McFadden',
 'Saksham Garg']

#### Book Price

In [186]:
def get_book_price(topic_doc):
    books_tag = topic_doc.find_all('div',class_="a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc")
    book_price = []
    for i in range(len( books_tag)):
        try:
            price_tag =books_tag[i].find('span',class_='p13n-sc-price').text
            book_price.append(price_tag)
        except AttributeError:
            book_price.append(None)
          
    return book_price

In [187]:
book_price = get_book_price(get_topic_page(b))

In [188]:
book_price[0:5]

['₹313.95', '₹250.88', '₹254.00', '₹310.00', '₹165.00']

#### Star Rating

In [189]:
def get_star_rating(topic_doc):
    books_tag = topic_doc.find_all('div',class_="a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc")
    star_rating = []
    for i in range(len( books_tag)):
        try:
            star_tag = books_tag[i].find('div',class_='a-icon-row').text[0:3]
            star_rating.append(star_tag)
        except AttributeError:
            star_rating.append(None)
    return star_rating

In [190]:
star_rating = get_star_rating(get_topic_page(b))

In [191]:
star_rating[26:37]

['4.7', '4.6', '3.5', None, '4.5', '4.6', '4.8', '4.8', '4.6', '4.6', '4.6']

#### Rating

In [192]:
def get_rating(topic_doc):
    books_tag = topic_doc.find_all('div',class_="a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc")
    rating = []
    for i in range(len( books_tag)):
        try:
            rating_tag= books_tag[i].find('div',class_='a-icon-row')('span')[1].text
            rating.append(rating_tag)
        except TypeError:
            rating.append(None)
    return rating

In [193]:
rating = get_rating(get_topic_page(b))

In [194]:
rating[:5]

['68,429', '325,888', '8,486', '365,191', '1,665']

### Let’s create a function to put them together

In [201]:
def scrape_topic_list(main_url):
    main_dict = {
        'title': [],
        'url': [],
        'book_name': [],
        'books_url': [],
        'author_name': [],
        'book_price': [],
        'star_rating': [],
        'rating': []
    }
    
    doc = get_topic_page(main_url)     
    sel_class = '_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8'
    topic_title_tags = doc.find_all('div', class_=sel_class)
    
    for i in topic_title_tags[1:35]:
        title = get_topic_titles(i)
        url = get_topic_urls(i)
        print(f"Fetching details from URL: {url}")  # Debugging URL
        try:
            topic_doc = get_topic_page(url)
            books = get_books_name(topic_doc) or [None]
            authors = get_author_name(topic_doc) or [None]
            prices = get_book_price(topic_doc) or [None]
            star_ratings = get_star_rating(topic_doc) or [None]
            ratings = get_rating(topic_doc) or [None]
            books_urls = get_books_url(topic_doc) or [None]
            
            # Repeat the topic title and URL for the number of books in the topic
            main_dict['title'].extend([title] * len(books))
            main_dict['url'].extend([url] * len(books))
            
            main_dict['book_name'].extend(books)
            main_dict['author_name'].extend(authors)
            main_dict['book_price'].extend(prices)
            main_dict['star_rating'].extend(star_ratings)
            main_dict['rating'].extend(ratings)
            main_dict['books_url'].extend(books_urls)
        except Exception as e:
            print(f"Error fetching URL {url}: {e}")
            continue  # Skip to the next iteration

    # Debugging: Check if all lists have the same length
    for key, value in main_dict.items():
        print(f"Length of {key}: {len(value)}")

    # Creating the DataFrame
    df = pd.DataFrame(main_dict)
    return df


In [202]:
scrape_df= scrape_topic_list('https://www.amazon.in/gp/bestsellers/books/')

Fetching details from URL: https://www.amazon.in//gp/bestsellers/books/1318158031
Fetching details from URL: https://www.amazon.in//gp/bestsellers/books/1318052031
Fetching details from URL: https://www.amazon.in//gp/bestsellers/books/1318064031
Fetching details from URL: https://www.amazon.in//gp/bestsellers/books/1318068031
Fetching details from URL: https://www.amazon.in//gp/bestsellers/books/64619755031
Fetching details from URL: https://www.amazon.in//gp/bestsellers/books/1318104031
Fetching details from URL: https://www.amazon.in//gp/bestsellers/books/1318105031
Fetching details from URL: https://www.amazon.in//gp/bestsellers/books/1318118031
Fetching details from URL: https://www.amazon.in//gp/bestsellers/books/1318161031
Fetching details from URL: https://www.amazon.in//gp/bestsellers/books/22960344031
Fetching details from URL: https://www.amazon.in//gp/bestsellers/books/4149751031
Fetching details from URL: https://www.amazon.in//gp/bestsellers/books/1402038031
Fetching detai

In [203]:
scrape_df

Unnamed: 0,title,url,book_name,books_url,author_name,book_price,star_rating,rating
0,Action & Adventure,https://www.amazon.in//gp/bestsellers/books/13...,Harry Potter and the Philosopher's Stone,https://www.amazon.in//Harry-Potter-Philosophe...,J.K. Rowling,₹313.95,4.7,68429
1,Action & Adventure,https://www.amazon.in//gp/bestsellers/books/13...,"THE SILENT PATIENT [Paperback] Michaelides, Alex",https://www.amazon.in//Silent-Patient-Alex-Mic...,Alex Michaelides,₹250.88,4.5,325888
2,Action & Adventure,https://www.amazon.in//gp/bestsellers/books/13...,THE LION INSIDE,https://www.amazon.in//LION-INSIDE-Rachel-Brig...,Rachel Bright,₹254.00,4.7,8486
3,Action & Adventure,https://www.amazon.in//gp/bestsellers/books/13...,The Housemaid : An addictive psychological thr...,https://www.amazon.in//Housemaid-addictive-psy...,Freida McFadden,₹310.00,4.4,365191
4,Action & Adventure,https://www.amazon.in//gp/bestsellers/books/13...,"Samsara: Enter the Valley of the Gods (""India'...",https://www.amazon.in//Samsara-Valley-Indias-a...,Saksham Garg,₹165.00,4.3,1665
...,...,...,...,...,...,...,...,...
1695,Textbooks & Study Guides,https://www.amazon.in//gp/bestsellers/books/15...,Cursive Writing Books (Set of 5 Books) (Handwr...,https://www.amazon.in//Cursive-Writing-Books-S...,Maple Press,₹348.00,4.4,1703
1696,Textbooks & Study Guides,https://www.amazon.in//gp/bestsellers/books/15...,VCP Early Learning Educational Chart Set for K...,https://www.amazon.in//Learning-Educational-LA...,Vidya Chitr Prakashan,₹199.00,4.2,258
1697,Textbooks & Study Guides,https://www.amazon.in//gp/bestsellers/books/15...,World's Greatest Leaders: Biographies of Inspi...,https://www.amazon.in//Worlds-Greatest-Leaders...,Wonder House Books,₹109.00,4.3,3379
1698,Textbooks & Study Guides,https://www.amazon.in//gp/bestsellers/books/15...,Princess Colouring Book (Giant Book Series): J...,https://www.amazon.in//Princess-Colouring-Book...,Wonder House Books,₹149.00,4.5,2480


We have successfully extracted the information from all the webpages in the format of list of dictionaries. For easier understanding, we are converting the same into CSV file.

### Save the extracted information into CSV file

In [204]:
scrape_df.to_csv('top_rated_books.csv',index = None)     #Converting the final Dataframe 'scrape_df' to a CSV File

### Summary
* Install and import libraries
* Download and Parse the Best seller HTML page source code using resquest and Beautifulsoup to get item categories topics URL.
* Extract information from each page
* Created Pandas DataFrame using a Function
* Save the information data to CSV file Using Pandas library