We are targeting the Amazon Best Sellers page in the Teaching & Education category. Amazon’s pagination allows us to navigate through multiple pages of results. The base URL for the first page looks like this:

'''https://www.amazon.in/gp/bestsellers/books/4149461031/ref=zg_bs_pg_1?ie=UTF8&pg=1'''

Notice the pagination parameters “pg” and “zg_bs_pg” in the URL. We will increment these values to navigate through the pages.

# Step 1: Set Up the HTTP Request

To scrape the content from Amazon, we first need to send a request to the server and retrieve the HTML content of the page. We also need to mimic a real browser to avoid being blocked by Amazon, which is why we always need to include a User-Agent header in the request. Here’s how to set up the HTTP request:

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# base url of the best sellers page for teaching & education books
base_url = "https://www.amazon.in/gp/bestsellers/books/4149461031/ref=zg_bs_pg_{}?ie=UTF8&pg={}"

# http headers to mimic a browser visit
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}

# Step 2: Iterate Over Pages to Collect Data

Now, we will loop through the first three pages to collect data for the top 50 books (assuming each page displays around 20 items). On each page, we will extract the author’s name and rating:

In [28]:
# response = requests.get(url, headers=headers)
# response.content[:1000]
!pip install lxml


Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting lxml
  Downloading lxml-5.3.0-cp310-cp310-manylinux_2_28_x86_64.whl (5.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.0/5.0 MB[0m [31m80.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: lxml
Successfully installed lxml-5.3.0


In [30]:
# initialize a list to store book data
book_list = []

# iterate over the first 3 pages to get top 50 books (assuming each page has about 20 items)
for page in range(1, 4):
    # construct the URL for the current page
    url = base_url.format(page, page)
    
    # send a GET request to the url
    response = requests.get(url, headers=headers)
    
    # parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, "html.parser") #"lxml")
    
    # find all the book elements
    books = soup.find_all("div", {"class": "zg-grid-general-faceout"})
    
    # iterate over each book element to extract data
    for book in books:
        if len(book_list) < 50:  # stop once we've collected 50 books
            author = book.find("a", class_="a-size-small a-link-child").get_text(strip=True) if book.find("a", class_="a-size-small a-link-child") else "N/A"
            rating = book.find("span", class_="a-icon-alt").get_text(strip=True) if book.find("span", class_="a-icon-alt") else "N/A"
            
            # append the extracted data to the book_list
            book_list.append({
                "Author": author,
                "Rating": rating
            })
        else:
            break

In [31]:
book_list

[{'Author': 'Samapti Sinha Mahapatra', 'Rating': '4.6 out of 5 stars'},
 {'Author': 'Ishinna B. Sadana', 'Rating': '4.8 out of 5 stars'},
 {'Author': 'Kriti Sharma', 'Rating': '4.7 out of 5 stars'},
 {'Author': 'Kautilya', 'Rating': '4.5 out of 5 stars'},
 {'Author': 'एम लक्ष्मीकांत', 'Rating': '4.4 out of 5 stars'},
 {'Author': 'Lori Gottlieb', 'Rating': '4.6 out of 5 stars'},
 {'Author': 'PR Yadav', 'Rating': '4.4 out of 5 stars'},
 {'Author': 'Dr. Chhavi Kalra', 'Rating': '4.6 out of 5 stars'},
 {'Author': 'R.K. Gupta', 'Rating': '4.5 out of 5 stars'},
 {'Author': 'Wonder House Books', 'Rating': '4.7 out of 5 stars'},
 {'Author': 'Rajesh Verma', 'Rating': '4.3 out of 5 stars'},
 {'Author': 'EduGorilla PREP EXPERT', 'Rating': '4.0 out of 5 stars'},
 {'Author': 'Wonder House Books', 'Rating': '4.7 out of 5 stars'},
 {'Author': 'Professional Book Publishers', 'Rating': '4.7 out of 5 stars'},
 {'Author': 'N/A', 'Rating': '4.7 out of 5 stars'},
 {'Author': 'N/A', 'Rating': '4.8 out of 5 

# Step 3: Store and Save the Data

In [32]:
# After collecting the data, we will store it in a Pandas DataFrame and save it to a CSV file:
# convert the list of dictionaries into a DataFrame
df = pd.DataFrame(book_list)

print(df.head())

# save the DataFrame to a CSV file
df.to_csv("amazon_top_50_books_authors_ratings.csv", index=False)

                    Author              Rating
0  Samapti Sinha Mahapatra  4.6 out of 5 stars
1        Ishinna B. Sadana  4.8 out of 5 stars
2             Kriti Sharma  4.7 out of 5 stars
3                 Kautilya  4.5 out of 5 stars
4           एम लक्ष्मीकांत  4.4 out of 5 stars
