# Web Scraping from Amazon with Python
#### This notebook demonstrates how to scrape data from Amazon's Best Sellers page in the Teaching & Education category.


In [2]:
# Import necessary libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [4]:
#Set up the base URL and HTTP headers
# Base URL of the best sellers page for teaching & education books
base_url = "https://www.amazon.in/gp/bestsellers/books/4149461031/ref=zg_bs_pg_{}?ie=UTF8&pg={}"

# HTTP headers to mimic a browser visit
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}

In [6]:
# Iterate over pages to collect data
# Initialize a list to store book data
book_list = []

# Iterate over the first 3 pages to get top 50 books (assuming each page has about 20 items)
for page in range(1, 4):
    # Construct the URL for the current page
    url = base_url.format(page, page)
    
    # Send a GET request to the URL
    response = requests.get(url, headers=headers)
    
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, "lxml")
    
    # Find all the book elements
    books = soup.find_all("div", {"class": "zg-grid-general-faceout"})
    
    # Iterate over each book element to extract data
    for book in books:
        if len(book_list) < 50:  # Stop once we've collected 50 books
            author = book.find("a", class_="a-size-small a-link-child").get_text(strip=True) if book.find("a", class_="a-size-small a-link-child") else "N/A"
            rating = book.find("span", class_="a-icon-alt").get_text(strip=True) if book.find("span", class_="a-icon-alt") else "N/A"
            
            # Append the extracted data to the book_list
            book_list.append({
                "Author": author,
                "Rating": rating
            })
        else:
            break

In [7]:
#Store and save the data
# Convert the list of dictionaries into a DataFrame
df = pd.DataFrame(book_list)

# Display the first few rows of the DataFrame
print(df.head())

# Save the DataFrame to a CSV file
df.to_csv("amazon_top_50_books_authors_ratings.csv", index=False)

                    Author              Rating
0  Samapti Sinha Mahapatra  4.6 out of 5 stars
1                      N/A  4.4 out of 5 stars
2                 PR Yadav  4.4 out of 5 stars
3           एम लक्ष्मीकांत  4.4 out of 5 stars
4       Subhadra Sen Gupta  4.5 out of 5 stars


In [8]:
# Display a random sample of 10 rows from the DataFrame
print(df.sample(10))

                           Author              Rating
18        EduGorilla Prep Experts  4.4 out of 5 stars
43                Peter Liljedahl  4.8 out of 5 stars
24                            N/A  4.2 out of 5 stars
25                   Aman Kharwal  4.4 out of 5 stars
28         ALLEN Expert Faculties  4.2 out of 5 stars
15                   Rajesh Verma  4.3 out of 5 stars
34  Scholastic Teaching Resources  4.6 out of 5 stars
32        EduGorilla Prep Experts  3.9 out of 5 stars
5               Ishinna B. Sadana  4.7 out of 5 stars
0         Samapti Sinha Mahapatra  4.6 out of 5 stars


Web scraping is a method for extracting data from websites by sending requests to the server, retrieving the web pages, and parsing the HTML content to obtain the desired information. This article provided an overview of how to collect data from Amazon using Python for web scraping. I hope you found this guide helpful for your data collection endeavors!
Make sure to run each cell in order to execute the code step by step.
You can modify the number of pages to scrape or the specific data you want to extract based on your needs.
Always check the website's robots.txt file and terms of service to ensure that web scraping is allowed.
