# Web Scraping Tutorial with Beautiful Soup

Web scraping is a technique used to extract data from websites. In this tutorial, we will use the **Beautiful Soup** library in Python to scrape product reviews from Amazon. We will also utilize the **requests** library to fetch web pages.

## Prerequisites

Before we begin, ensure you have the required libraries installed. You can install them using pip:

```bash
pip install beautifulsoup4 requests pandas
```

## Importing Libraries

* BeautifulSoup: A library for parsing HTML and XML documents. It creates parse
trees from page source codes that can be used to extract data easily.
* requests: A library for making HTTP requests to web servers. It allows us to send GET and POST requests and handle responses.
* pandas: A powerful data manipulation and analysis library that provides data structures like DataFrames.

In [134]:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

## Fetching the Web Page
* requests.get(link): Sends an HTTP GET request to the specified URL and returns a response object.
* print(page): Displays the response object, including the HTTP status code.

#### Understanding HTTP Status Codes:
* 200 = you’re good
* 503 = the website thinks you’re a bot

To see the content of the page, use
```python
print(page.content)
```


In [135]:
link = 'https://www.amazon.in/Hawkins-Contura-Anodised-Aluminium-Pressure/product-reviews/B00SX03I08/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews'
#link = 'https://www.amazon.in/NutriPro-Bullet-Juicer-Grinder-Blades/product-reviews/B09J2SCVQT'
page = requests.get(link)
print(page)

<Response [200]>


## Parsing HTML Content

* bs(page.content, 'html.parser'): Creates a Beautiful Soup object, parsing the HTML content of the page.
* soup.prettify(): Formats the HTML content with indentation for better readability.

In [136]:
soup = bs(page.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html class="a-no-js" data-19ax5a9jf="dingo" lang="en-in">
 <!-- sp:feature:head-start -->
 <head>
  <script>
   var aPageStart = (new Date()).getTime();
  </script>
  <meta charset="utf-8"/>
  <!-- sp:end-feature:head-start -->
  <!-- sp:feature:csm:head-open-part1 -->
  <!-- sp:end-feature:csm:head-open-part1 -->
  <!-- sp:feature:cs-optimization -->
  <meta content="on" http-equiv="x-dns-prefetch-control"/>
  <link href="https://images-eu.ssl-images-amazon.com" rel="dns-prefetch"/>
  <link href="https://m.media-amazon.com" rel="dns-prefetch"/>
  <link href="https://completion.amazon.com" rel="dns-prefetch"/>
  <!-- sp:end-feature:cs-optimization -->
  <!-- sp:feature:csm:head-open-part2 -->
  <!-- sp:end-feature:csm:head-open-part2 -->
  <!-- sp:feature:aui-assets -->
  <link href="https://m.media-amazon.com/images/I/11EIQ5IGqaL._RC|01e5ncglxyL.css,01lF2n-pPaL.css,412sHz-V95L.css,31ASPyl+r4L.css,01GZEvC5WIL.css,11GEPqXartL.css,01qPl4hxayL.css,01ITNc8rK9L.css,413Vvv3G

## Extracting Customer Names
To extract customer names from the reviews, we can use the find_all method.

soup.find_all(tag, class_='class_name'): Finds all occurrences of the specified tag with the given class name. In this case, we are looking for < span > tags with the class a-profile-name.


In [137]:
names = soup.find_all('span', class_='a-profile-name')

In [138]:
print(names)

[<span class="a-profile-name">S Ghosh</span>, <span class="a-profile-name">Female</span>, <span class="a-profile-name">S Ghosh</span>, <span class="a-profile-name">Chandra Mohan Sriwas</span>, <span class="a-profile-name">Shraddha More</span>, <span class="a-profile-name">Brinda Devadas</span>, <span class="a-profile-name">Mridul gupta</span>, <span class="a-profile-name">Harish</span>, <span class="a-profile-name">Female</span>, <span class="a-profile-name">arvind acharya</span>, <span class="a-profile-name">Customer711</span>, <span class="a-profile-name">Basudev Guru</span>]


## Storing Customer Names

We'll be storing the extracted names in a list.

get_text(): Retrieves the text content from the HTML element, stripping away any HTML tags.

In [139]:
cust_name = []
for i in range(len(names)):
    cust_name.append(names[i].get_text())
cust_name

['S Ghosh',
 'Female',
 'S Ghosh',
 'Chandra Mohan Sriwas',
 'Shraddha More',
 'Brinda Devadas',
 'Mridul gupta',
 'Harish',
 'Female',
 'arvind acharya',
 'Customer711',
 'Basudev Guru']

## Removing Duplicates

In [140]:
# Remove duplicates while preserving the first occurrence
unique_cust_name = []
seen = set()
for cust in cust_name:
    if cust not in seen:
        unique_cust_name.append(cust)
        seen.add(cust)

## Extracting Review Titles
We do it exactly like we had before.

In [141]:
titles = soup.find_all('a', class_='review-title-content')

In [142]:
print(titles)

[<a class="a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R3FU7DBSVV4H2Y?ASIN=B00SX03I08"><i class="a-icon a-icon-star a-star-5 review-rating" data-hook="review-star-rating"><span class="a-icon-alt">5.0 out of 5 stars</span></i><span class="a-letter-space"></span>
<span>Excellent product</span>
</a>, <a class="a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R2TCIKKAM8HC5Y?ASIN=B00SX03I08"><i class="a-icon a-icon-star a-star-4 review-rating" data-hook="review-star-rating"><span class="a-icon-alt">4.0 out of 5 stars</span></i><span class="a-letter-space"></span>
<span>Good quality no doubt 1.5 liter is perfect for the 1 person</span>
</a>, <a class="a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R22TCK8IMU0E2G?ASIN=B00SX

## Cleaning Review Titles
To clean up the review titles and remove leading and trailing whitespace, we can use list comprehension.

strip(): Removes any leading and trailing whitespace from the string.

In [143]:
review_title = [titles.get_text().strip() for titles in titles]
review_title

['5.0 out of 5 stars\nExcellent product',
 '4.0 out of 5 stars\nGood quality no doubt 1.5 liter is perfect for the 1 person',
 '5.0 out of 5 stars\nGreat durability, Quality and Fittings, Worth Buying',
 '5.0 out of 5 stars\nSuper',
 '5.0 out of 5 stars\nGreat',
 '4.0 out of 5 stars\nGood cooker',
 '1.0 out of 5 stars\nHawkins brand warranty policy is just a fraud',
 '5.0 out of 5 stars\nSturdy',
 '4.0 out of 5 stars\nFirst time came without rubber',
 '4.0 out of 5 stars\nGood product']

## Creating a DataFrame
Now that we have customer names and review titles, we can create a DataFrame using pandas.

* pd.DataFrame(): Initializes a new DataFrame.
* df['Column Name'] = data: Assigns data to a specific column in the DataFrame.

In [144]:
# Print lengths for debugging
print(f"Number of unique customers: {len(unique_cust_name)}")
print(f"Number of review titles: {len(review_title)}")

Number of unique customers: 10
Number of review titles: 10


In [145]:
# Bam, mismatch
# Truncate both lists to the minimum length
min_length = min(len(unique_cust_name), len(review_title))
unique_cust_name = unique_cust_name[:min_length]
review_title = review_title[:min_length]

In [146]:
df = pd.DataFrame()
df['Customer Name'] = unique_cust_name
df['Review Title'] = review_title

In [147]:
df

Unnamed: 0,Customer Name,Review Title
0,S Ghosh,5.0 out of 5 stars\nExcellent product
1,Female,4.0 out of 5 stars\nGood quality no doubt 1.5 ...
2,Chandra Mohan Sriwas,"5.0 out of 5 stars\nGreat durability, Quality ..."
3,Shraddha More,5.0 out of 5 stars\nSuper
4,Brinda Devadas,5.0 out of 5 stars\nGreat
5,Mridul gupta,4.0 out of 5 stars\nGood cooker
6,Harish,1.0 out of 5 stars\nHawkins brand warranty pol...
7,arvind acharya,5.0 out of 5 stars\nSturdy
8,Customer711,4.0 out of 5 stars\nFirst time came without ru...
9,Basudev Guru,4.0 out of 5 stars\nGood product


## Saving the data to Excel
to_excel(file_name, index=False): Saves the DataFrame to an Excel file. The index=False argument prevents pandas from writing row indices to the file.

In [148]:
file_name = 'reviews.xlsx'
df.to_excel(file_name, index=False)
print('DataFrame is written to Excel File successfully')

DataFrame is written to Excel File successfully


## But there's a problem!!

Instead of extracting customer names and review titles separately, you can iterate through the reviews and extract both the customer name and the review title in a single loop.

It's a one-one mapping, you can't just delete like that!

In [149]:
# Initialize lists to store customer names and review titles
cust_name = []
review_title = []

In [150]:
# Find all review containers (make sure to adjust this to the correct parent tag)
reviews = soup.find_all('div', class_='review')  # Adjust this to the correct parent tag and class

In [151]:
for review in reviews:
    # Extract customer name
    name_tag = review.find('span', class_='a-profile-name')  # Adjust the class name if necessary
    name = name_tag.get_text().strip() if name_tag else None
    cust_name.append(name)

    # Extract review title
    title_tag = review.find('a', class_='review-title-content')  # This should work since you already have this
    title = title_tag.get_text().strip() if title_tag else None
    review_title.append(title)

In [152]:
# Remove duplicates while preserving the first occurrence
unique_cust_name = []
seen = set()
for cust in cust_name:
    if cust not in seen:
        unique_cust_name.append(cust)
        seen.add(cust)

In [153]:
# Create a DataFrame with the extracted data
df = pd.DataFrame({
    'Customer Name': unique_cust_name,
    'Review Title': review_title[:len(unique_cust_name)]  # Ensure the lengths match
})

In [154]:
# Display the DataFrame
df

Unnamed: 0,Customer Name,Review Title
0,S Ghosh,5.0 out of 5 stars\nExcellent product
1,Chandra Mohan Sriwas,4.0 out of 5 stars\nGood quality no doubt 1.5 ...
2,Shraddha More,"5.0 out of 5 stars\nGreat durability, Quality ..."
3,Brinda Devadas,5.0 out of 5 stars\nSuper
4,Mridul gupta,5.0 out of 5 stars\nGreat
5,Harish,4.0 out of 5 stars\nGood cooker
6,Female,1.0 out of 5 stars\nHawkins brand warranty pol...
7,arvind acharya,5.0 out of 5 stars\nSturdy
8,Customer711,4.0 out of 5 stars\nFirst time came without ru...
9,Basudev Guru,4.0 out of 5 stars\nGood product


In [155]:
# Print lengths for debugging
print(f"Number of unique customers: {len(unique_cust_name)}")
print(f"Number of review titles: {len(review_title)}")

Number of unique customers: 10
Number of review titles: 10


## Wrapping Up

In this tutorial, we covered the fundamentals of web scraping using Beautiful Soup and the requests library. We learned how to:

- Fetch a web page and check its status.
- Parse HTML content to navigate the structure of the web page.
- Extract specific data such as customer names and review titles.
- Clean and process the extracted data to remove duplicates and handle formatting.
- Save the cleaned data to an Excel file for further analysis.

Web scraping is a powerful tool for data collection and analysis, enabling you to gather information from various websites efficiently. However, it's essential to use this technique responsibly and ethically, adhering to the website's terms of service and robots.txt rules.

Feel free to modify the code to scrape different data or from different websites. Happy scraping!

### Running the Code

You can copy and paste the entire code into your Python environment or a Jupyter notebook. This code will demonstrate how to scrape product reviews from Amazon, process the extracted data, and save it to an Excel file. Each section of the code is designed to be run sequentially, allowing you to see the results of each step in the web scraping process. Enjoy your web scraping journey!