# Amazon Review Scraper

### Import libraries

In [251]:
import pandas as pd

import requests
from bs4 import BeautifulSoup

### Add a search query like on what catogery you wanted to extract the reviews

In [252]:
search_query="gadgets"

### Adding the base URL of website

In [253]:
base_url="https://www.amazon.com/s?k="

In [254]:
url=base_url+search_query
url

'https://www.amazon.com/s?k=gadgets'

### Setting Up HTTP Headers for Web Scraping Amazon Product Listings

This header is designed to simulate a request made by a browser (specifically Chrome) to avoid being blocked by Amazon while scraping. The User-Agent identifies the browser and operating system, and the referer simulates the origin of the request, which in this case is a search query for "gadgets" on Amazon.

In [255]:
header={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36','referer':'https://www.amazon.com/s?k=gadgets&crid=3RX9ZX1COK85L&sprefix=gadgets%2Caps%2C273&ref=nb_sb_noss_1'}

In [256]:
search_response=requests.get(url,headers=header)

### Checking the HTTP Status Code of the Web Scraping Request

This code checks the HTTP status code returned by the server in response to the web scraping request. The status code indicates whether the request was successful (200), if there was a client error like a 403 Forbidden or 404 Not Found, or if there was a server error like 500. This is useful for debugging and ensuring the request was processed correctly.

In [257]:
search_response.status_code

200

### Accessing the HTML Content of the Web Scraping Response

search_response.text is used to retrieve the raw HTML content returned by the server in response to the web scraping request. This HTML can then be parsed to extract the desired data from the webpage.

In [258]:
search_response.text

'<!doctype html><html lang="en-us" class="a-no-js" data-19ax5a9jf="dingo"><!-- sp:feature:head-start -->\n<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>\n<!-- sp:end-feature:head-start -->\n<!-- sp:feature:csm:head-open-part1 -->\n\n<script type=\'text/javascript\'>var ue_t0=ue_t0||+new Date();</script>\n<!-- sp:end-feature:csm:head-open-part1 -->\n<!-- sp:feature:cs-optimization -->\n<meta http-equiv=\'x-dns-prefetch-control\' content=\'on\'>\n<link rel="dns-prefetch" href="https://images-na.ssl-images-amazon.com">\n<link rel="dns-prefetch" href="https://m.media-amazon.com">\n<link rel="dns-prefetch" href="https://completion.amazon.com">\n<!-- sp:end-feature:cs-optimization -->\n<!-- sp:feature:csm:head-open-part2 -->\n<script type=\'text/javascript\'>\nwindow.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;\nif (window.ue_ihb === 1) {\n\nvar ue_csm = window,\n    ue_hob = +new Date();\n(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){

### Retrieving Cookies from the Web Scraping Response
search_response.cookies retrieves the cookies set by the server in response to the web scraping request. These cookies can be useful for maintaining session state or for making subsequent requests that require the same session information.

In [259]:
search_response.cookies

<RequestsCookieJar[Cookie(version=0, name='session-id', value='142-7617801-7235355', port=None, port_specified=False, domain='.amazon.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=True, expires=1755434583, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False), Cookie(version=0, name='session-id-time', value='2082787201l', port=None, port_specified=False, domain='.amazon.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=True, expires=1755434583, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False), Cookie(version=0, name='i18n-prefs', value='USD', port=None, port_specified=False, domain='.amazon.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=1755434583, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False), Cookie(version=0, name='sp-cdn', value='"L5Z9:IN"', port=None, port_specified=False,

### Function to Fetch Amazon Search Results Page for a Given Query
This function, getAmazonSearch, takes a search query as an argument and constructs the corresponding Amazon search URL. The function then sends an HTTP GET request to Amazon using the specified headers to mimic a browser request. If the request is successful (indicated by a status code of 200), it returns the content of the page. If the request fails, it returns an error message.

Parameters:

search_query: A string representing the search term you want to query on Amazon.
Returns:

The HTML content of the Amazon search results page if the request is successful.
An error message if the request fails.
The cookie dictionary is initialized but currently empty, indicating that cookies could be added for future requests if needed.

In [260]:
cookie={} # insert request cookies within{}
def getAmazonSearch(search_query):
    url="https://www.amazon.com/s?k="+search_query
    print(url)
    page=requests.get(url,headers=header)
    if page.status_code==200:
        return page
    else:
        return "Error"

### Function to Retrieve and Extract Content from Amazon 'See All Reviews' Page
This function, Searchreviews, takes a link to the "See All Reviews" page of a product and fetches its content. The function constructs the full URL by appending the provided review_link to the base Amazon URL. It then sends an HTTP GET request to this URL, using specified headers and any cookies that might be needed for the request. If the request is successful (indicated by a status code of 200), it returns the page content. Otherwise, it returns an error message.

Parameters:

review_link: A string representing the relative URL path to the "See All Reviews" page of a specific product on Amazon.
Returns:

The HTML content of the Amazon reviews page if the request is successful.
An error message if the request fails.
This function is useful for navigating from a product page to its corresponding reviews page and retrieving the reviews' HTML content for further analysis.

In [261]:
def Searchreviews(review_link):
    url="https://www.amazon.com"+review_link
    print(url)
    page=requests.get(url,cookies=cookie,headers=header)
    if page.status_code==200:
        return page
    else:
        return "Error"

### Extracting Product Names from the First Page of Amazon Search Results

This block of code is used to extract the names of products from the first page of Amazon search results for a given query (in this case, "gadgets"). The process involves:

**Initializing an Empty List:** product_names is initialized as an empty list to store the names of the products found on the search results page.

**Fetching the Search Results Page:** The getAmazonSearch('gadgets') function is called to fetch the HTML content of the Amazon search results page for the query "gadgets."

**Parsing the HTML Content:** The HTML content of the page is parsed using BeautifulSoup to create a soup object, which allows for easy navigation and extraction of elements.

**Finding and Extracting Product Names:**

The soup.findAll method is used to search for all the HTML span elements with the specified class 'a-size-base-plus a-color-base a-text-normal'. 

This class is common for all the product names displayed on the search results page.

The text content of each span element (i.e., the product name) is extracted and appended to the product_names list.

In [262]:
# ## First page product reviews extraction
product_names=[]
response=getAmazonSearch('gadgets')
soup=BeautifulSoup(response.content)
for i in soup.findAll("span",{'class':'a-size-base-plus a-color-base a-text-normal'}): # the tag which is common for all the names of products
    product_names.append(i.text) #adding the product names to the list

https://www.amazon.com/s?k=gadgets


In [263]:
product_names

['MEATER Pro / 2 Plus: Wireless Bluetooth Smart Meat Thermometer | 1000°F Open Flame Grilling | Extra Long Range | Multi Sensors | Certified Accuracy | BBQ, Oven, Grill, Smoker, Air Fryer, Deep Fryer',
 'Sync WiFi Wireless Meat Thermometer Digital, 2 Probes, Smart Base, LCD Display, Unlimited Range, Bluetooth 5.4, Improved Stability, NIST-Certified Accuracy, BBQ, Grill, Smoker, Oven, Kitchen',
 'SUPVAN E10 Bluetooth Label Maker Machine with Tape, Continuous Waterproof Label, Versatile App with 35 Fonts and 1k+ Icons, Inkless Labeler for Home, Kitchen, School, Office Organization, Green',
 'Inkbird WiFi Meat Thermometer IBBQ-4T, Wireless WiFi BBQ Thermometer for Smoker, Oven | APP Calibration Temp Graph | Mobile Notification Timer Alarm | Rechargeable Digital Grill Thermometer, 4 Probes',
 'Mueller Pro-Series 10-in-1, 8 Blade Vegetable Chopper, Onion Mincer, Cutter, Dicer, Egg Slicer with Container, French Fry Cutter Potatoe Slicer, Home Essentials & Kitchen Gadgets, Salad Chopper',
 'S

In [264]:
len(product_names)

70

### Function to Fetch Content from Individual Amazon Product Pages Using ASIN

This function, Searchasin, retrieves the content of an individual product page on Amazon using the product's unique identification number, known as the ASIN (Amazon Standard Identification Number). The function constructs the product's URL by appending the provided ASIN to the base Amazon URL. It then sends an HTTP GET request to this URL using specified headers and any necessary cookies. If the request is successful (indicated by a status code of 200), it returns the page content. Otherwise, it returns an error message.

**Parameters:**

**asin:** A string representing the Amazon Standard Identification Number (ASIN) of the product.

**Returns:**

1. The HTML content of the Amazon product page if the request is successful.

2. An error message if the request fails.

In [265]:
def Searchasin(asin):
    url="https://www.amazon.com/dp/"+asin
    print(url)
    page=requests.get(url,cookies=cookie,headers=header)
    if page.status_code==200:
        return page
    else:
        return "Error"

### Extracting data-asin Numbers from Amazon Search Results

This code extracts data-asin numbers, which are unique identifiers for Amazon products, from the first page of search results for "gadgets":

**Initialize List:** An empty list data_asin is created to store the data-asin numbers.

**Fetch Search Results:** The getAmazonSearch('gadgets') function retrieves the HTML content of the search results.

**Parse HTML:** BeautifulSoup is used to parse the page content.

**Extract data-asin:** The code uses soup.findAll to find all div elements with the class 's-result-item s-asin' and appends their data-asin values to the data_asin list.

This method captures the unique product identifiers needed for further processing.

In [266]:
data_asin=[]
response=getAmazonSearch('gadgets')
soup=BeautifulSoup(response.content)
for i in soup.findAll("div",{'class':"sg-col-4-of-24 sg-col-4-of-12 s-result-item s-asin sg-col-4-of-16 sg-col s-widget-spacing-small sg-col-4-of-20 gsx-ies-anchor"}):
    data_asin.append(i['data-asin'])

https://www.amazon.com/s?k=gadgets


In [267]:
response.status_code

200

In [268]:
data_asin

['B08N9Q24M9',
 'B09CX95397',
 'B0C69688VG',
 'B09XMMZSWW',
 'B08QRWDHVT',
 'B0CGQPSWWB',
 'B087RQZ3GH',
 'B015MS3ICQ',
 'B0C3H9MJQV',
 'B07TPLZY74',
 'B0CSTCYQB9',
 'B0CNZ4BB9P',
 'B0CG6953FL',
 'B08LVSFN4X',
 'B08R9PTZ5G',
 'B0BRXZV6R2',
 'B0CHMM4Y7L',
 'B07PZF3QS3',
 'B0BQ6WCZ9H',
 'B09P1DV7D8',
 'B09XJ1Z8CV',
 'B07H7ZDBV4',
 'B094MN8FT9',
 'B07PVBS863',
 'B0764HS4SL',
 'B0BTNT4225',
 'B09WF796VW',
 '0762462876',
 'B0BFZ84XVZ',
 'B0822G8NCF',
 'B07GCHLDWR',
 'B0C3L93F2Q',
 'B09P6HFCSP',
 'B09QKLHG3Y',
 'B09X9BGVQJ',
 'B0BJZ2PFCV',
 'B076CTTZKX',
 'B0B1VM1C6L',
 'B0BMBC1FJX',
 'B0C9ZGMN9D',
 'B0775GR18G',
 'B0C37DLHWW',
 'B0C6FDRRP3',
 'B08ZJQCZJ4',
 'B071X8S8WQ',
 'B0CJRTVWR2',
 'B0BRKPVZB4',
 'B07DPCCBZT']

In [269]:
len(data_asin)

48

### Extracting 'See All Reviews' Links

This code snippet retrieves 'See All Reviews' links for products:

**Initialize List:** Create an empty list link.

**Fetch Product Pages:** For each data-asin, use Searchasin to get the product page HTML.

**Parse HTML:** Use BeautifulSoup to parse the HTML.

**Extract Links:** Find a tags with data-hook="see-all-reviews-link-foot" and append their href attributes to the link list.

The link list will contain URLs to the review pages for each product.

In [270]:
link=[]
for i in range(len(data_asin)):
    response=Searchasin(data_asin[i])
    soup=BeautifulSoup(response.content)
    for i in soup.findAll("a",{'data-hook':"see-all-reviews-link-foot"}):
        link.append(i['href'])

https://www.amazon.com/dp/B08N9Q24M9
https://www.amazon.com/dp/B09CX95397
https://www.amazon.com/dp/B0C69688VG
https://www.amazon.com/dp/B09XMMZSWW
https://www.amazon.com/dp/B08QRWDHVT
https://www.amazon.com/dp/B0CGQPSWWB
https://www.amazon.com/dp/B087RQZ3GH
https://www.amazon.com/dp/B015MS3ICQ
https://www.amazon.com/dp/B0C3H9MJQV
https://www.amazon.com/dp/B07TPLZY74
https://www.amazon.com/dp/B0CSTCYQB9
https://www.amazon.com/dp/B0CNZ4BB9P
https://www.amazon.com/dp/B0CG6953FL
https://www.amazon.com/dp/B08LVSFN4X
https://www.amazon.com/dp/B08R9PTZ5G
https://www.amazon.com/dp/B0BRXZV6R2
https://www.amazon.com/dp/B0CHMM4Y7L
https://www.amazon.com/dp/B07PZF3QS3
https://www.amazon.com/dp/B0BQ6WCZ9H
https://www.amazon.com/dp/B09P1DV7D8
https://www.amazon.com/dp/B09XJ1Z8CV
https://www.amazon.com/dp/B07H7ZDBV4
https://www.amazon.com/dp/B094MN8FT9
https://www.amazon.com/dp/B07PVBS863
https://www.amazon.com/dp/B0764HS4SL
https://www.amazon.com/dp/B0BTNT4225
https://www.amazon.com/dp/B09WF796VW
h

In [271]:
len(link)

92

### Now we have the 'see all review' links. Using this link along with a page number, we can extract the reviews in any number of pages for all the products


In [272]:
link

['/Mueller-Austria-Chopper-Vegetable-Container/product-reviews/B08N9Q24M9/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews',
 '/Mueller-Austria-Chopper-Vegetable-Container/product-reviews/B08N9Q24M9/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews',
 '/Ecolution-Microwave-Micro-Pop-Borosilicate-Dishwasher/product-reviews/B09CX95397/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews',
 '/Ecolution-Microwave-Micro-Pop-Borosilicate-Dishwasher/product-reviews/B09CX95397/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews',
 '/Gaiatop-Portable-Handheld-Foldable-Rechargeable/product-reviews/B0C69688VG/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews',
 '/Gaiatop-Portable-Handheld-Foldable-Rechargeable/product-reviews/B0C69688VG/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews',
 '/Protector-Extender-Addtam-5-Outlet-Splitter/product-reviews/B09XMMZSWW/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews',
 '/Protector-

### Extracting Reviews from Multiple Pages

1. Initialize List: reviews = []
2. Fetch Reviews: For each link, get reviews from the first 11 pages.
3. Parse and Extract: Use BeautifulSoup to extract and append review texts to reviews.

In [273]:
reviews=[]
for j in range(len(link)):
    for k in range(11):
        response=Searchreviews(link[j]+'&pageNumber='+str(k))
        soup=BeautifulSoup(response.content)
        for i in soup.findAll("span",{'data-hook':"review-body"}):
            reviews.append(i.text)

https://www.amazon.com/Mueller-Austria-Chopper-Vegetable-Container/product-reviews/B08N9Q24M9/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber=0
https://www.amazon.com/Mueller-Austria-Chopper-Vegetable-Container/product-reviews/B08N9Q24M9/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber=1
https://www.amazon.com/Mueller-Austria-Chopper-Vegetable-Container/product-reviews/B08N9Q24M9/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber=2
https://www.amazon.com/Mueller-Austria-Chopper-Vegetable-Container/product-reviews/B08N9Q24M9/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber=3
https://www.amazon.com/Mueller-Austria-Chopper-Vegetable-Container/product-reviews/B08N9Q24M9/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber=4
https://www.amazon.com/Mueller-Austria-Chopper-Vegetable-Container/product-reviews/B08N9Q24M9/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageN

ConnectTimeout: HTTPSConnectionPool(host='www.amazon.com', port=443): Max retries exceeded with url: /Silicone-Utensil-Rest-Multiple-Heat-Resistant/product-reviews/B07PVBS863/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber=9 (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000002232FDDEBD0>, 'Connection to www.amazon.com timed out. (connect timeout=None)'))

### Saving Reviews to CSV

**Count Reviews:** len(reviews) determines the number of reviews collected.

**Create DataFrame:** Convert the reviews list into a DataFrame using pd.DataFrame.from_dict.

**Set Column Width:** pd.set_option('max_colwidth', 800) ensures that long review texts are fully visible.

**Save to CSV:** Write the DataFrame to a CSV file named 'Scraping gadgets reviews.csv'.

In [274]:
len(reviews)

450

In [275]:
rev={'reviews':reviews}

In [276]:
review_data=pd.DataFrame.from_dict(rev)
pd.set_option('max_colwidth',800)

In [277]:
review_data.shape

(450, 1)

In [278]:
review_data.tail()

Unnamed: 0,reviews
445,\nPerfect size. Dishwasher safe. Good quality. Nice color. Lightweight.\n
446,\nIt’s nice to have one place to hold multiple utensils. It’s big enough to hold my largest tools and has plenty of space to contain all the dripping messes. So much easier than cleaning the stove or counter a bunch of times.\n
447,"\nI really like this. I normally have several pots and/or pans going on the stove at once, so I need more spaces for spoons. Those cute little ceramic one spoon holders look pretty but are not practical for me. Plus,this is easy to clean and you can't accidently break it!\n"
448,\nI am using these to organize diamond art painting pens on my work surface. They are great except they slide down my drawing table even at a small tilt so I am using putty to hold them in place. I was hoping the silicone factor would help prevent creeping but it doesn't. I do like the units and may purchase more in the future.\n
449,"\nIf you cook a lot you will love this utensil holder!! Holds up to 4 utensils, cleans easily and keeps my new stone counter clean.\n"


In [1]:
review_data.to_csv('Scraping gadgets reviews.csv')

NameError: name 'review_data' is not defined