## Building a web scraper that extracts product search results from amazon.com <br> 

In this notebook, I am building a web scraper for amazon.com. Given a search term, this scrapper will extract records from all pages in a search result and store them in a csv file. <br>As an example, we will be using "bike lock" as search term.

### Installing and importing the packages/libraries

In [1]:
pip install selenium

Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.




[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
pip install bs4

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
pip install requests

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
import csv
from bs4 import BeautifulSoup
from selenium import webdriver
import requests

In [5]:
# writing a function that will generate a url based on the search terms provided

def get_url(search_term):
    """Generate a url from search term"""
    template = 'https://www.amazon.com/s?k={}&crid=2LEXIUFP0SED4&sprefix={}%2Caps%2C148&ref=nb_sb_noss_2'
    search_term = search_term.replace(' ', '+')
    return template.format(search_term, search_term)     #since the search term appears twice in the url when tried on amazon.com

In [6]:
# Creating a url for the bike lock search

url = get_url("bike lock")
print(url)

https://www.amazon.com/s?k=bike+lock&crid=2LEXIUFP0SED4&sprefix=bike+lock%2Caps%2C148&ref=nb_sb_noss_2


In [7]:
#get your user agent from 'whatismybrowser.com' / Detect my settings / What is my user agent.
#copy and paste the user agent in the headers code below

HEADERS = ({'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36', 'Accept-Language':'en-US, en;q=0.5'})

In [8]:
# Creating an insatnce of a webpage for HTTP request
webpage = requests.get(url, headers=HEADERS)

In [9]:
webpage

<Response [200]>

In [10]:
webpage.content

b'<!doctype html><html lang="en-us" class="a-no-js" data-19ax5a9jf="dingo"><!-- sp:feature:head-start -->\n<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>\n<!-- sp:end-feature:head-start -->\n<!-- sp:feature:csm:head-open-part1 -->\n\n<script type=\'text/javascript\'>var ue_t0=ue_t0||+new Date();</script>\n<!-- sp:end-feature:csm:head-open-part1 -->\n<!-- sp:feature:cs-optimization -->\n<meta http-equiv=\'x-dns-prefetch-control\' content=\'on\'>\n<link rel="dns-prefetch" href="https://images-na.ssl-images-amazon.com">\n<link rel="dns-prefetch" href="https://m.media-amazon.com">\n<link rel="dns-prefetch" href="https://completion.amazon.com">\n<!-- sp:end-feature:cs-optimization -->\n<!-- sp:feature:csm:head-open-part2 -->\n<script type=\'text/javascript\'>\nwindow.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;\nif (window.ue_ihb === 1) {\n\nvar ue_csm = window,\n    ue_hob = +new Date();\n(function(d){var e=d.ue=d.ue||{},f=Date.now||function()

In [11]:
# this content is in bytes format. 
type(webpage.content)


bytes

### Extracting the content of the page from the html in the background

In [12]:
#We want to convert it to html format using BeautifulSoup
#creating soup object which will parse the html content from the page source
soup = BeautifulSoup(webpage.content, 'html.parser')
soup

<!DOCTYPE html>
<html class="a-no-js" data-19ax5a9jf="dingo" lang="en-us"><!-- sp:feature:head-start -->
<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>
<!-- sp:end-feature:head-start -->
<!-- sp:feature:csm:head-open-part1 -->
<script type="text/javascript">var ue_t0=ue_t0||+new Date();</script>
<!-- sp:end-feature:csm:head-open-part1 -->
<!-- sp:feature:cs-optimization -->
<meta content="on" http-equiv="x-dns-prefetch-control"/>
<link href="https://images-na.ssl-images-amazon.com" rel="dns-prefetch"/>
<link href="https://m.media-amazon.com" rel="dns-prefetch"/>
<link href="https://completion.amazon.com" rel="dns-prefetch"/>
<!-- sp:end-feature:cs-optimization -->
<!-- sp:feature:csm:head-open-part2 -->
<script type="text/javascript">
window.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;
if (window.ue_ihb === 1) {

var ue_csm = window,
    ue_hob = +new Date();
(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=func

In [13]:
results = soup.find_all('div', {'data-component-type': 's-search-result'})
results

[<div class="sg-col-4-of-24 sg-col-4-of-12 s-result-item s-asin sg-col-4-of-16 AdHolder sg-col s-widget-spacing-small sg-col-4-of-20 gsx-ies-anchor" data-asin="B07T3F6JST" data-component-type="s-search-result" data-index="2" data-uuid="42ff5fe0-70df-4c84-84a4-075a305384be"><div class="sg-col-inner"><div cel_widget_id="MAIN-SEARCH_RESULTS-2" class="s-widget-container s-spacing-small s-widget-container-height-small celwidget slot=MAIN template=SEARCH_RESULTS widgetId=search-results_1" data-csa-c-item-id="amzn1.asin.1.B07T3F6JST" data-csa-c-pos="1" data-csa-c-type="item" data-csa-op-log-render="">
 <div class="rush-component s-expand-height" data-component-props='{"percentageShownToFire":"50","batchable":true,"requiredElementSelector":".s-image:visible","url":"https://unagi-na.amazon.com/1/events/com.amazon.eel.SponsoredProductsEventTracking.prod?qualifier=1717874385&amp;id=1516289367278226&amp;widgetName=sp_atf&amp;adId=200022823637231&amp;eventType=1&amp;adIndex=0"}' data-component-type

In [14]:
len(results)

60

### Prototyping the extraction of a single record

In [15]:
item = results[0]

In [16]:
atag = item.h2.a

In [17]:
description = atag.text.strip() 
description

'Amazon Basics 6 ft. Adjustable Bike Cable Key Lock, Black, 1-Pack'

In [18]:
url = 'https://www.amazon.com' + atag.get('href')
url

'https://www.amazon.com/sspa/click?ie=UTF8&spc=MToxNTE2Mjg5MzY3Mjc4MjI2OjE3MTc4NzQzODU6c3BfYXRmOjIwMDAyMjgyMzYzNzIzMTo6MDo6&url=%2FAmazonBasics-Adjustable-Keyed-Cable-1-Pack%2Fdp%2FB07T3F6JST%2Fref%3Dsr_1_1_ffob_sspa%3Fcrid%3D2LEXIUFP0SED4%26dib%3DeyJ2IjoiMSJ9.cRbrbgWYMEO9qTlKHYrfzrNolNMY90QL9DzZ1pmpbpZSG5Ep2Zyc80hPEbE34EcB6Y88LMRg1Te6IXdD7xnlZbC7zVSZM_Is-nkxumq2jooUydel-EPFKJrnxmeqJO-IHuQe8BBOVfBslJMwLzN8HLw4EbM86hVRk5odgtREelGze52xIv0Ybt3EakOW0Bn5WH6tI4IogS-bRpEN0V_mHiM5K3i7RlivtOweF1xZdvJUUNmr82jUw43S9e_JyiIAbxir4-ekQi1KNlnm-ehFDb0TPxBnOmFU2IYs3gmKAyU.HYI-J7nb8haLKbKnkH6vMyU7VTfANah2hk21tC7SzoE%26dib_tag%3Dse%26keywords%3Dbike%2Block%26qid%3D1717874385%26sprefix%3Dbike%2Block%252Caps%252C148%26sr%3D8-1-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9hdGY%26psc%3D1'

In [19]:
# getting the price
price_parent = item.find('span', 'a-price')
price = price_parent.find('span', 'a-offscreen').text

In [20]:
# getting the ratings. Inspecting the stars shows that ratings are in the i tag

rating = item.i.text

In [21]:
#getting the reviews

item.find('span', {'class':'a-size-base s-underline-text'}).text

'2,487'

### Generalizing the patttern of extraction within a function which can be applied to all the records on a page

Here I incorporated some error handling. 5 attributes will be extracted from each record, but since some attributes may be missing in some records, AttributeError will need to be accomodated.

Add error handling: The above code assumes each record contains all 5 attributes, but that's not the case

In [22]:
def extract_record(item):
    """Extract and return data from a single record"""
   
    # description and url
    atag = item.h2.a
    description = atag.text.strip()
    url = 'https://www.amazon.com' + atag.get('href')
    
    try:
        #price
        price_parent = item.find('span', 'a-price')
        price = price_parent.find('span', 'a-offscreen').text
    except AttributeError:
        return
    
    try:
        #rating & review(rank)
        rating = item.i.text
        review_count = item.find('span', {'class':'a-size-base s-underline-text'}).text
    except AttributeError:
        rating = ''
        review_count = ''
    
    result = (description, price, rating, review_count, url)
    
    return result

In [23]:
# apply the pattern to all records on the page 

records =[]
results = soup.find_all('div', {'data-component-type': 's-search-result'})

for item in results:
    record = extract_record(item)
    if record:   #ie, if record has something in it
        records.append(record)

In [24]:
# viewing the first record
records[0]

('Amazon Basics 6 ft. Adjustable Bike Cable Key Lock, Black, 1-Pack',
 '$12.59',
 '4.6 out of 5 stars',
 '2,487',
 'https://www.amazon.com/sspa/click?ie=UTF8&spc=MToxNTE2Mjg5MzY3Mjc4MjI2OjE3MTc4NzQzODU6c3BfYXRmOjIwMDAyMjgyMzYzNzIzMTo6MDo6&url=%2FAmazonBasics-Adjustable-Keyed-Cable-1-Pack%2Fdp%2FB07T3F6JST%2Fref%3Dsr_1_1_ffob_sspa%3Fcrid%3D2LEXIUFP0SED4%26dib%3DeyJ2IjoiMSJ9.cRbrbgWYMEO9qTlKHYrfzrNolNMY90QL9DzZ1pmpbpZSG5Ep2Zyc80hPEbE34EcB6Y88LMRg1Te6IXdD7xnlZbC7zVSZM_Is-nkxumq2jooUydel-EPFKJrnxmeqJO-IHuQe8BBOVfBslJMwLzN8HLw4EbM86hVRk5odgtREelGze52xIv0Ybt3EakOW0Bn5WH6tI4IogS-bRpEN0V_mHiM5K3i7RlivtOweF1xZdvJUUNmr82jUw43S9e_JyiIAbxir4-ekQi1KNlnm-ehFDb0TPxBnOmFU2IYs3gmKAyU.HYI-J7nb8haLKbKnkH6vMyU7VTfANah2hk21tC7SzoE%26dib_tag%3Dse%26keywords%3Dbike%2Block%26qid%3D1717874385%26sprefix%3Dbike%2Block%252Caps%252C148%26sr%3D8-1-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9hdGY%26psc%3D1')

### Getting the next page

On the amazon page, if you click on 'Next' and oberve the url of the next page, you notice a parameter for page number. We can add this page parameter to the url using string formatting. Amazon search gives a maximum of 20 pages

In [25]:
def get_url(search_term):
    """Generate a url from search term"""
    template = 'https://www.amazon.com/s?k={}&crid=2LEXIUFP0SED4&sprefix={}%2Caps%2C148&ref=nb_sb_noss_2'
    search_term = search_term.replace(' ', '+')
    
    # add term query to url
    url = template.format(search_term, search_term)
    
    #add page query placeholder
    url += '&page{}'  #this gives a place to insert the next page number (with string formatting)
    
    return url

### Putting all together

In [27]:
import csv
from bs4 import BeautifulSoup
#from selenium import webdriver
import requests


def get_url(search_term):
    """Generate a url from search term"""
    template = 'https://www.amazon.com/s?k={}&crid=2LEXIUFP0SED4&sprefix={}%2Caps%2C148&ref=nb_sb_noss_2'
    search_term = search_term.replace(' ', '+')
    
    # add term query to url
    url = template.format(search_term, search_term)
    
    #add page query placeholder
    url += '&page{}'  #this gives a place to insert the next page number (with string formatting)
    
    return url

def extract_record(item):
    """Extract and return data from a single record"""
   
    # description and url
    atag = item.h2.a
    description = atag.text.strip()
    url = 'https://www.amazon.com' + atag.get('href')
    
    try:
        #price
        price_parent = item.find('span', 'a-price')
        price = price_parent.find('span', 'a-offscreen').text
    except AttributeError:
        return
    
    try:
        #rating & review(rank)
        rating = item.i.text
        review_count = item.find('span', {'class':'a-size-base s-underline-text'}).text
    except AttributeError:
        rating = ''
        review_count = ''
    
    result = (description, price, rating, review_count, url)
    
    return result

def main(search_term):
    """Run main program routine"""
    #startup the webpage
    HEADERS = ({'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36', 'Accept-Language':'en-US, en;q=0.5'})
    
    records = []
    url = get_url(search_term)
    
    for page in range (1, 21):
        webpage = requests.get(url, headers=HEADERS)
        soup = BeautifulSoup(webpage.content, 'html.parser')
        results = soup.find_all('div', {'data-component-type': 's-search-result'})
        
        for item in results:
            record = extract_record(item)
            if record:
                records.append(record)
                
    webpage.close()
    
    #save data to csv
    with open('results.csv', 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['Description', 'Price', 'Rating', 'ReviewCount', 'url'])
        writer.writerows(records)
    

In [28]:
main('bike lock')