<a href="https://colab.research.google.com/github/michaelfarayola7/Data-Science-ML-Projects/blob/main/Amazon_Generic_Auto_Scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Note: This code will run better on JupyterLap or JupyterNotebook
<br>
We will build an **Automated Web Scraper** to extract data from **amazon.com** that we can use for any data analysis,data science or machine learning project.

Before we get started let me make it clear that Amazon has a tight security for their platform and some of the things you can easily do on other webpages wont work on Amozon platform.

Previously, we could have used **Beautiful Soup and Request** to easily get titles from the page, but things have changed little bit. We will still use **Beautiful Soup** but in a different way.

Let's see how we can do things differently now.

# Installation

We will be using:
* **Selenium**
* **BeautifulSoup**
* **Webdrivers**

In [None]:
# !pip install selenium
# !pip install msedge-selenium-tools
# !pip install bs4

In [None]:
# !pip install chromedriver_binary==114.0.5735.90

In [None]:
from selenium import webdriver
import chromedriver_binary

# #for MSEdge
# from msedge.selenium_tools import Edge, EdgeOptions
# import csv

In [None]:
#for chrome
driver = webdriver.Chrome()

# #for Firefox
# driver = webdriver.Firefox()

# #for MSEdge
# options = EdgeOptions()
# options.use_chromium = True
# driver = Edge(options=options)

In [None]:
url = 'https://www.amazon.com/'
driver.get(url)

In [None]:
def my_url(keyword):
    temp = 'https://www.amazon.com/s?k={}&crid=OVLNC2PDJUBK&sprefix={}%2Caps%2C173&ref=nb_sb_noss_1'
    keyword = keyword.replace(' ','+')
    return temp.format(keyword,keyword)

In [None]:
url = my_url('phone bag')

In [None]:
url

'https://www.amazon.com/s?k=phone+bag&crid=OVLNC2PDJUBK&sprefix=phone+bag%2Caps%2C173&ref=nb_sb_noss_1'

In [None]:
driver.get(url) #this will open in your browser and return the page for your keyword

We can realise that the page is quite structured, although there are some few records that we need to deal with. What we want to do is to extract data from each record. There are also multiple pages (e.g 1-20 pages can be returned for a single keyword search).

We need to access the html of the page in order to extract the required data. We will create a Beutiful Soup object for this.

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(driver.page_source,'html.parser')

In [None]:
soup

<html class="a-js a-audio a-video a-canvas a-svg a-drag-drop a-geolocation a-history a-webworker a-autofocus a-input-placeholder a-textarea-placeholder a-local-storage a-gradients a-transform3d a-touch-scrolling a-text-shadow a-text-stroke a-box-shadow a-border-radius a-border-image a-opacity a-transform a-transition a-ember" data-19ax5a9jf="dingo" data-aui-build-date="3.23.1-2023-06-23" lang="en-us"><!-- sp:feature:head-start --><head><script async="" crossorigin="anonymous" src="https://images-na.ssl-images-amazon.com/images/I/31bJewCvY-L.js"></script><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>
<!-- sp:end-feature:head-start -->
<!-- sp:feature:csm:head-open-part1 -->
<script type="text/javascript">var ue_t0=ue_t0||+new Date();</script>
<!-- sp:end-feature:csm:head-open-part1 -->
<!-- sp:feature:cs-optimization -->
<meta content="on" http-equiv="x-dns-prefetch-control"/>
<link crossorigin="" href="https://images-na.ssl-images-amazon.com" rel="dns-

In [None]:
soup_results=soup.find_all('div',{'data-component-type':'s-search-result'})

In [None]:
len(soup_results)

60

In [None]:
# we will assign our first result to an obj

obj=soup_results[0]

In [None]:
atag = obj.h2.a #create the h2 tag variable

In [None]:
des = atag.text.strip()


In [None]:
des #we can see below that we have the title correctly scraped

'MoKo Floating Waterproof Phone Pouch - 9.5" Large Clear Phone Water Protector Pouch Dry Bag Case with Lanyard, Compatible for iPhone 14 13 12 Pro Max, Galaxy S23 S22 Ultra, 2 Pack Black+Mint Green'

In [None]:
#let's now create a generic url

url='https://www.amazon.com/'+atag.get('href')

## Get the Price

In [None]:
#let's get the price same way we searched for the title by looking for the div tag, in this case, we will look for the tag that contains the price of the item.

#we will get this from the 'span' which contains the a-price and then use the 'span' which contains 'a-offscreen' to obtain the actual price.

parent=obj.find('span','a-price')

price=parent.find('span','a-offscreen').text

price

'$19.99'

## Building an Automated Amazon Scraper Function

In [None]:
from selenium import webdriver
import chromedriver_binary
import csv


def my_url(keyword):
    temp = 'https://www.amazon.com/s?k={}&crid=OVLNC2PDJUBK&sprefix={}%2Caps%2C173&ref=nb_sb_noss_1'
    keyword = keyword.replace(' ','+')
    url = temp.format(keyword,keyword)

    url +='&page{}'
    return url

def extract_records(obj):
    atag = obj.h2.a
    description = atag.text.strip()
    url='https://www.amazon.com/'+atag.get('href')

    try:
        parent = obj.find('span', 'a-price')
        price = parent.find('span', 'a-offscreen').text
    except AttributeError:
        return

    try:
        rate = obj.find('span', 'a-icon-alt').text
        counts_reviews = obj.find('span',{'class':'a-size-base','dir':'auto'}).text
    except AttributeError:
        rate = ' '
        counts_review = ' '

    image = obj.find('img',{'class':'s-image'}).get('src')

    result = (description, price, rate, counts_review, url, image)
    return result


def main(keyword):
    driver = webdriver.Chrome()

    records = []
    url = my_url(keyword)

    for page in range(1,20):
        driver.get(url.format(page))
        soup = BeautifulSoup(driver.page_source,'html.parser')
        soup_results=soup.find_all('div',{'data-component-type':'s-search-result'})

        for item in soup_results:
            record = extract_records(item)
            if record:
                records.append(record)

        with open('result.csv','w', newline='',encoding = 'utf-8') as f:
            writer = csv.writer(f)
            writer.writerow(['Description', 'Price', 'Rate', 'Review Counts', 'URL', 'Image'])
            writer.writerows(records)

In [None]:
main('Laptop')

#### You can find the compiled csv result in your current working directory. You can use pwd to find that

In [None]:
pwd

'C:\\Users\\micha\\Desktop\\Data Science Data'

To view our scrape result, let use the Pandas

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('result.csv')

In [None]:
df # This is our scrape data from the Amazon website

Unnamed: 0,Description,Price,Rate,Review Counts,URL,Image
0,"HP Newest 14"" Ultral Light Laptop for Students...",$299.00,,,https://www.amazon.com//sspa/click?ie=UTF8&spc...,https://m.media-amazon.com/images/I/91UBjD+vAf...
1,"HP 2022 New 15 Laptop, 15.6"" HD LED Display, I...",$479.00,,,https://www.amazon.com//sspa/click?ie=UTF8&spc...,https://m.media-amazon.com/images/I/71TK5dRc6i...
2,"jumper 16 Inch Laptop, 16GB RAM 512GB SSD, Qua...",$379.99,,,https://www.amazon.com//jumper-Quad-Core-1920x...,https://m.media-amazon.com/images/I/81v6g1MtqS...
3,"HP 2022 Stream 14"" HD BrightView Laptop, Intel...",$269.99,,,https://www.amazon.com//HP-BrightView-Celeron-...,https://m.media-amazon.com/images/I/61ZmZnBrFq...
4,"jumper Laptop 16 Inch FHD IPS Display (16:10),...",$279.95,,,https://www.amazon.com//jumper-Laptop-Inch-FHD...,https://m.media-amazon.com/images/I/81d2r5VCPk...
...,...,...,...,...,...,...
413,"HP 17.3"" Flagship HD+ Business Laptop, 16GB DD...",$569.00,,,https://www.amazon.com//HP-i3-1125G4-i5-1035G4...,https://m.media-amazon.com/images/I/61DF0nkzgx...
414,"SGIN Laptop 15.6 Inch 4GB DDR4 128GB SSD, Wind...",$259.98,,,https://www.amazon.com//SGIN-Windows-Celeron-E...,https://m.media-amazon.com/images/I/71Qp4khq-1...
415,"Acer 2022 15inch HD IPS Chromebook, Intel Dual...",$162.00,,,https://www.amazon.com//Acer-15inch-Chromebook...,https://m.media-amazon.com/images/I/71-QYvMpwG...
416,"Acer 2023 15"" HD Premium Chromebook, Intel Cel...",$161.99,,,https://www.amazon.com//Acer-15-Chromebook-Pro...,https://m.media-amazon.com/images/I/61wvt5OL8C...


In [None]:
df.head(50)

Unnamed: 0,Description,Price,Rate,Review Counts,URL,Image
0,"HP Newest 14"" Ultral Light Laptop for Students...",$299.00,,,https://www.amazon.com//sspa/click?ie=UTF8&spc...,https://m.media-amazon.com/images/I/91UBjD+vAf...
1,"HP 2022 New 15 Laptop, 15.6"" HD LED Display, I...",$479.00,,,https://www.amazon.com//sspa/click?ie=UTF8&spc...,https://m.media-amazon.com/images/I/71TK5dRc6i...
2,"jumper 16 Inch Laptop, 16GB RAM 512GB SSD, Qua...",$379.99,,,https://www.amazon.com//jumper-Quad-Core-1920x...,https://m.media-amazon.com/images/I/81v6g1MtqS...
3,"HP 2022 Stream 14"" HD BrightView Laptop, Intel...",$269.99,,,https://www.amazon.com//HP-BrightView-Celeron-...,https://m.media-amazon.com/images/I/61ZmZnBrFq...
4,"jumper Laptop 16 Inch FHD IPS Display (16:10),...",$279.95,,,https://www.amazon.com//jumper-Laptop-Inch-FHD...,https://m.media-amazon.com/images/I/81d2r5VCPk...
5,HP ChromeBook 11 G4 11.6 Inch Business Noteboo...,$57.00,,,https://www.amazon.com//HP-ChromeBook-11-Noteb...,https://m.media-amazon.com/images/I/61B-1rox81...
6,Lenovo - 2022 - IdeaPad Flex 5i - 2-in-1 Chrom...,$373.99,,,https://www.amazon.com//sspa/click?ie=UTF8&spc...,https://m.media-amazon.com/images/I/81uDDuBAwp...
7,Dell Latitude 3189 2-in-1 11.6 inch HD Touchsc...,$119.00,,,https://www.amazon.com//Dell-Latitude-3189-Tou...,https://m.media-amazon.com/images/I/41eg0ErloR...
8,"HP Newest 14"" Ultral Light Laptop for Students...",$299.99,,,https://www.amazon.com//HP-Students-Business-Q...,https://m.media-amazon.com/images/I/81divYKpeT...
9,Acer Aspire 3 A315-24P-R7VH Slim Laptop | 15.6...,$329.99,,,https://www.amazon.com//A315-24P-R7VH-Display-...,https://m.media-amazon.com/images/I/71+1lOl1Y1...


# From here, any possible analysis can be done on the scrape data