## AMAZON WEB SCRAPING USING PYTHON

![image](https://miro.medium.com/max/1400/1*kr-TlPT8c7_ZR4iC9QILWg.png)

## 1. Introduction

### 1.1 Overview

Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications. There are many different ways to perform web scraping to obtain data from websites. These include using online services, particular API’s or even creating your code for web scraping from scratch. Many large websites, like Google, Twitter, Facebook, StackOverflow, etc. have API’s that allow you to access their data in a structured format. This is the best option, but there are other sites that don’t allow users to access large amounts of data in a structured form or they are simply not that technologically advanced. In that situation, it’s best to use Web Scraping to scrape the website for data.

In thisproject we will extract attributes ofthe best seller books.we will then collect a single product's attributes and create an email alert for when prices
of the item goes down.

### Problem Statement

In order to concentrate on competitor price research, real-time cost monitoring and seasonal shifts in order to provide consumers with better product offers. Web scraping Amazon data is needed for you to extract relevant data from the Amazon website and save it in a spreadsheet or JSON format. You can even automate the process to update the data on a regular weekly or monthly basis.



## 2. Loading Dependencies

In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import re
import time
from datetime import datetime
import matplotlib.dates as mdates
import matplotlib.ticker as ticker
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests

## 3. Defining base URL and data pulling

we will be extracting the best seller books attributes from this  url https://www.amazon.in/gp/bestsellers/books/
and the data to be collected is


*   Book Name
*   Author
* Rating
* Customers Rated
* Price



 Python contains an amazing library called BeautifulSoup to allow web scraping. We will be using it to scrape product information and save the details in a CSV file.
 in the following functions we will declare a Header and add a user agent. This ensures that the target website we are going to web scrape doesn’t consider traffic from our program as spam and finally get blocked by them.
To pinpoint our target elements, we will grab its element classes and feed them to the script. 

In [50]:
no_pages = 2

def get_data(pageNo):  
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_pg_x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

    r = requests.get('https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_pg_'+str(pageNo)+'?ie=UTF8&pg='+str(pageNo), headers=headers)#, proxies=proxies)
    content = r.content
    soup1 = BeautifulSoup(content,"html.parser")
    soup = BeautifulSoup(soup1.prettify(), "html.parser")
    #print(soup)

    alls = []
    for d in soup.findAll('div', attrs={'class':'a-cardui _cDEzb_grid-cell_1uMOS p13n-grid-content'}):
        #print(d)
        name = d.find('span', attrs={'class':'_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y'})
        n = name.find_all('img', alt=True)
        #print(n[0]['alt'])
        author = d.find('div', attrs={'class':'_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y'})
        print(author)
        rating = d.find('a', attrs={'class':'a-link-normal'})
        users_rated = d.find('span', attrs={'class':'a-size-small'})
        <span class="a-offscreen">₹4,049.00</span>

        all1=[]

        if name is not None:
            #print(n[0]['alt'])
            all1.append(n[0]['alt'])
        else:
            all1.append("unknown-product")

        if author is not None:
            #print(author.text)
            all1.append(author.text)
        elif author is None:
            author = d.find('span', attrs={'class':'a-size-small a-color-base'})
            if author is not None:
                all1.append(author.text)
            else:    
                all1.append('0')

        if rating is not None:
            #print(rating.text)
            all1.append(rating.text)
        else:
            all1.append('-1')

        if users_rated is not None:
            #print(price.text)
            all1.append(users_rated.text)
        else:
            all1.append('0')     

        if price is not None:
            #print(price.text)
            all1.append(price.text)
        else:
            all1.append('0')
        alls.append(all1)    
    return alls

In [51]:
results = []
for i in range(1, no_pages+1):
  results.append(get_data(i))
flatten = lambda l: [item for sublist in l for item in sublist]
df = pd.DataFrame(flatten(results),columns=['Book Name','Author','Rating','Customers_Rated', 'Price'])
df.to_csv('amazon_products.csv', index=False, encoding='utf-8')

In [52]:
df = pd.read_csv("amazon_products.csv")


**single item scraping**

We will also try scraping a single products attributes which would also be useful to different businesses.

In [67]:
# Connect to Website and pull in data

URL = 'https://www.amazon.in/CrystalTech-Mitsubishi-Lexus-450-Multi-Layered/dp/B09ZHP54M3/ref=sr_1_1_sspa?crid=3KBXBTUB6P2KT&keywords=lexus&qid=1654459917&s=automotive&sprefix=lex%2Cautomotive%2C268&sr=1-1-spons&vehicle=Lexus%3ALX&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUExMUlVMlIyVlBIVzhXJmVuY3J5cHRlZElkPUEwNTM3MjQ0MUozVk40Rk9PUDRRSCZlbmNyeXB0ZWRBZElkPUExMDA5MTk1M040WkhHTFBWMEtHSiZ3aWRnZXROYW1lPXNwX2F0ZiZhY3Rpb249Y2xpY2tSZWRpcmVjdCZkb05vdExvZ0NsaWNrPXRydWU='

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

page = requests.get(URL, headers=headers)

soup1 = BeautifulSoup(page.content, "html.parser")

soup2 = BeautifulSoup(soup1.prettify(), "html.parser")

title = soup2.find(id='productTitle').get_text()

#price = soup2.find('span', attrs={'class':'a-offscreen'})
price = float(soup2.find('span', attrs={'class':'a-offscreen'}).get_text().replace('.', '').replace('₹', '').replace(',', '').strip())




print(title)
print(price)


     NEODRIFT 'CrystalTech' Silver Car Body Cover for Mitsubishi Lexus 450 (100% Water Resistant, Tailored Fit, All-Weather Protection, Multi-Layered & Breathable Fabric)
    
404900.0


In [68]:
# Create a Timestamp for your output to track when data was collected

import datetime

today = datetime.date.today()

print(today)

2022-06-05


In [69]:
# Create CSV and write headers and data into the file

import csv 

header = ['Title', 'Price', 'Date']
data = [title, price, today]


with open('AmazonWebScraperDataset.csv', 'w', newline='', encoding='UTF8') as f:
    writer = csv.writer(f)
    writer.writerow(header)
    writer.writerow(data)

In [70]:
import pandas as pd

df = pd.read_csv(r'AmazonWebScraperDataset.csv')

print(df)

                                               Title     Price        Date
0  \n     NEODRIFT 'CrystalTech' Silver Car Body ...  404900.0  2022-06-05


 creating  function by combining the code so that we canrun it to checkamazon prices every once in a while.this will then be used tocreate email alerts so that i can know when i can afford that cover

In [72]:
# Runs check_price after a set time and inputs data into your CSV
def check_price():
    URL = 'https://www.amazon.in/CrystalTech-Mitsubishi-Lexus-450-Multi-Layered/dp/B09ZHP54M3/ref=sr_1_1_sspa?crid=3KBXBTUB6P2KT&keywords=lexus&qid=1654459917&s=automotive&sprefix=lex%2Cautomotive%2C268&sr=1-1-spons&vehicle=Lexus%3ALX&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUExMUlVMlIyVlBIVzhXJmVuY3J5cHRlZElkPUEwNTM3MjQ0MUozVk40Rk9PUDRRSCZlbmNyeXB0ZWRBZElkPUExMDA5MTk1M040WkhHTFBWMEtHSiZ3aWRnZXROYW1lPXNwX2F0ZiZhY3Rpb249Y2xpY2tSZWRpcmVjdCZkb05vdExvZ0NsaWNrPXRydWU='

    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

    page = requests.get(URL, headers=headers)

    soup1 = BeautifulSoup(page.content, "html.parser")

    soup2 = BeautifulSoup(soup1.prettify(), "html.parser")

    title = soup2.find(id='productTitle').get_text()
    

    #price = soup2.find('span', attrs={'class':'a-offscreen'})
    price = float(soup2.find('span', attrs={'class':'a-offscreen'}).get_text().replace('.', '').replace('₹', '').replace(',', '').strip())
    

    import datetime

    today = datetime.date.today()
    
    import csv 

    header = ['Title', 'Price', 'Date']
    data = [title, price, today]

    with open('AmazonWebScraperDataset.csv', 'a+', newline='', encoding='UTF8') as f:
        writer = csv.writer(f)
        writer.writerow(data)
 


In [None]:
# Runs check_price after a set time and inputs data into your CSV

while(True):
    check_price()
    time.sleep(86400)

In [None]:
# If uou want to try sending yourself an email (just for fun) when a price hits below a certain level you can try it
# out with this script

def send_mail():
    server = smtplib.SMTP_SSL('smtp.gmail.com',465)
    server.ehlo()
    #server.starttls()
    server.ehlo()
    server.login('mukamijeniffer6@gmail.com','xxxxxxxxxxxxxx')
    
    subject = "Jenny the car cover you want is now available!"
    body = "Jenifer, This is the moment we have been waiting for. Now is your chance to pick up the car cover of your dreams. Don't mess it up! Link here: https://www.amazon.in/CrystalTech-Mitsubishi-Lexus-450-Multi-Layered/dp/B09ZHP54M3/ref=sr_1_1_sspa?crid=3KBXBTUB6P2KT&keywords=lexus&qid=1654459917&s=automotive&sprefix=lex%2Cautomotive%2C268&sr=1-1-spons&vehicle=Lexus%3ALX&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUExMUlVMlIyVlBIVzhXJmVuY3J5cHRlZElkPUEwNTM3MjQ0MUozVk40Rk9PUDRRSCZlbmNyeXB0ZWRBZElkPUExMDA5MTk1M040WkhHTFBWMEtHSiZ3aWRnZXROYW1lPXNwX2F0ZiZhY3Rpb249Y2xpY2tSZWRpcmVjdCZkb05vdExvZ0NsaWNrPXRydWU="
   
    msg = f"Subject: {subject}\n\n{body}"
    
    server.sendmail(
        'mukamijeniffer6@gmail.com',
        msg
     
    )

In [None]:
#  sending myself an email (just for fun) when a price hits below a certain level 
# out with this script

def send_mail():
    server = smtplib.SMTP_SSL('smtp.gmail.com',465)
    server.ehlo()
    #server.starttls()
    server.ehlo()
    server.login('AlexTheAnalyst95@gmail.com','xxxxxxxxxxxxxx')
    
    subject = "The Shirt you want is below $15! Now is your chance to buy!"
    body = "Alex, This is the moment we have been waiting for. Now is your chance to pick up the shirt of your dreams. Don't mess it up! Link here: https://www.amazon.com/Funny-Data-Systems-Business-Analyst/dp/B07FNW9FGJ/ref=sr_1_3?dchild=1&keywords=data+analyst+tshirt&qid=1626655184&sr=8-3"
   
    msg = f"Subject: {subject}\n\n{body}"
    
    server.sendmail(
        'mukamijeniffer6@gmail.com',
        msg
     
    )

web scraping using Python is a skill you can use to extract the data into a useful form that can then be imported and used in various ways.

Some of the practical applications of web scraping could be:

Gathering resume of candidates with a specific skill,
Extracting tweets from twitter with specific hashtags,
Lead generation in marketing,
Scraping product details and reviews from e-commerce websites.

Apart from the above use-cases, web scraping is widely used in natural language processing for extracting text from the websites for training a deep learning model.




However it also has its own challenge forinstance longevity. Since the web developers keep updating their websites, you cannot certainly rely on one scraper for too long. Even though the modifications might be minor, but they still might create a hindrance for you while fetching the data.so aother preffered realistic approach would be to use Application Programming Interfaces (APIs) offered by various websites & platforms
```

