## Why Web Scraping?

Web scraping is used to collect large information from websites. But why does someone have to collect such large data from websites? To know about this, let’s look at the applications of web scraping:

Price Comparison: Services such as ParseHub use web scraping to collect data from online shopping websites and use it to compare the prices of products.

Email address gathering: Many companies that use email as a medium for marketing, use web scraping to collect email ID and then send bulk emails.

Social Media Scraping: Web scraping is used to collect data from Social Media websites such as Twitter to find out what’s trending.

Research and Development: Web scraping is used to collect a large set of data (Statistics, General Information, Temperature, etc.) from websites, which are analyzed and used to carry out Surveys or for R&D.

Job listings: Details regarding job openings, interviews are collected from different websites and then listed in one place so that it is easily accessible to the user.

### Find the URL that you want to scrape

we are going scrape Flipkart website to extract the Price, Name, and Rating of Laptops. The URL for this page is https://www.flipkart.com/laptops/pr?sid=6bo,b5g&marketplace=FLIPKART

Download chdrome web driver https://sites.google.com/a/chromium.org/chromedriver/home

### Inspecting the Page

The data is usually nested in tags. So, we inspect the page to see, under which tag the data we want to scrape is nested. To inspect the page, just right click on the element and click on “Inspect”.

When you click on the “Inspect” tab, you will see a “Browser Inspector Box” open.

### Find the data you want to extract

Let’s extract the Price, Name, and Rating which is nested in the “div” tag respectively.

In [1]:
#conda install -c anaconda beautiful-soup

In [2]:
#import all the necessary libraries:

from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd

In [3]:
import numpy as np

In [4]:
#To configure webdriver to use Chrome browser, we have to set the path to chromedriver

driver = webdriver.Chrome("chromedriver.exe")
#driver = webdriver.Ie(r'IEDriverServer.exe')

In [5]:
#open the URL:

products = [] #List to store name of the product
actual_prices = [] #List to actual price of the product
offer_prices = [] #List to offer price of the product
prices = []
ratings = [] #List to store rating of the product

In [6]:
#driver.get("https://www.flipkart.com/laptops/~buyback-guarantee-on-laptops-/pr?sid=6bo%2Cb5g&uniq")
driver.get('https://www.flipkart.com/laptops/pr?sid=6bo,b5g&marketplace=FLIPKART')

Now that we have written the code to open the URL, it’s time to extract the data from the website. As mentioned earlier, the data we want to extract is nested in <div> tags. So, I will find the div tags with those respective class-names, extract the data and store the data in a variable.

In [7]:
content = driver.page_source
soup = BeautifulSoup(content)

    
for a in soup.findAll('a',href=True, attrs={'class':'_31qSD5'}):
    name=a.find('div', attrs={'class':'_3wU53n'})
    price=a.find('div', attrs={'class':'_1vC4OE _2rQ-NK'})
    rating=a.find('div', attrs={'class':'hGSR34'})
    products.append(name.text)
    prices.append(price.text)
    try: 
        ratings.append(rating.text) 
    except:
        ratings.append(np.nan)

In [8]:
print(len(products))
products

24


['Asus VivoBook S14 Core i7 8th Gen - (8 GB/1 TB HDD/256 GB SSD/Windows 10 Home/2 GB Graphics) S430FN-EB...',
 'Apple MacBook Air Core i5 5th Gen - (8 GB/128 GB SSD/Mac OS Sierra) MQD32HN/A A1466',
 'HP 14q Core i3 7th Gen - (8 GB/256 GB SSD/Windows 10 Home) 14q-cs0023TU Thin and Light Laptop',
 'Lenovo Ideapad 130 Core i3 7th Gen - (4 GB/1 TB HDD/Windows 10 Home) 130-15IKB Laptop',
 'Asus VivoBook S14 Core i7 8th Gen - (8 GB/1 TB HDD/256 GB SSD/Windows 10 Home/2 GB Graphics) S430FN-EB...',
 'Dell Vostro 3000 Core i3 8th Gen - (4 GB/1 TB HDD/Linux) 3480 Laptop',
 'Lenovo Ideapad 130 Core i5 8th Gen - (8 GB/1 TB HDD/Windows 10 Home/2 GB Graphics) 130-15IKB Laptop',
 'Acer Nitro 5 Core i7 9th Gen - (8 GB/1 TB HDD/256 GB SSD/Windows 10 Home/4 GB Graphics/NVIDIA Geforce ...',
 'Lenovo Ideapad 130 APU Dual Core A6 - (4 GB/1 TB HDD/Windows 10 Home) 130-15AST Laptop',
 'HP Pavilion 15-EC Ryzen 5 Quad Core - (8 GB/1 TB HDD/128 GB SSD/Windows 10 Home/4 GB Graphics/NVIDIA G...',
 'HP 14q Core i5

In [9]:
print(len(prices))
prices

24


['₹69,990',
 '₹64,990',
 '₹32,490',
 '₹25,990',
 '₹69,990',
 '₹26,990',
 '₹39,990',
 '₹69,990',
 '₹19,990',
 '₹56,990',
 '₹41,990',
 '₹18,999',
 '₹51,990',
 '₹42,990',
 '₹28,490',
 '₹19,990',
 '₹54,990',
 '₹28,990',
 '₹31,490',
 '₹49,990',
 '₹23,990',
 '₹37,990',
 '₹32,990',
 '₹31,490']

In [10]:
print(len(ratings))
ratings

24


['4.7',
 '4.7',
 '4.3',
 '4',
 '4.7',
 nan,
 '4.2',
 '4.6',
 '3.9',
 '4.5',
 '4.3',
 '3.8',
 '4.8',
 '4.5',
 '4.7',
 '4.1',
 '4.6',
 '3.8',
 '4.1',
 '4.1',
 '4',
 '4.1',
 '4.3',
 '4.1']

In [11]:
driver.quit()

###  Store the data in a required format

After extracting the data, you might want to store it in a format. This format varies depending on your requirement. For this example, we will store the extracted data in a CSV (Comma Separated Value) format.

In [12]:
df = pd.DataFrame({'Product Name':products,'Price':prices,'Rating':ratings}) 
df

Unnamed: 0,Product Name,Price,Rating
0,Asus VivoBook S14 Core i7 8th Gen - (8 GB/1 TB...,"₹69,990",4.7
1,Apple MacBook Air Core i5 5th Gen - (8 GB/128 ...,"₹64,990",4.7
2,HP 14q Core i3 7th Gen - (8 GB/256 GB SSD/Wind...,"₹32,490",4.3
3,Lenovo Ideapad 130 Core i3 7th Gen - (4 GB/1 T...,"₹25,990",4.0
4,Asus VivoBook S14 Core i7 8th Gen - (8 GB/1 TB...,"₹69,990",4.7
5,Dell Vostro 3000 Core i3 8th Gen - (4 GB/1 TB ...,"₹26,990",
6,Lenovo Ideapad 130 Core i5 8th Gen - (8 GB/1 T...,"₹39,990",4.2
7,Acer Nitro 5 Core i7 9th Gen - (8 GB/1 TB HDD/...,"₹69,990",4.6
8,Lenovo Ideapad 130 APU Dual Core A6 - (4 GB/1 ...,"₹19,990",3.9
9,HP Pavilion 15-EC Ryzen 5 Quad Core - (8 GB/1 ...,"₹56,990",4.5


In [13]:
df.to_csv('products.csv', index=False)