# Web Scraping
Data scraping is one of the most used ways to collect data. In simple terms it means, to get HTML code for a webpage and scan it for data.  

![](https://rukminim1.flixcart.com/image/312/312/kfpq5jk0-0/headphone/c/n/6/rockerz-400-rockerz-410-boat-original-imafw45vhyrax3zj.jpeg?q=70)

**[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/)** and **[Selenium](https://www.selenium.dev/)** are most used packages for scanning data.  
In this notebook we'll see how to use Beautiful Soup and get reviews of **[boAt Rockerz 400](https://www.flipkart.com/boat-rockerz-400-bluetooth-headset/p/itm14d0416b87d55)**  
**Let's Get started**

## Importing modules

**[Request](https://requests.readthedocs.io/en/master/)** Module is used to get the HTML code for the URL given.

**Note**: *Not all webpages can be requested. For example most social media does not allow to scrape data due to privacy issues. These pages require special access of Developer APIs to scrape data.*

In [1]:
import requests 
from bs4 import BeautifulSoup 
from tqdm import tqdm

## Setting variables

In [2]:
URL = "https://www.flipkart.com/boat-rockerz-400-bluetooth-headset/product-reviews/itm14d0416b87d55?pid=ACCEJZXYKSG2T9GS&lid=LSTACCEJZXYKSG2T9GSVY4ZIC&marketplace=FLIPKART&page=1"

### Requesting desired Webpage

In [3]:
r = requests.get(URL)    
soup = BeautifulSoup(r.content, 'html.parser') 
print(soup.prettify()[6000:7000])

>
   </meta>
  </link>
 </head>
 <body>
  <div id="container">
   <div data-reactroot="">
    <div class="_1kfTjk">
     <div class="_1rH5Jn">
      <div class="_1TmfNK">
      </div>
      <div class="_2Xfa2_">
       <div class="_3_C9Hx">
        <div class="_3qX0zy">
         <a href="/">
          <img alt="Flipkart" class="_2xm1JU" src="//img1a.flixcart.com/www/linchpin/fk-cp-zion/img/flipkart-plus_8d85f4.png" title="Flipkart" width="75"/>
         </a>
         <a class="_21ljIi" href="/plus">
          Explore
          <!-- -->
          <span class="_2FVHGh">
           Plus
          </span>
          <img src="//img1a.flixcart.com/www/linchpin/fk-cp-zion/img/plus_aef861.png" width="10"/>
         </a>
        </div>
       </div>
       <div class="_1cmsER">
        <form action="/search" class="_2M8cLY header-form-search" method="GET">
         <div class="col-12-12 _2oO9oE">
          <div class="_3OO5Xc">
           <input autocomplete="off" class="_3704LK" name="q" place

If you're know HTML, this might look familiar.  
Next we'll see how to get our data.

# Extracting data

A website can be divided into many components and sub components. At times it is a complex grid structure which needs to decoded.  
1. You can easily view the structure by `Ctrl + Shift + C`
2. Now if you hover on any review, you'll notice that each block has name `col._2wzgFH.K0kLPL`
![](Images/div-name.png)  

3. Further this is divided into mutiple rows. The first row contains the rating, while the second contains the actual review. 
![](Images/rating.png)  
![](Images/review.png)  

We'll follow exact same approach to extract data.

In [4]:
# Extracting all review blocks
## Note col._2wzgFH.K0kLPL means 3 entities namely 'col', ' _2wzgFH' and 'K0kLPL' 
## This is written in HTML as 'col _2wzgFH K0kLPL'
## This can also be seen in Bullet 3

row = soup.find_all('div',attrs={'class':'col _2wzgFH K0kLPL'})

In [5]:
# list to store data
dataset = []

# iteration over all blocks
for i in row: 
    
    # finding all rows within the block
    sub_row = i.find_all('div',attrs={'class':'row'})
        
    # extracting text from 1st and 2nd row
    rating = sub_row[0].find('div').text
    review = sub_row[1].find('div').text
    
    # appending to data
    dataset.append({'review': review , 'rating' : rating})

dataset[:5]

[{'review': "It was nice produt. I like it's design a lot.  It's easy to carry. And.   Looked stylish.READ MORE",
  'rating': '5'},
 {'review': 'awesome sound....very pretty to see this nd the sound quality was too good I wish to take this product loved this product 😍😍😍READ MORE',
  'rating': '5'},
 {'review': 'awesome sound quality. pros 7-8 hrs of battery life (including 45 mins approx call time)Awesome sound output. Bass and treble are really very clear without equaliser. With equaliser, sound wary depends on the handset sound quality.Weightless to carry and in head tooMic is good, but in traffic it is not too good (3.25/5)3.5mm Option is really important to mention. Really expecting other leading brands to implement this.ConsVery tight in ears. adjusters are ok .. this ll be very tight...READ MORE',
  'rating': '4'},
 {'review': 'I think it is such a good product not only as per the quality but also the design is quite good . I m using this product from January ... In this pandamic

## Iterating over multiple Pages

In [6]:
dataset = []

# iterating over 50 pages of reviews
for i in tqdm(range(1,50)):

    URL = f"https://www.flipkart.com/boat-rockerz-400-bluetooth-headset/product-reviews/itm14d0416b87d55?pid=ACCEJZXYKSG2T9GS&lid=LSTACCEJZXYKSG2T9GSVY4ZIC&marketplace=FLIPKART&page={i}"
    r = requests.get(URL)    
    soup = BeautifulSoup(r.content, 'html.parser') 

    cols = soup.find_all('div',attrs={'class':'col _2wzgFH K0kLPL'})

    for col in cols:
        row = col.find_all('div',attrs={'class':'row'})

        rating = row[0].find('div').text
        review = row[1].find('div').text

        dataset.append({'review': review , 'rating' : rating})
len(dataset)

100%|██████████| 49/49 [02:23<00:00,  2.93s/it]


489

In [7]:
import pandas as pd
pd.DataFrame(dataset).to_csv('Data/data.csv',index=False)