In this project, I'll be using BeautifulSoup library to perform web scraping of the first 100 pages of "Books" data from Flipkart. This data can be utilized to create visualizations to study the impact of various factors on the ratings and discounts of books sold on Flipkart. I'll first start by importing the relevant libraries to the project.

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import matplotlib.pyplot as plt

Now, web scraping is performed using the imported BeautifulSoup library. First, the get request will be sent to the specified Flipkart URL. After extracting HTML from the request, it is parsed using BeautifulSoup library. The required text is extracted and converted into a list.

In [2]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
t = []
for x in range(1,100):
    url = "https://www.flipkart.com/books/pr?sid=bks&otracker=categorytree&page={}".format(x)
    r = requests.get(url, headers=headers)
    html = r.text
    soup = BeautifulSoup(html)
    data = soup.find_all(['a','div', 'span'], class_ = ['_2cLu-l', '_1rcHFq', '_2_KrJI', '_38sUEc', '_38sUEc', '_1vC4OE', '_3auQ3N', 'VGWI6T'])
    a = []
    for i in data:
        if i.get('class') == ['_2cLu-l']:
            t.append(a)
            a = []
            a.append(i.text)
            continue
        a.append(i.text)

The list is now converted into a dataframe and relevant column names are provided.

In [3]:
df = pd.DataFrame(t, columns=['name','type','rating','number_of_ratings', 'final_price', 'original_price', 'discount'])

Following is a snippet of the extracted data:

In [4]:
df.head()

Unnamed: 0,name,type,rating,number_of_ratings,final_price,original_price,discount
0,,,,,,,
1,Ethics (Hindi) - Nitishastra with 1 Disc,"Hindi, Paperback, Sunil Agrahari",4.4,(150),₹449,₹560,19% off
2,A Naturalist's Guide To The Reptiles Of India,"English, Paperback, Das Indraneil",4.3,(30),₹317,₹499,36% off
3,Think Like a Monk - Train Your Mind for Peace ...,"English, Paperback, Jay Shetty",4.8,(807),₹388,₹499,22% off
4,25 Years UPSC IAS/ IPS Prelims Topic-wise Solv...,"English, Paperback, Disha Experts, Mrunal Patel",4.5,"(16,860)",₹271,₹525,48% off


Now that data has been extracted, I'll remove incomplete rows, and convert the data to its relevant type so that calculations can be easily performed on it.

In [5]:
df = df.dropna(how='any')

In [6]:
df.isna().sum()

name                 0
type                 0
rating               0
number_of_ratings    0
final_price          0
original_price       0
discount             0
dtype: int64

In [7]:
df.dtypes

name                 object
type                 object
rating               object
number_of_ratings    object
final_price          object
original_price       object
discount             object
dtype: object

In [8]:
df['number_of_ratings'] = pd.to_numeric(df['number_of_ratings'].str.replace('(','').str.replace(')','').str.replace(',',''))

In [9]:
df['final_price'] = pd.to_numeric(df['final_price'].str.replace('₹','').str.replace(',',''))

In [10]:
df['original_price'] = pd.to_numeric(df['original_price'].str.replace('₹','').str.replace(',',''))

In [11]:
df['discount'] = pd.to_numeric(df['discount'].str.replace('% off',''))

In [12]:
df['rating'] = pd.to_numeric(df['rating'])

Let's check the data type of the columns:

In [13]:
df.dtypes

name                  object
type                  object
rating               float64
number_of_ratings      int64
final_price            int64
original_price         int64
discount               int64
dtype: object

Here is the cleaned data:

In [14]:
df.head()

Unnamed: 0,name,type,rating,number_of_ratings,final_price,original_price,discount
1,Ethics (Hindi) - Nitishastra with 1 Disc,"Hindi, Paperback, Sunil Agrahari",4.4,150,449,560,19
2,A Naturalist's Guide To The Reptiles Of India,"English, Paperback, Das Indraneil",4.3,30,317,499,36
3,Think Like a Monk - Train Your Mind for Peace ...,"English, Paperback, Jay Shetty",4.8,807,388,499,22
4,25 Years UPSC IAS/ IPS Prelims Topic-wise Solv...,"English, Paperback, Disha Experts, Mrunal Patel",4.5,16860,271,525,48
5,Word Power Made Easy,"Paperbook, Norman Lewis",4.3,808,170,399,57


Now, the extracted and cleaned data can be converted into a CSV for ease of access.

In [15]:
df.reset_index(drop=True).to_csv(r'C:\Users\nakausha\Downloads\projects\flipkart-webscraping\flipkart_books.csv')