# Problem Statement: Web Scraping Product Information from Flipkart

**Background:**

Flipkart is an e-commerce platform where users can search for and purchase various products. It contains a vast catalog of products with details such as product names, prices, sellers, and additional specifications. To gather data for analysis or other purposes, it can be valuable to extract specific product information from Flipkart's website programmatically.

**Objective:**

The objective of this project is to create a Python script that performs web scraping on Flipkart's website to extract essential information about Samsung mobile phone and store it for further analysis or use. 

In [4]:
# Import necessary libraries
import pandas as pd
import requests
import numpy as np
from bs4 import BeautifulSoup

In [30]:
# Define functions to extract title
def get_title(soup):
    
    try:
        # Extract and clean the product title
        title_value = soup.find("span",class_="B_NuCI").text.replace("\xa0","")
    except:
        # If the title is not found, set it as an empty string
        title_value = ""
        
    return title_value

In [31]:
# Define functions to extract price
def get_price(soup):
    
    try:
        # Extract the product price
        price_value = soup.find("div",class_="_30jeq3 _16Jk6d").text
    except:
        # If the price is not found, set it as an empty string
        price_value = ""
    
    return price_value

In [32]:
# Define functions to extract rating
def get_rating(soup):
    
    try:
        # Extract the product rating
        rating_value = soup.find("div",class_="_3LWZlK").text
    except:
        # If the rating is not found, set it as an empty string
        rating_value = ""
    
    return rating_value

In [33]:
# Define functions to extract number of reviews
def get_num_review(soup):
    
    try:
        # Extract the number of reviews
        reviews_value = soup.find("span",class_="_2_R_DZ").find_all("span")[3].text.replace("\xa0","")
    except:
        # If the number of reviews is not found, set it as an empty string
        reviews_value = ""
    
    return reviews_value

In [34]:
# Define functions to extract product color
def get_color(soup):
    
    try:
        # Extract the product color
        color_value = soup.find_all("tr",class_="_1s_Smc row")[3].find("li",class_="_21lJbe").text
    except:
        # If the color is not found, set it as an empty string
        color_value = ""
    
    return color_value

In [35]:
# Define functions to extract product display size
def get_display_size(soup):
    
    try:
        # Extract the product display size
        display_value = soup.find_all("tr",class_="_1s_Smc row")[9].find("li",class_="_21lJbe").text
    except:
        # If the display size is not found, set it as an empty string
        display_value = ""
    
    return display_value

In [36]:
# Set the user-agent headers for HTTP requests
HEADERS = {"User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36'}

In [37]:
# Define the URL for Flipkart's Samsung product search
URL = "https://www.flipkart.com/search?q=samsung&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off"

In [38]:
# Send an HTTP GET request to the search page
response = requests.get(URL,headers=HEADERS)

In [39]:
# Print the HTTP response object
response

<Response [403]>

In [28]:
response.content

b'<!DOCTYPE html><html lang=en><meta charset=UTF-8><meta content="width=device-width,initial-scale=1"name=viewport><title>Flipkart reCAPTCHA</title><link href=https://static-assets-web.flixcart.com/batman-returns/batman-returns/s/recaptcha.css rel=stylesheet><script src="https://www.google.com/recaptcha/api.js?render=6Lc49B0pAAAAAIVgOhfwW8i7t7SRO0KSnlSVZRAq"></script><script src=https://static-assets-web.flixcart.com/batman-returns/batman-returns/s/recaptcha.js async defer></script><div class=container><img alt="Flipkart Logo"class=logo src="https://rukminim1.flixcart.com/www/60/60/promos/14/06/2024/88011666-ce1d-40f0-a8eb-1bac7d164885.png?q=60"><h1 class=header>Are you a human?</h1><p class=subText>Confirming...<div class=loaderContainer><div class=loader></div></div></div>'

In [15]:
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content,"html.parser")

In [16]:
# Find all product links on the search results page
links = soup.find_all("a",class_="s1Q9rs")

In [17]:
# Create an empty list to store product URLs
links_list = []

In [18]:
# Extract product URLs and store them in the list
for link in links:
    links_list.append(link.get("href"))

In [19]:
# Create a dictionary to store scraped data
d = {"title":[],"price":[],"rating":[],"num_review":[],"color":[],"display":[]}

In [20]:
# Iterate through product URLs and scrape data
for link in links_list:
    
    # Create the full URL for the product page
    new_webpage_URL = "https://www.flipkart.com" + link
    
    # Send an HTTP GET request to the product page and parse the HTML content
    new_response = requests.get(new_webpage_URL,headers=HEADERS)
    new_soup = BeautifulSoup(new_response.content,"html.parser")
    
    # Call the defined functions to extract and append data to the dictionary
    d["title"].append(get_title(new_soup))
    d["price"].append(get_price(new_soup))
    d["rating"].append(get_rating(new_soup))
    d["num_review"].append(get_num_review(new_soup))
    d["color"].append(get_color(new_soup))
    d["display"].append(get_display_size(new_soup))

In [21]:
# Create a DataFrame from the scraped data
df = pd.DataFrame(d)

# Replace empty strings in the "title" column with NaN and drop rows with NaN values in the "title" column
df["title"].replace("",np.nan,inplace=True)
df = df.dropna(subset=["title"])

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["title"].replace("",np.nan,inplace=True)


In [22]:
# Display the DataFrame
df

Unnamed: 0,title,price,rating,num_review,color,display


In [20]:
# Save the DataFrame to a CSV file
df.to_csv("Flipkart_samsung_phone_data.csv", index=False)