## Smartphone Market Analysis & Price Prediction using Web Scraping – Flipkart Case Study

### Project Overview

This project focuses on performing live web scraping from Flipkart to extract real-time smartphone data using Python libraries such as Requests and BeautifulSoup. The rapidly evolving smartphone market, characterized by frequent product launches and dynamic pricing, makes automated data collection essential for continuous insights.

After extraction, the data undergoes a comprehensive cleaning and preprocessing pipeline, ensuring that the information — including product names, prices, ratings, discounts, and key technical specifications (RAM, Storage, Battery, Display, Processor, Camera) — is accurate and analysis-ready.

The cleaned dataset is then used to train a machine learning model that predicts smartphone prices based on technical specifications, enabling data-driven insights into pricing patterns and feature impact.

Finally, the results are visualized in Power BI, providing a complete end-to-end workflow — from live data collection to ML-based predictive analysis — demonstrating practical expertise in:

- Web Scraping & Automation
- Data Cleaning & Feature Engineering
- Model Training & Evaluation (Price Prediction)
- Interactive Visualization & Insights in Power BI

---

### Phase 1 - Web Scraping


The first step in this project is to collect real-time smartphone data from Flipkart using Python web scraping tools such as Requests and BeautifulSoup. This phase focuses on extracting all essential product information to build a rich and reliable dataset for downstream analysis and machine learning.  

In this phase, we will scrape smartphone listings from **Flipkart** to capture essential details such as:  
- Brand & Model  
- Price & Discounts  
- Ratings  
- Key Specifications (RAM, ROM, Display, Battery, Processor, Camera, etc.)  

The scraped dataset will be stored in a structured format (CSV) and used in the next phase for transformation and analysis.

Importing Required Libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time, random

Let's defines the scraping function, iterates through all pages, extracts product details and stores the results into a CSV file.

In [3]:
# Base setup
BASE_URL = "https://www.flipkart.com/search?q=smartphones&page={}"
HEADERS = {"User-Agent": "Mozilla/5.0"}
TOTAL_PAGES = 390
OUTPUT_FILE = "flipkart_smartphones.csv"

# Store all products
all_products = []

def scrape_page(page_num):
    try:
        url = BASE_URL.format(page_num)
        response = requests.get(url, headers=HEADERS)
        if response.status_code != 200:
            print(f"⚠️ Failed to fetch page {page_num}")
            return []
    except Exception as e:
        print(f"Error fetching {url}: {e}")
        return []
    
    soup = BeautifulSoup(response.text, "html.parser")
    data = []

    # Each product card
    product_cards = soup.find_all("div", {"class": "tUxRFH"})
    for card in product_cards:
        # Name
        name_tag = card.find("div", {"class": "KzDlHZ"})
        name = name_tag.text.strip() if name_tag else None

        # Price
        price_tag = card.find("div", {"class": "Nx9bqj _4b5DiR"})
        price = price_tag.text.strip().replace("₹", "").replace(",", "") if price_tag else None

        # Discount
        dis_tag = card.find("div", {"class": "UkUFwK"})
        discount = dis_tag.text.strip() if dis_tag else None

        # Rating
        rating_tag = card.find("div", {"class": "XQDdHH"})
        rating = rating_tag.text.strip() if rating_tag else None

        # Specs
        specs_container = card.find("div", {"class": "_6NESgJ"})
        if specs_container:
            specs_list = specs_container.find_all("li", {"class": "J+igdf"})
            specs = "; ".join([s.text.strip() for s in specs_list])
        else:
            specs = None

        if name:
            data.append({
                "Name": name,
                "Price": price,
                "Discount": discount,
                "Rating": rating,
                "Specs": specs
            })

    return data


# Main loop
for page in range(1, TOTAL_PAGES + 1):
    print(f"Scraping page {page}/{TOTAL_PAGES}...")
    products = scrape_page(page)
    all_products.extend(products)

    # Save progress every 50 pages
    if page % 50 == 0:
        pd.DataFrame(all_products).to_csv(OUTPUT_FILE, index=False)
        print(f"✅ Progress saved at page {page}")

    # Sleep to avoid blocking
    time.sleep(random.uniform(1, 3))

# Final save
df = pd.DataFrame(all_products)
df.to_csv(OUTPUT_FILE, index=False)
print(f"\n Done! Collected {len(df)} products into {OUTPUT_FILE}")

Scraping page 1/390...
Scraping page 2/390...
Scraping page 3/390...
Scraping page 4/390...
Scraping page 5/390...
Scraping page 6/390...
Scraping page 7/390...
Scraping page 8/390...
Scraping page 9/390...
Scraping page 10/390...
Scraping page 11/390...
Scraping page 12/390...
Scraping page 13/390...
Scraping page 14/390...
Scraping page 15/390...
Scraping page 16/390...
Scraping page 17/390...
Scraping page 18/390...
Scraping page 19/390...
Scraping page 20/390...
Scraping page 21/390...
Scraping page 22/390...
Scraping page 23/390...
Scraping page 24/390...
Scraping page 25/390...
Scraping page 26/390...
Scraping page 27/390...
Scraping page 28/390...
Scraping page 29/390...
Scraping page 30/390...
Scraping page 31/390...
Scraping page 32/390...
Scraping page 33/390...
Scraping page 34/390...
Scraping page 35/390...
Scraping page 36/390...
Scraping page 37/390...
Scraping page 38/390...
Scraping page 39/390...
Scraping page 40/390...
Scraping page 41/390...
Scraping page 42/390...
S

---


### Phase 2 - Data Preparation & Transformation

With the web scraping phase complete, we have successfully gathered raw smartphone listings from Flipkart and saved them as flipkart_smartphones.csv.
This dataset now serves as the foundation for both exploratory analysis and machine learning model training.

In the Data Preparation & Transformation phase, we will -

- Evaluate and enhance overall data quality
- Handle missing, inconsistent, and duplicate records
- Standardize technical specifications (RAM, ROM, Battery, Display, Processor, Camera) into numeric formats
- Engineer relevant features for model input
- Prepare a clean and structured dataset suitable for Exploratory Data Analysis (EDA) and Price Prediction Modeling

*This phase ensures that the data is accurate, consistent, and machine-learning-ready — forming the backbone of the predictive analytics workflow.*

---
---