# 01_Data Extraction (Web Scraping)
**Goal:** Build a custom dataset of used car listings from eBay to train my pricing model.

**Key Steps:**
1.  **Download HTML:** Saving pages locally first to respect server load and avoid data loss.
2.  **Parse & Extract:** Using BeautifulSoup to pull specific details (Price, Mileage, Title) from the raw HTML.
3.  **Initial Cleaning:** Removing duplicates and formatting basic numbers.
4.  **Export:** Saving the raw dataset to CSV.

### Import Libraries

In [3]:
import os
import requests
import time
import random
from bs4 import BeautifulSoup
import re
import pandas as pd
from pathlib import Path
print("Libraries are imported")

Libraries are imported


### Configuration & Rate Limiting
- **Output Path:** Defined a local directory to store raw HTML files.
- **Page Limit:** Targeted 100 pages to gather sufficient data for Machine Learning (~6,000 listings).
- **Anti-Blocking Strategy:** Implemented a check to skip already downloaded pages and added delays to prevent IP blocking.

In [7]:
# --- CONFIGURATION ---
OUTPUT_FOLDER = "ebay_htmls"
BASE_DIR = os.path.dirname(os.getcwd())
output_folder = os.path.join(BASE_DIR, OUTPUT_FOLDER)

BASE_URL = "https://www.ebay.com/sch/i.html?_nkw=sedans&_sacat=6001"
NUM_PAGES = 100  # Number of pages to download

# Header to mimic human like behavior.
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": "https://www.ebay.com/",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}

# Ensure the folder exists
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

# CREATE A SESSION ---
# A session stores cookies and settings
session = requests.Session()
session.headers.update(HEADERS)

print(f"Starting process for {NUM_PAGES} pages...")

for page in range(1, NUM_PAGES + 1):
    
    filename = f"{output_folder}/{page}.html"
    
    # Check if file exists (Resume capability)
    if os.path.exists(filename):
        print(f"Page {page} already exists. Skipping.")
        continue
    
    target_url = f"{BASE_URL}&_pgn={page}"
    
    try:
        print(f"⬇Downloading Page {page}...", end=" ")
        
        # Use session.get instead of requests.get
        response = session.get(target_url)
        
        if response.status_code == 200:
            with open(filename, "w", encoding="utf-8") as f:
                f.write(response.text)
            print("Saved.")
        else:
            print(f"Failed (Status: {response.status_code})")
            
    except Exception as e:
        print(f"Error: {e}")

    # SAFETY SLEEP
    if page < NUM_PAGES:
        wait_time = random.uniform(4, 8)
        print(f" Waiting {wait_time:.2f} seconds...")
        time.sleep(wait_time)

print("\nProcess Completed.")

Starting process for 100 pages...
Page 1 already exists. Skipping.
Page 2 already exists. Skipping.
Page 3 already exists. Skipping.
Page 4 already exists. Skipping.
Page 5 already exists. Skipping.
Page 6 already exists. Skipping.
Page 7 already exists. Skipping.
Page 8 already exists. Skipping.
Page 9 already exists. Skipping.
Page 10 already exists. Skipping.
Page 11 already exists. Skipping.
Page 12 already exists. Skipping.
Page 13 already exists. Skipping.
Page 14 already exists. Skipping.
Page 15 already exists. Skipping.
Page 16 already exists. Skipping.
Page 17 already exists. Skipping.
Page 18 already exists. Skipping.
Page 19 already exists. Skipping.
Page 20 already exists. Skipping.
Page 21 already exists. Skipping.
Page 22 already exists. Skipping.
Page 23 already exists. Skipping.
Page 24 already exists. Skipping.
Page 25 already exists. Skipping.
Page 26 already exists. Skipping.
Page 27 already exists. Skipping.
Page 28 already exists. Skipping.
Page 29 already exists.

### Page Navigation & Extraction Logic
- Iterating through each saved HTML file.
- **Structure:** Each page contains up to 60 valid car listings. We implemented logic to filter out "Sponsored" or "Garbage" slots that don't contain valid vehicle data.

In [8]:
#Reading the locally stored htmls
INPUT_FOLDER = "ebay_htmls"
input_folder = os.path.join(BASE_DIR, INPUT_FOLDER)

files = [f for f in os.listdir(input_folder) if f.endswith(".html")]
filename = f"{input_folder}/{files[0]}"

print(f" Starting Scraping: {filename}\n")

with open(filename, "r", encoding="utf-8") as f:
    soup = BeautifulSoup(f.read(), "html.parser")

# 1. Finding the main list items (the cards)
cards = soup.find_all("li", class_=lambda x: x and "s-card" in x)

print(f"Found {len(cards)} cards. Now Inspecting...\n")

for card in cards[:62]:
    # 2. CONTENT BLOCK 
    content_block = card.find("div", class_="su-card-container__content")
    
    if not content_block:
        continue # Skip if this card is broken/empty

    print("-" * 50)

    # A. TITLE
    # Inside the 'header' div -> 'a' tag
    header = content_block.find("div", class_="su-card-container__header")
    title = header.text.strip() if header else "N/A"
    print(f"Title:     {title}")

    # B. CONDITION
    # Inside 's-card__subtitle-row'
    subtitle = content_block.find("div", class_="s-card__subtitle-row")
    condition = subtitle.text.strip() if subtitle else "N/A"
    print(f"Condition: {condition}")

    # C. PRICE 
    # Inside 'attributes-primary' -> 's-card__price'
    price_tag = content_block.find("span", class_=lambda x: x and "s-card__price" in x)
    price = price_tag.text.strip() if price_tag else "N/A"
    print(f"Price: {price}")

    # D. DETAILS (Year, Mileage, Bids, Seller)
    # These are all inside 'su-card-container__attributes'
    details_block = content_block.find("div", class_=lambda x: x and "su-card-container__attributes" in x)
    
    seller_info = "N/A"
    bids = "N/A"
    mileage = "N/A"
    
    if details_block:
        # Get all text lines from this block to analyze
        all_text = details_block.get_text(" | ", strip=True) # Join with pipe | to separate
        print(f"Raw Data:  {all_text}")
        
        # Extraction logic
        if "bid" in all_text.lower():
            bids = "Auction/Bids Found"
        if "positive" in all_text.lower():
            seller_info = "Seller Rating Found"
        if "miles" in all_text.lower():
            mileage = "Mileage Found"

    print(f" Bids: {bids}, Seller: {seller_info}")

 Starting Scraping: d:\Used_Cars_Price_Prediction\ebay_htmls/1.html

Found 62 cards. Now Inspecting...

--------------------------------------------------
Title:     Shop on eBayBrand New
Condition: Brand New
Price: $20.00
Raw Data:  $20.00 | or Best Offer
 Bids: N/A, Seller: N/A
--------------------------------------------------
Title:     Shop on eBayBrand New
Condition: Brand New
Price: $20.00
Raw Data:  $20.00 | or Best Offer
 Bids: N/A, Seller: N/A
--------------------------------------------------
Title:     2004 BMW 5-Series 525iOpens in a new window or tabPre-Owned
Condition: Pre-Owned
Price: $1,750.00
Raw Data:  $1,750.00 | 45 bids | · | Time left | 19m left | (Today 11:00 AM) | Shipping not specified | Located in United States | Year: 2004 | Miles: 171,100 | Brand: BMW | justdonated | 99% positive (10.3K)
 Bids: Auction/Bids Found, Seller: Seller Rating Found
--------------------------------------------------
Title:     2007 Honda Accord Opens in a new window or tabPre-Owned


### Data Cleaning
- **Price:** Removing `$` symbols and formatting as integers.
- **Reviews:** Converting shorthand like "1.2k" into `1200` for consistent analysis.
- **Attributes:** Initializing default values for missing fields to prevent script crashes.

In [9]:
# --- HELPER 1: Clean Price ---
def clean_number(text):
    """Returns float or None. Example: '$1,750.00' -> 1750.0"""
    if not text: return None
    clean = re.sub(r'[^\d.]', '', text)
    try:
        return float(clean)
    except:
        return None

# --- HELPER 2: Clean Reviews ---
def clean_reviews(text):
    """Returns int or None. Example: '(10.3K)' -> 10300"""
    if not text: return None
    # Remove parens and spaces, keep numbers and 'K'
    clean = re.sub(r'[^\d.K]', '', text.upper()) 
    multiplier = 1
    if 'K' in clean:
        multiplier = 1000
        clean = clean.replace('K', '')
    try:
        return int(float(clean) * multiplier)
    except:
        return None

# --- HELPER 3: Parse Attributes (The Master Parser) ---
def parse_attributes_string(raw_text):
    """
    Parses the long string: 
    '$1,750.00 | 45 bids | Ended | ... '
    """
    # 1. Initialize Default Values
    data = {
        "bids": None,
        "year": None,
        "mileage": None,
        "brand": None,
        "seller_score": None,
        "reviews_count": None,
        "type": None,
        "status": "Active" # <--- Default to Active
    }
    
    if not raw_text: return data
    
    # 2. STATUS CHECK
    lower_text = raw_text.lower()
    if "ended" in lower_text or "sold" in lower_text:
        data["status"] = "Sold"

    # 3. Split by Pipe '|' and analyze each part
    parts = [p.strip() for p in raw_text.split('|')]
    
    for p in parts:
        # BIDS
        if "bids" in p.lower():
            try:
                data["bids"] = int(re.search(r'\d+', p).group(0))
                data["type"] = "Auction"
            except: pass
        
        # BUY IT NOW
        if "buy it now" in p.lower():
            data["type"] = "Buy It Now"

        # YEAR
        if "Year:" in p:
            try:
                data["year"] = int(re.search(r'\d+', p).group(0))
            except: pass
            
        # MILEAGE
        if "Miles:" in p:
            try:
                data["mileage"] = int(re.sub(r'[^\d]', '', p))
            except: pass
            
        # BRAND
        if "Brand:" in p:
            data["brand"] = p.replace("Brand:", "").strip()
            
        # SELLER INFO
        if "% positive" in p:
            score_match = re.search(r'(\d+(?:\.\d+)?%)', p)
            if score_match:
                data["seller_score"] = score_match.group(1)
            
            count_match = re.search(r'\((.*?)\)', p)
            if count_match:
                data["reviews_count"] = clean_reviews(count_match.group(1))

    return data

### Execution: Parsing & Building the DataFrame
- Looping through all 100 HTML files.
- Extracting features (Year, Mileage, Title, Price) for every car.
- **Deduplication:** Removing identical records (same Price + Mileage + Year) to ensure dataset quality.
- Compiling the list of dictionaries into a Pandas DataFrame.

In [10]:
# --- CONFIGURATION ---

# --- HELPER FUNCTIONS ---
def clean_number(text):
    if not text: return None
    clean = re.sub(r'[^\d.]', '', text)
    try: return float(clean)
    except: return None

def clean_reviews(text):
    if not text: return None
    clean = re.sub(r'[^\d.K]', '', text.upper()) 
    multiplier = 1000 if 'K' in clean else 1
    clean = clean.replace('K', '')
    try: return int(float(clean) * multiplier)
    except: return None

def parse_attributes_string(raw_text):
    data = {"bids": None, "year": None, "mileage": None, "brand": None, 
            "seller_score": None, "reviews_count": None, "type": None, "status": "Active"}
    if not raw_text: return data
    if "ended" in raw_text.lower() or "sold" in raw_text.lower(): data["status"] = "Sold"
    
    parts = [p.strip() for p in raw_text.split('|')]
    for p in parts:
        lower = p.lower()
        if "bids" in lower:
            try: 
                data["bids"] = int(re.search(r'\d+', p).group(0))
                data["type"] = "Auction"
            except: pass
        if "buy it now" in lower: data["type"] = "Buy It Now"
        if "year:" in lower:
            try: data["year"] = int(re.search(r'\d+', p).group(0))
            except: pass
        if "miles:" in lower:
            try: data["mileage"] = int(re.sub(r'[^\d]', '', p))
            except: pass
        if "brand:" in lower: data["brand"] = p.replace("Brand:", "").strip()
        if "% positive" in p:
            score = re.search(r'(\d+(?:\.\d+)?%)', p)
            if score: data["seller_score"] = score.group(1)
            count = re.search(r'\((.*?)\)', p)
            if count: data["reviews_count"] = clean_reviews(count.group(1))
    return data

# --- MAIN PROCESSOR (ETL) ---
def get_clean_dataframe():
    # 1. EXTRACT (Load to Memory)
    files = [f for f in os.listdir(input_folder) if f.endswith(".html")]
    print(f" Processing {len(files)} files into DataFrame...")
    
    all_cars = [] 
    
    for filename in files:
        filepath = f"{input_folder}/{filename}"
        with open(filepath, "r", encoding="utf-8") as f:
            soup = BeautifulSoup(f.read(), "html.parser")
            
        cards = soup.find_all("li", class_=lambda x: x and "s-card" in x)
        
        for card in cards:
            content = card.find("div", class_="su-card-container__content")
            if not content: continue
            
            # Extract Data
            header = content.find("div", class_="su-card-container__header")
            if not header: continue
            raw_title = header.find("a").text.strip() if header.find("a") else header.text.strip()
            if "Shop on eBay" in raw_title: continue 
            title = raw_title.replace("Opens in a new window or tab", "").strip()
            
            price_tag = content.find("span", class_=lambda x: x and "s-card__price" in x)
            price = clean_number(price_tag.text) if price_tag else None
            
            sub = content.find("div", class_="s-card__subtitle-row")
            condition = sub.text.strip() if sub else None
            
            details = content.find("div", class_=lambda x: x and "su-card-container__attributes" in x)
            attr = parse_attributes_string(details.get_text("|", strip=True)) if details else parse_attributes_string("")

            # Filter Logic
            valid_price = (price is not None and price > 500)
            valid_bids = (attr["bids"] is not None and attr["bids"] > 0)
            
            if title and (valid_price or valid_bids):
                all_cars.append({
                    "title": title,
                    "price": price,
                    "year": attr["year"],
                    "mileage": attr["mileage"],
                    "brand": attr["brand"],
                    "condition": condition,
                    "bids": attr["bids"],
                    "seller_score": attr["seller_score"],
                    "reviews_count": attr["reviews_count"],
                    "link_type": attr["type"],
                    "status": attr["status"]
                })

    # 2. TRANSFORM - Create DataFrame & Clean
    df = pd.DataFrame(all_cars)
    print(f" Raw extracted rows: {len(df)}")
    
    # Drop Duplicates
    df_clean = df.drop_duplicates(subset=['title', 'year', 'mileage'], keep='first')
    print(f" Duplicates removed: {len(df) - len(df_clean)}")
    print(f" Unique cars remaining: {len(df_clean)}")
    
    # 3. RETURN
    return df_clean

# Assign result to a variable
df_cars = get_clean_dataframe()
print("\nDataFrame is ready.")

 Processing 100 files into DataFrame...
 Raw extracted rows: 5777
 Duplicates removed: 1149
 Unique cars remaining: 4628

DataFrame is ready.


### View Data

In [11]:
# Display some data to verify
df_cars.head(3)

Unnamed: 0,title,price,year,mileage,brand,condition,bids,seller_score,reviews_count,link_type,status
0,2004 BMW 5-Series 525i,1750.0,2004,171100,BMW,Pre-Owned,45.0,99%,10300,Auction,Active
1,2007 Honda Accord,660.0,2007,256860,Honda,Pre-Owned,26.0,99%,10300,Auction,Active
2,2019 Nissan Altima 2.5 Platinum AWD clean carf...,5100.0,2019,16550,Nissan,Great Value Look,60.0,100%,2300,Buy It Now,Active


### Saving to CSV File

In [12]:

# Output file path
file_name="data/ebay.csv"

output_file=os.path.join(BASE_DIR,file_name)

# Export DataFrame to CSV
df_cars.to_csv(output_file, index=False, encoding='utf-8')

# Display export summary
file_size= os.path.getsize(output_file) / 1024

print(f"Data export completed successfully, Size {file_size:.0f} KB.")

Data export completed successfully, Size 428 KB.


# Data Types

In [13]:
# Display data types for each column
print("Data types:")
print(df_cars.dtypes)

print("\nData type summary:")
print(f"  Numeric columns: {df_cars.select_dtypes(include=['int64', 'float64']).columns.tolist()}")
print(f"  Text columns: {df_cars.select_dtypes(include=['object']).columns.tolist()}")

Data types:
title             object
price            float64
year               int64
mileage            int64
brand             object
condition         object
bids             float64
seller_score      object
reviews_count      int64
link_type         object
status            object
dtype: object

Data type summary:
  Numeric columns: ['price', 'year', 'mileage', 'bids', 'reviews_count']
  Text columns: ['title', 'brand', 'condition', 'seller_score', 'link_type', 'status']


# Data Summary

In [14]:
# Display statistical summary of numeric columns
print("Statistical summary:")
df_cars.describe()

Statistical summary:


Unnamed: 0,price,year,mileage,bids,reviews_count
count,4628.0,4628.0,4628.0,228.0,4628.0
mean,23131.871493,2009.363224,103448.2,5.552632,1003.316768
std,31829.138176,20.009067,1815358.0,10.414825,5597.048744
min,1.25,1908.0,0.0,0.0,0.0
25%,9250.0,2010.0,37371.75,0.0,0.0
50%,15844.0,2016.0,69931.5,0.0,230.0
75%,25000.0,2020.0,104661.5,6.0,771.0
max,849980.0,2026.0,123456800.0,60.0,158200.0


# Phase 1: Data Extraction Summary

### 1. The Objective
To build a robust Machine Learning model, I needed real-world, current market data. Standard datasets (like Kaggle) are often outdated. Therefore, I built a custom web scraper to target eBay's used car listings.

### 2. The Strategy
I utilized Python's scraping ecosystem to iterate through search result pages and extract granular details for each listing.
* **Target:** Used Car Listings
* **Technique:** "Download First, Parse Later." I saved the raw HTML locally first. This allowed me to experiment with different parsing logic without needing to re-request (and potentially get blocked by) the server.

### 3. Data Extraction Points
For every car listing, I successfully extracted the following raw attributes:
* **Title:** (e.g., "2015 BMW 5-Series 80k miles") - *Critical for extracting Year/Make/Model.*
* **Price:** The target variable for predictions.
* **Mileage:** A key depreciation factor.
* **Seller Info:** Ratings and review counts (to analyze seller reliability).
* **Listing Type:** Auction vs. Buy It Now (Can be used in pricing logic later).

### 4. Challenges & Solutions
* **Anti-Scraping Logic:** The website flagged high-frequency requests. I solved this by implementing random time delays (`time.sleep`) and manually handling CAPTCHA interruptions when necessary.
* **Inconsistent Formatting:** Prices appeared as "$10,000" or "10000". I wrote helper functions to standardize these into pure integers immediately.

### Outcome
I successfully scraped and saved a raw dataset (`ebay_cars_raw.csv`) containing **~4,600 unique rows**. This raw data is now ready for the **Database Integration** phase.