# Introduction

The CarDataProcessor class is designed to load and process car datasets from multiple cities. It handles large, unstructured data in Excel files and converts JSON-like string fields into structured data using Python's ast.literal_eval.

# Dataset Structure

The dataset consists of three main sections:

1. New Car Detail:

* it (integer): Ignition type.
* ft (string): Fuel type (e.g., Petrol).
* bt (string): Body type (e.g., Hatchback).
* km (string): Kilometers driven.
* transmission (string): Transmission type (e.g., Manual).
* ownerNo (integer): Number of previous owners.
* owner (string): Ownership details.
* oem (string): Original Equipment Manufacturer (e.g., Maruti).
* model (string): Car model (e.g., Maruti Celerio).
* modelYear (integer): Year of car manufacture.
* centralVariantId (integer): Central variant ID.
* variantName (string): Variant name.
* price (string): Price of the used car.
* priceActual (string): Actual price (if available).
* priceSaving (string): Price saving information (if available).
* priceFixedText (string): Fixed price details.
* trendingText (dictionary): Trending car information.

2. New Car Overview:

* heading (string): Car overview heading.
* top (list of dictionaries): Top details, including keys like registration year, insurance validity, fuel type, etc.
* bottomData (None): Additional bottom data (currently not available).

3. New Car Feature:

* heading (string): Features heading.
* top (list of strings): Top features.
* data (list of dictionaries): Detailed feature information categorized by comfort, interior, exterior, safety, etc.

4. New Car Specs:

* heading (string): Specifications heading.
* top (list of dictionaries): Top specifications like mileage, engine, max power, torque, etc.
* data (list of dictionaries): Detailed engine and transmission information, dimensions, capacity, and miscellaneous details.

In [1]:
import json
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
import sys
import re

class CarDataProcessor:
    def __init__(self, file_paths: dict):
        self.file_paths = file_paths
        sys.setrecursionlimit(5000)  # Increase recursion limit for deep nested structures

    def safe_json_eval(self, val):
        """
        Safely evaluate strings containing JSON-like data.
        This avoids recursion depth issues seen with ast.literal_eval.
        """
        if isinstance(val, str):  # Only attempt to parse strings
            try:
                # Replace single quotes with double quotes
                val = re.sub(r"(?<!^)(?<!\\)'(?!$)", '"', val)
                
                # Replace 'None' with 'null'
                val = val.replace("None", "null")
                
                # Try to parse the string as JSON
                return json.loads(val)
            except (json.JSONDecodeError, RecursionError) as e:
                print(f"Error parsing JSON: {e}, value: {val}")  # Debugging
        return {}

    def convert_price(self, price_str):
       if isinstance(price_str, str):
           # Remove currency symbol and commas
            price_str = price_str.replace('₹', '').replace(',', '').strip()

                # Handle Lakh
            if 'Lakh' in price_str:
                return float(price_str.split(' ')[0]) * 1e5

                # Handle Crore
            elif 'Crore' in price_str:
                return float(price_str.split(' ')[0]) * 1e7

            # Handle plain number
            else:
                return float(price_str)    
                
        

    def load_and_process_city(self, city, path):
        """
        Load and process the data for a single city.
        """
        df = pd.read_excel(path)

        # Apply safe_json_eval with error handling
        df['new_car_detail'] = df['new_car_detail'].apply(self.safe_json_eval)
        df['new_car_overview'] = df['new_car_overview'].apply(self.safe_json_eval)
        df['new_car_feature'] = df['new_car_feature'].apply(self.safe_json_eval)
        df['new_car_specs'] = df['new_car_specs'].apply(self.safe_json_eval)
        
        # Add city information to the dataframe
        df['City'] = city
        df['City'] = df['City'].str.title()
    
        # Extract details from 'new_car_detail'
        df['FuelType'] = df['new_car_detail'].apply(lambda x: x.get('ft', '') if isinstance(x, dict) else '')
        df['BodyType'] = df['new_car_detail'].apply(lambda x: x.get('bt', '') if isinstance(x, dict) else '')
        df['KmsDriven'] = df['new_car_detail'].apply(lambda x: x.get('km', '') if isinstance(x, dict) else '')
        df['TransmissionType'] = df['new_car_detail'].apply(lambda x: x.get('transmission', '') if isinstance(x, dict) else '')
        df['NumberOwner'] = df['new_car_detail'].apply(lambda x: x.get('ownerNo', '') if isinstance(x, dict) else '')
        df['Manufacturer'] = df['new_car_detail'].apply(lambda x: x.get('oem', '') if isinstance(x, dict) else '')
        df['CarModel'] = df['new_car_detail'].apply(lambda x: x.get('model', '') if isinstance(x, dict) else '')
        df['ModelYear'] = df['new_car_detail'].apply(lambda x: x.get('modelYear', '') if isinstance(x, dict) else '')
        df['CentralVariantId'] = df['new_car_detail'].apply(lambda x: x.get('centralVariantId', '') if isinstance(x, dict) else '')
        df['VariantName'] = df['new_car_detail'].apply(lambda x: x.get('variantName', '') if isinstance(x, dict) else '')
        df['Price'] = df['new_car_detail'].apply(lambda x: x.get('price', '') if isinstance(x, dict) else '')
        df['Price'] = df['Price'].apply(self.convert_price)
        df['Top_key'] = df['new_car_overview'].apply(lambda x: x.get('top', []))
        df['RegistrationYear'] = df['Top_key'].apply(lambda x: x[0]['value'] if x and isinstance(x, list) and len(x) > 0 and isinstance(x[0], dict) else '')
        df['Insurance'] = df['Top_key'].apply(lambda x: x[1]['value'] if x and isinstance(x, list) and len(x) > 0 and isinstance(x[1], dict) else '')
        
        df['Top specification'] = df['new_car_specs'].apply(lambda x: x.get('top', []) if isinstance(x, dict) else '')
        df['Detailed engine'] = df['new_car_specs'].apply(lambda x: x.get('data', []) if isinstance(x, dict) else '')
                
        # Extract details from 'Specifications'
        def extract_specifications(specs):
            if isinstance(specs, dict):
                top_specs = {item.get('key', ''): item.get('value', '') for item in specs.get('top', []) if isinstance(item, dict)}
                data_specs = {item['heading'] + ' - ' + item['subHeading']: {spec.get('key', ''): spec.get('value', '') for spec in item.get('list', [])} 
                              for item in specs.get('data', []) if isinstance(item, dict)}
                # Flatten the specifications into columns
                flat_specs = {**top_specs, **{k: v for sub_dict in data_specs.values() for k, v in sub_dict.items()}}
                return flat_specs
            return {}

        # Apply the extraction function and create new columns
        specs_df = df['new_car_specs'].apply(extract_specifications).apply(pd.Series)
        df = pd.concat([df, specs_df], axis=1)
        
        df.drop(['new_car_detail', 'new_car_overview', 'new_car_feature', 'new_car_specs', 'car_links','Top_key', 'Top specification', 'Detailed engine', 'Ground Clearance Unladen'],axis = 1,inplace = True)
        
        return df

    def process_data(self):
        """
        Process all city data using threading to speed up the process.
        """
        processed_data = []
        
        with ThreadPoolExecutor() as executor:
            # Use threads to process each city dataset concurrently
            futures = [executor.submit(self.load_and_process_city, city, path) for city, path in self.file_paths.items()]
            for future in futures:
                processed_data.append(future.result())

        # Combine all processed city data into a single DataFrame
        combined_df = pd.concat(processed_data, ignore_index=True)
        return combined_df

# Attributes:

* file_paths : A dictionary where each key is the name of a city (e.g., 'bangalore', 'kolkata') and each value is the file path to that city's dataset (Excel file).

* structured_data : A dictionary that stores the loaded data for each city after being processed.

In [2]:
file_paths = {
    'bangalore': r"C:\Users\Lenovo\Downloads\data set\bangalore_cars.xlsx",
    'kolkata': r"C:\Users\Lenovo\Downloads\data set\kolkata_cars.xlsx",        
    'hyderabad': r"C:\Users\Lenovo\Downloads\data set\hyderabad_cars.xlsx",
    'delhi': r"C:\Users\Lenovo\Downloads\data set\delhi_cars.xlsx",
    'jaipur': r"C:\Users\Lenovo\Downloads\data set\jaipur_cars.xlsx",
    'chennai': r"C:\Users\Lenovo\Downloads\data set\chennai_cars.xlsx",
}

processor = CarDataProcessor(file_paths)
structured_data = processor.process_data()
structured_data.head()

Unnamed: 0,City,FuelType,BodyType,KmsDriven,TransmissionType,NumberOwner,Manufacturer,CarModel,ModelYear,CentralVariantId,...,Turning Radius,Front Brake Type,Rear Brake Type,Top Speed,Acceleration,Tyre Type,No Door Numbers,Cargo Volumn,Wheel Size,Alloy Wheel Size
0,Bangalore,Petrol,Hatchback,120000,Manual,3,Maruti,Maruti Celerio,2015,3979,...,4.7 metres,Ventilated Disc,Drum,150 Kmph,15.05 Seconds,"Tubeless, Radial",5,235-litres,,
1,Bangalore,Petrol,SUV,32706,Manual,2,Ford,Ford Ecosport,2018,6087,...,5.3 metres,Ventilated Disc,Drum,,,"Tubeless,Radial",4,352-litres,16.0,16.0
2,Bangalore,Petrol,Hatchback,11949,Manual,1,Tata,Tata Tiago,2018,2983,...,4.9 meters,Disc,Drum,150 kmph,14.3 Seconds,Tubeless,5,242-litres,14.0,14.0
3,Bangalore,Petrol,Sedan,17794,Manual,1,Hyundai,Hyundai Xcent,2014,1867,...,4.7 metres,Disc,Drum,172km/hr,14.2 Seconds,"Tubeless,Radial",4,407-litres,14.0,14.0
4,Bangalore,Diesel,SUV,60000,Manual,1,Maruti,Maruti SX4 S Cross,2015,4277,...,5.2 meters,Ventilated Disc,Solid Disc,190 Kmph,12 Seconds,"Tubeless,Radial",5,353-litres,16.0,16.0


# Save Dataset

Saving converted data into csv file inside data folder

In [3]:
structured_data.to_csv('data\\structured_data_car_price_prediction.csv', index=False)