# Exploratory Data Analysis on Used Car Prices

### Project Objective

The objective of this project is to perform Exploratory Data Analysis (EDA) on a used car dataset to understand the factors influencing used car prices.

This analysis aims to:
- Understand the structure and quality of the dataset
- Identify key variables affecting selling price
- Detect data quality issues such as missing values and incorrect data types
- Prepare a strong foundation for data cleaning, visualization, and feature engineering


####Connecting Kaggle Dataset With Kaggle Api

In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("sukhmandeepsinghbrar/car-price-prediction-dataset")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/sukhmandeepsinghbrar/car-price-prediction-dataset?dataset_version_number=1...


100%|██████████| 141k/141k [00:00<00:00, 46.6MB/s]

Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/sukhmandeepsinghbrar/car-price-prediction-dataset/versions/1





####Accessing the necessary libraries

In [3]:
import pandas as pd;
import numpy as np;

####Load dataset

In [4]:
data = pd.read_csv(path+'/cardekho.csv', index_col=False)
df = pd.DataFrame(data)

####Preview the dataset

In [5]:
df.shape

(8128, 12)

####Observations :
The table have 12 Columns and 8128 Rows

In [6]:
df.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage(km/ltr/kg),engine,max_power,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4,1248.0,74.0,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14,1498.0,103.52,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7,1497.0,78.0,5.0
3,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,23.0,1396.0,90.0,5.0
4,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,16.1,1298.0,88.2,5.0


In [7]:
df.tail()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage(km/ltr/kg),engine,max_power,seats
8123,Hyundai i20 Magna,2013,320000,110000,Petrol,Individual,Manual,First Owner,18.5,1197.0,82.85,5.0
8124,Hyundai Verna CRDi SX,2007,135000,119000,Diesel,Individual,Manual,Fourth & Above Owner,16.8,1493.0,110.0,5.0
8125,Maruti Swift Dzire ZDi,2009,382000,120000,Diesel,Individual,Manual,First Owner,19.3,1248.0,73.9,5.0
8126,Tata Indigo CR4,2013,290000,25000,Diesel,Individual,Manual,First Owner,23.57,1396.0,70.0,5.0
8127,Tata Indigo CR4,2013,290000,25000,Diesel,Individual,Manual,First Owner,23.57,1396.0,70.0,5.0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8128 entries, 0 to 8127
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   name                8128 non-null   object 
 1   year                8128 non-null   int64  
 2   selling_price       8128 non-null   int64  
 3   km_driven           8128 non-null   int64  
 4   fuel                8128 non-null   object 
 5   seller_type         8128 non-null   object 
 6   transmission        8128 non-null   object 
 7   owner               8128 non-null   object 
 8   mileage(km/ltr/kg)  7907 non-null   float64
 9   engine              7907 non-null   float64
 10  max_power           7913 non-null   object 
 11  seats               7907 non-null   float64
dtypes: float64(3), int64(3), object(6)
memory usage: 762.1+ KB


####Observations :
- The columns contains different data types.
  - numerical data types : 6
  - categorical data types : 6.
- There are missing values in
  - mileage, engine , max_power, seats
- The max_power is stored as object need to be in integer.

In [9]:
df.describe(include= 'all',)

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage(km/ltr/kg),engine,max_power,seats
count,8128,8128.0,8128.0,8128.0,8128,8128,8128,8128,7907.0,7907.0,7913.0,7907.0
unique,2058,,,,4,3,2,5,,,320.0,
top,Maruti Swift Dzire VDI,,,,Diesel,Individual,Manual,First Owner,,,74.0,
freq,129,,,,4402,6766,7078,5289,,,377.0,
mean,,2013.804011,638271.8,69819.51,,,,,19.418783,1458.625016,,5.416719
std,,4.044249,806253.4,56550.55,,,,,4.037145,503.916303,,0.959588
min,,1983.0,29999.0,1.0,,,,,0.0,624.0,,2.0
25%,,2011.0,254999.0,35000.0,,,,,16.78,1197.0,,5.0
50%,,2015.0,450000.0,60000.0,,,,,19.3,1248.0,,5.0
75%,,2017.0,675000.0,98000.0,,,,,22.32,1582.0,,5.0


In [10]:
df.nunique()

Unnamed: 0,0
name,2058
year,29
selling_price,677
km_driven,921
fuel,4
seller_type,3
transmission,2
owner,5
mileage(km/ltr/kg),381
engine,121


In [11]:
df.isna().sum()

Unnamed: 0,0
name,0
year,0
selling_price,0
km_driven,0
fuel,0
seller_type,0
transmission,0
owner,0
mileage(km/ltr/kg),221
engine,221


In [12]:
(df.isnull().sum() / len(df)) * 100

Unnamed: 0,0
name,0.0
year,0.0
selling_price,0.0
km_driven,0.0
fuel,0.0
seller_type,0.0
transmission,0.0
owner,0.0
mileage(km/ltr/kg),2.718996
engine,2.718996


In [16]:
df['fuel'].value_counts()

Unnamed: 0_level_0,count
fuel,Unnamed: 1_level_1
Diesel,4402
Petrol,3631
CNG,57
LPG,38


In [17]:
df['seller_type'].value_counts()

Unnamed: 0_level_0,count
seller_type,Unnamed: 1_level_1
Individual,6766
Dealer,1126
Trustmark Dealer,236


In [18]:
df['owner'].value_counts()

Unnamed: 0_level_0,count
owner,Unnamed: 1_level_1
First Owner,5289
Second Owner,2105
Third Owner,555
Fourth & Above Owner,174
Test Drive Car,5


In [19]:
df['transmission'].value_counts()

Unnamed: 0_level_0,count
transmission,Unnamed: 1_level_1
Manual,7078
Automatic,1050


In [20]:
df['seats'].value_counts()

Unnamed: 0_level_0,count
seats,Unnamed: 1_level_1
5.0,6254
7.0,1120
8.0,236
4.0,133
9.0,80
6.0,62
10.0,19
2.0,2
14.0,1


### Data Quality Issues Identified

| Issue | Column | Description | Planned Action |
|-----|------|------------|----------------|
| Missing values | mileage, engine, seats, max_power | Missing technical specifications for some vehicles | Impute using median (robust to outliers) |
| Incorrect data type | max_power | Stored as object | Extract numeric values and convert to float |
| High cardinality | name | Large number of unique car models | Extract brand to reduce dimensionality |
| Potential outliers | selling_price, km_driven | Extreme values may distort analysis | Detect using IQR and treat accordingly |
| Rare categories | owner, fuel, seats | Some categories have very low frequency | Review and group if required during EDA |
