Goals: 
- Deal with null data points   
- Clean the data (getting rid of the units)
- Data split (train / test)
- Save the clean data in data/processed 

In [1103]:
import pandas as pd
import numpy as np  

In [1104]:
file_loc = "../data/raw/Car details v3.csv"
df = pd.read_csv(file_loc)
#The data source is the following https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho


In [1105]:
df = df.drop_duplicates()

In [1106]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6926 entries, 0 to 8125
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           6926 non-null   object 
 1   year           6926 non-null   int64  
 2   selling_price  6926 non-null   int64  
 3   km_driven      6926 non-null   int64  
 4   fuel           6926 non-null   object 
 5   seller_type    6926 non-null   object 
 6   transmission   6926 non-null   object 
 7   owner          6926 non-null   object 
 8   mileage        6718 non-null   object 
 9   engine         6718 non-null   object 
 10  max_power      6721 non-null   object 
 11  torque         6717 non-null   object 
 12  seats          6718 non-null   float64
dtypes: float64(1), int64(3), object(9)
memory usage: 757.5+ KB


In [1107]:
df.describe()

Unnamed: 0,year,selling_price,km_driven,seats
count,6926.0,6926.0,6926.0,6718.0
mean,2013.4203,517270.7,73995.68,5.434653
std,4.078286,519767.0,58358.1,0.98423
min,1983.0,29999.0,1.0,2.0
25%,2011.0,250000.0,40000.0,5.0
50%,2014.0,400000.0,70000.0,5.0
75%,2017.0,633500.0,100000.0,5.0
max,2020.0,10000000.0,2360457.0,14.0


In [1108]:
df.isnull().mean()

name             0.000000
year             0.000000
selling_price    0.000000
km_driven        0.000000
fuel             0.000000
seller_type      0.000000
transmission     0.000000
owner            0.000000
mileage          0.030032
engine           0.030032
max_power        0.029599
torque           0.030176
seats            0.030032
dtype: float64

Let's check if the none values are overlapping. That would justify dropping the null values.

In [1109]:
#Look at the rows with null values
df[pd.isnull(df).any(axis=1)]

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
13,Maruti Swift 1.3 VXi,2007,200000,80000,Petrol,Individual,Manual,Second Owner,,,,,
31,Fiat Palio 1.2 ELX,2003,70000,50000,Petrol,Individual,Manual,Second Owner,,,,,
78,Tata Indica DLS,2003,50000,70000,Diesel,Individual,Manual,First Owner,,,,,
87,Maruti Swift VDI BSIV W ABS,2015,475000,78000,Diesel,Dealer,Manual,First Owner,,,,,
119,Maruti Swift VDI BSIV,2010,300000,120000,Diesel,Individual,Manual,Second Owner,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7740,Hyundai Santro Xing XG,2004,70000,70000,Petrol,Individual,Manual,Second Owner,,,,,
7996,Hyundai Santro LS zipPlus,2000,140000,50000,Petrol,Individual,Manual,Second Owner,,,,,
8009,Hyundai Santro Xing XS eRLX Euro III,2006,145000,80000,Petrol,Individual,Manual,Second Owner,,,,,
8068,Ford Figo Aspire Facelift,2017,580000,165000,Diesel,Individual,Manual,First Owner,,,,,


In [1110]:
#These 209 rows have null values for muliple columns, so we will drop them
df=df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6717 entries, 0 to 8125
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           6717 non-null   object 
 1   year           6717 non-null   int64  
 2   selling_price  6717 non-null   int64  
 3   km_driven      6717 non-null   int64  
 4   fuel           6717 non-null   object 
 5   seller_type    6717 non-null   object 
 6   transmission   6717 non-null   object 
 7   owner          6717 non-null   object 
 8   mileage        6717 non-null   object 
 9   engine         6717 non-null   object 
 10  max_power      6717 non-null   object 
 11  torque         6717 non-null   object 
 12  seats          6717 non-null   float64
dtypes: float64(1), int64(3), object(9)
memory usage: 734.7+ KB


In [1111]:
df.sample(5)

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
6798,Tata Indigo CR4,2011,200000,70000,Diesel,Individual,Manual,Second Owner,23.57 kmpl,1396 CC,70 bhp,140Nm@ 1800-3000rpm,5.0
6311,Mahindra Renault Logan 1.5 DLE Diesel,2007,150000,120000,Diesel,Individual,Manual,Second Owner,19.2 kmpl,1461 CC,65 bhp,"16@ 2,000(kgm@ rpm)",5.0
7628,Ford EcoSport 1.5 TDCi Titanium BSIV,2017,934000,101000,Diesel,Individual,Manual,First Owner,22.77 kmpl,1498 CC,98.59 bhp,205Nm@ 1750-3250rpm,5.0
1028,Hyundai i10 Magna 1.1L,2013,310000,70000,Petrol,Individual,Manual,First Owner,19.81 kmpl,1086 CC,68.05 bhp,99.04Nm@ 4500rpm,5.0
5419,Hyundai i10 Sportz Option,2011,220000,70000,Petrol,Individual,Manual,Third Owner,20.36 kmpl,1197 CC,78.9 bhp,111.8Nm@ 4000rpm,5.0


The columns "selling_price", "km_driven", "year", and "seats" have float values. Therefore, no need to change those. The columns "mileage", "engine", and "max_power" have units. We may need to do unit conversion and coversion to float values. The "torque" column is weird, need to pay special attention. The columns "fuel", "seller_type", "transmission", and "owner" are categorical features.

First, we will work on the "engine", "mileage", and "max power" columns.

In [1112]:
units_features = ["engine", "mileage", "max_power"]

for feature in units_features:
    print(df[feature].apply(lambda string : string.split(" ")[1]).unique())


['CC']
['kmpl' 'km/kg']
['bhp']


Mileage has two different units. Let's have a look at those.

In [1113]:
#cond1: To see how many rows have km/kg
#cond2: To see how many rows have fuel CNG or LPG  
cond1 = df["mileage"].apply(lambda string : string.split(" ")[1]) == "km/kg"   
cond2 = (df["fuel"] == "CNG") | (df["fuel"] == "LPG")
print(len(df[cond1]))
print(len(df[cond2]))
print(len(df[cond1 & cond2]))
print(len(df[cond1])/len(df))
print(len(df) - len(df[cond1]))

86
86
86
0.012803334822093197
6631


There are only 1% of such values. The representation is too little. We will drop those.

In [1114]:
#This is only one percent of data, we can drop corresponding rows

df=df.drop(df[cond1].index)

In [1115]:
def drop_units(text : str) -> float :
    """Takes a value with units (eg. 125 bhp). Strips the units part and outputs the value"""

    return float(text.split(" ")[0])

Drop the units for engine and mileage columns.

In [1116]:
feature_units = {"mileage" : "kmpl", "engine" : "bhp"}  #have units kmpl and bhp

for feature in feature_units.keys():
    df[f"{feature}_{feature_units[feature]}"] = df[feature].apply(drop_units)


df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6631 entries, 0 to 8125
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           6631 non-null   object 
 1   year           6631 non-null   int64  
 2   selling_price  6631 non-null   int64  
 3   km_driven      6631 non-null   int64  
 4   fuel           6631 non-null   object 
 5   seller_type    6631 non-null   object 
 6   transmission   6631 non-null   object 
 7   owner          6631 non-null   object 
 8   mileage        6631 non-null   object 
 9   engine         6631 non-null   object 
 10  max_power      6631 non-null   object 
 11  torque         6631 non-null   object 
 12  seats          6631 non-null   float64
 13  mileage_kmpl   6631 non-null   float64
 14  engine_bhp     6631 non-null   float64
dtypes: float64(3), int64(3), object(9)
memory usage: 828.9+ KB


Next, we will extract data from the torque values. Unfortunately, the torque values have different types of units. From the first glance, there is Nm (Newton-meter) and kgm (kg-meter) for the torque. Also, there are rpm vales. We would like to know the kind of units we are given.

The following function takes a string and returns the same string without any numbers.

In [1117]:
def eat_numbers(text:str ) -> str:
    """Takes a string and outputs the same string without numbers. Example: "No 1 DJ" -> "No  DJ", "50kg" -> "kg" """
    return ''.join(["" if char.isdigit() else char for char in text])

assert(eat_numbers("No 1 DJ") == "No  DJ")
assert(eat_numbers("115Nm@ 3500-3600rpm") == "Nm@ -rpm")

import re 

def replace_numbers_with_star(text : str) -> str:
    """Takes a string replaces a number sequence with *. Example: "No 10 DJ" -> "No * DJ", "50kg over 500m" -> "*kg over *m" """
    new_text =  re.sub(r'\d+', '*', text)
    return new_text.replace('*.*', '*').replace(' ', '').lower()

replace_numbers_with_star("50kg over 500m")

'*kgover*m'

Let's check what units are there.

In [1118]:
df["torque"].apply(replace_numbers_with_star).unique()

array(['*nm@*rpm', '*nm@*-*rpm', '*@*,*(kgm@rpm)', '*kgmat*-*rpm',
       '*kgm@*rpm', '*nm@*~*rpm', '*nmat*rpm', '*@*-*rpm', '*nm',
       '*kgmat*rpm', '*kgmat*,*rpm', '*@*-*(kgm@rpm)', '*nmat*-*rpm',
       '*@*,*-*,*(kgm@rpm)', '*nm(*kgm)@*rpm', '*nm@*-*', '*nm@*+/-*rpm',
       '*@*,*+/-*(nm@rpm)', '*@*-*', '*(*)@*', '*nm/*rpm', '*@*(kgm@rpm)',
       '*nm@*,*rpm', '*nm@*', '*/*', '*nmat*,*-*,*rpm'], dtype=object)

There are too many different types of units. Some have range and some dont. Some torque values are in NM whereas other are in kgm. For some rows, one of torque or rpm is missing. Therefore, as a group we decided to drop the column. One could extract reasonable information, but that is a separate project on its own.

In [1119]:
df = df.drop("torque", axis = 1)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6631 entries, 0 to 8125
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           6631 non-null   object 
 1   year           6631 non-null   int64  
 2   selling_price  6631 non-null   int64  
 3   km_driven      6631 non-null   int64  
 4   fuel           6631 non-null   object 
 5   seller_type    6631 non-null   object 
 6   transmission   6631 non-null   object 
 7   owner          6631 non-null   object 
 8   mileage        6631 non-null   object 
 9   engine         6631 non-null   object 
 10  max_power      6631 non-null   object 
 11  seats          6631 non-null   float64
 12  mileage_kmpl   6631 non-null   float64
 13  engine_bhp     6631 non-null   float64
dtypes: float64(3), int64(3), object(8)
memory usage: 777.1+ KB


Now lets check number of unique values for different categorical variables starting with names.

In [1120]:
df["name"].nunique()  #This doesn't seem as a useful feature to include in price prediction 

1947

In [1121]:
df["name"].str.split().str[0].unique()

array(['Maruti', 'Skoda', 'Honda', 'Hyundai', 'Toyota', 'Ford', 'Renault',
       'Mahindra', 'Tata', 'Chevrolet', 'Datsun', 'Jeep', 'Mercedes-Benz',
       'Mitsubishi', 'Audi', 'Volkswagen', 'BMW', 'Nissan', 'Lexus',
       'Jaguar', 'Land', 'MG', 'Volvo', 'Daewoo', 'Kia', 'Fiat', 'Force',
       'Ambassador', 'Ashok', 'Isuzu', 'Opel'], dtype=object)

In [1122]:
df["name"].str.split().str[0].nunique()

31

In [1123]:
df["brand"]=df["name"].str.split().str[0]
df["brand"]=df["brand"].replace({"Land":"Land Rover"})
print(df["brand"].nunique())
print(df["brand"].unique())

31
['Maruti' 'Skoda' 'Honda' 'Hyundai' 'Toyota' 'Ford' 'Renault' 'Mahindra'
 'Tata' 'Chevrolet' 'Datsun' 'Jeep' 'Mercedes-Benz' 'Mitsubishi' 'Audi'
 'Volkswagen' 'BMW' 'Nissan' 'Lexus' 'Jaguar' 'Land Rover' 'MG' 'Volvo'
 'Daewoo' 'Kia' 'Fiat' 'Force' 'Ambassador' 'Ashok' 'Isuzu' 'Opel']


In [1124]:
categorical_features = ["fuel", "seller_type", "transmission", "owner"]

for feature in categorical_features:
    print(df[feature].value_counts())
    print(" ")

fuel
Diesel    3658
Petrol    2973
Name: count, dtype: int64
 
seller_type
Individual          5939
Dealer               665
Trustmark Dealer      27
Name: count, dtype: int64
 
transmission
Manual       6056
Automatic     575
Name: count, dtype: int64
 
owner
First Owner             4126
Second Owner            1861
Third Owner              486
Fourth & Above Owner     153
Test Drive Car             5
Name: count, dtype: int64
 


We can merge some categorical values with other. For example: we can merge dealer and trustmark dealer. I suggest we do this step in EDA/modeling after the data split. 

In [1125]:
#Knowing that the data was collected in 2020, a variable Age is added by subtracting the variabe year fron 2020.
df['age']=2020-df['year']
df.drop("year", axis=1)

df

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,seats,mileage_kmpl,engine_bhp,brand,age
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4 kmpl,1248 CC,74 bhp,5.0,23.40,1248.0,Maruti,6
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14 kmpl,1498 CC,103.52 bhp,5.0,21.14,1498.0,Skoda,6
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7 kmpl,1497 CC,78 bhp,5.0,17.70,1497.0,Honda,14
3,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,23.0 kmpl,1396 CC,90 bhp,5.0,23.00,1396.0,Hyundai,10
4,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,16.1 kmpl,1298 CC,88.2 bhp,5.0,16.10,1298.0,Maruti,13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8121,Maruti Wagon R VXI BS IV with ABS,2013,260000,50000,Petrol,Individual,Manual,Second Owner,18.9 kmpl,998 CC,67.1 bhp,5.0,18.90,998.0,Maruti,7
8122,Hyundai i20 Magna 1.4 CRDi,2014,475000,80000,Diesel,Individual,Manual,Second Owner,22.54 kmpl,1396 CC,88.73 bhp,5.0,22.54,1396.0,Hyundai,6
8123,Hyundai i20 Magna,2013,320000,110000,Petrol,Individual,Manual,First Owner,18.5 kmpl,1197 CC,82.85 bhp,5.0,18.50,1197.0,Hyundai,7
8124,Hyundai Verna CRDi SX,2007,135000,119000,Diesel,Individual,Manual,Fourth & Above Owner,16.8 kmpl,1493 CC,110 bhp,5.0,16.80,1493.0,Hyundai,13


In [1126]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6631 entries, 0 to 8125
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           6631 non-null   object 
 1   year           6631 non-null   int64  
 2   selling_price  6631 non-null   int64  
 3   km_driven      6631 non-null   int64  
 4   fuel           6631 non-null   object 
 5   seller_type    6631 non-null   object 
 6   transmission   6631 non-null   object 
 7   owner          6631 non-null   object 
 8   mileage        6631 non-null   object 
 9   engine         6631 non-null   object 
 10  max_power      6631 non-null   object 
 11  seats          6631 non-null   float64
 12  mileage_kmpl   6631 non-null   float64
 13  engine_bhp     6631 non-null   float64
 14  brand          6631 non-null   object 
 15  age            6631 non-null   int64  
dtypes: float64(3), int64(4), object(9)
memory usage: 880.7+ KB
