# Car Price Prediction

**Write meaningful header to project**

## Importing standard libraries

In [87]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt


### Loading the dataset
as a pandas DataFrame

In [88]:
dataset = pd.read_csv('./assets/car-details.csv')

Retrieving some information about the dataset

In [89]:
dataset.head(8)

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4 kmpl,1248 CC,74 bhp,190Nm@ 2000rpm,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14 kmpl,1498 CC,103.52 bhp,250Nm@ 1500-2500rpm,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7 kmpl,1497 CC,78 bhp,"12.7@ 2,700(kgm@ rpm)",5.0
3,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,23.0 kmpl,1396 CC,90 bhp,22.4 kgm at 1750-2750rpm,5.0
4,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,16.1 kmpl,1298 CC,88.2 bhp,"11.5@ 4,500(kgm@ rpm)",5.0
5,Hyundai Xcent 1.2 VTVT E Plus,2017,440000,45000,Petrol,Individual,Manual,First Owner,20.14 kmpl,1197 CC,81.86 bhp,113.75nm@ 4000rpm,5.0
6,Maruti Wagon R LXI DUO BSIII,2007,96000,175000,LPG,Individual,Manual,First Owner,17.3 km/kg,1061 CC,57.5 bhp,"7.8@ 4,500(kgm@ rpm)",5.0
7,Maruti 800 DX BSII,2001,45000,5000,Petrol,Individual,Manual,Second Owner,16.1 kmpl,796 CC,37 bhp,59Nm@ 2500rpm,4.0


Having seen the shape of the dataset, it is time to see if some of the values are missing.

In [90]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8128 entries, 0 to 8127
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           8128 non-null   object 
 1   year           8128 non-null   int64  
 2   selling_price  8128 non-null   int64  
 3   km_driven      8128 non-null   int64  
 4   fuel           8128 non-null   object 
 5   seller_type    8128 non-null   object 
 6   transmission   8128 non-null   object 
 7   owner          8128 non-null   object 
 8   mileage        7907 non-null   object 
 9   engine         7907 non-null   object 
 10  max_power      7913 non-null   object 
 11  torque         7906 non-null   object 
 12  seats          7907 non-null   float64
dtypes: float64(1), int64(3), object(9)
memory usage: 825.6+ KB


In [91]:
dataset.isna().sum()

name               0
year               0
selling_price      0
km_driven          0
fuel               0
seller_type        0
transmission       0
owner              0
mileage          221
engine           221
max_power        215
torque           222
seats            221
dtype: int64

In the output we can see that the dataset has a total of 8128 rows and 13 columns. 

Also we can see that the dataset has some missing values in the columns mileage, engine, max_power, torque and seats.

#### There is reason to believe some of the columns are categorical.

In [92]:
print("Column: Fuel types")
print(dataset.fuel.value_counts())

print("\nColumn: Seller types")
print(dataset.seller_type.value_counts())

print("\nColumn: Transmission types")
print(dataset.transmission.value_counts())

print("\nColumn: Owner count")
print(dataset.transmission.value_counts())

Column: Fuel types
Diesel    4402
Petrol    3631
CNG         57
LPG         38
Name: fuel, dtype: int64

Column: Seller types
Individual          6766
Dealer              1126
Trustmark Dealer     236
Name: seller_type, dtype: int64

Column: Transmission types
Manual       7078
Automatic    1050
Name: transmission, dtype: int64

Column: Owner count
Manual       7078
Automatic    1050
Name: transmission, dtype: int64


## Cleaning up the data

Okay so having seen the and understood the plan. Here is the TODO to clean up the data.

* Take care of the missing values
* Removing the car model from the brand name
* Encode the categorical values
* Convert max_power, milage and engine to numeric values
* Extract torque to a usable value

### Taking care of missing values

Going back to see the isna() function. It seems a lot of the missing values are on the same row of an entry

Therefore we can use the dropna() function to drop the rows with missing values. As it doesn't seem to make sense to try to predict

In [93]:
dataset = dataset.dropna()

In [94]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7906 entries, 0 to 8127
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           7906 non-null   object 
 1   year           7906 non-null   int64  
 2   selling_price  7906 non-null   int64  
 3   km_driven      7906 non-null   int64  
 4   fuel           7906 non-null   object 
 5   seller_type    7906 non-null   object 
 6   transmission   7906 non-null   object 
 7   owner          7906 non-null   object 
 8   mileage        7906 non-null   object 
 9   engine         7906 non-null   object 
 10  max_power      7906 non-null   object 
 11  torque         7906 non-null   object 
 12  seats          7906 non-null   float64
dtypes: float64(1), int64(3), object(9)
memory usage: 864.7+ KB


In [95]:
dataset.isna().sum()

name             0
year             0
selling_price    0
km_driven        0
fuel             0
seller_type      0
transmission     0
owner            0
mileage          0
engine           0
max_power        0
torque           0
seats            0
dtype: int64

In [96]:
dataset.reset_index(drop=True, inplace=True)

Now the missing data has been taken care of. If the missing data was more than a few rows, it would be a good idea to try to impute the missing values.

However in this case, it simply didn't make any sense to do so.

### Removing the car model from the brand name

I'm afraid keeping the exact car model, will make it difficult to predict the price of not recorded vehicles.

However keeping the brand name can still be useful, as it can indicate luxury of a car.

In [97]:
brands = dataset.name.copy()
dataset = dataset.drop(['name'], axis=1)

In [98]:
for i in range(len(brands)):
    brands[i] = brands[i].split(' ')[0]

Get the amount of unique brand names

In [99]:
len(brands.unique())

31

In [100]:
brands.value_counts()

Maruti           2367
Hyundai          1360
Mahindra          758
Tata              719
Honda             466
Toyota            452
Ford              388
Chevrolet         230
Renault           228
Volkswagen        185
BMW               118
Skoda             104
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               41
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Force               6
Land                6
Isuzu               5
Kia                 4
Ambassador          4
Daewoo              3
MG                  3
Ashok               1
Opel                1
Name: name, dtype: int64

In [101]:
dataset['brand'] = brands

In [102]:
dataset.head()

Unnamed: 0,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats,brand
0,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4 kmpl,1248 CC,74 bhp,190Nm@ 2000rpm,5.0,Maruti
1,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14 kmpl,1498 CC,103.52 bhp,250Nm@ 1500-2500rpm,5.0,Skoda
2,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7 kmpl,1497 CC,78 bhp,"12.7@ 2,700(kgm@ rpm)",5.0,Honda
3,2010,225000,127000,Diesel,Individual,Manual,First Owner,23.0 kmpl,1396 CC,90 bhp,22.4 kgm at 1750-2750rpm,5.0,Hyundai
4,2007,130000,120000,Petrol,Individual,Manual,First Owner,16.1 kmpl,1298 CC,88.2 bhp,"11.5@ 4,500(kgm@ rpm)",5.0,Maruti


**Succes!** The car brand names are now added to the dataset, and should be used as a categorical feature.