# Used Car Price Prediction: KNN

### Dataset

It is a comma separated file and there are 14 columns in the dataset.

- Location - The location in which the car is being sold or is available for purchase.
- Year - The year or edition of the model.
- KM_Driven - The total kilometers are driven in the car by the previous owner(s) in '000 KM.
- Fuel_Type - The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)
- Transmission - The type of transmission used by the car. (Automatic / Manual)
- Owner_Type - First, Second, Third, or Fourth & Above
- Mileage - The standard mileage offered by the car company in kmpl or km/kg
- Engine - The displacement volume of the engine in CC.
- Power - The maximum power of the engine in bhp.
- Seats - The number of seats in the car.
- Price - The price of the car (target).

### Load Dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

In [2]:
cars_df = pd.read_csv( "https://drive.google.com/uc?export=download&id=10ABViLN4Q7vgIlLvepCduU4B3C6BneJR" )

In [3]:
cars_df.sample(5)

Unnamed: 0,Location,Fuel_Type,Transmission,Owner_Type,Seats,Price,age,KM_Driven,make,mileage,engine,power
55,Mumbai,Diesel,Manual,First,5.0,3.6,7,45,volkswagen,22.07,1199,73.9
355,Mumbai,Petrol,Manual,First,5.0,3.95,5,57,volkswagen,16.47,1198,74.0
223,Coimbatore,Diesel,Manual,First,5.0,10.88,2,45,honda,25.1,1498,98.6
354,Hyderabad,Diesel,Manual,First,5.0,3.2,6,51,ford,20.0,1399,68.05
90,Coimbatore,Diesel,Manual,First,5.0,2.58,6,59,tata,17.0,1405,70.0


In [4]:
cars_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1038 entries, 0 to 1037
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Location      1038 non-null   object 
 1   Fuel_Type     1038 non-null   object 
 2   Transmission  1038 non-null   object 
 3   Owner_Type    1038 non-null   object 
 4   Seats         1037 non-null   float64
 5   Price         1038 non-null   float64
 6   age           1038 non-null   int64  
 7   KM_Driven     1038 non-null   int64  
 8   make          1038 non-null   object 
 9   mileage       1038 non-null   float64
 10  engine        1038 non-null   int64  
 11  power         1038 non-null   float64
dtypes: float64(4), int64(3), object(5)
memory usage: 97.4+ KB


### Feature Set Selection

In [5]:
cars_df.columns

Index(['Location', 'Fuel_Type', 'Transmission', 'Owner_Type', 'Seats', 'Price',
       'age', 'KM_Driven', 'make', 'mileage', 'engine', 'power'],
      dtype='object')

In [6]:
x_features = ['KM_Driven', 'Fuel_Type', 'age',
              'Transmission', 'Owner_Type', 'Seats', 
              'make', 'mileage', 'engine', 
              'power', 'Location']

In [7]:
cat_vars = ['Fuel_Type', 
                'Transmission', 'Owner_Type',
                'make', 'Location']

In [8]:
num_vars = list(set(x_features) - set(cat_vars))

In [9]:
num_vars

['mileage', 'KM_Driven', 'Seats', 'engine', 'power', 'age']

In [10]:
cars_df[x_features].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1038 entries, 0 to 1037
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   KM_Driven     1038 non-null   int64  
 1   Fuel_Type     1038 non-null   object 
 2   age           1038 non-null   int64  
 3   Transmission  1038 non-null   object 
 4   Owner_Type    1038 non-null   object 
 5   Seats         1037 non-null   float64
 6   make          1038 non-null   object 
 7   mileage       1038 non-null   float64
 8   engine        1038 non-null   int64  
 9   power         1038 non-null   float64
 10  Location      1038 non-null   object 
dtypes: float64(3), int64(3), object(5)
memory usage: 89.3+ KB


### Need for Data Transformation

1. Data imputation for Seats Column
    - Mean imputation 
2. Categorical Encoding for categorical columns
    - OHE Encoding
3. Data scaling
    - Standard scaling

### Setting X and y variables

In [11]:
X = cars_df[x_features]
y = cars_df['Price']

### Data Splitting

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    train_size = 0.8,
                                                    random_state = 80)

In [14]:
X_train.shape

(830, 11)

In [15]:
X_test.shape

(208, 11)

### Data Imputation

In [16]:
from sklearn.impute import SimpleImputer

In [17]:
imputed_num_vars = ['Seats']

In [18]:
imputed_num_vars

['Seats']

In [19]:
non_imputed_num_vars = list(set(num_vars) - set(imputed_num_vars))

In [20]:
non_imputed_num_vars

['mileage', 'KM_Driven', 'engine', 'power', 'age']

In [21]:
mean_imputer = SimpleImputer(strategy='mean')

### Encode Categorical Variables

In [22]:
from sklearn.preprocessing import OneHotEncoder

In [23]:
ohe_encoder = OneHotEncoder(handle_unknown='ignore')

### Scaling numerical vars

In [24]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

### Creating Pipelines

In [25]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [28]:
imputed_num_transformer = Pipeline( steps = [  
        ('imputation', mean_imputer),
        ('scaler', scaler)])

In [26]:
non_imputed_num_transformer = Pipeline( steps = [('scaler', scaler)])

In [27]:
cat_transformer = Pipeline( steps = [('ohencoder', ohe_encoder)])

In [31]:
preprocessor = ColumnTransformer(
    transformers=[  
        ('num_imputed', imputed_num_transformer, imputed_num_vars),
        ('num_not_imputed', non_imputed_num_transformer, non_imputed_num_vars),
        ('catvars', cat_transformer, cat_vars)])

### KNN (K-Nearest Neighbor)


In [29]:
from sklearn.neighbors import KNeighborsRegressor

In [30]:
#knn = KNeighborsRegressor(n_neighbors=20)
knn = KNeighborsRegressor(n_neighbors=20, weights='distance')

In [32]:
knn_v1 = Pipeline(steps=[('preprocessor', preprocessor),
                          ('knn', knn)])

In [33]:
knn_v1.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num_imputed',
                                                  Pipeline(steps=[('imputation',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['Seats']),
                                                 ('num_not_imputed',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['mileage', 'KM_Driven',
                                                   'engine', 'power', 'age']),
                                                 ('catvars',
                                                  Pipeline(s

In [34]:
from sklearn import set_config
set_config(display='diagram') 

In [35]:
knn_v1

### Predict on test set

In [36]:
y_pred = knn_v1.predict(X_test)

### K Fold Cross Validation

In [37]:
from sklearn.model_selection import cross_val_score

In [38]:
scores = cross_val_score( knn_v1,
                          X_train,
                          y_train,
                          cv = 10,
                          scoring = 'r2')

In [39]:
scores

array([0.82494184, 0.71891728, 0.75005726, 0.8216027 , 0.74097026,
       0.76401927, 0.72654669, 0.79012772, 0.84630061, 0.74544216])

In [40]:
scores.mean()

0.7728925800703326

In [None]:
scores.std()