# Used Car Price Prediction: KNN

### Dataset

It is a comma separated file and there are 14 columns in the dataset.

- Location - The location in which the car is being sold or is available for purchase.
- Year - The year or edition of the model.
- KM_Driven - The total kilometers are driven in the car by the previous owner(s) in '000 KM.
- Fuel_Type - The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)
- Transmission - The type of transmission used by the car. (Automatic / Manual)
- Owner_Type - First, Second, Third, or Fourth & Above
- Mileage - The standard mileage offered by the car company in kmpl or km/kg
- Engine - The displacement volume of the engine in CC.
- Power - The maximum power of the engine in bhp.
- Seats - The number of seats in the car.
- Price - The price of the car (target).

### Load Dataset

In [88]:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

In [89]:
import wandb
# start a new wandb run to track this script
os.environ['WANDB_API_KEY'] = '5ad2fb94cbdb665b701c09aebd3f71273eef52c5'
wandb.init(
    # set the wandb project where this run will be logged
    project="my-carprice-prediction",

    # track hyperparameters and run metadata
    config={
    "learning_rate": 0.02,
    "architecture": "CNN",
    "dataset": "CIFAR-100",
    "epochs": 10,
    }
)


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011167693255508008, max=1.0…

In [86]:
cars_df = pd.read_csv( "../inputdata/cars.csv" )

In [46]:
cars_df.sample(5)

Unnamed: 0,Location,Fuel_Type,Transmission,Owner_Type,Seats,Price,age,KM_Driven,make,mileage,engine,power
651,Delhi,Diesel,Manual,First,5.0,5.5,4,76,maruti,26.59,1248,74.0
9,Hyderabad,Petrol,Manual,First,5.0,2.6,5,88,maruti,24.07,998,67.1
534,Ahmedabad,Diesel,Manual,First,5.0,5.7,5,66,honda,26.0,1498,98.6
958,Mumbai,Diesel,Manual,Second,5.0,4.25,7,74,maruti,28.4,1248,73.75
202,Hyderabad,Diesel,Manual,First,5.0,3.0,7,107,ford,17.8,1399,67.0


In [47]:
cars_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1038 entries, 0 to 1037
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Location      1038 non-null   object 
 1   Fuel_Type     1038 non-null   object 
 2   Transmission  1038 non-null   object 
 3   Owner_Type    1038 non-null   object 
 4   Seats         1037 non-null   float64
 5   Price         1038 non-null   float64
 6   age           1038 non-null   int64  
 7   KM_Driven     1038 non-null   int64  
 8   make          1038 non-null   object 
 9   mileage       1038 non-null   float64
 10  engine        1038 non-null   int64  
 11  power         1038 non-null   float64
dtypes: float64(4), int64(3), object(5)
memory usage: 97.4+ KB


### Feature Set Selection

In [48]:
cars_df.columns

Index(['Location', 'Fuel_Type', 'Transmission', 'Owner_Type', 'Seats', 'Price',
       'age', 'KM_Driven', 'make', 'mileage', 'engine', 'power'],
      dtype='object')

In [49]:
x_features = ['KM_Driven', 'Fuel_Type', 'age',
              'Transmission', 'Owner_Type', 'Seats', 
              'make', 'mileage', 'engine', 
              'power', 'Location']

In [50]:
cat_vars = ['Fuel_Type', 
                'Transmission', 'Owner_Type',
                'make', 'Location']

In [51]:
num_vars = list(set(x_features) - set(cat_vars))

In [52]:
num_vars

['mileage', 'Seats', 'engine', 'KM_Driven', 'age', 'power']

In [53]:
cars_df[x_features].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1038 entries, 0 to 1037
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   KM_Driven     1038 non-null   int64  
 1   Fuel_Type     1038 non-null   object 
 2   age           1038 non-null   int64  
 3   Transmission  1038 non-null   object 
 4   Owner_Type    1038 non-null   object 
 5   Seats         1037 non-null   float64
 6   make          1038 non-null   object 
 7   mileage       1038 non-null   float64
 8   engine        1038 non-null   int64  
 9   power         1038 non-null   float64
 10  Location      1038 non-null   object 
dtypes: float64(3), int64(3), object(5)
memory usage: 89.3+ KB


### Need for Data Transformation

1. Data imputation for Seats Column
    - Mean imputation 
2. Categorical Encoding for categorical columns
    - OHE Encoding
3. Data scaling
    - Standard scaling

### Setting X and y variables

In [54]:
X = cars_df[x_features]
y = cars_df['Price']

### Data Splitting

In [55]:
from sklearn.model_selection import train_test_split

In [56]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    train_size = 0.8,
                                                    random_state = 80)

In [57]:
X_train.shape

(830, 11)

In [58]:
X_test.shape

(208, 11)

### Data Imputation

In [59]:
from sklearn.impute import SimpleImputer

In [60]:
imputed_num_vars = ['Seats']

In [61]:
imputed_num_vars

['Seats']

In [62]:
non_imputed_num_vars = list(set(num_vars) - set(imputed_num_vars))

In [63]:
non_imputed_num_vars

['mileage', 'engine', 'KM_Driven', 'age', 'power']

In [64]:
mean_imputer = SimpleImputer(strategy='mean')

### Encode Categorical Variables

In [65]:
from sklearn.preprocessing import OneHotEncoder

In [66]:
ohe_encoder = OneHotEncoder(handle_unknown='ignore')

### Scaling numerical vars

In [67]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

### Creating Pipelines

In [68]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [69]:
imputed_num_transformer = Pipeline( steps = [  
        ('imputation', mean_imputer),
        ('scaler', scaler)])

In [70]:
non_imputed_num_transformer = Pipeline( steps = [('scaler', scaler)])

In [71]:
cat_transformer = Pipeline( steps = [('ohencoder', ohe_encoder)])

In [72]:
preprocessor = ColumnTransformer(
    transformers=[  
        ('num_imputed', imputed_num_transformer, imputed_num_vars),
        ('num_not_imputed', non_imputed_num_transformer, non_imputed_num_vars),
        ('catvars', cat_transformer, cat_vars)])

### KNN (K-Nearest Neighbor)


In [73]:
from sklearn.neighbors import KNeighborsRegressor

In [74]:
#knn = KNeighborsRegressor(n_neighbors=20)
knn = KNeighborsRegressor(n_neighbors=20, weights='distance')

In [75]:
knn_v1 = Pipeline(steps=[('preprocessor', preprocessor),
                          ('knn', knn)])

In [76]:
knn_v1.fit(X_train, y_train)

In [77]:
from sklearn import set_config
set_config(display='diagram') 

In [102]:
from joblib import dump
knn_v1
dump(knn_v1, '../out/knn_v1.joblib')
artifact=wandb.Artifact("cardsdatabase", type= "database")

artifact.add_file("../inputdata/cars.csv")
artifact.add_file('../out/knn_v1.joblib')
wandb.log_artifact(artifact)
wandb.finish


<function wandb.sdk.wandb_run.finish(exit_code: Optional[int] = None, quiet: Optional[bool] = None) -> None>

wandb: Network error (ConnectionError), entering retry loop.
[34m[1mwandb[0m: Network error resolved after 0:02:42.919784, resuming normal operation.
wandb: Network error (ConnectionError), entering retry loop.
[34m[1mwandb[0m: Network error resolved after 1:27:08.490254, resuming normal operation.
wandb: Network error (ConnectionError), entering retry loop.


### Predict on test set

In [79]:
y_pred = knn_v1.predict(X_test)

### K Fold Cross Validation

In [80]:
from sklearn.model_selection import cross_val_score

In [81]:
scores = cross_val_score( knn_v1,
                          X_train,
                          y_train,
                          cv = 10,
                          scoring = 'r2')

In [82]:
scores

array([0.82486505, 0.71891728, 0.75005726, 0.8216027 , 0.74097026,
       0.76401927, 0.72654669, 0.79012772, 0.84630204, 0.74544216])

In [83]:
scores.mean()

0.7728850440653854

In [101]:
scores.std()
wandb.log({"mean_r2": scores.mean(), "std_r2": scores.std()})
wandb.finish

<function wandb.sdk.wandb_run.finish(exit_code: Optional[int] = None, quiet: Optional[bool] = None) -> None>

In [85]:
#Model experiment 

In [None]:
# 'Model exchange' refers to the process of taking the model from the learning system and deploying it on the inference system for predictions. In the following video, we will learn the concept of model exchange and we will also learn some model exchange formats.