# Using KNN to Predict Car Prices

The [dataset](https://archive.ics.uci.edu/ml/datasets/automobile) contains a variety of data that can be useful for evaluating prices of cars.  These data will provide a foundation for price prediction using K-nearest Neighbor.

# 0. Setting up Dependencies

In [None]:
!conda install -c plotly chart-studio

In [None]:
import pandas as pd

# 1. Exploring the Dataset

In [None]:
# Read dataset into a DataFrame
cars = pd.read_csv('imports-85.data', header = None) # Data has no header

In [None]:
cars.head()

The dataset has no header, but we can get the header infomation from our source as below:

* symboling: -3, -2, -1, 0, 1, 2, 3.
* normalized-losses: continuous from 65 to 256.
* make: alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo
* fuel-type: diesel, gas.
* aspiration: std, turbo.
* num-of-doors: four, two.
* body-style: hardtop, wagon, sedan, hatchback, convertible.
* drive-wheels: 4wd, fwd, rwd.
* engine-location: front, rear.
* wheel-base: continuous from 86.6 120.9.
* length: continuous from 141.1 to 208.1.
* width: continuous from 60.3 to 72.3.
* height: continuous from 47.8 to 59.8.
* curb-weight: continuous from 1488 to 4066.
* engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
* num-of-cylinders: eight, five, four, six, three, twelve, two.
* engine-size: continuous from 61 to 326.
* fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
* bore: continuous from 2.54 to 3.94.
* stroke: continuous from 2.07 to 4.17.
* compression-ratio: continuous from 7 to 23.
* horsepower: continuous from 48 to 288.
* peak-rpm: continuous from 4150 to 6600.
* city-mpg: continuous from 13 to 49.
* highway-mpg: continuous from 16 to 54.
* price: continuous from 5118 to 45400.

In [None]:
# Create a string with all the header information
header = '''1. symboling: -3, -2, -1, 0, 1, 2, 3.A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.

2. normalized-losses: continuous from 65 to 256.
3. make:
alfa-romero, audi, bmw, chevrolet, dodge, honda,
isuzu, jaguar, mazda, mercedes-benz, mercury,
mitsubishi, nissan, peugot, plymouth, porsche,
renault, saab, subaru, toyota, volkswagen, volvo
4. fuel-type: diesel, gas.
5. aspiration: std, turbo.
6. num-of-doors: four, two.
7. body-style: hardtop, wagon, sedan, hatchback, convertible.
8. drive-wheels: 4wd, fwd, rwd.
9. engine-location: front, rear.
10. wheel-base: continuous from 86.6 120.9.
11. length: continuous from 141.1 to 208.1.
12. width: continuous from 60.3 to 72.3.
13. height: continuous from 47.8 to 59.8.
14. curb-weight: continuous from 1488 to 4066.
15. engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
16. num-of-cylinders: eight, five, four, six, three, twelve, two.
17. engine-size: continuous from 61 to 326.
18. fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
19. bore: continuous from 2.54 to 3.94.
20. stroke: continuous from 2.07 to 4.17.
21. compression-ratio: continuous from 7 to 23.
22. horsepower: continuous from 48 to 288.
23. peak-rpm: continuous from 4150 to 6600.
24. city-mpg: continuous from 13 to 49.
25. highway-mpg: continuous from 16 to 54.
26. price: continuous from 5118 to 45400.'''

# After observing, split header by '. '
header = header.split('. ')
header

In [None]:
# Extract column names from the list of headers
import re
pat = '[^:]*' # Matches anything that's not ':' therefore stops at first ':'
columns = []
for h in header:
    m = re.search(pat, h)
    if m: 
        found = m.group(0) # If pattern exist, extract group(0)
        columns.append(found)

In [None]:
columns, len(columns) # Checking columns result and make sure all of the headers are included

In [None]:
# Add columns to cars DataFrame and exclude the first element in columns that shouldn't be included
cars.columns = columns[1:]

In [None]:
cars.head()

In [None]:
cars.info()

In [None]:
pd.options.display.max_columns = 26
cars.describe(include = 'all')

In [None]:
cars.columns

After exploring the dataset, we can determine the columns that are numerical and can be used as features as below:

`'symboling', 'normalized-losses', 'num-of-doors', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'num-of-cylinders', 'engine-size', 'bore', 'stroke', 'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg'`

Columns 'num-of-doors', 'num-of-cylinders' are not numerical but can be converted to numerical.

Column 'price' will be our target column.

# 2. Data Cleaning

In [None]:
# Keep only selected features and target columns
cars_selected = cars[['symboling', 'normalized-losses', 'num-of-doors', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'num-of-cylinders', 'engine-size', 'bore', 'stroke', 'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']].copy()

From data exploration, missing values are replaced with '?' in column 'normalized-losses'; the following will replace '?' with null(NaN).

In [None]:
import numpy as np
cars_selected = cars_selected.replace('?', np.nan)

In [None]:
# Convert strings in columns 'num-of-doors', 'num-of-cylinders' to numerical values
cars_selected['num-of-doors'] = cars_selected['num-of-doors'].map({'two':2, 'four':4})
cars_selected['num-of-cylinders'] = cars_selected['num-of-cylinders'].map({'eight': 8,
                             'five':5, 
                             'four':4,
                             'six':6, 
                             'three':3, 
                             'twelve':12, 
                             'two':2})

In [None]:
# Convert all columns in the dataframe to type float 
cars_selected = cars_selected.astype(float)

In [None]:
# Check for missing values in the dataframe
cars_selected.isnull().sum()

Since there are only 4 missing car prices, dropping these rows will not compromise the prediction. 

There are also 2 missing values in the `num-of-doors` column. The original dataframe should tell us the car make and body-type so we can likely figure out the number of doors there.

For the other columns, it is reasonable to fill the missing values with the column mean.

In [None]:
# Drop rows with missing price
cars_selected.dropna(subset = ['price'], inplace = True)

In [None]:
# Check out the rows with missing num-of-doors value
idx = cars_selected[cars_selected['num-of-doors'].isnull()].index
cars.iloc[idx]

The Dodge and Mazda sedans are the culprits - a light application of Google-fu and we can find that they are both 4-door models.

In [None]:
# Assign door number values to rows with the missing values 
cars_selected.loc[idx, 'num-of-doors'] = 4

In [None]:
# Fill the missing values in the rest of columns with missing values with their column mean 
cars_selected = cars_selected.fillna(cars_selected.mean())

In [None]:
# Ensure no null values remain
cars_selected.isnull().sum()

Feature columns come next

In [None]:
# Normalize features
cars_features = cars_selected.drop('price', axis = 1)
cars_features = (cars_features - cars_features.mean())/np.std(cars_features)

In [None]:
cars_features.head()

In [None]:
cars_clean = pd.concat([cars_features, cars_selected.price], axis = 1)

# 3. Building the Univariate Model

In [None]:
# Import model & validation methods from sklearn 
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Training & validation function
def knn_train_test(feature_col, target_col, df):
    train, test = train_test_split(df, train_size = 0.8, test_size = 0.2, random_state = 1)
    model = KNeighborsRegressor()
    model.fit(train[feature_col], train[target_col])
    predictions = model.predict(test[feature_col])
    mse = mean_squared_error(test[target_col], predictions)
    rmse = np.sqrt(np.abs(mse))
    return rmse

Using columns with numerical data to train and test the univariate models comes next

In [None]:
rmses = {}
feature_cols = cars_features.columns

for col in feature_cols:
    rmses[col] = knn_train_test([col], 'price', cars_clean)

rmses

In [None]:
# Get the key of the minimum value in the rmses dictionary 
min(rmses, key=rmses.get)

The result shows that `horsepower` performed the best using the k=5 default.

Modifying the `knn_train_test()` function from above to accept a parameter for k will also make the design more fail-safe.

In [None]:
# Modify knn_train_test() function 
def knn_train_test(feature_col, target_col, df, k):
    train, test = train_test_split(df, train_size = 0.8, test_size = 0.2, random_state = 1)
    model = KNeighborsRegressor(n_neighbors = k)
    model.fit(train[feature_col], train[target_col])
    predictions = model.predict(test[feature_col])
    mse = mean_squared_error(test[target_col], predictions)
    rmse = np.sqrt(np.abs(mse))
    return rmse

For each selected column, additional k values can be used to create, train, and test a univariate model.

In [None]:
# List of k_values
k_values = range(1,10,2)

# Create a dataframe to store the result
univariate_k_rmse = pd.DataFrame(data = 0, index = range(len(k_values)),columns = feature_cols)
univariate_k_rmse['k_values'] = k_values

In [None]:
for col in feature_cols:
    for n in k_values:
        univariate_k_rmse.loc[univariate_k_rmse.k_values == n, col] = knn_train_test([col], 'price', cars_clean, n)

In [None]:
univariate_k_rmse

And while a table is always nice, a picture can speak a thousand words:

In [None]:
Visualize with plotly bar graph and slider
import plotly.graph_objects as go
import plotly.offline as pyo
pyo.init_notebook_mode()

# Initialize a set of colors
colors = ['#30336b',
          '#4834d4', '#686de0',
          '#22a6b3', '#7ed6df']

# Create figure
fig = go.Figure()
i = 0
# Add traces, one for each slider step
for step in np.arange(1, 10, 2):
    fig.add_trace(
        go.Bar(
            visible=False,
            name="k-value = " + str(step),
            x=feature_cols,
            y=univariate_k_rmse.loc[i, feature_cols],
            marker=dict(
                color=colors[i]
            )))
    i+=1
    
    
# Make first trace visible
fig.data[0].visible = True

# Create and add slider
steps = []

for i in range(len(fig.data)):
    step = dict(
        method="update",
        args=[{"visible": [False] * len(fig.data)},
              {"title": "Slider switched to K-Value: " + str(k_values[i])}],
        label = str(k_values[i]) # layout attribute
    )
    step["args"][0]["visible"][i] = True  # Toggle i'th trace to "visible"
    steps.append(step)

sliders = [dict(
    active=0,
    currentvalue={"prefix": "K-Value: "},
    pad={"t": 50},
    steps=steps
)]

fig.layout.update(
    sliders=sliders,
    yaxis=dict(range=[0,1.2e4])
)

fig.show()

# 4. Building the Multivariate Model

In order to accommodate more than one column, the next step is to update the model.

This will involve training the `knn_train_test()` function with additional features from the previous step and a default k value.

In [None]:
# find mean rmse for each feature from previous step
best_five = univariant_k_rmse[feature_cols].mean().sort_values().index[:5]
best_eight = univariant_k_rmse[feature_cols].mean().sort_values().index[:8]
best_eight

In [None]:
# Use the best 2,3,4,5 features from the previous step to train and test a multivariate k-nearest neighbors model
for i in range(7):
    rmse = knn_train_test(feature_col = best_eight[:i+1], target_col = 'price', df = cars_clean, k = 5)
    i += 1
    print('RMSE from default k and feature columns', list(best_eight[:i+1]), 'is: ', rmse)

The top three models that performed the best can be optimized

In [None]:
# Initialize a list of k values from 1 to 25
multi_k = range(1,26)

# Initialize a dataframe to store result 
models = ['3_best_features', '4_best_features', '5_best_features']
multivariate_k_rmse = pd.DataFrame(data = 0, columns = models, index = range(len(multi_k)))
multivariate_k_rmse['k_values'] = multi_k

# Fit the best 3 models from the previous step 
for i in range(3):
    for n in multi_k:
        rmse = knn_train_test(best_five[:i+2], 'price', cars_clean, n)
        multivariate_k_rmse.loc[multivariate_k_rmse.k_values == n, models[i]] = rmse 
    
multivariate_k_rmse

And to visualize the results, again

In [None]:
# Visualize with plotly line graph and slider

# Initialize a set of colors
colors = ['#d54062', '#ffa36c','#799351']

# Create figure
fig = go.Figure()
i = 0
# Add traces, one for each slider step
for step in models:
    fig.add_trace(
        go.Scatter(
            visible=False,
            name="Number of features = " + str(step),
            x=multivariate_k_rmse.k_values,
            y=multivariate_k_rmse[step],
            marker=dict(
                color=colors[i]
            )))
    i+=1
    
    
# Make first trace visible
fig.data[0].visible = True

# Create and add slider
steps = []

for i in range(len(fig.data)):
    step = dict(
        method="update",
        args=[{"visible": [False] * len(fig.data)},
              {"title": "Slider switched to: " + str(models[i])}],
        label = str(models[i]) # layout attribute
    )
    step["args"][0]["visible"][i] = True  # Toggle i'th trace to "visible"
    steps.append(step)

sliders = [dict(
    active = 0,
    currentvalue={"prefix": "Model: "},
    pad={"t": 50},
    steps=steps
)]

fig.update_layout(
    sliders=sliders,
    yaxis=dict(range=[1500, 3600]),
    xaxis = dict(range = [1,25],
                nticks = 25)
)

fig.show()

# Conclusion:

In the project, KNN was used to experiment with car prices, tuning the model with an array of features and k_values.  Interactable visualisation was also used to allow the reader to see the variation in the experiment results.  

Some of the findings:
* In this project, with a default k = 5, as the number of top features oscillates, from increasing, where the RMSE value drops, but then increases again. This indicates that as features increase the better trained model loses validity; it appears that feature relevance does play a role.
* And much like rings, there is no 'one K to rule them all.' Instead, each feature responds differently to different k-values. Also, different combinations of features respond differently to different k-values.