# Predicting Car Prices using the K Nearest Neighbors Algorithm

K Nearest Neighbors or KNN is an an algorithm that can make predictions based on the similarity between different observations. In this project, I used KNN to predict the price of a car based on how similar its features are to those of other cars. Towards this end, I applied various machine learning techniques, such as standardization, feature selection, train-test split, hyperparameter optimization, and k-fold cross validation.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn import model_selection
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
from scipy.stats import zscore
import seaborn as sns
import matplotlib.pyplot as plt

plt.show()  # Display the plot

## Step 1 : Data Inspection and Cleaning

In [2]:
cols = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 
        'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 
        'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']


In [3]:
cars= pd.read_csv('imports-85.data', names=cols)


In [4]:
cars.head(3)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-rate,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500


In [5]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
symboling            205 non-null int64
normalized-losses    205 non-null object
make                 205 non-null object
fuel-type            205 non-null object
aspiration           205 non-null object
num-of-doors         205 non-null object
body-style           205 non-null object
drive-wheels         205 non-null object
engine-location      205 non-null object
wheel-base           205 non-null float64
length               205 non-null float64
width                205 non-null float64
height               205 non-null float64
curb-weight          205 non-null int64
engine-type          205 non-null object
num-of-cylinders     205 non-null object
engine-size          205 non-null int64
fuel-system          205 non-null object
bore                 205 non-null object
stroke               205 non-null object
compression-rate     205 non-null float64
horsepower           205 non-nul

In [6]:
summary= cars.describe()
print(summary)

        symboling  wheel-base      length       width      height  \
count  205.000000  205.000000  205.000000  205.000000  205.000000   
mean     0.834146   98.756585  174.049268   65.907805   53.724878   
std      1.245307    6.021776   12.337289    2.145204    2.443522   
min     -2.000000   86.600000  141.100000   60.300000   47.800000   
25%      0.000000   94.500000  166.300000   64.100000   52.000000   
50%      1.000000   97.000000  173.200000   65.500000   54.100000   
75%      2.000000  102.400000  183.100000   66.900000   55.500000   
max      3.000000  120.900000  208.100000   72.300000   59.800000   

       curb-weight  engine-size  compression-rate    city-mpg  highway-mpg  
count   205.000000   205.000000        205.000000  205.000000   205.000000  
mean   2555.565854   126.907317         10.142537   25.219512    30.751220  
std     520.680204    41.642693          3.972040    6.542142     6.886443  
min    1488.000000    61.000000          7.000000   13.000000    16.00

Step 2 Cleaning steps: find out number of missing, convert object like 'four' into numeric 

In [7]:
cars=cars.replace(r'?', np.nan)
cars.head(5)


Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-rate,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [8]:
null_pct=(cars
          .isnull()
          .sum()
          .divide(cars.shape[0])
         .multiply(100)
         )
null_pct.loc[null_pct>0]

normalized-losses    20.00000
num-of-doors          0.97561
bore                  1.95122
stroke                1.95122
horsepower            0.97561
peak-rpm              0.97561
price                 1.95122
dtype: float64

The table above shows the percentage of missing values in each column that has them. In particular, “normalized-losses” has missing values in 20% of the observations. Thus, we will have to drop this column from the dataset. For other 6 we will drop them becuase they are quite small in percentage using listwise deletion.

In [9]:
cars['num-of-doors'].unique()
cars['num-of-doors'].replace('?',np.nan)
cars['num-of-doors']= cars['num-of-doors'].map({'two':2.0,'four':4.0})

In [10]:
cars['fuel-type'].unique()
cars['fuel-type'] = cars['fuel-type'].map({'gas':1.0, 'diesel':2.0})


In [11]:
cars['body-style'].unique()
cars['body-style']=cars['body-style'].map({'convertible':1.0, 'hatchback':2.0,'sedan':3.0, 'wagon':4.0,'hardtop':5.0,})

In [12]:
cars.head(4)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-rate,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,1.0,std,2.0,1.0,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,1.0,std,2.0,1.0,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,1.0,std,2.0,2.0,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,1.0,std,4.0,3.0,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950


In [13]:
cars['built'] = cars['length'] * cars['height'] * cars['width']

In [14]:
cars = (
    cars
    .drop('normalized-losses', axis=1)
    .dropna(
        subset=[
            "num-of-doors",
            "bore",
            "stroke",
            "horsepower",
            "peak-rpm",
            "price",
        ]
    )
)

num_null = cars.isnull().sum().sum()
# print(f"Total number of missing values: {num_null}")
# print(f"New shape of dataset: {cars_df.shape}")



## Step2 K Nearest Neighbors Algorithm

KNN uses the Euclidean distance as a similarity metric between two observations. A low distance close to 0 means that the observations are very similar to each other. The following formula is used:
\begin{equation}
\text{Euclidean Distance}(q, p) = \sqrt{\sum_{i=1}^{n} (q_i - p_i)^2}
\end{equation}

Given that we want to predict the price of a car \(q\), K-Nearest Neighbors (KNN) computes the Euclidean distance of \(q\) from every single car in the training set. The cars most similar to \(q\) are its "nearest neighbors."

We then choose a number \(k\), which will determine how many of the nearest neighbors will be selected. For example, if \(k=5\), we select the five most similar cars. Then, we take the mean price of these five cars, and we predict that this is the price of car \(q\).



## Step 3 Machine Learning workflow for model preparation

Step 3a : Feature scaling or data normalisation is a method used to normalize the range of independent variables or features of data.Some of the methods are min-max, mean normalisation, z-score or standarlisation etc. 

In this case, we will use z score for standardlisation. So that each feature will contribute equally to the Euclidean distance. Z score convert the value such that mean of each feature is 0 and its standard deviation is 1.
The z-score is calculated using the formula:

\[ Z = \frac{{X - \mu}}{{\sigma}} \]

Where:
- \( Z \) is the z-score,
- \( X \) is the value in a feature,
- \( \mu \) is the mean of the feature, and
- \( \sigma \) is the sample standard deviation.


In [15]:
len(cars) # total columns in the dataset

193

Convert non-numeric to numeric
The second technique is feature selection. To predict a car's price, we choose relevant features. Only numeric features are suitable for Euclidean distance calculations in KNN. However, you can handle categorical variables with two techniques:

One-Hot Encoding: Convert categories into binary (0/1) dummy variables. For example, 'color' can become 'IsRed,' 'IsGreen,' and 'IsBlue,' enabling distance measurement between categorical variables.

Weighted Distances: Assign different weights to categorical attribute differences to emphasize the importance of specific categories."


cars['aspiration'].unique()
cars['aspiration']= cars['aspiration'].map({'std':1, 'turbo':2})

In [16]:
cars.head(4)

Unnamed: 0,symboling,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,...,fuel-system,bore,stroke,compression-rate,horsepower,peak-rpm,city-mpg,highway-mpg,price,built
0,3,alfa-romero,1.0,std,2.0,1.0,rwd,front,88.6,168.8,...,mpfi,3.47,2.68,9.0,111,5000,21,27,13495,528019.904
1,3,alfa-romero,1.0,std,2.0,1.0,rwd,front,88.6,168.8,...,mpfi,3.47,2.68,9.0,111,5000,21,27,16500,528019.904
2,1,alfa-romero,1.0,std,2.0,2.0,rwd,front,94.5,171.2,...,mpfi,2.68,3.47,9.0,154,5000,19,26,16500,587592.64
3,2,audi,1.0,std,4.0,3.0,fwd,front,99.8,176.6,...,mpfi,3.19,3.4,10.0,102,5500,24,30,13950,634816.956


In [17]:
cars.isnull().sum() # checking for null values by colunms

symboling           0
make                0
fuel-type           0
aspiration          0
num-of-doors        0
body-style          0
drive-wheels        0
engine-location     0
wheel-base          0
length              0
width               0
height              0
curb-weight         0
engine-type         0
num-of-cylinders    0
engine-size         0
fuel-system         0
bore                0
stroke              0
compression-rate    0
horsepower          0
peak-rpm            0
city-mpg            0
highway-mpg         0
price               0
built               0
dtype: int64

In [18]:
cars= cars.dropna(subset=['price']) # bcoz price predictor cannot have nan

In [19]:
cars=cars.fillna(cars.mean())

In [20]:
cars['price'].min()

'10198'

## step 3A. Standardization
In this part, I will individually discuss certain important techniques used in the machine learning workflow. In the next part, I will combine these techniques in order to obtain the optimal KNN model.
The first important technique is standardization. To ensure that each feature contributes equally to the Euclidean distance, we standardize each numeric feature. In other words, each value is converted into a z-score, so that the mean of each feature is 0 and its standard deviation is 1. The correct following equation is used:


z = x - \barx /S



where:

\begin{align*}
z & \text{ is the z-score.} \\
x & \text{ is a value in a feature.} \\
\bar{x} & \text{ is the mean of the feature.} \\
s & \text{ is the sample standard deviation.}
\end{align*}


https://miguelahg.github.io/migs-germar-data-science-blog/posts/2021-12-21-predicting-car-prices-k-nearest-neighbors.html

In [21]:
all_feature_cols = [col for col in cars.columns if col != "price"]

# Series of feature:data type
fdt = cars[all_feature_cols].dtypes

# Identify numeric features
all_numeric_features = fdt.index[fdt != "object"]

# Standardize
cars[all_numeric_features] = cars[all_numeric_features].apply(zscore, axis = 0, ddof = 1)

cars[all_numeric_features].head()

Unnamed: 0,symboling,fuel-type,num-of-doors,body-style,wheel-base,length,width,height,curb-weight,engine-size,compression-rate,city-mpg,highway-mpg,built
0,1.782215,-0.32959,-1.172839,-2.155714,-1.678015,-0.442872,-0.83908,-2.117092,-0.025646,0.045098,-0.287525,-0.677292,-0.555613,-1.165264
1,1.782215,-0.32959,-1.172839,-2.155714,-1.678015,-0.442872,-0.83908,-2.117092,-0.025646,0.045098,-0.287525,-0.677292,-0.555613,-1.165264
2,0.163544,-0.32959,-1.172839,-0.970378,-0.719041,-0.250543,-0.1842,-0.613816,0.496473,0.574066,-0.287525,-0.990387,-0.702307,-0.420946
3,0.97288,-0.32959,0.848214,0.214957,0.14241,0.182198,0.14324,0.17958,-0.426254,-0.459826,-0.03611,-0.207649,-0.115531,0.169087
4,0.97288,-0.32959,0.848214,0.214957,0.077395,0.182198,0.236794,0.17958,0.498371,0.189362,-0.53894,-1.146935,-1.289083,0.193049


## Step 3b. Feature Selection
In our case, we have a regression problem, since we want to predict a continuous variable, car price. Thus, we will use the F-statistic as our score function. According to Frost (2017), the F-statistic indicates the “overall significance” of a linear regression model. In univariate feature selection, we would do the following steps:

For each feature:
Perform linear regression where the independent variable is the feature and the dependent variable is the target (in this case, price).
Obtain the F-statistic.
Compile a list with the F-statistic of each feature.
Identify the features with the highest F-statistics.
This can be implemented automatically using the scikit-learn’s SelectKBest class. It is called SelectKBest because we can set a parameter k which tells how many features to select. 

In [22]:
from sklearn.feature_selection import SelectKBest, f_regression

# Assuming cars, all_numeric_features, and price are defined elsewhere in your code

# Create SelectKBest instance and fit_transform
skb = SelectKBest(score_func=f_regression, k=3)
X_new = skb.fit_transform(cars[all_numeric_features].astype(int), cars["price"].astype(int))


# Get the names of the selected features
best_features = cars[all_numeric_features].columns[skb.get_support()].tolist()
print("Top 3 features:", best_features)


Top 3 features: ['length', 'curb-weight', 'engine-size']



Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.



## Step 4 Train Test split along with stratefication 
Prior to model training, the dataset is split into 80% training and 20% testing sets. Ensuring a balanced distribution of the target variable is crucial before this split. The histogram displays the frequency distribution of car prices across the entire dataset.

In [23]:
cars.head()

Unnamed: 0,symboling,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,...,fuel-system,bore,stroke,compression-rate,horsepower,peak-rpm,city-mpg,highway-mpg,price,built
0,1.782215,alfa-romero,-0.32959,std,-1.172839,-2.155714,rwd,front,-1.678015,-0.442872,...,mpfi,3.47,2.68,-0.287525,111,5000,-0.677292,-0.555613,13495,-1.165264
1,1.782215,alfa-romero,-0.32959,std,-1.172839,-2.155714,rwd,front,-1.678015,-0.442872,...,mpfi,3.47,2.68,-0.287525,111,5000,-0.677292,-0.555613,16500,-1.165264
2,0.163544,alfa-romero,-0.32959,std,-1.172839,-0.970378,rwd,front,-0.719041,-0.250543,...,mpfi,2.68,3.47,-0.287525,154,5000,-0.990387,-0.702307,16500,-0.420946
3,0.97288,audi,-0.32959,std,0.848214,0.214957,fwd,front,0.14241,0.182198,...,mpfi,3.19,3.4,-0.03611,102,5500,-0.207649,-0.115531,13950,0.169087
4,0.97288,audi,-0.32959,std,0.848214,0.214957,4wd,front,0.077395,0.182198,...,mpfi,3.19,3.4,-0.53894,115,5500,-1.146935,-1.289083,17450,0.193049


In [24]:
cars['price'] = pd.to_numeric(cars['price'])
#cars['price'] = cars['price'].astype(int)
sns.distplot(cars["price"], bins=80)
plt.title("Frequency Distribution of Car Price")
plt.xlabel("Price (USD)")
plt.ylabel("Number of Cars")

plt.show()

<matplotlib.figure.Figure at 0x7f6257add5c0>

We found that the ditribution of price is skewed. So we have to ensure that the frequency distribution of the target is similar between the training and testing sets.In this case, we prefer using stratified K-fold cross validation to ratio of labels in each fold constant. 

In [34]:
# independent variables
cars_features = cars.drop('price',axis=1) # without price
# target variables
cars_target = cars["price"]  # with price

# split the data into training and testing dataset
X_train, X_test, y_train, y_test= train_test_split(cars_features, cars_target, test_size=0.2, random_state=42)

# Assuming you have defined X_train and y_train earlier
stratified_kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)

# Initialize KNeighborsRegressor
model = KNeighborsRegressor(n_neighbors=3)  # You can adjust the number of neighbors


ValueError: labels ['categorical_column'] not contained in axis

In [41]:
# Initialize KNeighborsRegressor
model = KNeighborsRegressor(n_neighbors=2)  # You can adjust the number of neighbors

# Use stratified_kfold.split(X_train, y_train) in your model training loop
for train_index, val_index in stratified_kfold.split(X_train, y_train):
    X_train_fold, X_val_fold = X_train.iloc[train_index], X_train.iloc[val_index]
    y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]

    # Train your model on X_train_fold, y_train_fold
    model.fit(X_train_fold, y_train_fold)

    # Validate your model on X_val_fold, y_val_fold
    #y_pred = model.predict(X_val_fold)

    # Calculate RMSE for the current fold
    #rmse = np.sqrt(mean_squared_error(y_val_fold, y_pred))
    #print(f'RMSE for the current fold: {rmse}')

# Optionally, you can return the average RMSE across all folds

#average_rmse = np.mean(rmse)

#print(f'Average RMSE across all folds: {average_rmse}')


The least populated class in y has only 1 members, which is too few. The minimum number of groups for any class cannot be less than n_splits=2.



ValueError: could not convert string to float: 'mpfi'

# Hyperparameter Optimization
A hyperparameter is a value that influences the behavior of a model and has no relation to the data. In the case of KNN, one important hyperparameter is the 
 value, or the number of neighbors used to make a prediction. If, we take the mean price of the top five most similar cars and call this our prediction. However, if 
, we take the top three cars, so the mean price may be different.

We can optimize by experimenting with differnt values of K and decide values and identify the best-performing model by analysing RMSE values (lowest RMSE value- comparing actual value away for the predicted value).