In [126]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## <font color=blue>01 Introduction to the data set</font> 
*  Read __imports-85.data__ ([data set description](https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names)) into a dataframe named __cars__. If you read in the file using [pandas.read_csv()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) without specifying any additional parameter values, you'll notice that the column names don't match the ones in the [dataset's documentation](https://archive.ics.uci.edu/ml/datasets/automobile). Why do you think this is and how can you fix this?
*  Determine which columns are numeric and can be used as features and which column is the target column.
*  Display the first few rows of the dataframe and make sure it looks like the data set preview.

In [127]:
headers=['symboling','normalized_losses','make','fuel_type','aspiration','num_of_doors',
         'body_style','drive_wheels','engine_location','wheel_base','length','width',
        'height','curb_weight','engine_type','num_of_cylinders','engine_size','fuel_system',
        'bore','stroke','compression_ratio','horsepower','peak_rpm','city_mpg','highway_mpg',
        'price']
cars = pd.read_csv('imports-85.data.txt', names=headers)

# Select only the columns with continuous values from - https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names
continuous_values_cols = ['normalized_losses', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg', 'price']
numeric_cars = cars[continuous_values_cols].copy()
print(numeric_cars.info())
numeric_cars.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 14 columns):
normalized_losses    205 non-null object
wheel_base           205 non-null float64
length               205 non-null float64
width                205 non-null float64
height               205 non-null float64
curb_weight          205 non-null int64
bore                 205 non-null object
stroke               205 non-null object
compression_ratio    205 non-null float64
horsepower           205 non-null object
peak_rpm             205 non-null object
city_mpg             205 non-null int64
highway_mpg          205 non-null int64
price                205 non-null object
dtypes: float64(5), int64(3), object(6)
memory usage: 22.5+ KB
None


Unnamed: 0,normalized_losses,wheel_base,length,width,height,curb_weight,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,?,88.6,168.8,64.1,48.8,2548,3.47,2.68,9.0,111,5000,21,27,13495
1,?,88.6,168.8,64.1,48.8,2548,3.47,2.68,9.0,111,5000,21,27,16500
2,?,94.5,171.2,65.5,52.4,2823,2.68,3.47,9.0,154,5000,19,26,16500
3,164,99.8,176.6,66.2,54.3,2337,3.19,3.4,10.0,102,5500,24,30,13950
4,164,99.4,176.6,66.4,54.3,2824,3.19,3.4,8.0,115,5500,18,22,17450
5,?,99.8,177.3,66.3,53.1,2507,3.19,3.4,8.5,110,5500,19,25,15250
6,158,105.8,192.7,71.4,55.7,2844,3.19,3.4,8.5,110,5500,19,25,17710
7,?,105.8,192.7,71.4,55.7,2954,3.19,3.4,8.5,110,5500,19,25,18920
8,158,105.8,192.7,71.4,55.9,3086,3.13,3.4,8.3,140,5500,17,20,23875
9,?,99.5,178.2,67.9,52.0,3053,3.13,3.4,7.0,160,5500,16,22,?


## <font color=blue>02 Data Cleaning</font>
*  Use the [DataFrame.replace()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html) method to replace all of the __?__ values with the __numpy.nan__ missing value.
*  Because __?__ is a string value, columns containing this value were cast to the pandas __object__ data type (instead of a numeric type like __int__ or __float__). After replacing the ? values, determine which columns need to be converted to numeric types. You can use either the [DataFrame.astype()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html) or the [Series.astype()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.astype.html) methods to convert column types.
*  Return the number of rows that have a missing value for the __normalized-losses__ column. Determine how you should handle this column. You could:
  *  Replace the missing values using the average values from that column.
  *  Drop the rows entirely (especially if other columns in those rows have missing values).
  *  Drop the column entirely.
*  Explore the missing value counts for the other numeric columns and handle any missing values.
*  Of the columns you decided to keep, normalize the numeric ones so all values range from __0__ to __1__.

In [128]:
print('Convert missing values (?) with np.NaN then set the type to float')
numeric_cars.replace(to_replace='?', value=np.nan, inplace=True)
numeric_cars.astype('float', inplace=True)
numeric_cars.head(10)

Convert missing values (?) with np.NaN then set the type to float


Unnamed: 0,normalized_losses,wheel_base,length,width,height,curb_weight,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,,88.6,168.8,64.1,48.8,2548,3.47,2.68,9.0,111,5000,21,27,13495.0
1,,88.6,168.8,64.1,48.8,2548,3.47,2.68,9.0,111,5000,21,27,16500.0
2,,94.5,171.2,65.5,52.4,2823,2.68,3.47,9.0,154,5000,19,26,16500.0
3,164.0,99.8,176.6,66.2,54.3,2337,3.19,3.4,10.0,102,5500,24,30,13950.0
4,164.0,99.4,176.6,66.4,54.3,2824,3.19,3.4,8.0,115,5500,18,22,17450.0
5,,99.8,177.3,66.3,53.1,2507,3.19,3.4,8.5,110,5500,19,25,15250.0
6,158.0,105.8,192.7,71.4,55.7,2844,3.19,3.4,8.5,110,5500,19,25,17710.0
7,,105.8,192.7,71.4,55.7,2954,3.19,3.4,8.5,110,5500,19,25,18920.0
8,158.0,105.8,192.7,71.4,55.9,3086,3.13,3.4,8.3,140,5500,17,20,23875.0
9,,99.5,178.2,67.9,52.0,3053,3.13,3.4,7.0,160,5500,16,22,


In [129]:
print('This shows the percentage of values in each column that are not numberic.')

not_numeric_count = len(numeric_cars) - numeric_cars.count(axis=0, level=None, numeric_only=False)
percentage_not_numeric = (not_numeric_count / len(numeric_cars)) * 100
percentage_not_numeric

This shows the percentage of values in each column that are not numberic.


normalized_losses    20.00000
wheel_base            0.00000
length                0.00000
width                 0.00000
height                0.00000
curb_weight           0.00000
bore                  1.95122
stroke                1.95122
compression_ratio     0.00000
horsepower            0.97561
peak_rpm              0.97561
city_mpg              0.00000
highway_mpg           0.00000
price                 1.95122
dtype: float64

In [130]:
print("Because the column we're trying to predict is 'price', any row were price is NaN will be removed.")
print("After doings so we will again check the percentage of values that are NaN for each column")
numeric_cars.dropna(subset=['price'], inplace=True)
not_numeric_count = len(numeric_cars) - numeric_cars.count(axis=0, level=None, numeric_only=False)
percentage_not_numeric = (not_numeric_count / len(numeric_cars)) * 100
percentage_not_numeric

Because the column we're trying to predict is 'price', any row were price is NaN will be removed.
After doings so we will again check the percentage of values that are NaN for each column


normalized_losses    18.407960
wheel_base            0.000000
length                0.000000
width                 0.000000
height                0.000000
curb_weight           0.000000
bore                  1.990050
stroke                1.990050
compression_ratio     0.000000
horsepower            0.995025
peak_rpm              0.995025
city_mpg              0.000000
highway_mpg           0.000000
price                 0.000000
dtype: float64

In [132]:
print("All remaining NaN's will be filled with the mean of its respective column")
print(numeric_cars.mean())
numeric_cars = numeric_cars.fillna(numeric_cars.mean())
print("Then, verify that all NaN's have been removed by showing the number of NaN's for each column.")

# not_numeric_count = len(numeric_cars) - numeric_cars.count(axis=0, level=None, numeric_only=False)
# percentage_not_numeric = (not_numeric_count / len(numeric_cars)) * 100
# percentage_not_numeric
print(numeric_cars.head(10))

All remaining NaN's will be filled with the mean of its respective column
wheel_base             98.797015
length                174.200995
width                  65.889055
height                 53.766667
curb_weight          2555.666667
compression_ratio      10.164279
city_mpg               25.179104
highway_mpg            30.686567
dtype: float64
Then, verify that all NaN's have been removed by showing the number of NaN's for each column.
   normalized_losses  wheel_base  length  width  height  curb_weight  bore  \
0                NaN        88.6   168.8   64.1    48.8         2548  3.47   
1                NaN        88.6   168.8   64.1    48.8         2548  3.47   
2                NaN        94.5   171.2   65.5    52.4         2823  2.68   
3                164        99.8   176.6   66.2    54.3         2337  3.19   
4                164        99.4   176.6   66.4    54.3         2824  3.19   
5                NaN        99.8   177.3   66.3    53.1         2507  3.19   
6      

## <font color=blue>03 Univariate Model</font>
*  Create a function, named __knn_train_test()__ that encapsulates the training and simple validation process. This function should have 3 parameters -- training column name, target column name, and the dataframe object.
  *  This function should split the data set into a training and test set.
  *  Then, it should instantiate the KNeighborsRegressor class, fit the model on the training set, and make predictions on the test set.
  *  Finally, it should calculate the RMSE and return that value.
*  Use this function to train and test univariate models using the different numeric columns in the data set. Which column performed the best using the default __k__ value?
*  Modify the __knn_train_test()__ function you wrote to accept a parameter for the __k__ value.
  *  Update the function logic to use this parameter.
  *  For each numeric column, create, train, and test a univariate model using the following __k__ values (__1__, __3__, __5__, __7__, and __9__). Visualize the results using a scatter plot and a line plot.

## <font color=blue>04 Mulitvariate Model</font>
*  Modify the __knn_train_test()__ function to accept a list of column names (instead of just a string). Modify the rest of the function logic to use this parameter:
  *  Instead of using just a single column for train and test, use all of the columns passed in.
  *  Use a the default k value from scikit-learn for now (we'll tune the k value in the next step).
*  Use the best 2 features from the previous step to train and test a multivariate k-nearest neighbors model using the default __k__ value.
*  Use the best 3 features from the previous step to train and test a multivariate k-nearest neighbors model using the default __k__ value.
*  Use the best 4 features from the previous step to train and test a multivariate k-nearest neighbors model using the default __k__ value.
*  Use the best 5 features from the previous step to train and test a multivariate k-nearest neighbors model using the default __k__ value.
*  Display all of the RMSE values.

## <font color=blue>05 Hyperparameter Tuning</font>
*  For the top 3 models in the last step, vary the hyperparameter value from __1__ to __25__ and plot the resulting RMSE values.
*  Which __k__ value is optimal for each model? How different are the __k__ values and what do you think accounts for the differences?