## Car Price Prediction Using Scikit-learn

#### (Chapter 3, Challenge Problem 1)

This is another implementation of the car-price-prediction project using Scikit-learn rather than constructing algorithms by hand as in the original project. I will skip most of the exploratory analysis in this version, since there shouldn't be major differences from the original implementation. I also will generally limit code explanations to inline comments, as most of the process is very similar to the telco-customer-churn project (https://github.com/mbalexander19/telco_customer_churn) that uses Sk-learn from the start.

All code in this notebook is entirely my own.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('data.csv')

In [3]:
#lowercase all column names and replace spaces with underscores
df.columns = df.columns.str.lower().str.replace(' ', '_')

#select all columns with string values
string_columns = list(df.dtypes[df.dtypes == 'object'].index)

#lowercases and replaces spaces with underscores for all string values
for col in string_columns:
    df[col] = df[col].str.lower().str.replace(' ', '_')

df.head().T

Unnamed: 0,0,1,2,3,4
make,bmw,bmw,bmw,bmw,bmw
model,1_series_m,1_series,1_series,1_series,1_series
year,2011,2011,2011,2011,2011
engine_fuel_type,premium_unleaded_(required),premium_unleaded_(required),premium_unleaded_(required),premium_unleaded_(required),premium_unleaded_(required)
engine_hp,335.0,300.0,300.0,230.0,230.0
engine_cylinders,6.0,6.0,6.0,6.0,6.0
transmission_type,manual,manual,manual,manual,manual
driven_wheels,rear_wheel_drive,rear_wheel_drive,rear_wheel_drive,rear_wheel_drive,rear_wheel_drive
number_of_doors,2.0,2.0,2.0,2.0,2.0
market_category,"factory_tuner,luxury,high-performance","luxury,performance","luxury,high-performance","luxury,performance",luxury


In [4]:
#convert cylinders and number of doors to string variable so sklearn treats as factor
df = df.astype({'engine_cylinders':'string', 'number_of_doors':'string'})
df.dtypes

make                  object
model                 object
year                   int64
engine_fuel_type      object
engine_hp            float64
engine_cylinders      string
transmission_type     object
driven_wheels         object
number_of_doors       string
market_category       object
vehicle_size          object
vehicle_style         object
highway_mpg            int64
city_mpg               int64
popularity             int64
msrp                   int64
dtype: object

In [5]:
#verify that mathematical operations no longer work on transformed columns
for col in ['engine_cylinders', 'number_of_doors']:
    try: #this should fail if conversion was successful
        df[col].sum()
    except TypeError:
        print(f'{col} is a non-numeric column!')

engine_cylinders is a non-numeric column!
number_of_doors is a non-numeric column!


In [6]:
#check NAs
df.isnull().sum()

make                    0
model                   0
year                    0
engine_fuel_type        3
engine_hp              69
engine_cylinders       30
transmission_type       0
driven_wheels           0
number_of_doors         6
market_category      3742
vehicle_size            0
vehicle_style           0
highway_mpg             0
city_mpg                0
popularity              0
msrp                    0
dtype: int64

In [7]:
#fill in NAs in engine_hp with 0
na_cols = ['engine_fuel_type', 'engine_cylinders', 'number_of_doors', 'market_category']

for col in na_cols:
    df[col].fillna('nodata', inplace = True)

df['engine_hp'].fillna(0, inplace = True)
df.isnull().sum()

make                 0
model                0
year                 0
engine_fuel_type     0
engine_hp            0
engine_cylinders     0
transmission_type    0
driven_wheels        0
number_of_doors      0
market_category      0
vehicle_size         0
vehicle_style        0
highway_mpg          0
city_mpg             0
popularity           0
msrp                 0
dtype: int64

In [8]:
#calculate age from year
df['age'] = 2017 - df['year']

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
#split data twice to get 60% training, 20% validation, 20% test
df_train_full, df_test = train_test_split(df, test_size = 0.2, random_state = 1)
df_train, df_val = train_test_split(df_train_full, test_size = 0.25, random_state = 11)

#extract target variable and log transform
y_train = np.log1p(df_train.msrp.values)
y_val = np.log1p(df_val.msrp.values)
y_test = np.log1p(df_test.msrp.values)

#delete target variables from original data
del df_train['msrp']
del df_val['msrp']
del df_test['msrp']

In [11]:
df.dtypes

make                  object
model                 object
year                   int64
engine_fuel_type      object
engine_hp            float64
engine_cylinders      string
transmission_type     object
driven_wheels         object
number_of_doors       string
market_category       object
vehicle_size          object
vehicle_style         object
highway_mpg            int64
city_mpg               int64
popularity             int64
msrp                   int64
age                    int64
dtype: object

In [12]:
#create lists of categories to use in model
categorical = ['make', 'engine_fuel_type', 'engine_cylinders', 'transmission_type', 'driven_wheels',
               'number_of_doors', 'market_category', 'vehicle_size', 'vehicle_style']
numeric = ['engine_hp', 'highway_mpg', 'city_mpg', 'popularity', 'age']

In [13]:
#create dictionaries
train_dict = df_train[categorical + numeric].to_dict(orient = 'records')
val_dict = df_val[categorical + numeric].to_dict(orient = 'records')
test_dict = df_test[categorical + numeric].to_dict(orient = 'records')

In [14]:
from sklearn.feature_extraction import DictVectorizer

In [15]:
dv = DictVectorizer(sparse = False)
dv.fit(train_dict)

X_train = dv.transform(train_dict)
len(dv.get_feature_names_out()) 
#there were 171 columns in the full model in the original implementation vs. 175 here
#this is explained by the addition of 'None' in place of NA in 4 categories here
#in the original implementation, these didn't map to any values at all

175

In [16]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge

In [17]:
#we'll first try a regular, non-regularized linear regression
model = LinearRegression()
model.fit(X_train, y_train)

In [18]:
#run predictions on validation set
X_val = dv.transform(val_dict)

y_pred = model.predict(X_val)

In [19]:
from sklearn.metrics import mean_squared_error

In [20]:
rmse_val = np.sqrt(mean_squared_error(y_val, y_pred))

In [21]:
rmse_val

1574.5724826561748

It's clear that the non-regularized model did not work as intended (though this was the result I expected - we had the same issues with the non-regularized model in the original project). The RMSE here is over 1500 - and remember that's on the log of price, so our rmse is actually $e^{1574}$ compared to the true price.

Now let's try ridge regression.

In [22]:
model = Ridge()
model.fit(X_train, y_train)

In [23]:
y_pred = model.predict(X_val)

In [24]:
rmse_val = np.exp(np.sqrt(mean_squared_error(y_val, y_pred)))
rmse_val

1.5911423775012985

This is *much* more reasonable, and close to the RMSE we got in the original project. The error value here is about 0.1 higher than in the original model, but it's close enough that we can reasonably attribute most of the variance to using different subsets of the data.

In [25]:
#let's also run the test set through
X_test = dv.transform(test_dict)

y_pred = model.predict(X_test)

In [26]:
rmse_val = np.exp(np.sqrt(mean_squared_error(y_test, y_pred)))
rmse_val

1.478986264313766

We got better results with the test set, which strengthens the conclusion that the sklearn-based model and the manual model from the original project have effectively identical quality.

Finally, we can test our 2013 Toyota Venza on the new model.

In [27]:
#rolling the venza back out again - though data is pre-cleaned to run through sklearn properly
ad = {
    'city_mpg': 18,
    'driven_wheels': 'all_wheel_drive',
    'engine_cylinders': '6.0',
    'engine_fuel_type': 'regular_unleaded',
    'engine_hp': 268.0,
    'highway_mpg': 25,
    'make': 'toyota',
    'market_category': 'crossover,performance',
    'model': 'venza',
    'number_of_doors': '4.0',
    'popularity': 2031,
    'transmission_type': 'automatic',
    'vehicle_size': 'midsize',
    'vehicle_style': 'wagon',
    'year': 2013
}

ad['age'] = 2017 - ad['year']

In [28]:
X_ad = dv.transform(ad)

In [29]:
venza_price = np.exp(model.predict(X_ad))
venza_price[0].round(2)

28054.38

In [30]:
(venza_price[0] - 31120).round(2)

-3065.62

The predicted price of the Venza from the sklearn model is 28054 dollars, which is just over 3000 dollar lower than its actual price of 3065 dollars. This error is slightly greater (by about 250 dollars) than the initial model from the original project, which predicted a price of 28294 dollars. It does far better for this particular vehicle than the second, fully-factored model, which was over 10000 high in its prediction. 

Our RMSE values - all of which were fairly similar, in the 1.45-1.6 range - are of course better estimates of model quality than a single Venza, so we shouldn't give too much significance to this specific test.