# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
import statsmodels.graphics.tsaplots as tsaplots
from statsmodels.tsa.filters.filtertools import convolution_filter
from statsmodels.tsa.seasonal import _extrapolate_trend
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from warnings import filterwarnings 
filterwarnings('ignore')
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error 
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer, TransformedTargetRegressor
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import SequentialFeatureSelector, SelectFromModel, RFE 
from sklearn.preprocessing import StandardScaler

import statsmodels.api as sm
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px


In [None]:
vehicles = pd.read_csv('/Users/kimberlytulga/Documents/Executive Education Courses/Berkley HAAS - ML and AI Certificate/Module 11/practical_application_II_starter/data/vehicles.csv')

In [None]:
vehicles.sample(10)

In [None]:
vehicles.info()

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

In [None]:
vehicles.describe()

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [None]:
vehicles = vehicles.drop(['VIN', 'odometer'], axis = 1)
vehicles = vehicles.set_index('id')
vehicles['year'] = vehicles['year'].fillna(1899).astype('Int64')
#vehicles['odometer'] = vehicles['odometer'].fillna(9999999)
vehicles.head(10)

In [None]:
ss_vehicles = vehicles[vehicles['price'] <= 58000 ]
ss_vehicles = ss_vehicles[ss_vehicles['price'] >= 500]
#ss_vehicles = ss_vehicles[ss_vehicles['odometer'] <= 150000 ]
ss_vehicles = ss_vehicles[ss_vehicles['year'] >= 2005 ]
ss_vehicles = ss_vehicles[ss_vehicles['title_status'] == 'clean' ]
ss_vehicles.info()


In [None]:
ss_vehicles = ss_vehicles.drop(['size','cylinders', 'condition', 'title_status'], axis = 1)
ss_vehicles.describe()

In [None]:
sns.boxplot(data = ss_vehicles,  y = 'price')
plt.xticks(rotation = 90)

In [None]:
sns.boxplot(data = ss_vehicles,  y = 'year')
plt.xticks(rotation = 90)

In [None]:
# Column to exclude
exclude_col = 'year'

# Separate columns to impute
columns_to_impute = ss_vehicles.columns[ss_vehicles.columns != exclude_col]



imputer = SimpleImputer(strategy='constant', fill_value='unknown')
vehicles_imputed = ss_vehicles.copy()
vehicles_imputed[columns_to_impute] = imputer.fit_transform(vehicles_imputed[columns_to_impute])


vehicles_imputed.sample(10)

In [None]:
# Importing LabelEncoder from Sklearn 
# library from preprocessing Module.
from sklearn.preprocessing import LabelEncoder

# Creating a instance of label Encoder.
le = LabelEncoder()



# Using .fit_transform function to fit label
# encoder and return encoded label
region_label = le.fit_transform(vehicles_imputed['region'])
manufacturer_label = le.fit_transform(vehicles_imputed['manufacturer'])
model_label = le.fit_transform(vehicles_imputed['model'])
#condition_label = le.fit_transform(vehicles_imputed['condition'])
#cylinders_label = le.fit_transform(vehicles_imputed['cylinders'])
fuel_label = le.fit_transform(vehicles_imputed['fuel'])
#title_status_label = le.fit_transform(vehicles_imputed['title_status'])
transmission_label = le.fit_transform(vehicles_imputed['transmission'])
drive_label = le.fit_transform(vehicles_imputed['drive'])
#size_label = le.fit_transform(vehicles_imputed['size'])
type_label = le.fit_transform(vehicles_imputed['type'])
paint_label = le.fit_transform(vehicles_imputed['paint_color'])
state_label = le.fit_transform(vehicles_imputed['state'])


# removing each column from df
# as it is of no use now.
vehicles_imputed.drop("region", axis=1, inplace=True)
vehicles_imputed.drop("manufacturer", axis=1, inplace=True)
vehicles_imputed.drop("model", axis=1, inplace=True)
#vehicles_imputed.drop("condition", axis=1, inplace=True)
#vehicles_imputed.drop("cylinders", axis=1, inplace=True)
vehicles_imputed.drop("fuel", axis=1, inplace=True)
#vehicles_imputed.drop("title_status", axis=1, inplace=True)
vehicles_imputed.drop("transmission", axis=1, inplace=True)
vehicles_imputed.drop("drive", axis=1, inplace=True)
#vehicles_imputed.drop("size", axis=1, inplace=True)
vehicles_imputed.drop("type", axis=1, inplace=True)
vehicles_imputed.drop("paint_color", axis=1, inplace=True)
vehicles_imputed.drop("state", axis=1, inplace=True)


# Appending the array to our dataFrame 
# with each column name 
vehicles_imputed["region"] = region_label
vehicles_imputed["manufacturer"] = manufacturer_label
vehicles_imputed["model"] = model_label
#vehicles_imputed["condition"] = condition_label
#vehicles_imputed["cylinders"] = cylinders_label
vehicles_imputed["fuel"] = fuel_label
#vehicles_imputed["title_status"] = title_status_label
vehicles_imputed["transmission"] = transmission_label
vehicles_imputed["drive"] = drive_label
#vehicles_imputed["size"] = size_label
vehicles_imputed["type"] = type_label
vehicles_imputed["paint"] = paint_label
vehicles_imputed["state"] = state_label


# printing Dataframe
vehicles_imputed.sample(10)


In [None]:
# Checking most correlated features

vehicles_imputed.corr()

In [None]:
corelation_maps = sns.PairGrid(vehicles_imputed)
corelation_maps = corelation_maps.map_diag(plt.hist)
corelation_maps = corelation_maps.map_offdiag(plt.scatter)
corelation_maps.add_legend()

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [None]:
X = vehicles_imputed.drop('price', axis = 1)
y = vehicles_imputed['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
#Building Sequential Selector Model

selector_pipe = Pipeline([ ('scaler', StandardScaler()),
                         ('selector', SequentialFeatureSelector(LinearRegression())),
                          ('model', LinearRegression())])

selector_pipe

In [None]:
param_dict = {'selector__n_features_to_select': [ 3, 4, 5]}
selector_grid = GridSearchCV(selector_pipe, param_grid=param_dict)
selector_grid.fit(X_train, y_train)
train_preds = selector_grid.predict(X_train)
test_preds = selector_grid.predict(X_test)
selector_train_mse = mean_squared_error(y_train, train_preds)
selector_test_mse = mean_squared_error(y_test, test_preds)


# ANSWER CHECK
print(f'Train MSE: {selector_train_mse}')
print(f'Test MSE: {selector_test_mse}')

In [None]:
R_selector = selector_grid.score(X_test, y_test)
R_selector

In [None]:
px.scatter(x = y_train, y = train_preds, title ='Train Price Predictions vs Real', labels=dict(x = 'Real Price', y = 'Predicted Price'))


In [None]:
best_estimator = selector_grid.best_estimator_
best_selector = best_estimator.named_steps['selector']
best_model = selector_grid.best_estimator_.named_steps['model']
feature_names = X_train.columns[best_selector.get_support()]
coefs = best_model.coef_

print(best_estimator)
print(f'Features from best selector: {feature_names}.')
print('Coefficient values: ')
print('===================')
pd.DataFrame([coefs.T], columns = feature_names, index = ['model'])

In [None]:
selector_df = pd.DataFrame({'feature': feature_names, 'coef': coefs})
selector_sorted = selector_df.reindex(selector_df['coef'].abs().sort_values(ascending=False).index)
selector_sorted 

In [None]:
# Building a Ridge Model

ridge_param_dict = {'ridge__alpha': np.logspace(0, 10, 50)}
ridge_pipe = Pipeline([('scaler', StandardScaler()), 
                      ('ridge', Ridge())])
ridge_grid = GridSearchCV(ridge_pipe, param_grid=ridge_param_dict)
print(ridge_pipe.named_steps)
ridge_pipe

In [None]:
ridge_grid.fit(X_train, y_train)
ridge_train_preds = ridge_grid.predict(X_train)
ridge_test_preds = ridge_grid.predict(X_test)
ridge_train_mse = mean_squared_error(y_train, ridge_train_preds)
ridge_test_mse = mean_squared_error(y_test, ridge_test_preds)
print(f'Train MSE: {ridge_train_mse}')
print(f'Test MSE: {ridge_test_mse}')

In [None]:
R_ridge = ridge_grid.score(X_test, y_test)
R_ridge

In [None]:
best_ridge_estimator = ridge_grid.best_estimator_
best_ridge_selector = best_ridge_estimator.named_steps['ridge']
best_ridge_selector

In [None]:
best_ridge_model = ridge_grid.best_estimator_.named_steps['ridge']
ridge_coefs = best_ridge_model.coef_

ridge_df = pd.DataFrame({'feature': X_train.columns, 'coef': ridge_coefs})
ridge_df = ridge_df.loc[ridge_df['coef'] != 0]
ridge_sorted = ridge_df.reindex(ridge_df['coef'].abs().sort_values(ascending=False).index)
ridge_sorted 

In [None]:
r = permutation_importance(ridge_grid, X_test, y_test, n_repeats=30, random_state=0)


for i in r.importances_mean.argsort()[::-1]:

    if r.importances_mean[i]  * r.importances_std[i] > 0:

        print(f"{X_train.columns[i]:<10}"

              f"{r.importances_mean[i]:.3f}"

              f" +/- {r.importances_std[i]:.3f}")

In [None]:
#Lasso Model
auto_pipe = Pipeline([('scaler', StandardScaler()),
                     ('lasso', Lasso(random_state = 42))])
auto_pipe.fit(X_train, y_train)
lasso_coefs = auto_pipe.named_steps['lasso'].coef_

print(type(lasso_coefs))
auto_pipe

In [None]:
lasso_train_mse = mean_squared_error(y_train, auto_pipe.predict(X_train))
lasso_test_mse = mean_squared_error(y_test, auto_pipe.predict(X_test))

print(lasso_train_mse)
print(lasso_test_mse)

In [None]:
R_lasso = auto_pipe.score(X_test, y_test)
R_lasso

In [None]:
lasso_df = pd.DataFrame({'feature': X_train.columns, 'coef': lasso_coefs})
lasso_df =lasso_df.loc[lasso_df['coef'] != 0]
lasso_sorted = lasso_df.reindex(lasso_df['coef'].abs().sort_values(ascending=False).index)
lasso_sorted 

In [None]:
# Lasso as a feature Selector
from sklearn.feature_selection import SequentialFeatureSelector, SelectFromModel

lasso_selector_pipe = Pipeline([('scaler', StandardScaler()),
                                ('selector', SelectFromModel(Lasso())),
                                    ('linreg', LinearRegression())])
lasso_selector_pipe.fit(X_train, y_train)
lasso_selector_train_mse = mean_squared_error(y_train, lasso_selector_pipe.predict(X_train))
lasso_selector_test_mse = mean_squared_error(y_test, lasso_selector_pipe.predict(X_test))

print(lasso_selector_train_mse)
print(lasso_selector_test_mse)

In [None]:
R_lasso_selector = lasso_selector_pipe.score(X_test, y_test)
R_lasso_selector

In [None]:
# Finding Coefficients for each features
feature_names = X_train.columns

lasso_selected_feature = lasso_selector_pipe.named_steps['selector']
# Get the support mask of selected features
lasso_selected_feature_mask = lasso_selected_feature.get_support()
lasso_selected_feature_names = [feature_names[i] for i in range(len(feature_names)) if lasso_selected_feature_mask[i]]
lasso_selector_model = lasso_selected_feature.estimator_
lasso_selector_coefs = lasso_selector_model.coef_


lasso_selector_df = pd.DataFrame({'feature': lasso_selected_feature_names, 'coef': lasso_selector_coefs})
lasso_selector_sorted = lasso_selector_df.reindex(lasso_selector_df['coef'].abs().sort_values(ascending=False).index)
lasso_selector_sorted 

In [None]:
# Define RFE with Lasso

lasso = Lasso() 
rfe = RFE(estimator=lasso, n_features_to_select=5)

# Define a pipeline with scaling and RFE
rfe_lasso_pipe = Pipeline([
                    ('scaler', StandardScaler()),  # Scale features
                    ('feature_selection', rfe),    # Perform RFE
                    ('model', lasso)               # Final model
                    ])
rfe_lasso_pipe.fit(X_train, y_train)

# Get the support mask of selected features
rfe_lasso_selected_features = rfe_lasso_pipe.named_steps['feature_selection'].support_

# Indices and names of the selected features
feature_names = X.columns
rfe_lasso_selected_feature_names = feature_names[rfe_lasso_selected_features]


rfe_lasso_train_mse = mean_squared_error(y_train, rfe_lasso_pipe.predict(X_train))
rfe_lasso_test_mse = mean_squared_error(y_test, rfe_lasso_pipe.predict(X_test))


print(f'RFE Lasso Selector Train MSE: {rfe_lasso_train_mse}')
print(f'RFE Lasso Selector Test MSE: {rfe_lasso_test_mse}')

In [None]:
rfe_lasso_selected_coefs = rfe_lasso_pipe.named_steps['model'].coef_


rfe_lasso_selected_df = pd.DataFrame({'feature': rfe_lasso_selected_feature_names, 'coef': rfe_lasso_selected_coefs})
rfe_lasso_selected_sorted = rfe_lasso_selected_df.reindex(rfe_lasso_selected_df['coef'].abs().sort_values(ascending=False).index)
rfe_lasso_selected_sorted 

In [None]:
R_rfe = rfe_lasso_pipe.score(X_test, y_test)
R_rfe 

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [None]:
selector_features_name = selector_sorted.reset_index(drop=True)['feature']
selector_features_list = ', '.join(map(str, selector_features_name))

ridge_features_name = ridge_sorted.reset_index(drop=True)['feature'][:5]
ridge_features_list = ', '.join(map(str, ridge_features_name))

lasso_features_name = lasso_sorted.reset_index(drop=True)['feature'][:5]
lasso_features_list = ', '.join(map(str, lasso_features_name))

lasso_selector_features_name = lasso_selector_sorted.reset_index(drop=True)['feature'][:5]
lasso_selector_features_list = ', '.join(map(str, lasso_features_name))


rfe_lasso_features_name = rfe_lasso_selected_sorted .reset_index(drop=True)['feature']
rfe_lasso_features_list = ', '.join(map(str, rfe_lasso_features_name))

In [None]:


selection_methods = ['Sequential Selector', 'Ridge', 'Lasso','Sequential Selector with Lasso', 'Recursive Feature Elimination with Lasso']
comparison_columns = ['Train MSE', 'Test MSE', 'R^2 Score', 'Selected Features']
comparison_table = pd.DataFrame( index = selection_methods, columns = comparison_columns)
comparison_table['Train MSE'] = [f"{selector_train_mse:,.0f}", f"{ridge_train_mse:,.0f}", f"{lasso_train_mse:,.0f}", f"{lasso_selector_train_mse:,.0f}", f"{rfe_lasso_train_mse:,.0f}"]
comparison_table['Test MSE'] = [f"{selector_test_mse:,.0f}", f"{ridge_test_mse:,.0f}", f"{lasso_test_mse:,.0f}",f"{lasso_selector_test_mse:,.0f}", f"{rfe_lasso_test_mse:,.0f}"]
comparison_table['R^2 Score'] = ["{:.1%}".format(R_selector),"{:.1%}".format(R_ridge),"{:.1%}".format(R_lasso),"{:.1%}".format(R_lasso_selector),"{:.1%}".format(R_rfe)]
comparison_table['Selected Features'] = [selector_features_list, ridge_features_list,  lasso_features_list , lasso_selector_features_list, rfe_lasso_features_list ]
comparison_table

In [None]:
px.box(ss_vehicles, x = "year", y ="price")

In [None]:
px.box(ss_vehicles, x = "transmission", y ="price")

In [None]:
px.box(ss_vehicles, x = "fuel", y ="price")

In [None]:
px.box(ss_vehicles, x = "drive", y ="price")

In [None]:
px.box(ss_vehicles, x = "manufacturer", y ="price")

## Comments

Given all five models are pretty bad at predicting the price, I believe there are too many unknowns to build effective prediction model. I will cut sample size and get rid of some outliers in the Price, Year, and Odometer. I would limit my price under $58,0000, cutoff the year to oldest at 2005, and limit the odometer to 150,000 miles and re-run my analysis. 

In [None]:
sns.set(rc={"figure.figsize":(16, 6)}) #width=16, #height=6
fig = sns.countplot(ss_vehicles, x = 'manufacturer' , order=ss_vehicles['manufacturer'].value_counts().index).set(title='Most Popular Manufacturer')
plt.xticks(rotation = 60)

fig

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.