# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary.

The objective is to provide the client(Car Dealership) recommendation on what factors make a car more or less expensive. This in turn will help the client to buy and stock the right inventory for maximizing its sales.

For this project, we will use dataset from Kaggle. The original dataset contained information on 3 million used cars. The dataset that will be used for analysis contains information on 426K cars. The dataset is in a .csv file. Price is one of the columns. Other columns include region, year, manufacturer, model, condition, cylinders, fuel, odometer, title_status, transmission, VIN, drive, size, type, paint_color and state. We will analyse these columns to determine which one influence the price the most. This will then help in providing a recommendation to stock used cars with the top 5 parameters that influence the car pricing.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

Import the relevant libraries and load the input data file. The dataset is provided in .csv format. The dataset has columns for Price, region, manufacturer, model, condition, cylinders, fuel, title_status, transmission, drive, size, type, paint_color, state, VIN, year, odometer and id. We will analyze all the avaialable parameters to evaluate what impacts the price of the car the most. Accordingly we will make the recommendation to prioritize the top 5 factors that influence the pricing of the used cars.

In [10]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from scipy.linalg import svd
from sklearn.inspection import permutation_importance
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Read the input file
df = pd.read_csv('vehicles.csv')

### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`.

Firstly, we will clean the input data file by dropping rows with Null, 0 and duplicate values. Secondly, we will separate the categorical columns from the numerical columns. Plotting the heat map or price versus individual column, we can visualize the inputs that have little to no effect on price. We identify and drop those columns (id, region, manufacturer, model, state, VIN). Finally in this step, we conver the remaining categorical variables into dummy/indicator varirables.

In [11]:
# Drop rows with Null, Zero and duplicates
df.dropna(inplace=True)
df = df.loc[(df != 0).all(axis=1)]
df.drop_duplicates(inplace=True)

#Seperate Categorical and Numeric columns
cat_features = ['region', 'manufacturer', 'model', 'condition', 'cylinders',
                'fuel', 'title_status', 'transmission', 'drive', 'size', 'type',
                'paint_color', 'state', 'VIN']
num_features = ['year', 'odometer', 'id', 'price']

for feature in num_features:
    df[feature] = df[feature].astype('float')

#Drop columns that do not have high correlation with Price
df.drop(columns=['id','region', 'manufacturer', 'model', 'state', 'VIN'],
        inplace=True)

#Convert categorical variables
df_encoded = pd.get_dummies(df, columns=['condition', 'cylinders', 'fuel',
                                         'title_status', 'transmission', 'drive',
                                         'size', 'type', 'paint_color'])
df_encoded = df_encoded.fillna(0)
df_encoded = df_encoded.astype(int)
df_encoded.head()

Unnamed: 0,price,year,odometer,condition_excellent,condition_fair,condition_good,condition_like new,condition_new,condition_salvage,cylinders_10 cylinders,...,paint_color_brown,paint_color_custom,paint_color_green,paint_color_grey,paint_color_orange,paint_color_purple,paint_color_red,paint_color_silver,paint_color_white,paint_color_yellow
215,4000,2002,155000,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
219,2500,1995,110661,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
268,9000,2008,56700,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
337,8950,2011,164000,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
338,4000,1972,88100,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

Firstly, we will separate the data into training and test. x will be our input and y will be our output (the price of the used car). We will allocate 70% of the available data for training. We will set random_state equal to 0 so that the data is split the same way every time we run our model code.

A Pipeline is a way to chain multiple data processing steps (transformers) and a final estimator (model) together. It simplifies the workflow and ensures that the same transformations are applied to both training and testing data.
This transformer creates polynomial features from the original features in our data. degree=2 specifies that we want to generate polynomial features up to the second degree (e.g., x^2, xy, y^2). include_bias=False means that a constant term (bias) won't be added to the features.

LinearRegression is the final estimator in the pipeline, which is a linear regression model.
In summary, our pipeline does the following:
**Polynomial Feature Expansion:**
It takes the original features in our dataset and creates new features by raising them to different powers (up to degree 2 in this case). This allows the model to capture non-linear relationships between the features and the target variable.
**Linear Regression:**
It fits a linear regression model to the transformed data, which now includes the polynomial features.

In [12]:
# separate data into target & independent variables
x = df_encoded.drop(['price'], axis=1)
x = x.astype(int)
y = df_encoded['price']

#split dataset into training and test
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7,
                                                    random_state=0)

#Create a Pipeline
pipe_model = Pipeline([('PolyFeatures',PolynomialFeatures(degree =2,
                          include_bias=False)),('LinModel',LinearRegression())])


### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

We have created a supervised ML model. The pipeline applies all the transformations defined in the pipeline to the training data (x_train). The final estimator (model) in the pipeline is trained using the transformed training data (x_train) and corresponding target values (y_train). We use predict on the pipeline object, which applies all the pre-processing steps defined in the pipeline to the x_train data, and then uses the final estimator (the machine learning model) to make predictions on the transformed data. Finally, we calculate the Mean Squared Error (MSE) between the true target values (y_train) and the predicted values (y_train_pred) for our training data. We repeat the same step on our test data.

We use permutation_importance function is used to calculate the importance of features in our ML Model. We evaluate the performance of our trained model (pipe_model) on the test data (x_test, y_test). We calculate the importance score for each feature based on the average decrease in performance across multiple permutations.

In [13]:
#Fit the Pipeline
pipe_model.fit(x_train,y_train)

#Make predictions
y_train_pred = pipe_model.predict(x_train)
y_test_pred = pipe_model.predict(x_test)

# Calculate MSEs for trained data and test data
mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)
print(f'MSE for training data: {mse_train}')
print(f'MSE for test data: {mse_test}')

# Calculate the permutation importance
results = permutation_importance(pipe_model, x_test, y_test)
importances = pd.DataFrame(data=results.importances_mean, index=x.columns,
                           columns=['Importance']).sort_values(by='Importance',
                                                               ascending=False)

MSE for training data: 43578176.75006912
MSE for test data: 715356212.8152697


### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.

We recommend using the top 5 features to consdier when making a decision to stock used cars. Auto transmission, Manual transmission, Rear Wheel Drive, 4 wheel drive and Forward Wheel Drive are the top 5 features to consider.

In [14]:

importances.head(5)

Unnamed: 0,Importance
transmission_automatic,29499230.0
transmission_manual,28991760.0
drive_rwd,26774200.0
drive_4wd,17719760.0
drive_fwd,17349040.0
