# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

> As a used car dealership, it is important to us to understand what an appropriate market price is for the cars we sell. This dataset contains a large number of used cars with various details about them and how much they sold for. Using this large dataset, we can construct a theoretical model of how much each of these different details contributes to the market value of a car. This model can then, hopefully, be used to predict with decent precision the expected value of other cars that are not in the dataset and have not been sold yet. In doing so, the prices we set can be as attuned as possible to the details of the market, maximizing business profits.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

In [1]:
import pandas as pd
import numpy as np

In [2]:
vehicles = pd.read_csv('data/vehicles.csv')

In [3]:
vehicles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

In [4]:
vehicles.condition.unique()

array([nan, 'good', 'excellent', 'fair', 'like new', 'new', 'salvage'],
      dtype=object)

### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [5]:
vehicles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

In [6]:
vehicles.nunique()

id              426880
region             404
price            15655
year               114
manufacturer        42
model            29649
condition            6
cylinders            8
fuel                 5
odometer        104870
title_status         6
transmission         3
VIN             118246
drive                3
size                 4
type                13
paint_color         12
state               51
dtype: int64

In [7]:
vehicles.type.value_counts()

type
sedan          87056
SUV            77284
pickup         43510
truck          35279
other          22110
coupe          19204
hatchback      16598
wagon          10751
van             8548
convertible     7731
mini-van        4825
offroad          609
bus              517
Name: count, dtype: int64

First, we can drop the id and VIN columns as they are simply unique identifiers that tell us nothing specific about the car. Then, we notice that there are 30k different models, which is far too many to incorporate in a linear regression model, so we will drop that column as well. The 'type' column is a reasonable and much more workable substitute.

In [8]:
vehicles1 = vehicles.drop(['id', 'VIN', 'model'], axis='columns')

Next, we notice that many of the columns are missing data. We can use the following code to print out how much data is missing from each column:

In [9]:
len(vehicles1) - vehicles1.count()

region               0
price                0
year              1205
manufacturer     17646
condition       174104
cylinders       177678
fuel              3013
odometer          4400
title_status      8242
transmission      2556
drive           130567
size            306361
type             92858
paint_color     130203
state                0
dtype: int64

For these columns that are missing data, two ideas come to mind. First, we could drop the rows that are missing data from those columns. Secondly, we could assign the missing data to an 'unknown' category for the categorical data. I am not sure which is the better approach; in the first approach, we deal with a cleaner dataset, but we also drop a lot of data.

I am going to go ahead with the second approach, where we put the missing data into an 'unknown' category for the categorical data types. The only exceptions are the numerical columns 'year' and 'odometer', for which I will drop the rows which are missing that data. Fortunately, these columns are only missing a couple thousand entries in a dataset of over 400k rows.

In [10]:
vehicles2 = vehicles1.dropna(subset=['year', 'odometer'])

In [11]:
vehicles2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 421344 entries, 27 to 426879
Data columns (total 15 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   region        421344 non-null  object 
 1   price         421344 non-null  int64  
 2   year          421344 non-null  float64
 3   manufacturer  405077 non-null  object 
 4   condition     250851 non-null  object 
 5   cylinders     246585 non-null  object 
 6   fuel          419172 non-null  object 
 7   odometer      421344 non-null  float64
 8   title_status  413986 non-null  object 
 9   transmission  419649 non-null  object 
 10  drive         292495 non-null  object 
 11  size          119732 non-null  object 
 12  type          329562 non-null  object 
 13  paint_color   293254 non-null  object 
 14  state         421344 non-null  object 
dtypes: float64(2), int64(1), object(12)
memory usage: 51.4+ MB


In [12]:
len(vehicles2) - vehicles2.count()

region               0
price                0
year                 0
manufacturer     16267
condition       170493
cylinders       174759
fuel              2172
odometer             0
title_status      7358
transmission      1695
drive           128849
size            301612
type             91782
paint_color     128090
state                0
dtype: int64

In [13]:
vehicles3 = vehicles2.fillna('unknown')

In [14]:
vehicles3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 421344 entries, 27 to 426879
Data columns (total 15 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   region        421344 non-null  object 
 1   price         421344 non-null  int64  
 2   year          421344 non-null  float64
 3   manufacturer  421344 non-null  object 
 4   condition     421344 non-null  object 
 5   cylinders     421344 non-null  object 
 6   fuel          421344 non-null  object 
 7   odometer      421344 non-null  float64
 8   title_status  421344 non-null  object 
 9   transmission  421344 non-null  object 
 10  drive         421344 non-null  object 
 11  size          421344 non-null  object 
 12  type          421344 non-null  object 
 13  paint_color   421344 non-null  object 
 14  state         421344 non-null  object 
dtypes: float64(2), int64(1), object(12)
memory usage: 51.4+ MB


In [15]:
len(vehicles3) - vehicles3.count()

region          0
price           0
year            0
manufacturer    0
condition       0
cylinders       0
fuel            0
odometer        0
title_status    0
transmission    0
drive           0
size            0
type            0
paint_color     0
state           0
dtype: int64

In [20]:
for column in ['region', 'manufacturer', 'condition', 'cylinders', 'fuel', 'title_status', 'transmission', 'drive', 'size', 'type', 'paint_color', 'state']:
    print(vehicles3[column].value_counts())
    print('\n---\n')

region
columbus                   3608
jacksonville               3522
spokane / coeur d'alene    2985
new hampshire              2979
sarasota-bradenton         2975
                           ... 
meridian                     28
southwest MS                 14
kansas city                  11
fort smith, AR                9
west virginia (old)           8
Name: count, Length: 404, dtype: int64

---

manufacturer
ford               70313
chevrolet          54414
toyota             33790
honda              21026
nissan             18818
jeep               18814
ram                18227
gmc                16609
unknown            16267
bmw                14609
dodge              13546
mercedes-benz      11595
hyundai            10223
subaru              9422
volkswagen          9252
kia                 8352
lexus               8132
audi                7528
cadillac            6871
chrysler            5956
acura               5938
buick               5452
mazda               5373
infiniti

Additionally, I am going to drop the 'state' and 'region' columns, as I would imagine they are unrelated to the value of a car, and will add a bunch of complexity and noise to our model. The same could maybe be true of paint color, but we could maybe see certain colors preferred over others to a degree it would be helpful to include in our model. So, I will keep that column for now.

In [21]:
vehicles4 = vehicles3.drop(['state', 'region'], axis='columns')

In [22]:
vehicles4.info()

<class 'pandas.core.frame.DataFrame'>
Index: 421344 entries, 27 to 426879
Data columns (total 13 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   price         421344 non-null  int64  
 1   year          421344 non-null  float64
 2   manufacturer  421344 non-null  object 
 3   condition     421344 non-null  object 
 4   cylinders     421344 non-null  object 
 5   fuel          421344 non-null  object 
 6   odometer      421344 non-null  float64
 7   title_status  421344 non-null  object 
 8   transmission  421344 non-null  object 
 9   drive         421344 non-null  object 
 10  size          421344 non-null  object 
 11  type          421344 non-null  object 
 12  paint_color   421344 non-null  object 
dtypes: float64(2), int64(1), object(10)
memory usage: 45.0+ MB


Finally, it is time to encode our categorical columns using one-hot encoding. We should also consider whether we want to create polynomial or logarithmic features of our numerical columns (price, year, and odometer) as well as considering scaling and normalization.

For price, my intuition is to use a logarithmic transformation.

For year, my intuition is to transform this into an age category from the present, i.e. take the current year - the values in that column.

For odometer, I am not sure whether to keep as is (linear) or use a logarithmic transformation. Perhaps both?

In [27]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder

In [29]:
categorical_columns = ['manufacturer', 'condition', 'cylinders', 'fuel', 'title_status', 'transmission', 'drive', 'size', 'type', 'paint_color']

column_transformer = make_column_transformer(
    (SimpleImputer('unknown'), categorical_columns),
    (OneHotEncoder(), categorical_columns),
    remainder='passthrough'
)

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.