# What drives the price of a car?

## Business Understanding 
A car dealership is a volume based business and pricing a car to match the demand is key. When there is a large supply of used cars across dealerships, it make little sense for dealers to mark up significantly. Recently due to Ukraine war and chip shortage, there was a dearth of new cars in the market. Used cars of all kinds were in demand. Timing plays a factor in the used car market, however the specific thing we are trying to find in this assignment is also, what factors within the characteristics of the car still tend to influence the price more. True, a luxury car with the same mechanical specs tends to be more expensive. Lets dive into this and learn more about the used car domain

## Data Understanding
Before looking at the dataset, here is the approach I would like to pursue:
1. Review the dataset for what attributes are included
2. Review how the columns correlate to the price using the techniques we have learned
3. Resolve quality issues - dropna, and drop columns
4. Remove columns that dont seem to have as much effect on the price
5. If there is historical pricing data for the same car before it is sold, we could do some autocorrelation to see how prices affected the sale price

In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [3]:
# read data
data = pd.read_csv('data/vehicles.csv')
data.head()

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
0,7222695916,prescott,6000,,,,,,,,,,,,,,,az
1,7218891961,fayetteville,11900,,,,,,,,,,,,,,,ar
2,7221797935,florida keys,21000,,,,,,,,,,,,,,,fl
3,7222270760,worcester / central MA,1500,,,,,,,,,,,,,,,ma
4,7210384030,greensboro,4900,,,,,,,,,,,,,,,nc


In [4]:
# check missing values
data.isnull().sum()

id                   0
region               0
price                0
year              1205
manufacturer     17646
model             5277
condition       174104
cylinders       177678
fuel              3013
odometer          4400
title_status      8242
transmission      2556
VIN             161042
drive           130567
size            306361
type             92858
paint_color     130203
state                0
dtype: int64

In [5]:
# show all columns
pd.set_option('display.max_columns', None)
data.head()

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
0,7222695916,prescott,6000,,,,,,,,,,,,,,,az
1,7218891961,fayetteville,11900,,,,,,,,,,,,,,,ar
2,7221797935,florida keys,21000,,,,,,,,,,,,,,,fl
3,7222270760,worcester / central MA,1500,,,,,,,,,,,,,,,ma
4,7210384030,greensboro,4900,,,,,,,,,,,,,,,nc


In [6]:
# distinct values in each column
data.nunique()


id              426880
region             404
price            15655
year               114
manufacturer        42
model            29649
condition            6
cylinders            8
fuel                 5
odometer        104870
title_status         6
transmission         3
VIN             118246
drive                3
size                 4
type                13
paint_color         12
state               51
dtype: int64

In [8]:
# remove columns that are not useful - VIN
# for this analysis we will assume location is not important.. drop region and state
data.drop(['VIN', 'region', 'state'], axis=1, inplace=True)

In [9]:
data.head()

Unnamed: 0,id,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,size,type,paint_color
0,7222695916,6000,,,,,,,,,,,,,
1,7218891961,11900,,,,,,,,,,,,,
2,7221797935,21000,,,,,,,,,,,,,
3,7222270760,1500,,,,,,,,,,,,,
4,7210384030,4900,,,,,,,,,,,,,


### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.