# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

### Answer

The target variable will be the car's price. The predictor's variables will include make, model, year, mileage, vehicle condition and location. The goal of the assignment is to build a predictive model that properly quantifies the impact the aforementioned features on the car's price. This insight can provide actionable recommendations to the used car dealership.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

### Answer

The first step I would take is to load dataset into a pandas dataframe and then get an overview of the data through a call to the head function. Using the describe, info and isnull methods to further evaluate what is in the dataset and determine what level of data preparation will be required. Lastly, I would use the unique method to determine the unique values of each column to determine their cardinality.

### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

## Pandas Data Investigation
The following code shows the investigation performed.

In [4]:
# Import Pandas and read in the csv into a dataframe
import pandas as pd

df = pd.read_csv('data/vehicles.csv')

In [6]:
# Look at the columns from the dataframe

print(df.head())

# Shows a lot of NaN values

           id                  region  price  year manufacturer model  \
0  7222695916                prescott   6000   NaN          NaN   NaN   
1  7218891961            fayetteville  11900   NaN          NaN   NaN   
2  7221797935            florida keys  21000   NaN          NaN   NaN   
3  7222270760  worcester / central MA   1500   NaN          NaN   NaN   
4  7210384030              greensboro   4900   NaN          NaN   NaN   

  condition cylinders fuel  odometer title_status transmission  VIN drive  \
0       NaN       NaN  NaN       NaN          NaN          NaN  NaN   NaN   
1       NaN       NaN  NaN       NaN          NaN          NaN  NaN   NaN   
2       NaN       NaN  NaN       NaN          NaN          NaN  NaN   NaN   
3       NaN       NaN  NaN       NaN          NaN          NaN  NaN   NaN   
4       NaN       NaN  NaN       NaN          NaN          NaN  NaN   NaN   

  size type paint_color state  
0  NaN  NaN         NaN    az  
1  NaN  NaN         NaN    ar  
2 

In [31]:
# Inspect the variation 

print(df.describe())

                 id         price           year      odometer
count  4.268800e+05  4.268800e+05  426880.000000  4.268800e+05
mean   7.311487e+09  7.519903e+04    2011.240173  9.791454e+04
std    4.473170e+06  1.218228e+07       9.439234  2.127801e+05
min    7.207408e+09  0.000000e+00    1900.000000  0.000000e+00
25%    7.308143e+09  5.900000e+03    2008.000000  3.813000e+04
50%    7.312621e+09  1.395000e+04    2013.000000  8.554800e+04
75%    7.315254e+09  2.648575e+04    2017.000000  1.330000e+05
max    7.317101e+09  3.736929e+09    2022.000000  1.000000e+07


In [30]:
# Identify Missing Values and get the data types of the columns

print(df.info())
print(df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Columns: 29762 entries, id to paint_color_yellow
dtypes: bool(29756), float64(2), int64(2), object(2)
memory usage: 11.8+ GB
None
id                     0
region                 0
price                  0
year                   0
odometer               0
                      ..
paint_color_red        0
paint_color_silver     0
paint_color_unknown    0
paint_color_white      0
paint_color_yellow     0
Length: 29762, dtype: int64


In [12]:
# Establish the unique values for each column

for col in df.select_dtypes(include=['object']).columns:
    print(f'{col}: {df[col].nunique()} unique values')

region: 404 unique values
manufacturer: 42 unique values
model: 29649 unique values
condition: 6 unique values
cylinders: 8 unique values
fuel: 5 unique values
title_status: 6 unique values
transmission: 3 unique values
VIN: 118246 unique values
drive: 3 unique values
size: 4 unique values
type: 13 unique values
paint_color: 12 unique values
state: 51 unique values


In [28]:
# After inspecting values VIN number can't help the business case, dropping that column
df = df.drop('VIN', axis=1)

In [14]:
# Drop Na on a threshold - choosing 60% 
missing_threshold = 0.6
df = df.dropna(thresh=len(df) * (1 - missing_threshold), axis=1)

In [16]:
# Fill Empty Values
df['year'] = df['year'].fillna(df['year'].median())
df['manufacturer'] = df['manufacturer'].fillna('unknown')
df['model'] = df['model'].fillna('unknown')
df['condition'] = df['condition'].fillna('unknown')
df['cylinders'] = df['cylinders'].fillna('unknown')
df['fuel'] = df['fuel'].fillna('unknown')
df['odometer'] = df['odometer'].fillna(df['odometer'].median())
df['title_status'] = df['title_status'].fillna('unknown')
df['transmission'] = df['transmission'].fillna('unknown')
df['drive'] = df['drive'].fillna('unknown')
df['type'] = df['type'].fillna('unknown')
df['paint_color'] = df['paint_color'].fillna('unknown')


In [18]:
# Re-inspect the dataframe after some clean up
for col in df.select_dtypes(include=['object']).columns:
    print(f'{col}: {df[col].nunique()} unique values')

region: 404 unique values
manufacturer: 43 unique values
model: 29649 unique values
condition: 7 unique values
cylinders: 9 unique values
fuel: 6 unique values
title_status: 7 unique values
transmission: 4 unique values
VIN: 118246 unique values
drive: 4 unique values
type: 14 unique values
paint_color: 13 unique values
state: 51 unique values


In [20]:
# One Hot Encode Columns
df = pd.get_dummies(df, columns=['manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'title_status', 'transmission', 'drive', 'type', 'paint_color'])

In [34]:
# Perform a train and test split

X = df.drop(['price'], axis=1)
y = df['price']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [36]:
# Initialize 4 regressions and import the required imports
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
    'Lasso Regression': Lasso(),
    'Random Forest': RandomForestRegressor()
}

In [38]:
# Loop through each model calculate the MSE
results = {}

for name, model in models.items():
    # Cross-validation
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
    mean_cv_score = -np.mean(cv_scores)
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Predictions on the test set
    y_pred = model.predict(X_test)
    
    # Calculate test MSE
    test_mse = mean_squared_error(y_test, y_pred)
    
    results[name] = {
        'CV Mean MSE': mean_cv_score,
        'Test MSE': test_mse
    }

# Print the results
for model_name, metrics in results.items():
    print(f"{model_name}:")
    print(f"  Cross-Validation Mean MSE: {metrics['CV Mean MSE']}")
    print(f"  Test MSE: {metrics['Test MSE']}")
    print()


MemoryError: Unable to allocate 7.54 GiB for an array with shape (29649, 273203) and data type bool

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.