# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

From my experience, the price of used cars depends greatly on several factors. Is it old enough to be considered collectible? Does the age of the vehicle cause consumers to worry about it's reliability? How many miles does the vehicle have on it? What is the Make/Model of the vehicle? What is the market for the vehicle in the state.

I will want to keep the features "manufacturer", "model", "condition", "odometer","title_status" , and "state" to determine "price" 

We will want to determine what features most influence the price of a car.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv(r'C:\Users\ibaca\Downloads\practical_application_II_starter (2)\data\vehicles.csv')

In [3]:
df = df[["manufacturer",  "condition", "odometer",  "price"]]

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   manufacturer  409234 non-null  object 
 1   condition     252776 non-null  object 
 2   odometer      422480 non-null  float64
 3   price         426880 non-null  int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 13.0+ MB


In [5]:
from sklearn.preprocessing import OneHotEncoder

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Initialize OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)

# Select columns to encode
columns_to_encode = ['manufacturer', 'condition']

# Fit and transform the data
encoded_data = encoder.fit_transform(df[columns_to_encode])

# Get the encoded column names
encoded_columns = encoder.get_feature_names_out(columns_to_encode)

# Create a DataFrame with the encoded data
encoded_df = pd.DataFrame(encoded_data, columns=encoded_columns)
encoded_df['price'] = df['price']

Original DataFrame:
       manufacturer condition  odometer  price
0               NaN       NaN       NaN   6000
1               NaN       NaN       NaN  11900
2               NaN       NaN       NaN  21000
3               NaN       NaN       NaN   1500
4               NaN       NaN       NaN   4900
...             ...       ...       ...    ...
426875       nissan      good   32226.0  23590
426876        volvo      good   12029.0  30590
426877     cadillac      good    4174.0  34990
426878        lexus      good   30112.0  28990
426879          bmw      good   22716.0  30590

[426880 rows x 4 columns]


### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [6]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
pd.set_option("display.max_rows", 10)

In [7]:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.inspection import permutation_importance
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.neighbors import KNeighborsClassifier
import warnings

In [8]:
encoded_df.head()

Unnamed: 0,manufacturer_acura,manufacturer_alfa-romeo,manufacturer_aston-martin,manufacturer_audi,manufacturer_bmw,manufacturer_buick,manufacturer_cadillac,manufacturer_chevrolet,manufacturer_chrysler,manufacturer_datsun,...,manufacturer_volvo,manufacturer_nan,condition_excellent,condition_fair,condition_good,condition_like new,condition_new,condition_salvage,condition_nan,price
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,6000
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,11900
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,21000
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1500
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4900


In [9]:
encoded_df.dropna(axis=1, inplace=True)

In [10]:
encoded_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 51 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   manufacturer_acura            426880 non-null  float64
 1   manufacturer_alfa-romeo       426880 non-null  float64
 2   manufacturer_aston-martin     426880 non-null  float64
 3   manufacturer_audi             426880 non-null  float64
 4   manufacturer_bmw              426880 non-null  float64
 5   manufacturer_buick            426880 non-null  float64
 6   manufacturer_cadillac         426880 non-null  float64
 7   manufacturer_chevrolet        426880 non-null  float64
 8   manufacturer_chrysler         426880 non-null  float64
 9   manufacturer_datsun           426880 non-null  float64
 10  manufacturer_dodge            426880 non-null  float64
 11  manufacturer_ferrari          426880 non-null  float64
 12  manufacturer_fiat             426880 non-nul

In [11]:
# Create 2 dataframes, one for target and one for features
X = encoded_df.drop(columns=['price'])
y = encoded_df['price']

# Normalize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)



In [12]:
# Define the parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'gamma': ['scale', 'auto'],
    'degree': [2, 3, 4],
    'coef0': [0.0, 0.1, 0.5],
    'shrinking': [True, False],
    'probability': [True, False]
}

# Initialize the SVM model
svm = SVC()



In [13]:
# Set up the grid search with KFold cross-validation
kf = KFold(n_splits=3)





In [14]:
grid_search = GridSearchCV(estimator=svm, param_grid=param_grid, scoring='accuracy', cv=kf)

In [None]:
# Fit the grid search to the data
grid_search.fit(X_scaled, y)




In [None]:
# Print the best parameters and the best score
print("Best parameters found: ", grid_search.best_params_)
print("Best accuracy score: ", grid_search.best_score_)


In [None]:
# Calculate permutation importance
results = permutation_importance(grid_search.best_estimator_, X_scaled, y, scoring='accuracy')

# Create a DataFrame for permutation importance results
perm_importance_df = pd.DataFrame({
    'Feature': df.drop(columns=['AdjustedDonors']).columns,
    'Importance': results.importances_mean
})



In [None]:
# Sort the DataFrame by importance
perm_importance_df = perm_importance_df.sort_values(by='Importance', ascending=False)

# Display the most influential attributes
print(perm_importance_df.head(10))

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.