# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

From a data perspective, our task is to develop a predictive model to identify the key determinants that influence the price of used cars. We will analyze a dataset containing features such as year, manufacturer, model, condition, cylinders, fuel, odometer, transmission, drive, size, type, paint_color, and region, and aim to understand how these variables contribute to the price of the vehicles. Our goal is to leverage statistical and machine learning techniques to uncover relationships and patterns within the data that can inform our pricing predictions.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

### Steps to Get Familiar with the Dataset and Identify Quality Issues

1. **Load the Data**
    - Import the dataset into a data manipulation tool such as Python (using pandas) or R.
    - Display the first few rows to get an initial sense of the data structure.

2. **Data Overview**
    - Check the dimensions of the dataset (number of rows and columns).
    - Review column names and data types to ensure they align with the expected features.

3. **Handle Missing Values**
    - Identify columns with missing values and the proportion of missing data in each column.

4. **Descriptive Statistics**
    - Calculate summary statistics (mean, median, mode, standard deviation) for numerical columns to understand the distribution of the data.
    - For categorical columns, count the occurrences of each category.

5. **Data Cleaning**
    - Remove or impute any rows with significant amounts of missing data or outliers.
    - Standardize the formats of any categorical columns (e.g., ensure all `manufacturer` names are consistently formatted).

6. **Identify Duplicates**
    - Check for and remove any duplicate rows that may skew analysis.

7. **Data Transformation**
    - Convert categorical variables into numerical formats if necessary (e.g., using one-hot encoding).
    - Normalize or scale numerical variables to ensure they are on a comparable scale for modeling.

8. **Exploratory Data Analysis (EDA)**
    - Create visualizations such as histograms, box plots, and scatter plots to identify trends, patterns, and outliers in the data.
    - Analyze correlations between variables to identify potential drivers of the target variable (`price`).

9. **Feature Engineering**
    - Create new features that may be relevant, such as `car_age` (derived from `year`).
    - Consider aggregating or combining features that might have a joint impact on the target variable.

10. **Data Quality Check**
    - Validate data consistency, ensuring no discrepancies or logical errors within the columns.
    - Document any data quality issues and the steps taken to address them.



In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

df = pd.read_csv('data/vehicles.csv')


In [2]:
print(df.head())

print(df.info())


           id                  region  price  year manufacturer model  \
0  7222695916                prescott   6000   NaN          NaN   NaN   
1  7218891961            fayetteville  11900   NaN          NaN   NaN   
2  7221797935            florida keys  21000   NaN          NaN   NaN   
3  7222270760  worcester / central MA   1500   NaN          NaN   NaN   
4  7210384030              greensboro   4900   NaN          NaN   NaN   

  condition cylinders fuel  odometer title_status transmission  VIN drive  \
0       NaN       NaN  NaN       NaN          NaN          NaN  NaN   NaN   
1       NaN       NaN  NaN       NaN          NaN          NaN  NaN   NaN   
2       NaN       NaN  NaN       NaN          NaN          NaN  NaN   NaN   
3       NaN       NaN  NaN       NaN          NaN          NaN  NaN   NaN   
4       NaN       NaN  NaN       NaN          NaN          NaN  NaN   NaN   

  size type paint_color state  
0  NaN  NaN         NaN    az  
1  NaN  NaN         NaN    ar  
2 

In [3]:
# drop columns id, state since region is already in the data
df.drop(columns=['id', 'state'], inplace=True)
# drop columns with too few Non-Null count < 400000
df.drop(columns=['condition', 'cylinders', 'fuel', 'VIN',\
                  'drive', 'size', 'type', 'paint_color'], inplace=True)
print('before dropna', df.info())
# drop rows with missing values
df.dropna(inplace=True)
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   region        426880 non-null  object 
 1   price         426880 non-null  int64  
 2   year          425675 non-null  float64
 3   manufacturer  409234 non-null  object 
 4   model         421603 non-null  object 
 5   odometer      422480 non-null  float64
 6   title_status  418638 non-null  object 
 7   transmission  424324 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 26.1+ MB
before dropna None
<class 'pandas.core.frame.DataFrame'>
Index: 391154 entries, 27 to 426879
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   region        391154 non-null  object 
 1   price         391154 non-null  int64  
 2   year          391154 non-null  float64
 3   manufacturer  391154 non-null  object

### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [4]:
# Create 'car_age' feature
df['car_age'] = 2025 - df['year']


In [5]:
# Identify numerical and categorical columns
num_features = ['odometer', 'car_age']
cat_features = ['model', 'title_status', 'transmission', 'region']

# Create transformers for numerical and categorical data
num_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

cat_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformers into a column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_features),
        ('cat', cat_transformer, cat_features)
    ])


In [6]:
# Define target variable and features
X = df.drop(columns=['price'])
y = df['price']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [7]:
# Fit and transform the training data
X_train = preprocessor.fit_transform(X_train)

# Transform the testing data
X_test = preprocessor.transform(X_test)


### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [8]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error

# Linear Regression
linear_reg = LinearRegression()

# Ridge Regression
ridge_reg = Ridge()

# Lasso Regression
lasso_reg = Lasso()

# Decision Tree Regression
tree_reg = DecisionTreeRegressor()


In [9]:
# Perform cross-validation for Linear Regression
linear_scores = cross_val_score(linear_reg, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
linear_rmse_scores = np.sqrt(-linear_scores)
print("Linear Regression RMSE:", linear_rmse_scores.mean())

# Perform cross-validation for Ridge Regression
ridge_scores = cross_val_score(ridge_reg, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
ridge_rmse_scores = np.sqrt(-ridge_scores)
print("Ridge Regression RMSE:", ridge_rmse_scores.mean())

# # Perform cross-validation for Decision Tree Regression
# tree_scores = cross_val_score(tree_reg, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
# tree_rmse_scores = np.sqrt(-tree_scores)
# print("Decision Tree Regression RMSE:", tree_rmse_scores.mean())

# # Perform cross-validation for Lasso Regression
# lasso_scores = cross_val_score(lasso_reg, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
# lasso_rmse_scores = np.sqrt(-lasso_scores)
# print("Lasso Regression RMSE:", lasso_rmse_scores.mean())



Linear Regression RMSE: 9834046.542543653
Ridge Regression RMSE: 9775864.85015646


In [10]:
# Ridge Regression hyperparameter tuning
ridge_param_grid = {'alpha': [0.01, 0.1, 1, 10, 100]}
ridge_grid_search = GridSearchCV(ridge_reg, ridge_param_grid, cv=5, scoring='neg_mean_squared_error')
ridge_grid_search.fit(X_train, y_train)
print("Best Ridge Parameters:", ridge_grid_search.best_params_)
print("Best Ridge RMSE:", np.sqrt(-ridge_grid_search.best_score_))

# # Decision Tree Regression hyperparameter tuning
# tree_param_grid = {'max_depth': [None, 10, 20, 30], 'min_samples_split': [2, 10, 20]}
# tree_grid_search = GridSearchCV(tree_reg, tree_param_grid, cv=5, scoring='neg_mean_squared_error')
# tree_grid_search.fit(X_train, y_train)
# print("Best Decision Tree Parameters:", tree_grid_search.best_params_)
# print("Best Decision Tree RMSE:", np.sqrt(-tree_grid_search.best_score_))

# # Lasso Regression hyperparameter tuning
# lasso_param_grid = {'alpha': [0.01, 0.1, 1, 10, 100]}
# lasso_grid_search = GridSearchCV(lasso_reg, lasso_param_grid, cv=5, scoring='neg_mean_squared_error')
# lasso_grid_search.fit(X_train, y_train)
# print("Best Lasso Parameters:", lasso_grid_search.best_params_)
# print("Best Lasso RMSE:", np.sqrt(-lasso_grid_search.best_score_))



Best Ridge Parameters: {'alpha': 100}
Best Ridge RMSE: 12596165.971528497


In [11]:
# Final evaluation for Linear Regression
linear_reg.fit(X_train, y_train)
linear_predictions = linear_reg.predict(X_test)
print("Linear Regression Test RMSE:", np.sqrt(mean_squared_error(y_test, linear_predictions)))

# Final evaluation for Ridge Regression
best_ridge_reg = ridge_grid_search.best_estimator_
best_ridge_reg.fit(X_train, y_train)
ridge_predictions = best_ridge_reg.predict(X_test)
print("Ridge Regression Test RMSE:", np.sqrt(mean_squared_error(y_test, ridge_predictions)))

# # Final evaluation for Decision Tree Regression
# best_tree_reg = tree_grid_search.best_estimator_
# best_tree_reg.fit(X_train, y_train)
# tree_predictions = best_tree_reg.predict(X_test)
# print("Decision Tree Regression Test RMSE:", np.sqrt(mean_squared_error(y_test, tree_predictions)))

# # Final evaluation for Lasso Regression
# best_lasso_reg = lasso_grid_search.best_estimator_
# best_lasso_reg.fit(X_train, y_train)
# lasso_predictions = best_lasso_reg.predict(X_test)
# print("Lasso Regression Test RMSE:", np.sqrt(mean_squared_error(y_test, lasso_predictions)))



Linear Regression Test RMSE: 4657010.1293513095
Ridge Regression Test RMSE: 4452942.552492419


### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [14]:
# Assuming linear_reg or ridge_reg is the trained model
import pandas as pd

# Get feature names after preprocessing
feature_names = preprocessor.transformers_[0][2] + list(preprocessor.transformers_[1][1].named_steps['onehot'].get_feature_names_out(cat_features))

# Get coefficients
coefficients = best_ridge_reg.coef_

# Create a DataFrame to display feature importance
importance_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})
importance_df['Absolute Coefficient'] = importance_df['Coefficient'].abs()
importance_df = importance_df.sort_values(by='Absolute Coefficient', ascending=False)
print(importance_df)


                                Feature   Coefficient  Absolute Coefficient
3632                    model_benz s430  2.572334e+07          2.572334e+07
3524                    model_benz e320  1.867571e+07          1.867571e+07
9009       model_f350 super duty lariat  5.771308e+06          5.771308e+06
20284                  region_frederick  4.490146e+06          4.490146e+06
1651                      model_4runner  3.972500e+06          3.972500e+06
...                                 ...           ...                   ...
7508    model_expedition eddie bauer4x4  5.938823e-01          5.938823e-01
9486             model_forester 2.5 awd -3.282600e-01          3.282600e-01
10763                    model_hse spor  2.647360e-01          2.647360e-01
5324   model_cooper hardtop hatchback f  2.510420e-01          2.510420e-01
8952            model_f350 diesel dualy -1.167850e-01          1.167850e-01

[20571 rows x 3 columns]


In [14]:
# # Assuming best_forest_reg is the trained Random Forest model
# importances = best_forest_reg.feature_importances_

# # Create a DataFrame to display feature importance
# importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
# importance_df = importance_df.sort_values(by='Importance', ascending=False)
# print(importance_df)


In [15]:
# # Assuming lasso_reg is the trained model
# coefficients = lasso_reg.coef_

# # Create a DataFrame to display feature importance
# importance_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})
# importance_df = importance_df[importance_df['Coefficient'] != 0]
# importance_df['Absolute Coefficient'] = importance_df['Coefficient'].abs()
# importance_df = importance_df.sort_values(by='Absolute Coefficient', ascending=False)
# print(importance_df)


### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.