# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

#### Business Objective
 Background: The used car business in America is a surprisingly complex ecosystem. Several interconnected factors contribute to this complexity, making it a challenging industry to navigate for both buyers and sellers for e.g. Variety of products, information asymetry between buyer and seller, fluctuating market dynamics etc.
 Here, we are trying to create an AI model to simplify the process of identifying which factors influence the used car prices more to help dealers take necessary action to maxizmize their sale. 

#### Business Success Criteria
Perform predictive analysis to provide recmmendaton to used car dealers on various ways they can maximize their sales. 

Recources: Enough historic data, data with features influencing car prices

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

-  Identify and ignore nulls, 
-  normalize outliers, convert categorical features to numeric, 
-  Observe the dataset for interesting details and/or trends(are their estimates), 
-  scale the data to normalize magnitude of different features

### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
from warnings import filterwarnings 
filterwarnings('ignore')




In [None]:
pip install --upgrade category_encoders

In [None]:
import category_encoders as ce


In [None]:
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import recall_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.decomposition import PCA


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import root_mean_squared_error
from sklearn.metrics import r2_score
from sklearn.inspection import permutation_importance
import plotly.express as px
from scipy.linalg import svd

In [None]:
car_data = pd.read_csv('data/vehicles.zip', compression = 'zip')

In [None]:
car_data.head(20)


In [None]:
car_data.isna().sum()

In [None]:
print(car_data.dtypes)

In [None]:
#Checking to see how many rows have null in most of the columns except id, price, state and region
car_data[car_data.isna().sum(axis=1) >= 14].count()


In [None]:
#counting number of rows with price=0
car_data[car_data['price'] == 0 ].count()

In [None]:
## dropping rows with null values in most of the features.
car_data = car_data[car_data.isna().sum(axis=1) < 14]

In [None]:
## checling to see what percent of values are null in each feature
for car in car_data.columns:
    print(f"{car}:unique:{((car_data[car].nunique()/car_data[car].size)*100)}%  NotNull:{((car_data[car].count()/car_data[car].size)*100)}%  Null:{((car_data[car].isna().sum()/car_data[car].size)*100)}%:{car_data[car].dtype}")
print("total columns:",car_data.columns.shape)

### Dropping unnecessary columns

- `ID` is a unique identifier for each customer but not continuous so cannot be used to set as an index and also not useful for PCA.
- `VIN`  is not useful for PCA.
- `condition`,`drive`,`paint_color` are not useful as it is not populated for more then 25% of the data.
- `cylinders` is not useful as it is not populated for about 40% of the data.
- `Region` is not as useful since we have `state`.
- `size` is not useful as it is not populated for about 72% of the data.

In [None]:
dropped_features = ['id','VIN','condition','cylinders','size','drive','paint_color','region']

In [None]:
car_data_clean = car_data.drop(dropped_features, axis=1)

In [None]:
car_data_clean.info()

In [None]:
##checking for outliers
sns.boxplot(data=car_data_clean, x="price")
plt.show()

### Removing outliers

In [None]:
def remove_outliers_iqr(df, column):
    """
    Identifies and removes outliers from a Pandas DataFrame column using the IQR method.

    Args:
        df (pd.DataFrame): The input DataFrame.
        column (str): The name of the column to check for outliers.

    Returns:
        pd.DataFrame: A new DataFrame with outliers removed.
    """
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    print("lower_bound",lower_bound)
    upper_bound = Q3 + 1.5 * IQR
    print("upper_bound",upper_bound)
    
    filtered_df = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
    return filtered_df

In [None]:
car_data_clean = remove_outliers_iqr(car_data_clean, 'price')

In [None]:
##checking for outliers
sns.boxplot(data=car_data_clean, x="price")
plt.show()

In [None]:
car_data_clean['price'].describe()

In [None]:
pd.set_option('display.max_rows', None)
car_data_clean['model']

### Encoding and scaling the data to run PCA and determine correlation

In [None]:
car_data_encode = car_data_clean.drop('price',axis=1)

In [None]:
#for cat in categorical_columns:
m_estimator = ce.MEstimateEncoder(cols=car_data_encode.columns)
car_data_encoded = m_estimator.fit_transform(car_data_encode, car_data_clean['price'])
#X_test_encoded = m_estimator.transform(X_test)

In [None]:
targ_enc = ce.TargetEncoder(cols=car_data_encode.columns)
car_data_encoded = targ_enc.fit_transform(car_data_encode, car_data_clean['price'])

In [None]:
car_data_encoded['price'] = car_data_clean['price']
print(car_data_encoded.shape)
print(type(car_data_encoded))
print(car_data_encoded.columns)

#### Examining the Correlations

In [None]:
#feature & target
target = car_data_clean['price']
#features = car_data.drop('price',axis=1)
#scale data
scaler=StandardScaler()
car_data_encoded[car_data_encoded.columns]=scaler.fit_transform(car_data_encoded[car_data_encoded.columns])

In [None]:
highest_corr = car_data_encoded.corr()[['price']].nlargest(columns = 'price', n = 2).index[1]

print("highest correlation:",highest_corr)

In [None]:
corr_matrix = car_data_encoded.corr()
plt.figure(figsize=(10, 8)) # Adjust size as needed
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Heatmap')
plt.show()

In [None]:
pca_3d = PCA(n_components=3)
X_pca_3d = pca_3d.fit_transform(car_data_encoded)

In [None]:
#plot in 3D with Matplotlib
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(X_pca_3d[:, 0], X_pca_3d[:, 1], X_pca_3d[:, 2], c=target, cmap='viridis', edgecolor='k')
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
plt.title('3D PCA')
legend1 = ax.legend(*scatter.legend_elements(), title='Price')
ax.add_artist(legend1)
plt.show()

In [None]:
#pca to 2 dimensions
pca_2d = PCA(n_components=2)
X_pca_2d = pca_2d.fit_transform(car_data_encoded)

In [None]:
#plot in 2D with Matplotlib
plt.figure(figsize=(8, 6))
plt.scatter(X_pca_2d[:, 0], X_pca_2d[:, 1], c=target, cmap='viridis', edgecolor='k', s=25)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('2D PCA')
plt.colorbar(label='Price')
plt.show()

In [None]:
#px.scatter(data_frame=car_data, x='price', y='year')
variance = car_data_encoded.var()
#high_variance_features = variance[variance > 10] 
print(variance)
#X_train_encoded.boxplot(column=high_variance_features.index) 
#plt.boxplot(variance)

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

- `M-Estimate Encoder`, `CatBoost Encoder` , `James-stein Encoder` will be used to encoding categorical columns into numerical. 
- Then will use LinearRegression and Ridge to perform predictions.


In [None]:
target_feature = 'price'
car_data_clean = car_data_clean.fillna('missing')
X = car_data_clean.drop('price', axis=1)
y = car_data_encoded['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
print(X_train.shape)
print(y_train.shape)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:

#for cat in categorical_columns:
ml_estimator = ce.MEstimateEncoder(cols=X_train.columns)
mestimator_linear_pipeline = Pipeline([
    ('mtransformer', ml_estimator), 
    ('mscalor',StandardScaler()),
    ('mlinreg', LinearRegression())])
mestimator_linear_pipeline

In [None]:

#for cat in categorical_columns:
mr_estimator = ce.MEstimateEncoder(cols=X_train.columns)
mestimator_ridge_pipeline = Pipeline([
    ('mrtransformer', mr_estimator), 
    ('mrscalor',StandardScaler()),
    ('mridge', Ridge(alpha=1000))])
mestimator_ridge_pipeline

In [None]:

#for cat in categorical_columns:
js_estimator = ce.JamesSteinEncoder(cols=X_train.columns)
js_linear_pipeline = Pipeline([
    ('jtransformer', js_estimator), 
    ('jscalor',StandardScaler()),
    ('jlinreg', LinearRegression())])
js_linear_pipeline

In [None]:

#for cat in categorical_columns:
jsr_estimator = ce.JamesSteinEncoder(cols=X_train.columns)
js_ridge_pipeline = Pipeline([
    ('jrtransformer', jsr_estimator), 
    ('jrscalor',StandardScaler()),
    ('jridge', Ridge())])
js_ridge_pipeline

In [None]:

#for cat in categorical_columns:
c_estimator = ce.CatBoostEncoder(cols=X_train.columns)
c_linear_pipeline = Pipeline([
    ('ctransformer', c_estimator), 
    ('cscalor',StandardScaler()),
    ('clinreg', LinearRegression())])
c_linear_pipeline

In [None]:

#for cat in categorical_columns:
cr_estimator = ce.CatBoostEncoder(cols=X_train.columns)
cr_ridge_pipeline = Pipeline([
    ('crtransformer', cr_estimator), 
    ('crscalor',StandardScaler()),
    ('cridge', Ridge(alpha=1000))])
cr_ridge_pipeline

In [None]:
model_results = pd.DataFrame(columns=['loss','MEstimator_Linear','MEstimator_Ridge','JStein_Linear','JStein_Ridge','CBoost_Linear','CBoost_Ridge'])
model_results['loss']=['MSE_Train','MSE_Test','MAE_Train','MAE_Test','R2_Train','R2_Test']
model_results = model_results.set_index('loss')
model_results.head(6)

Model with MEstimator Encoder, Standard SCaler and Liner Regression

In [None]:
mestimator_linear_pipeline.fit(X_train,y_train)
y_train_pred = mestimator_linear_pipeline.predict(X_train)
y_test_pred = mestimator_linear_pipeline.predict(X_test)
train_mse = float(mean_squared_error(y_train,y_train_pred))
test_mse = float(mean_squared_error(y_test,y_test_pred))
train_mae = float(mean_absolute_error(y_train,y_train_pred))
test_mae = float(mean_absolute_error(y_test,y_test_pred))

# Compute R² using Scikit-Learn
R2_test = r2_score(y_test, y_test_pred)
R2_train = r2_score(y_train, y_train_pred)
model_results['MEstimator_Linear']=[train_mse,test_mse,train_mae,test_mae,R2_test,R2_train]

Model with MEstimator Encoder, Standard SCaler and Ridge Regression

In [None]:
mestimator_ridge_pipeline.fit(X_train,y_train)
y_train_pred = mestimator_ridge_pipeline.predict(X_train)
y_test_pred = mestimator_ridge_pipeline.predict(X_test)
train_mse = float(mean_squared_error(y_train,y_train_pred))
test_mse = float(mean_squared_error(y_test,y_test_pred))
train_mae = float(mean_absolute_error(y_train,y_train_pred))
test_mae = float(mean_absolute_error(y_test,y_test_pred))

# Compute R² using Scikit-Learn
R2_test = r2_score(y_test, y_test_pred)
R2_train = r2_score(y_train, y_train_pred)
model_results['MEstimator_Ridge']=[train_mse,test_mse,train_mae,test_mae,R2_test,R2_train]

Model with JamesStein Encoder, Standard SCaler and Liner Regression

In [None]:
js_linear_pipeline.fit(X_train,y_train)
y_train_pred = js_linear_pipeline.predict(X_train)
y_test_pred = js_linear_pipeline.predict(X_test)
train_mse = float(mean_squared_error(y_train,y_train_pred))
test_mse = float(mean_squared_error(y_test,y_test_pred))
train_mae = float(mean_absolute_error(y_train,y_train_pred))
test_mae = float(mean_absolute_error(y_test,y_test_pred))

# Compute R² using Scikit-Learn
R2_test = r2_score(y_test, y_test_pred)
R2_train = r2_score(y_train, y_train_pred)
model_results['JStein_Linear']=[train_mse,test_mse,train_mae,test_mae,R2_test,R2_train]

Model with JamesStein Encoder, Standard SCaler and Ridge Regression

In [None]:
js_ridge_pipeline.fit(X_train,y_train)
y_train_pred = js_ridge_pipeline.predict(X_train)
y_test_pred = js_ridge_pipeline.predict(X_test)
train_mse = float(mean_squared_error(y_train,y_train_pred))
test_mse = float(mean_squared_error(y_test,y_test_pred))
train_mae = float(mean_absolute_error(y_train,y_train_pred))
test_mae = float(mean_absolute_error(y_test,y_test_pred))

# Compute R² using Scikit-Learn
R2_test = r2_score(y_test, y_test_pred)
R2_train = r2_score(y_train, y_train_pred)
model_results['JStein_Ridge']=[train_mse,test_mse,train_mae,test_mae,R2_test,R2_train]

Model with Cboost Encoder, Standard SCaler and Liner Regression

In [None]:
c_linear_pipeline.fit(X_train,y_train)
y_train_pred = c_linear_pipeline.predict(X_train)
y_test_pred = c_linear_pipeline.predict(X_test)
train_mse = float(mean_squared_error(y_train,y_train_pred))
test_mse = float(mean_squared_error(y_test,y_test_pred))
train_mae = float(mean_absolute_error(y_train,y_train_pred))
test_mae = float(mean_absolute_error(y_test,y_test_pred))

# Compute R² using Scikit-Learn
R2_test = r2_score(y_test, y_test_pred)
R2_train = r2_score(y_train, y_train_pred)
model_results['CBoost_Linear']=[train_mse,test_mse,train_mae,test_mae,R2_test,R2_train]

Model with Cboost Encoder, Standard SCaler and Ridge Regression

In [None]:
cr_ridge_pipeline.fit(X_train,y_train)
y_train_pred = cr_ridge_pipeline.predict(X_train)
y_test_pred = cr_ridge_pipeline.predict(X_test)
train_mse = float(mean_squared_error(y_train,y_train_pred))
test_mse = float(mean_squared_error(y_test,y_test_pred))
train_mae = float(mean_absolute_error(y_train,y_train_pred))
test_mae = float(mean_absolute_error(y_test,y_test_pred))

# Compute R² using Scikit-Learn
R2_test = r2_score(y_test, y_test_pred)
R2_train = r2_score(y_train, y_train_pred)
model_results['CBoost_Ridge']=[train_mse,test_mse,train_mae,test_mae,R2_test,R2_train]

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [None]:
model_results.head(6)

In [None]:
# Calculate the permutation importance

results = permutation_importance(mestimator_ridge_pipeline, X_test, y_test,n_repeats=10)
#importances = pd.DataFrame(data=results.importances_mean, index=X.columns, columns=['Importance']).sort_values(by='Importance', ascending=False)


In [None]:
df = pd.DataFrame(results['importances'])
df = df.T
df.columns = X_test.columns
px.box(data_frame=df, orientation='h', title = 'Feature importance for price prediction')

In [None]:

#for cat in categorical_columns:
mr_estimator = ce.MEstimateEncoder(cols=X_train.columns)
X_train_encoded =  mr_estimator.fit_transform(X_train,y_train)


In [None]:
def get_parameters_for_given_alpha(alpha):
    lm_with_ridge_model = Ridge(alpha = alpha)
    lm_with_ridge_model.fit(X_train_encoded,y_train)
    training_mse = mean_squared_error(lm_with_ridge_model.predict(X_train_encoded),y_train)
    return alpha, *lm_with_ridge_model.coef_, training_mse
    

In [None]:
param_df = pd.DataFrame([get_parameters_for_given_alpha(alpha) for alpha in [0.01, 0.1, 1, 10, 100, 1000, 10000,100000]],
                        columns = ["alpha", *X_train_encoded.columns,"Training MSE"])
param_df

In [None]:
fig = px.line(param_df, x = "alpha", y = "Training MSE", log_x = True, markers = True)
#fig.write_image("MSE_vs_alpha_most_basic.png", scale = 3)
fig.show()

In [None]:
parameters = {'fit_intercept': [False, True]}

lr_model_finder = GridSearchCV(LinearRegression(),
                               parameters,
                               scoring = "neg_mean_squared_error",
                               cv=3)

lr_model_finder.fit(X_train_encoded, y_train)

In [None]:
lr_model_finder.cv_results_

All the different models and encoders combinations giving pretty consistent results which a R2 Score of ranging from 70% to 78%

Out of all the features Odometer, Model and Year are the most significant and have the most correlation with the Price of the car as can be seen in the Most important feature box chart above.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.

## Final Result

The box chart of the most important features shows a very significant relationship between Odometer, model, year and Price of car sales. Customers are ready to pay a good price for cars which are not driven too much and are relatively newer.