# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary.

The current task is to identify the primary factors that determine prices for used cars.

We first assess the situation, including noting inventory of what used cars belonging to the salesmen are preferred to be sold, assumptions that this task will ultimately aid in the sale of their used cars, and the risks, costs, and benefits of following the findings of this task.

Utilizing a dataset of over 400,000 entries outlining the characteristics of used cars, we will clean and refine the data and analyze it via data modeling techniques to determine which facets of used cars are the biggest determinants in the car's price such as brand, damage, cleanliness, year etc.
We will generate graphs to visually compare different categories such as the correlation between brand and year for example.

By the end of this data analysis, the goal is for the dealership to utilize the results to increase profits, used car sales, and aid in knowing what charactersitics to highlight when attempting to make a sale to customers, all of which will increase revenue for the business.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

In [1]:
import pandas as pd

In [2]:
# Load the data
df = pd.read_csv("data/vehicles.csv")

In [3]:
# Get dataset info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

In [4]:
# Get head of dataset (First 5 rows)
df.head()

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
0,7222695916,prescott,6000,,,,,,,,,,,,,,,az
1,7218891961,fayetteville,11900,,,,,,,,,,,,,,,ar
2,7221797935,florida keys,21000,,,,,,,,,,,,,,,fl
3,7222270760,worcester / central MA,1500,,,,,,,,,,,,,,,ma
4,7210384030,greensboro,4900,,,,,,,,,,,,,,,nc


In [5]:
# Get tail of datset (Last 5 rows)
df.tail()

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
426875,7301591192,wyoming,23590,2019.0,nissan,maxima s sedan 4d,good,6 cylinders,gas,32226.0,clean,other,1N4AA6AV6KC367801,fwd,,sedan,,wy
426876,7301591187,wyoming,30590,2020.0,volvo,s60 t5 momentum sedan 4d,good,,gas,12029.0,clean,other,7JR102FKXLG042696,fwd,,sedan,red,wy
426877,7301591147,wyoming,34990,2020.0,cadillac,xt4 sport suv 4d,good,,diesel,4174.0,clean,other,1GYFZFR46LF088296,,,hatchback,white,wy
426878,7301591140,wyoming,28990,2018.0,lexus,es 350 sedan 4d,good,6 cylinders,gas,30112.0,clean,other,58ABK1GG4JU103853,fwd,,sedan,silver,wy
426879,7301591129,wyoming,30590,2019.0,bmw,4 series 430i gran coupe,good,,gas,22716.0,clean,other,WBA4J1C58KBM14708,rwd,,coupe,,wy


In [6]:
# Describe dataset
df.describe()

Unnamed: 0,id,price,year,odometer
count,426880.0,426880.0,425675.0,422480.0
mean,7311487000.0,75199.03,2011.235191,98043.33
std,4473170.0,12182280.0,9.45212,213881.5
min,7207408000.0,0.0,1900.0,0.0
25%,7308143000.0,5900.0,2008.0,37704.0
50%,7312621000.0,13950.0,2013.0,85548.0
75%,7315254000.0,26485.75,2017.0,133542.5
max,7317101000.0,3736929000.0,2022.0,10000000.0


In [7]:
# Get count of duplicates
df.duplicated().sum()

np.int64(0)

In [8]:
# Get count of null entries
df.isnull().sum()

Unnamed: 0,0
id,0
region,0
price,0
year,1205
manufacturer,17646
model,5277
condition,174104
cylinders,177678
fuel,3013
odometer,4400


### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Drop id and VIN columns as these do not contribute in determining viability for car sale
df = df.drop(columns=['id', 'VIN'], errors='ignore')

# Fill missing values using median, mode, mean method
num_cols = df.select_dtypes(include=['float64', 'int64']).columns
df[num_cols] = df[num_cols].fillna(df[num_cols].median())

object_cols = df.select_dtypes(include=['object']).columns
df[object_cols] = df[object_cols].apply(lambda x: x.fillna(x.mode()[0]))

# Add the age of the car as a new column
df['age'] = 2025 - df['year']
df = df.drop(columns=['year'])

# Apply log transformation to 'price'
df['price_log'] = np.log(df['price'])
df = df.drop(columns=['price'])

# One-hot encode object columns
object_features = list(object_cols)
encoder = OneHotEncoder(handle_unknown='ignore', drop='first')
encoded_features = encoder.fit_transform(df[object_features]).toarray()
df_onehot = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(object_features))

# Concatenate one-hot features
df = df.drop(columns=object_features).reset_index(drop=True)
df = pd.concat([df, df_onehot], axis=1)

# Scaling
scaler = StandardScaler()
num_features = ['odometer', 'age']
df[num_features] = scaler.fit_transform(df[num_features])

# Split dataset into training set and test set
X = df.drop(columns=['price_log'])
y = df['price_log']


X = X.astype(np.float32).to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  result = getattr(ufunc, method)(*inputs, **kwargs)


### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [None]:
from sklearn.linear_model import Lasso, Ridge
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# Lasso Regression
models = {
    "Lasso Regression": Lasso(alpha=0.1),
    "Ridge Regression": Ridge(alpha=1.0)
}
results = []
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r_squared = r2_score(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))

    results.append({"Model": name, "R² Score": r_squared, "RMSE": rmse})
    print(f"{name}: R² = {r_squared:.4f}, RMSE = {rmse:.2f}")


In [None]:
results = pd.DataFrame(results)
print("\nModel Performance Summary:")
print(results)

In [None]:
for name, model in models.items():
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')
    print(f"\n{name} - Cross-Validation R² Scores: {cv_scores}")
    print(f"Mean R² Score: {np.mean(cv_scores):.4f}")

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

feature_names = df.drop(columns=['price_log']).columns  # Extract column names from the original DataFrame
# Predict Lasso and Ridge regression
y_pred_lasso = models["Lasso Regression"].predict(X_test)
y_pred_ridge = models["Ridge Regression"].predict(X_test)

# Scatter Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Lasso Regression
axes[0].scatter(y_test, y_pred_lasso, alpha=0.5, color='blue')
axes[0].plot(y_test, y_test, color='red', linestyle='dashed')
axes[0].set_title("Lasso Regression: Actual vs. Predicted Prices")
axes[0].set_xlabel("Actual Log Price")
axes[0].set_ylabel("Predicted Log Price")

# Ridge Regression
axes[1].scatter(y_test, y_pred_ridge, alpha=0.5, color='green')
axes[1].plot(y_test, y_test, color='red', linestyle='dashed')
axes[1].set_title("Ridge Regression: Actual vs. Predicted Prices")
axes[1].set_xlabel("Actual Log Price")
axes[1].set_ylabel("Predicted Log Price")

plt.tight_layout()
plt.show()

# Residuals Histogram
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
sns.histplot(y_test - y_pred_lasso, bins=30, kde=True, ax=axes[0], color='blue')
axes[0].set_title("Lasso Regression Residuals")
axes[0].set_xlabel("Residuals (Actual - Predicted)")
axes[0].set_ylabel("Frequency")
sns.histplot(y_test - y_pred_ridge, bins=30, kde=True, ax=axes[1], color='green')
axes[1].set_title("Ridge Regression Residuals")
axes[1].set_xlabel("Residuals (Actual - Predicted)")
axes[1].set_ylabel("Frequency")

plt.tight_layout()
plt.show()

# Ridge coefficients
ridge_coefficients = np.abs(models["Ridge Regression"].coef_)

# Create a DataFrame for feature importance
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': ridge_coefficients})
importance_df = importance_df.sort_values(by='Importance', ascending=False).head(20)

# Plot settings
plt.figure(figsize=(12, 6))
sns.barplot(x="Importance", y="Feature", data=importance_df, palette="viridis")
plt.xlabel("Feature Importance (Ridge Coefficients)")
plt.ylabel("Feature Name")
plt.title("Top 20 Feature Importances in Ridge Regression")
plt.show()

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

Comparing Ridge Regression and Lasso Regression, Ridge Regression has better performance and is more stable.

However, Ridge Regression seems to convey around 30% price distribution. On the other hand, Lasso Regression has worse performance with Roote Mean Squared Error (RMSE) being considerably high and predictions show high margin of error.

In terms of impact to the business, most of the data categories in the dataset seem to not have a relevant effect as to what factors determine the prices of the cars. It may be beneficial to explore more categories of data to add to the dataset. Regardless, there are no noticeable problems with the data preprocessing phase.

As the use of Ridge Regression retains more variables and has better performance than Lasso, it is a better modeling technique for this task with strong predictors being odometer and the age of the car. On the other hand, the results from Lasso regression indicates that most of the data categories in the dataset are not good indicators of relation to the prices of the cars.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.


**Summary**

Utilizing a dataset of over 400,000 entries outlining the characteristics of used cars, we cleaned and refined the data and analyze it via data modeling techniques to determine which facets of used cars are the biggest determinants in the car's price such as brand, damage, cleanliness, year etc. We generated graphs to visually compare different categories such as the correlation between brand and year for example.

By the end of this data analysis, the dealership is able to utilize the results to increase profits, used car sales, and aid in knowing what charactersitics to highlight when attempting to make a sale to customers, all of which will increase revenue for the business.

**Results**

- The use of Ridge Regression was the best modeling technique, painting a picture of around 30% of the price distribution in the cars.

- The use of Lasso Regression was did not perform well in terms of the age of the car and the odomoeter readings.

- There are some more unique data entries that affected the results. For example, there are some outliers such as more luxurious cars with higher end price brackets. This indicates more refined data cleaning may have been needed.

**Important Factors for Categories**

- Age of the Car – Newer cars correlate to higher prices
- Odometer Reading – Higher odometer correlates to lower prices
- Manufacturer – Some higher end cars have large price differences.
- Fuel Type – Hybrid and electric cars correlate to higher prices.
- Transmission Type – Automatic transmissions are somewhat correlated to higher prcies
- Vehicle Condition – Better vehicle conditions correlate to higher prices.


**What does this mean for the business?**
- Introduce more categories of data related to potential affect to car prices
- In tandem with this, better data cleaning may be needed such as sensitivity to outliers (luxury car brands)
- Observe combinations of multiple factors, some can be derived logically such as the age of the vehicle correlating to the odometer reading.

**What to focus on for higher profits**
- Newer Cars
- Lower Odometer Readings
- Be Cognisant of outlier models such as exotic/luxury cars
- Hybrid and electric cars are considered high end and premium
- Better physical conditions of the car