# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

 Our goal is to analyze data on used cars to understand what factors affect their prices.<br>We will treat price as the target variable and look at other details like car age, mileage, brand, condition, and fuel type as predictor variables <br>By using data analysis and modeling, we will find patterns that show which features make a car more valuable. This will help dealerships set better prices and choose the right cars for their inventory.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

##### Here are my steps for data understanding 


1. Load and inspect the data, first read the dataset and check the first few rows, identify the features of the columns and their data types, and count the number of missing values ​​in each column.


2. Identify key variables such as price, year, odometer, manufacturer, condition, and then assess how many records have missing or incomplete data. Decide whether to fill, delete, or ignore missing values ​​based on their importance. 

3. Detect outliers, such as unrealistic prices such as $0, extremely high values. Also, whether the odometer reading has unusually low or high mileage. Finally, check the year to ensure that there are no incorrect values, such as future years, very old cars.


4. Understand the feature distribution and relationship, draw histograms and box plots for the numerical variables price and odometer, and then use scatter plots to examine the relationship between car age, mileage, and price.


5. Assess data consistency and ensure that all records have valid state and region values.


6. Prepare business understanding insights to determine which features have the greatest impact on price

### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [3]:
import pandas as pd
file_path = r"C:\Users\user\Downloads\practical_application_II_starter\data\vehicles.csv"
df = pd.read_csv(file_path)
print(df.info())
print(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

In [4]:
categorical_columns = ['manufacturer', 'model', 'condition', 'cylinders', 'fuel', 
                       'title_status', 'transmission', 'drive', 'size', 'type', 'paint_color']
df[categorical_columns] = df[categorical_columns].fillna("Unknown")
df['year'].fillna(df['year'].median(), inplace=True)
df['odometer'].fillna(df['odometer'].median(), inplace=True)
df = df[(df['price'] > 500) & (df['price'] < 100000)]
df.to_csv("vehicles_cleaned.csv", index=False)
print("\nCleaned dataset saved as 'vehicles_cleaned.csv'.")


Cleaned dataset saved as 'vehicles_cleaned.csv'.


In [8]:
import pandas as pd
import numpy as np

file_path = r"C:\Users\user\Downloads\practical_application_II_starter\cleaned\vehicles_cleaned.csv"  
df = pd.read_csv(file_path)

print(df.info())
print(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 383068 entries, 0 to 383067
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            383068 non-null  int64  
 1   region        383068 non-null  object 
 2   price         383068 non-null  int64  
 3   year          383068 non-null  float64
 4   manufacturer  383068 non-null  object 
 5   model         383068 non-null  object 
 6   condition     383068 non-null  object 
 7   cylinders     383068 non-null  object 
 8   fuel          383068 non-null  object 
 9   odometer      383068 non-null  float64
 10  title_status  383068 non-null  object 
 11  transmission  383068 non-null  object 
 12  VIN           235330 non-null  object 
 13  drive         383068 non-null  object 
 14  size          383068 non-null  object 
 15  type          383068 non-null  object 
 16  paint_color   383068 non-null  object 
 17  state         383068 non-null  object 
dtypes: f

In [5]:
df['car_age'] = 2024 - df['year']

df['odometer_category'] = pd.cut(
    df['odometer'], 
    bins=[0, 50000, 100000, np.inf], 
    labels=['Low', 'Medium', 'High']
)

df.drop(columns=['year'], inplace=True)
print(df[['car_age', 'odometer', 'odometer_category']].head())

   car_age  odometer odometer_category
0     11.0   85548.0            Medium
1     11.0   85548.0            Medium
2     11.0   85548.0            Medium
3     11.0   85548.0            Medium
4     11.0   85548.0            Medium


In [6]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

one_hot_cols = ['fuel', 'transmission', 'drive', 'odometer_category']
label_cols = ['manufacturer', 'model', 'condition']

df = pd.get_dummies(df, columns=one_hot_cols, drop_first=True)

label_encoders = {}
for col in label_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])  
    label_encoders[col] = le  


print(df.head())

           id                  region  price  manufacturer  model  condition  \
0  7222695916                prescott   6000             0   6642          0   
1  7218891961            fayetteville  11900             0   6642          0   
2  7221797935            florida keys  21000             0   6642          0   
3  7222270760  worcester / central MA   1500             0   6642          0   
4  7210384030              greensboro   4900             0   6642          0   

  cylinders  odometer title_status  VIN  ... fuel_hybrid fuel_other  \
0   Unknown   85548.0      Unknown  NaN  ...           0          0   
1   Unknown   85548.0      Unknown  NaN  ...           0          0   
2   Unknown   85548.0      Unknown  NaN  ...           0          0   
3   Unknown   85548.0      Unknown  NaN  ...           0          0   
4   Unknown   85548.0      Unknown  NaN  ...           0          0   

  transmission_automatic transmission_manual  transmission_other  \
0                      0

In [7]:
from sklearn.preprocessing import StandardScaler
import numpy as np

num_features = ['car_age', 'odometer', 'price']
df['price'] = np.log1p(df['price'])
scaler = StandardScaler()
df[num_features] = scaler.fit_transform(df[num_features])

print(df.head())

           id                  region     price  manufacturer  model  \
0  7222695916                prescott -0.947981             0   6642   
1  7218891961            fayetteville -0.174938             0   6642   
2  7221797935            florida keys  0.466292             0   6642   
3  7222270760  worcester / central MA -2.512585             0   6642   
4  7210384030              greensboro -1.176594             0   6642   

   condition cylinders  odometer title_status  VIN  ... fuel_hybrid  \
0          0   Unknown  -0.06814      Unknown  NaN  ...           0   
1          0   Unknown  -0.06814      Unknown  NaN  ...           0   
2          0   Unknown  -0.06814      Unknown  NaN  ...           0   
3          0   Unknown  -0.06814      Unknown  NaN  ...           0   
4          0   Unknown  -0.06814      Unknown  NaN  ...           0   

  fuel_other transmission_automatic transmission_manual  transmission_other  \
0          0                      0                   0      

In [8]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=['price'])  
y = df['price']  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set: {X_train.shape}, {y_train.shape}")
print(f"Testing set: {X_test.shape}, {y_test.shape}")

Training set: (306454, 27), (306454,)
Testing set: (76614, 27), (76614,)


In [11]:
print(f"Training set: {X_train.shape}, {y_train.shape}")
print(f"Testing set: {X_test.shape}, {y_test.shape}")

Training set: (306454, 27), (306454,)
Testing set: (76614, 27), (76614,)


In [17]:
# Identify categorical columns in the dataset
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()

# Define high-cardinality columns that should not be One-Hot Encoded
high_cardinality_cols = ['region', 'VIN', 'state']  # Too many unique values

# Keep only categorical columns that are not in high-cardinality list
categorical_cols = [col for col in categorical_cols if col not in high_cardinality_cols]

# Apply One-Hot Encoding to the selected categorical columns
X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

print("One-Hot Encoding applied successfully to limited features!")
print("Updated feature set:", X.shape)


One-Hot Encoding applied successfully to limited features!
Updated feature set: (383068, 65)


In [18]:
high_cardinality_cols = ['region', 'VIN', 'state']  # Columns with too many unique values
categorical_cols = [col for col in categorical_cols if col not in high_cardinality_cols]

X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

print("One-Hot Encoding applied successfully to limited features!")
print("Updated feature set:", X.shape)

One-Hot Encoding applied successfully to limited features!
Updated feature set: (383068, 65)


In [17]:
from sklearn.preprocessing import LabelEncoder

label_cols = ['region', 'VIN', 'state']  # High-cardinality columns

for col in label_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])  # Convert to numeric values

print("Label Encoding applied to high-cardinality columns!")

Label Encoding applied to high-cardinality columns!


In [18]:
top_regions = X['region'].value_counts().index[:10]  # Keep top 10 regions
X['region'] = X['region'].apply(lambda x: x if x in top_regions else 'Other')

# Now apply One-Hot Encoding safely
X = pd.get_dummies(X, columns=['region'], drop_first=True)

print("One-Hot Encoding applied with category limits!")

One-Hot Encoding applied with category limits!


In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [20]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

# Initialize models
lr_model = LinearRegression()
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
xgb_model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Train models
lr_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)
xgb_model.fit(X_train, y_train)

# Make predictions
y_pred_lr = lr_model.predict(X_test)
y_pred_rf = rf_model.predict(X_test)
y_pred_xgb = xgb_model.predict(X_test)

print("Models trained and predictions made successfully!")

Models trained and predictions made successfully!


In [21]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

def evaluate_model(y_true, y_pred, model_name):
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2 = r2_score(y_true, y_pred)
    
    print(f"🔹 Model: {model_name}")
    print(f"   MAE: {mae:.4f}")
    print(f"   RMSE: {rmse:.4f}")
    print(f"   R² Score: {r2:.4f}")
    print("-" * 40)

# Evaluate each model
evaluate_model(y_test, y_pred_lr, "Linear Regression")
evaluate_model(y_test, y_pred_rf, "Random Forest Regressor")
evaluate_model(y_test, y_pred_xgb, "XGBoost Regressor")

🔹 Model: Linear Regression
   MAE: 0.4619
   RMSE: 0.6774
   R² Score: 0.5446
----------------------------------------
🔹 Model: Random Forest Regressor
   MAE: 0.1879
   RMSE: 0.3814
   R² Score: 0.8556
----------------------------------------
🔹 Model: XGBoost Regressor
   MAE: 0.3010
   RMSE: 0.4923
   R² Score: 0.7595
----------------------------------------


In [21]:
print("Columns with non-numeric data in X_train:")
print(X_train.select_dtypes(include=['object']).columns)

print("Columns with non-numeric data in X_test:")
print(X_test.select_dtypes(include=['object']).columns)

Columns with non-numeric data in X_train:
Index(['region', 'cylinders', 'title_status', 'VIN', 'size', 'type',
       'paint_color', 'state'],
      dtype='object')
Columns with non-numeric data in X_test:
Index(['region', 'cylinders', 'title_status', 'VIN', 'size', 'type',
       'paint_color', 'state'],
      dtype='object')


In [22]:
from sklearn.preprocessing import LabelEncoder

# Identify categorical columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()

# Apply Label Encoding
for col in categorical_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])

# Ensure train-test split is updated with numerical values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("All categorical columns have been encoded!")
print("Updated feature set:", X_train.shape)

All categorical columns have been encoded!
Updated feature set: (306454, 65)


In [24]:
from sklearn.model_selection import GridSearchCV


param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}


rf_model = RandomForestRegressor(random_state=42)

grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid,
                           scoring='neg_mean_absolute_error', cv=3, verbose=2, n_jobs=-1)

grid_search.fit(X_train, y_train)
print("Best Parameters Found:", grid_search.best_params_)


best_rf_model = RandomForestRegressor(**grid_search.best_params_, random_state=42)
best_rf_model.fit(X_train, y_train)
y_pred_best_rf = best_rf_model.predict(X_test)
evaluate_model(y_test, y_pred_best_rf, "Tuned Random Forest Regressor")

Fitting 3 folds for each of 81 candidates, totalling 243 fits


KeyboardInterrupt: 

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [27]:
from sklearn.model_selection import train_test_split

# Define features (X) and target variable (y)
X = df.drop(columns=['price'])  # Exclude price column (target)
y = df['price']  # Price is the target variable

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Data split completed successfully!")
print(f"Training set: {X_train.shape}, {y_train.shape}")
print(f"Testing set: {X_test.shape}, {y_test.shape}")

Data split completed successfully!
Training set: (306454, 27), (306454,)
Testing set: (76614, 27), (76614,)


In [29]:
# Check for object (categorical) columns
non_numeric_cols = X_train.select_dtypes(include=['object']).columns.tolist()
print(" Non-numeric columns found:", non_numeric_cols)

 Non-numeric columns found: ['region', 'cylinders', 'title_status', 'VIN', 'size', 'type', 'paint_color', 'state']


In [30]:
from sklearn.preprocessing import OneHotEncoder

# Identify categorical columns
categorical_cols = ['cylinders', 'title_status', 'size', 'type', 'paint_color']
X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

In [31]:
from sklearn.preprocessing import LabelEncoder

high_cardinality_cols = ['region', 'VIN', 'state']  # Columns with too many unique values
le = LabelEncoder()

for col in high_cardinality_cols:
    X[col] = le.fit_transform(X[col])

print(" Label Encoding applied to high-cardinality features!")


 Label Encoding applied to high-cardinality features!


In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(" Data ready for modeling!")

 Data ready for modeling!


In [33]:
print("Remaining non-numeric columns:", X_train.select_dtypes(include=['object']).columns.tolist())

Remaining non-numeric columns: []


In [34]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
import numpy as np

# Define models
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest Regressor": RandomForestRegressor(n_estimators=100, random_state=42),
    "XGBoost Regressor": XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
}

# Function to evaluate models using cross-validation
def cross_validate_model(model, X_train, y_train):
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_absolute_error')
    return np.mean(-scores)  # Convert negative MAE to positive

# Train & Evaluate Models
results = {}
for name, model in models.items():
    mae = cross_validate_model(model, X_train, y_train)
    results[name] = mae
    print(f"🔹 {name} - Cross-Validated MAE: {mae:.4f}")

# Display results
print("\n Model Performance Summary:")
for model, score in sorted(results.items(), key=lambda x: x[1]):
    print(f" {model}: {score:.4f} MAE")

🔹 Linear Regression - Cross-Validated MAE: 0.4632
🔹 Random Forest Regressor - Cross-Validated MAE: 0.2034
🔹 XGBoost Regressor - Cross-Validated MAE: 0.3032

 Model Performance Summary:
 Random Forest Regressor: 0.2034 MAE
 XGBoost Regressor: 0.3032 MAE
 Linear Regression: 0.4632 MAE


### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [39]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

# Initialize models
lr_model = LinearRegression()
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
xgb_model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
print(" Models initialized successfully!")

 Models initialized successfully!


In [40]:
# Train models
lr_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)
xgb_model.fit(X_train, y_train)

print(" Models trained successfully!")

 Models trained successfully!


In [41]:
# Make predictions
y_pred_lr = lr_model.predict(X_test)
y_pred_rf = rf_model.predict(X_test)
y_pred_xgb = xgb_model.predict(X_test)

print("Predictions generated successfully!")

Predictions generated successfully!


In [42]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Function to evaluate models
def evaluate_model(y_true, y_pred, model_name):
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2 = r2_score(y_true, y_pred)
    
    print(f"🔹 Model: {model_name}")
    print(f"   MAE: {mae:.4f}")
    print(f"   RMSE: {rmse:.4f}")
    print(f"   R² Score: {r2:.4f}")
    print("-" * 40)

# Evaluate all models
evaluate_model(y_test, y_pred_lr, "Linear Regression")
evaluate_model(y_test, y_pred_rf, "Random Forest Regressor")
evaluate_model(y_test, y_pred_xgb, "XGBoost Regressor")

🔹 Model: Linear Regression
   MAE: 0.4622
   RMSE: 0.6780
   R² Score: 0.5438
----------------------------------------
🔹 Model: Random Forest Regressor
   MAE: 0.1917
   RMSE: 0.3822
   R² Score: 0.8550
----------------------------------------
🔹 Model: XGBoost Regressor
   MAE: 0.3032
   RMSE: 0.4956
   R² Score: 0.7562
----------------------------------------


In [44]:
print("Is best_rf_model defined?", "best_rf_model" in globals())

Is best_rf_model defined? False


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize model
rf_model = RandomForestRegressor(random_state=42)

# Perform Grid Search
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, 
                           scoring='neg_mean_absolute_error', cv=3, verbose=2, n_jobs=-1)

grid_search.fit(X_train, y_train)

# Assign the best model
best_rf_model = grid_search.best_estimator_

print("Best Random Forest Model trained successfully!")

Fitting 3 folds for each of 81 candidates, totalling 243 fits


In [None]:
import joblib

# Load the saved model (adjust the filename if needed)
best_rf_model = joblib.load("best_rf_model.pkl")

print("Model loaded successfully!")

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.