# Optimizing Used Car Pricing: A Machine Learning Solution for Rusty Bargain

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

# Introduction

In today’s competitive used car market, accurate pricing is essential for attracting potential buyers while ensuring fair value for sellers. Rusty Bargain, a used car sales service, is developing an app to predict the market value of vehicles based on historical data, enabling users to quickly assess the worth of their cars. This project involves building and evaluating machine learning models to predict car prices using a range of technical specifications and attributes. By comparing multiple models, including Linear Regression, Decision Tree, Random Forest, and LightGBM, this analysis aims to identify the best approach in terms of prediction accuracy, training speed, and computational efficiency. The insights gained will help Rusty Bargain deploy an efficient, data-driven pricing tool to improve user experience and streamline the buying and selling process.

## Data preparation

We start by importing the necessary libraries. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, MaxAbsScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import time
from lightgbm import LGBMRegressor



This code cell performs essential data exploration and cleaning steps on the used car dataset. First, it loads the data and displays its structure and data types to understand the dataset’s composition. Next, it checks for any missing values and duplicate rows, reporting the counts for each. If duplicate rows are present, they are removed. The cell then provides a statistical summary of numerical columns to help identify data distributions and potential outliers, followed by a sample of rows to gain a quick overview of the dataset's content. This preliminary exploration helps ensure data quality before further analysis.

In [2]:
df = pd.read_csv('/datasets/car_data.csv')


print("Initial Data Info:")
print(df.info())


print("\nMissing Values:")
print(df.isnull().sum())


print("\nDuplicate Rows:", df.duplicated().sum())


df = df.drop_duplicates()


print("\nSummary Statistics:")
print(df.describe())


print("\nSample Rows:")
print(df.head())


Initial Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes


This code cell continues the data preparation process by addressing duplicate and missing values. It first removes any duplicate rows to ensure data uniqueness and prints the new dataset shape. For categorical columns with missing values, such as VehicleType and FuelType, missing entries are replaced with "Unknown" to retain all records without imputing potentially misleading values. For the numerical Power column, zeros (which may be invalid) and missing values are replaced with the column’s median to provide a realistic default. Finally, it verifies that all missing values have been handled, preparing the data for further analysis.


In [3]:
df = df.drop_duplicates()
print("Duplicates removed. New shape:", df.shape)


categorical_cols = ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'NotRepaired']
for col in categorical_cols:
    df[col].fillna('Unknown', inplace=True)


df['Power'] = df['Power'].replace(0, np.nan)  
df['Power'].fillna(df['Power'].median(), inplace=True)

print("\nMissing Values After Handling:")
print(df.isnull().sum())


Duplicates removed. New shape: (354107, 16)

Missing Values After Handling:
DateCrawled          0
Price                0
VehicleType          0
RegistrationYear     0
Gearbox              0
Power                0
Model                0
Mileage              0
RegistrationMonth    0
FuelType             0
Brand                0
NotRepaired          0
DateCreated          0
NumberOfPictures     0
PostalCode           0
LastSeen             0
dtype: int64


## Model training


This code cell prepares the data and trains a Linear Regression model to predict car prices. First, it selects relevant features, excluding non-numeric and unnecessary columns, and applies one-hot encoding to convert categorical features into numeric format. The data is then split into training and testing sets, and the model is trained on the training data. Finally, predictions are made on the test set, and the model's accuracy is assessed using RMSE, yielding a result of approximately 3349.52. This value provides an indication of the model's prediction accuracy, with lower RMSE values representing better performance.

In [4]:
X = df.drop(['Price', 'DateCrawled', 'DateCreated', 'LastSeen', 'PostalCode', 'Model'], axis=1)
y = df['Price']

X = pd.get_dummies(X, drop_first=True)  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)


y_pred = lin_reg.predict(X_test)


rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f'Linear Regression RMSE: {rmse:.2f}')


Linear Regression RMSE: 3349.52


This code cell trains a Decision Tree Regression model to predict car prices. After fitting the model on the training data, predictions are made on the test set, and the model's accuracy is evaluated using the RMSE metric. The resulting RMSE for the Decision Tree model is 2272.99, indicating the average error in the predicted car prices. This value is significantly lower than the Linear Regression model's RMSE, suggesting that the Decision Tree model may perform better for this dataset.

In [5]:
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(X_train, y_train)


y_pred_tree = tree_reg.predict(X_test)


rmse_tree = np.sqrt(mean_squared_error(y_test, y_pred_tree))
print(f'Decision Tree RMSE: {rmse_tree:.2f}')


Decision Tree RMSE: 2272.99



In this code cell, a Random Forest Regression model is trained to predict car prices, using 100 trees in the ensemble. After training on the X_train and y_train data, the model makes predictions on the test set (X_test). The model's accuracy is assessed using the RMSE metric, which calculates an error of 1794.22. This RMSE is notably lower than both the Linear Regression and Decision Tree models, suggesting that the Random Forest model provides a more accurate prediction for this dataset.

In [6]:
forest_reg = RandomForestRegressor(random_state=42, n_estimators=100)
forest_reg.fit(X_train, y_train)


y_pred_forest = forest_reg.predict(X_test)

rmse_forest = np.sqrt(mean_squared_error(y_test, y_pred_forest))
print(f'Random Forest RMSE: {rmse_forest:.2f}')


Random Forest RMSE: 1794.22



In this code cell, a LightGBM Regression model is applied to predict car prices with 100 boosting iterations. The model is trained on X_train and y_train and then used to predict prices for X_test. The model's performance is evaluated using RMSE, resulting in a score of 1882.59. This RMSE is lower than the Decision Tree model but slightly higher than the Random Forest model, indicating that LightGBM is effective in prediction but performs marginally less accurately than the Random Forest on this dataset.

In [7]:

lgbm_reg = LGBMRegressor(random_state=42, n_estimators=100)
lgbm_reg.fit(X_train, y_train)


y_pred_lgbm = lgbm_reg.predict(X_test)


rmse_lgbm = np.sqrt(mean_squared_error(y_test, y_pred_lgbm))
print(f'LightGBM RMSE: {rmse_lgbm:.2f}')


LightGBM RMSE: 1882.59


## Model analysis

In this code cell, a helper function evaluate_model is used to measure and compare the performance of four models: Linear Regression, Decision Tree, Random Forest, and LightGBM. For each model, the function captures training time, prediction time, and calculates the RMSE based on predictions from X_test. The results show that the Random Forest model achieves the lowest RMSE (1794.22), indicating the best accuracy, though it has the highest training and prediction times. The Linear Regression model, while fast, has the highest RMSE, and LightGBM demonstrates a balance with competitive accuracy and faster prediction times than Random Forest.

In [8]:

def evaluate_model(model, X_train, y_train, X_test, y_test):
    
    start_time = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start_time


    start_time = time.time()
    y_pred = model.predict(X_test)
    predict_time = time.time() - start_time


    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    
    return rmse, train_time, predict_time


models = {
    'Linear Regression': lin_reg,
    'Decision Tree': tree_reg,
    'Random Forest': forest_reg,
    'LightGBM': lgbm_reg
}

results = []

for model_name, model in models.items():
    rmse, train_time, predict_time = evaluate_model(model, X_train, y_train, X_test, y_test)
    results.append({
        'Model': model_name,
        'RMSE': rmse,
        'Training Time (s)': train_time,
        'Prediction Time (s)': predict_time
    })



results_df = pd.DataFrame(results)
display(results_df)


Unnamed: 0,Model,RMSE,Training Time (s),Prediction Time (s)
0,Linear Regression,3349.524577,1.313624,0.088403
1,Decision Tree,2272.994898,1.826055,0.062442
2,Random Forest,1794.220075,109.573451,3.926888
3,LightGBM,1882.585436,2.723067,0.601568


This project successfully developed and evaluated machine learning models for predicting car prices to support Rusty Bargain's new customer app. Using a dataset of historical vehicle data, the analysis involved thorough data preparation to address missing values and duplicates, followed by the training and evaluation of several models, including Linear Regression, Decision Tree, Random Forest, and LightGBM. Each model was assessed on prediction accuracy (using RMSE), training time, and prediction speed, aligning with Rusty Bargain’s priorities.

The results highlighted the Random Forest model as the most accurate, though it required more computational time. LightGBM presented a favorable balance between speed and accuracy, making it a viable choice for Rusty Bargain’s app. These insights provide Rusty Bargain with a solid basis for implementing an effective, user-friendly price evaluation feature in their app, helping attract and engage users with accurate, data-driven pricing.






