<a href="https://colab.research.google.com/github/nuryaningsih/CodeCraftedAtTripleTen/blob/main/12_Predicting_Car_Market_Value_Quality%2C_Speed%2C_and_Efficiency_in_Rusty_Bargain's_App_Development.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
In the rapidly evolving automotive industry, the demand for a reliable and efficient tool to estimate the market value of used cars has become paramount. Rusty Bargain, a company specializing in the sale of pre-owned vehicles, seeks to address this need through the development of a cutting-edge application. This application aims to empower users by providing immediate insights into the market value of their vehicles, leveraging historical data, technical specifications, model versions, and current market prices. To achieve this, Rusty Bargain intends to explore various machine learning models to identify the one that best balances prediction quality, speed, and training time.

## Objective
The primary goal of this project is to develop a predictive model that accurately determines the market value of used cars. This entails:
1. Evaluating Prediction Quality: The model must generate precise and reliable market value estimates for a wide range of vehicles based on their historical data, technical specifications, and model versions.
2. Ensuring Fast Prediction Speed: Given the user-oriented nature of the application, the model must deliver quick predictions to enhance user experience.
3. Optimizing Training Time: The model should be relatively fast to train, allowing for periodic retraining as new data becomes available or market conditions change.

## Stages
1. Data Preparation and Exploration
  * Collect and Observe Data: Gather comprehensive datasets encompassing historical sales, vehicle specifications, model versions, and current market prices.
  * Data Cleaning: Identify and rectify missing values, outliers, and inconsistencies in the dataset.
  * Feature Engineering: Develop relevant features that could influence a vehicle's market value, such as age, mileage, make, model, and condition.
2. Model Development and Training
  * Initial Model Selection: Choose a diverse set of machine learning models for initial experimentation, including Decision Trees, Random Forest, Gradient Boosting variants (e.g., XGBoost, LightGBM, CatBoost), and Linear Regression.
  * Hyperparameter Tuning: Employ techniques like grid search and random search to fine-tune the models' hyperparameters for optimal performance.
  * Cross-Validation: Utilize k-fold cross-validation to assess the models' performance and ensure they are not overfitting.
3. Performance Evaluation
  * Quality of Prediction: Evaluate the models using metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R² (Coefficient of Determination) to determine their accuracy and reliability in predicting vehicle market values.
  * Prediction Speed: Measure the time taken by each model to make predictions, aiming for the fastest possible without compromising quality.
  * Training Time: Record and compare the time required to train each model, seeking a balance between computational efficiency and predictive performance.
4. Conclusion and Model Selection
  * Based on the analysis of prediction quality, speed, and training time, select the best-performing model(s) for deployment in Rusty Bargain's application.
  * Discuss the trade-offs between different models and justify the selection of the chosen model.
5. Future Directions
  * Outline potential improvements for the model, such as incorporating more diverse data sources, implementing more advanced feature engineering techniques, or exploring newer machine learning algorithms.
  * Consider the infrastructure and computational resources needed for deploying the model in a production environment, ensuring it remains scalable and maintainable.
  
By meticulously following these stages, Rusty Bargain aims to develop a robust application that enhances the car selling and buying process, making it more transparent, efficient, and user-friendly.

# 1.	Data Preparation and Exploration

Load the libraries that we think are needed for this project. We will probably realize that we need additional libraries as we work on the project and that is normal.

In [10]:
!pip install catboost




In [11]:
# Import library to process data
import numpy as np
import pandas as pd
import math

# Import Library for viz data
import seaborn as sns
import matplotlib.pyplot as plt

# Import Library for Machine Learning

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgb
import catboost as cb
from xgboost import XGBRegressor

from sklearn.metrics import mean_squared_error

In [12]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Collect and Observe Data

Load data and perform checks to ensure data is free from problems.

In [13]:
# Load the data file into a DataFrameLoad the data file into a DataFrame
df = pd.read_csv('/content/drive/MyDrive/DATASET PROJECT/12. Predicting Car Market Value/car_data.csv')

In [14]:
# Let's see how many rows and columns our dataset has
df.shape

(354369, 16)

In [15]:
# Display general information/summary about the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [16]:
# Displays sample data
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


In [17]:
# Describe from general information
df.describe()

Unnamed: 0,Price,RegistrationYear,Power,Mileage,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


In [18]:
# View data types
df.dtypes

DateCrawled          object
Price                 int64
VehicleType          object
RegistrationYear      int64
Gearbox              object
Power                 int64
Model                object
Mileage               int64
RegistrationMonth     int64
FuelType             object
Brand                object
NotRepaired          object
DateCreated          object
NumberOfPictures      int64
PostalCode            int64
LastSeen             object
dtype: object

In [19]:
# Check for missing values
df.isnull().sum().sort_values(ascending=False) / df.shape[0] *100

NotRepaired          20.079070
VehicleType          10.579368
FuelType              9.282697
Gearbox               5.596709
Model                 5.560588
DateCrawled           0.000000
Price                 0.000000
RegistrationYear      0.000000
Power                 0.000000
Mileage               0.000000
RegistrationMonth     0.000000
Brand                 0.000000
DateCreated           0.000000
NumberOfPictures      0.000000
PostalCode            0.000000
LastSeen              0.000000
dtype: float64

In [20]:
# Checking for duplication
df.duplicated().sum()

262

After loading and inspecting the dataset, several observations can be made:

1. Missing Values:
  * There are missing values present in several columns, notably in "NotRepaired," "VehicleType," "FuelType," "Gearbox," and "Model." These missing values range from approximately 5.56% to 20.08% of the total entries.
  * The presence of missing values can be attributed to various factors such as oversight during data collection, incomplete records, or user omission during data entry.
2. Duplicate Entries:
  * There are 262 duplicated entries within the dataset, suggesting potential data replication during the data collection process or data entry errors.
3. Anomalies and Outliers:
  * Anomalies or outliers were not explicitly identified based on the provided summary statistics. However, certain columns may contain unexpected or unreasonable values that warrant further investigation, such as extremely low or high prices, unrealistic mileage values, or unusual registration years.

To address these observations, several steps can be taken:

1. Handling Missing Values:
  * Missing values can be addressed through various techniques, including imputation (replacing missing values with a statistical measure like mean or median), deletion of rows or columns with a significant number of missing values, or using advanced imputation methods such as predictive modeling. The choice of method depends on the specific characteristics of the dataset and the nature of the missing values.
2. Dealing with Duplicate Entries:
  * Duplicate entries should be carefully examined to determine whether they represent genuine duplicates or data entry errors. If they are genuine duplicates, one approach is to retain only the first occurrence of each entry. Alternatively, if duplicate entries indicate data replication issues, further investigation into the data collection process may be necessary to rectify the underlying cause.
3. Identifying and Handling Anomalies:
  * Anomalies or outliers, if present, should be identified and evaluated to determine their impact on the dataset and subsequent analysis. Depending on the context, anomalies may be genuine data points that reflect unusual but valid observations, or they may be errors that need to be corrected or removed. Techniques such as visualization, statistical methods (e.g., z-score or interquartile range), or domain knowledge can be employed to detect and address anomalies appropriately.

The decision-making process for addressing these data quality issues is informed by several factors, including the specific requirements of the predictive modeling task, the extent of missing or duplicate data, and the potential impact of anomalies on model performance. Additionally, considerations such as data integrity, representativeness, and the overall quality of the dataset guide the selection of appropriate data preprocessing strategies.

In summary, thorough data preprocessing, including handling missing values, addressing duplicates, and identifying anomalies, is essential to ensure the integrity and reliability of the dataset for subsequent modeling and analysis tasks. By systematically addressing these data quality issues, we can enhance the robustness and effectiveness of the predictive modeling process for estimating car market values.

## Data Cleaning

### Title writing style

Show column headings:

In [21]:
# list containing the column names in the df table
df.columns

Index(['DateCrawled', 'Price', 'VehicleType', 'RegistrationYear', 'Gearbox',
       'Power', 'Model', 'Mileage', 'RegistrationMonth', 'FuelType', 'Brand',
       'NotRepaired', 'DateCreated', 'NumberOfPictures', 'PostalCode',
       'LastSeen'],
      dtype='object')

In [22]:
df = df.rename(columns={'DateCrawled': 'date_crawled',
                        'Price': 'price',
                        'VehicleType': 'vehicle_type',
                        'RegistrationYear': 'registration_year',
                        'Gearbox': 'gearbox',
                        'Power' : 'power',
                        'Model' : 'model',
                        'Mileage' : 'mileage',
                        'RegistrationMonth' : 'registration_month',
                        'FuelType' : 'fuel_type',
                        'Brand' : 'brand',
                        'NotRepaired' : 'not_repaired',
                        'DateCreated' : 'date_created',
                        'NumberOfPictures' : 'number_of_pictures',
                        'PostalCode' : 'postal_code',
                        'LastSeen' : 'last_seen'
                       })

In [23]:
# check your results: display once again the list containing the column names
df.columns

Index(['date_crawled', 'price', 'vehicle_type', 'registration_year', 'gearbox',
       'power', 'model', 'mileage', 'registration_month', 'fuel_type', 'brand',
       'not_repaired', 'date_created', 'number_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

### Resolve duplicates

In [24]:
df.duplicated().sum()

262

In [25]:
# Cleans duplicate data
df = df.drop_duplicates()

In [26]:
df.duplicated().sum()

0

### Eliminate unused columns

In [27]:
df = df.drop(['date_crawled', 'date_created', 'postal_code', 'last_seen'], axis=1)

In [28]:
df.head()

Unnamed: 0,price,vehicle_type,registration_year,gearbox,power,model,mileage,registration_month,fuel_type,brand,not_repaired,number_of_pictures
0,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,0
1,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,0
2,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,0
3,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,0
4,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,0


### Resolve constant columns

In [29]:
df['number_of_pictures'].value_counts()

number_of_pictures
0    354107
Name: count, dtype: int64

In [30]:
df = df.drop(['number_of_pictures'], axis=1)

### Deal with strange data

In [31]:
df['registration_year'].describe()

count    354107.000000
mean       2004.235355
std          90.261168
min        1000.000000
25%        1999.000000
50%        2003.000000
75%        2008.000000
max        9999.000000
Name: registration_year, dtype: float64

In [32]:
df = df[df['registration_year'] < 2024]
df = df[df['registration_year'] > 2000]
df = df[df['power'] != 0].reset_index(drop=True)

### Resolve missing values

In [33]:
# Check for missing values
df.isnull().sum().sort_values(ascending=False) / df.shape[0] *100

not_repaired          13.166096
vehicle_type           9.393570
fuel_type              6.312061
model                  3.435244
gearbox                1.885779
price                  0.000000
registration_year      0.000000
power                  0.000000
mileage                0.000000
registration_month     0.000000
brand                  0.000000
dtype: float64

In [34]:
df = df.fillna('unknown')

In [35]:
# Check for missing values
df.isnull().sum().sort_values(ascending=False) / df.shape[0] *100

price                 0.0
vehicle_type          0.0
registration_year     0.0
gearbox               0.0
power                 0.0
model                 0.0
mileage               0.0
registration_month    0.0
fuel_type             0.0
brand                 0.0
not_repaired          0.0
dtype: float64

**Data Cleaning**

1. Renaming Columns

The column names in the dataset were standardized to adhere to a consistent naming convention for better readability and consistency in analysis.

2. Resolving Duplicates

Initial assessment revealed 262 duplicated rows in the dataset. These duplicates were subsequently removed, ensuring data integrity.

3. Eliminating Unused Columns

Columns such as 'date_crawled', 'date_created', 'postal_code', and 'last_seen' were deemed unnecessary for the analysis and thus were dropped from the dataset.

4. Resolving Constant Columns

The column 'number_of_pictures' contained only one unique value (0), indicating that it did not provide any useful information for analysis and was therefore removed.

5. Dealing with Strange Data

Analysis of the 'registration_year' column revealed anomalies such as years below 1900 and above the current year (2024). These outliers were filtered out to ensure the data remains within a realistic timeframe.

6. Resolving Missing Values

Several columns contained missing values, with 'not_repaired', 'vehicle_type', 'fuel_type', and 'model' being the most affected. These missing values were imputed with the label 'unknown' to maintain the integrity of the dataset.

**Conclusion**

The data cleaning process involved several steps to ensure the dataset's integrity and quality for subsequent analysis. By standardizing column names, resolving duplicates, eliminating unused columns, addressing strange data, and resolving missing values, the dataset has been prepared for further exploration and modeling.

Insights gained from the cleaning process include:
  * Identification and removal of duplicate entries to prevent bias in analysis.
  * Elimination of unnecessary columns to streamline data processing.
  * Filtering out of outliers in registration years to ensure data consistency.
  * Imputation of missing values with 'unknown' to maintain dataset completeness.
  
Overall, the cleaned dataset is now ready for further analysis and modeling to develop a robust predictive model for estimating the market value of used cars.

## Feature Engineering

In [36]:
categorical_features = ['vehicle_type', 'gearbox', 'model', 'fuel_type', 'brand', 'not_repaired']

In [37]:
for feature in categorical_features:
    print(f'features {feature} {len(df[feature].value_counts())}')

features vehicle_type 9
features gearbox 3
features model 246
features fuel_type 8
features brand 40
features not_repaired 3


In [38]:
df_ohe = df
df_ohe = pd.get_dummies(df_ohe)

In [39]:
df_ohe.shape

(206652, 314)

In [40]:
df_ohe.head()

Unnamed: 0,price,registration_year,power,mileage,registration_month,vehicle_type_bus,vehicle_type_convertible,vehicle_type_coupe,vehicle_type_other,vehicle_type_sedan,...,brand_sonstige_autos,brand_subaru,brand_suzuki,brand_toyota,brand_trabant,brand_volkswagen,brand_volvo,not_repaired_no,not_repaired_unknown,not_repaired_yes
0,18300,2011,190,125000,5,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,True
1,9800,2004,163,125000,8,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
2,1500,2001,75,150000,6,False,False,False,False,False,...,False,False,False,False,False,True,False,True,False,False
3,3600,2008,69,90000,7,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
4,2200,2004,109,150000,8,False,True,False,False,False,...,False,False,False,False,False,False,False,True,False,False


In [41]:
# feature selection

df_ohe_train_valid, df_ohe_test = train_test_split(df_ohe, test_size=0.15, random_state=12345)
df_ohe_train, df_ohe_valid = train_test_split(df_ohe_train_valid, test_size=0.25, random_state=12345)

print(df_ohe_train.shape)
print(df_ohe_valid.shape)
print(df_ohe_test.shape)

(131740, 314)
(43914, 314)
(30998, 314)


# Model Development and Training

In [42]:
def rmse(target, prediction):
    return mean_squared_error(target, prediction)** 0.5

In [43]:
features_train = df_ohe_train.drop(['price'], axis=1)
target_train = df_ohe_train['price']

features_valid = df_ohe_valid.drop(['price'], axis=1)
target_valid = df_ohe_valid['price']

features_test = df_ohe_test.drop(['price'], axis=1)
target_test = df_ohe_test['price']

## Linear Regression

In [44]:
%%time

model = LinearRegression()
model.fit(features_train, target_train)

CPU times: user 9.78 s, sys: 2.76 s, total: 12.5 s
Wall time: 15.1 s


In [45]:
%%time

pred_train = model.predict(features_train)
pred_valid = model.predict(features_valid)
pred_test = model.predict(features_test)

print("Train RSME:", rmse(target_train, pred_train).round(5))
print("Valid RSME:", rmse(target_valid, pred_valid).round(5))
print("Test RSME:", rmse(target_test, pred_test).round(5))

Train RSME: 2621.65006
Valid RSME: 2612.76508
Test RSME: 2626.50956
CPU times: user 712 ms, sys: 726 ms, total: 1.44 s
Wall time: 1.91 s


**Linear Regression**

Training Time

The training of the linear regression model took approximately 12.5 seconds, reflecting moderate computational requirements.

Prediction Speed

The prediction phase for the linear regression model was relatively fast, with predictions for all datasets (train, validation, and test) completed in approximately 1.44 seconds.

Prediction Quality

The performance metrics for the model on different datasets are as follows:
  * Train RSME: 2621.65006
  * Valid RSME: 2612.76508
  * Test RSME: 2626.50956

The Root Mean Squared Error (RSME) values indicate the average deviation of predicted car prices from actual prices. Despite the model's fast prediction speed, the RSME values suggest that the model's predictive accuracy could be improved.

**Conclusion**

The linear regression model demonstrates moderate computational efficiency and fast prediction speed. However, the model's predictive accuracy, as indicated by RSME values, suggests room for improvement. Further refinement or exploration of alternative models may be necessary to enhance prediction quality for estimating the market value of used cars effectively.

## Decision Tree

In [46]:
for depth in [1, 2, 4, 6, 8, None]:
    model = DecisionTreeRegressor(max_depth=depth)
    model.fit(features_train, target_train)

    pred_train = model.predict(features_train)
    pred_valid = model.predict(features_valid)
    print("Depth:", depth)
    print("Train RSME:", rmse(target_train, pred_train).round(5))
    print("Valid RSME:", rmse(target_valid, pred_valid).round(5))

Depth: 1
Train RSME: 4233.30031
Valid RSME: 4216.56961
Depth: 2
Train RSME: 3674.29782
Valid RSME: 3665.96888
Depth: 4
Train RSME: 2995.37379
Valid RSME: 2995.06149
Depth: 6
Train RSME: 2570.31782
Valid RSME: 2564.93774
Depth: 8
Train RSME: 2337.52933
Valid RSME: 2355.04674
Depth: None
Train RSME: 555.74643
Valid RSME: 2286.6301


In [47]:
%%time

model = DecisionTreeRegressor(max_depth=8)
model.fit(features_train, target_train)

CPU times: user 4.04 s, sys: 220 ms, total: 4.26 s
Wall time: 6.08 s


In [48]:
%%time

pred_train = model.predict(features_train)
pred_valid = model.predict(features_valid)
pred_test = model.predict(features_test)

print("Train RSME:", rmse(target_train, pred_train).round(5))
print("Valid RSME:", rmse(target_valid, pred_valid).round(5))
print("Test RSME:", rmse(target_test, pred_test).round(5))

Train RSME: 2337.52933
Valid RSME: 2355.49511
Test RSME: 2332.23109
CPU times: user 256 ms, sys: 252 ms, total: 509 ms
Wall time: 1.49 s


**Decision Tree**

Hyperparameter Tuning

The decision tree model was trained with different maximum depths ranging from 1 to 8, including an unlimited depth (None). The model's performance was evaluated based on the Root Mean Squared Error (RSME) metric on both the training and validation datasets.

* Depth: 1
  * Train RSME: 4233.30031
  * Valid RSME: 4216.56961
* Depth: 2
  * Train RSME: 3674.29782
  * Valid RSME: 3665.96888
* Depth: 4
  * Train RSME: 2995.37379
  * Valid RSME: 2995.06149
* Depth: 6
  * Train RSME: 2570.31782
  * Valid RSME: 2564.93774
* Depth: 8
  * Train RSME: 2337.52933
  * Valid RSME: 2355.04674
* Depth: None
  * Train RSME: 555.74643
  * Valid RSME: 2286.6301

The results show that increasing the maximum depth of the decision tree generally improves the model's performance until a certain point. Beyond a depth of 8, the model starts to overfit the training data, as indicated by the substantially lower RSME on the training set compared to the validation set.

**Selected Model Performance**

After tuning the hyperparameters, the decision tree model with a maximum depth of 8 was selected for further evaluation.

Training Time

The training of the selected decision tree model took approximately 4.26 seconds, indicating moderate computational requirements.

Prediction Speed

The prediction phase for the selected decision tree model was relatively fast, with predictions for all datasets (train, validation, and test) completed in approximately 509 milliseconds.

Prediction Quality

The performance metrics for the selected model on different datasets are as follows:
  * Train RSME: 2337.52933
  * Valid RSME: 2355.49511
  * Test RSME: 2332.23109

**Conclusion**

The decision tree model with a maximum depth of 8 demonstrates improved predictive performance compared to shallower trees. However, careful consideration is needed to prevent overfitting, especially when using unlimited depth. The selected model exhibits moderate computational efficiency, fast prediction speed, and reasonably good prediction quality. Further exploration and refinement may be necessary to optimize the model's performance for estimating the market value of used cars accurately.

## Random Forest

In [49]:
for depth in [1, 2, 4, 6, 8, None]:
    model = RandomForestRegressor(max_depth=depth, n_estimators=100)
    model.fit(features_train, target_train)

    pred_train = model.predict(features_train)
    pred_valid = model.predict(features_valid)
    print("Depth:", depth)
    print("Train RSME:", rmse(target_train, pred_train).round(5))
    print("Valid RSME:", rmse(target_valid, pred_valid).round(5))

Depth: 1
Train RSME: 4233.30059
Valid RSME: 4216.57011
Depth: 2
Train RSME: 3674.09886
Valid RSME: 3665.84904
Depth: 4
Train RSME: 2959.99571
Valid RSME: 2959.45785
Depth: 6
Train RSME: 2521.79589
Valid RSME: 2517.73965
Depth: 8
Train RSME: 2271.81273
Valid RSME: 2286.67691
Depth: None
Train RSME: 833.64133
Valid RSME: 1817.79793


In [50]:
%%time

model = RandomForestRegressor(max_depth=8, n_estimators=100)
model.fit(features_train, target_train)

CPU times: user 3min 13s, sys: 423 ms, total: 3min 13s
Wall time: 3min 17s


In [51]:
%%time

pred_train = model.predict(features_train)
pred_valid = model.predict(features_valid)
pred_test = model.predict(features_test)

print("Train RSME:", rmse(target_train, pred_train).round(5))
print("Valid RSME:", rmse(target_valid, pred_valid).round(5))
print("Test RSME:", rmse(target_test, pred_test).round(5))

Train RSME: 2271.30481
Valid RSME: 2285.48254
Test RSME: 2267.41039
CPU times: user 2.43 s, sys: 240 ms, total: 2.67 s
Wall time: 2.69 s


**Random Forest**

**Hyperparameter Tuning**

The Random Forest model was trained with different maximum depths ranging from 1 to 8, including an unlimited depth (None), while keeping the number of estimators constant at 100. The model's performance was evaluated based on the Root Mean Squared Error (RSME) metric on both the training and validation datasets.
* Depth: 1
  * Train RSME: 4233.30059
  * Valid RSME: 4216.57011
* Depth: 2
  * Train RSME: 3674.09886
  * Valid RSME: 3665.84904
* Depth: 4
  * Train RSME: 2959.99571
  * Valid RSME: 2959.45785
* Depth: 6
  * Train RSME: 2521.79589
  * Valid RSME: 2517.73965
* Depth: 8
  * Train RSME: 2271.81273
  * Valid RSME: 2286.67691
* Depth: None
  * Train RSME: 833.64133
  * Valid RSME: 1817.79793

The results demonstrate that increasing the maximum depth of the Random Forest generally improves the model's performance until a certain point. Beyond a depth of 8, the model starts to overfit the training data, as indicated by the substantially lower RSME on the training set compared to the validation set.

**Selected Model Performance**

After tuning the hyperparameters, the Random Forest model with a maximum depth of 8 was selected for further evaluation, as it achieved a good balance between training and validation RSME.

Training Time

The training of the selected Random Forest model took approximately 3 minutes and 13 seconds, indicating significant computational requirements due to the ensemble nature of the algorithm.

Prediction Speed

The prediction phase for the selected Random Forest model was relatively fast, with predictions for all datasets (train, validation, and test) completed in approximately 2.67 seconds.

Prediction Quality

The performance metrics for the selected model on different datasets are as follows:
  * Train RSME: 2271.30481
  * Valid RSME: 2285.48254
  * Test RSME: 2267.41039

**Conclusion**

The Random Forest model with a maximum depth of 8 demonstrates improved predictive performance compared to shallower trees. However, careful consideration is needed to prevent overfitting, especially when using unlimited depth. The selected model exhibits relatively high computational requirements for training but fast prediction speed and good prediction quality. Further exploration and refinement may be necessary to optimize the model's performance for estimating the market value of used cars accurately.

## LGBM - gradient boosting

In [52]:
%%time

model = lgb.LGBMRegressor(num_iterations=20, verbose=0, metric='rmse')
model.fit(features_train, target_train, eval_set = (features_valid, target_valid))



CPU times: user 1.46 s, sys: 398 ms, total: 1.86 s
Wall time: 1.9 s


In [53]:
%%time

pred_train = model.predict(features_train)
pred_valid = model.predict(features_valid)
pred_test = model.predict(features_test)

print("Train RSME:", rmse(target_train, pred_train).round(5))
print("Valid RSME:", rmse(target_valid, pred_valid).round(5))
print("Test RSME:", rmse(target_test, pred_test).round(5))

Train RSME: 2318.12303
Valid RSME: 2310.86439
Test RSME: 2296.99548
CPU times: user 740 ms, sys: 322 ms, total: 1.06 s
Wall time: 1.05 s


**LightGBM (Gradient Boosting)**

Training Time

The LightGBM (Gradient Boosting) model was trained with 20 iterations. The training process took approximately 1.86 seconds, reflecting the efficiency of the LightGBM algorithm in handling gradient boosting.

Prediction Speed

The prediction phase for the LightGBM model was remarkably fast, with predictions for all datasets (train, validation, and test) completed in approximately 1.06 seconds. This showcases the model's efficiency in generating predictions.

Prediction Quality

The performance metrics for the LightGBM model on different datasets are as follows:
  * Train RSME: 2318.12303
  * Valid RSME: 2310.86439
  * Test RSME: 2296.99548

The Root Mean Squared Error (RSME) values indicate the average deviation of predicted car prices from actual prices. The LightGBM model demonstrates competitive predictive accuracy on all datasets, with relatively low RSME values.

**Conclusion**

The LightGBM (Gradient Boosting) model showcases impressive performance in terms of training time, prediction speed, and prediction quality. With its efficient handling of gradient boosting and low computational requirements, LightGBM emerges as a promising candidate for estimating the market value of used cars accurately and efficiently. Further optimization and fine-tuning may enhance the model's performance, but its current performance already demonstrates its suitability for deployment in real-world applications.

## Catboost

In [54]:
%%time
model = cb.CatBoostRegressor()
model.fit(features_train, target_train)

Learning rate set to 0.088531
0:	learn: 4504.2312923	total: 86.1ms	remaining: 1m 26s
1:	learn: 4279.3148911	total: 108ms	remaining: 53.7s
2:	learn: 4067.5983087	total: 128ms	remaining: 42.7s
3:	learn: 3880.9847746	total: 150ms	remaining: 37.3s
4:	learn: 3723.6657810	total: 171ms	remaining: 34.1s
5:	learn: 3576.8962937	total: 194ms	remaining: 32.1s
6:	learn: 3443.5860923	total: 215ms	remaining: 30.5s
7:	learn: 3322.5024965	total: 242ms	remaining: 30s
8:	learn: 3220.3252466	total: 265ms	remaining: 29.2s
9:	learn: 3125.9721177	total: 289ms	remaining: 28.6s
10:	learn: 3042.7049930	total: 313ms	remaining: 28.1s
11:	learn: 2969.5811542	total: 336ms	remaining: 27.6s
12:	learn: 2897.3775238	total: 356ms	remaining: 27.1s
13:	learn: 2839.8480576	total: 376ms	remaining: 26.5s
14:	learn: 2785.6200909	total: 396ms	remaining: 26s
15:	learn: 2736.6506406	total: 417ms	remaining: 25.7s
16:	learn: 2694.5428109	total: 447ms	remaining: 25.8s
17:	learn: 2654.1088662	total: 470ms	remaining: 25.6s
18:	learn:

<catboost.core.CatBoostRegressor at 0x7bd888b541f0>

In [55]:
%%time

pred_train = model.predict(features_train)
pred_valid = model.predict(features_valid)
pred_test = model.predict(features_test)

print("Train RSME:", rmse(target_train, pred_train).round(5))
print("Valid RSME:", rmse(target_valid, pred_valid).round(5))
print("Test RSME:", rmse(target_test, pred_test).round(5))

Train RSME: 1742.70286
Valid RSME: 1815.35414
Test RSME: 1806.62819
CPU times: user 1.31 s, sys: 22.9 ms, total: 1.33 s
Wall time: 722 ms


**CatBoost**

Training Time

The CatBoost model was trained with default hyperparameters. The training process took approximately 39.8 seconds, indicating significant computational requirements compared to other models.

Prediction Speed

Despite the longer training time, the prediction phase for the CatBoost model was relatively fast, with predictions for all datasets (train, validation, and test) completed in approximately 1.33 seconds. This demonstrates efficient prediction speed once the model is trained.

Prediction Quality

The performance metrics for the CatBoost model on different datasets are as follows:
  * Train RSME: 1742.70286
  * Valid RSME: 1815.35414
  * Test RSME: 1806.62819

The Root Mean Squared Error (RSME) values indicate the average deviation of predicted car prices from actual prices. The CatBoost model demonstrates competitive predictive accuracy on all datasets, with relatively low RSME values.

**Conclusion**

Despite the longer training time, the CatBoost model showcases impressive predictive accuracy and efficient prediction speed. With its inherent ability to handle categorical features and robust performance, CatBoost emerges as a powerful tool for estimating the market value of used cars accurately. However, considerations should be made regarding computational resources when deploying CatBoost in real-world applications. Overall, the CatBoost model offers a promising solution for Rusty Bargain's goal of developing a reliable application for estimating car market values.

# Conclusion and Model Selection

## Based on the analysis of prediction quality, speed, and training time, select the best-performing model(s) for deployment in Rusty Bargain's application.

After analyzing the performance of various machine learning models for predicting car market values, let's summarize our findings and decide on the best-performing model for deployment in Rusty Bargain's application.

**Linear Regression**
  * Training Time: Moderate (12.5 seconds)
  * Prediction Speed: Fast (1.44 seconds)
  * Prediction Quality: Moderate (RSME values suggest room for improvement)

**Decision Tree**
  * Training Time: Moderate (4.26 seconds)
  * Prediction Speed: Fast (509 milliseconds)
  * Prediction Quality: Reasonably good (RSME values indicate improvement over linear regression)

**Random Forest**
  * Training Time: Significant (3 minutes and 13 seconds)
  * Prediction Speed: Fast (2.67 seconds)
  * Prediction Quality: Good (Balanced RSME values with improved performance over decision tree)

**LightGBM (Gradient Boosting)**
  * Training Time: Fast (1.86 seconds)
  * Prediction Speed: Very fast (1.06 seconds)
  * Prediction Quality: Competitive (Low RSME values, indicating good accuracy)

**CatBoost**
  * Training Time: Significant (39.8 seconds)
  * Prediction Speed: Fast (1.33 seconds)
  * Prediction Quality: Competitive (Low RSME values, indicating good accuracy)

**Conclusion**

Based on the analysis of prediction quality, speed, and training time, the best-performing models for deployment in Rusty Bargain's application are:

1. CatBoost: Despite having a longer training time compared to other models, CatBoost demonstrates impressive predictive accuracy and efficient prediction speed. With competitive RSME values across all datasets and the ability to handle categorical features effectively, CatBoost emerges as a powerful tool for estimating the market value of used cars accurately.

2. LightGBM (Gradient Boosting): LightGBM exhibits remarkable performance in terms of training time, prediction speed, and prediction quality. With its efficient handling of gradient boosting and low computational requirements, LightGBM is a promising candidate for accurate and efficient estimation of car market values.

These models offer a balance between predictive accuracy, computational efficiency, and speed, making them suitable choices for deployment in Rusty Bargain's application. Further fine-tuning and optimization may enhance their performance, but their current performance already meets the requirements for reliable estimation of used car prices.

**Next Steps**
  * Deployment: Deploy the CatBoost model into Rusty Bargain's application for estimating car market values.
  * Monitoring and Evaluation: Continuously monitor the performance of the deployed model in real-world scenarios and evaluate its accuracy and efficiency.
  * Feedback Loop: Gather feedback from users and stakeholders regarding the model's performance and usability to identify areas for improvement.
  * Model Updates: Periodically update the model based on new data, insights, or advancements in machine learning techniques to ensure its continued relevance and accuracy.

Further Research: Explore advanced techniques, such as ensemble methods or neural networks, to potentially improve prediction quality further. Additionally, investigate ways to optimize computational resources and reduce training times without compromising accuracy.

## Trade-offs between Models:

1.	Linear Regression:
  * Advantages: Simple, fast prediction speed.
  * Disadvantages: Limited predictive power, assumption of linear relationship may not hold.
2.	Decision Tree:
  * Advantages: Non-linear relationships, interpretable.
  * Disadvantages: Prone to overfitting, moderate prediction quality.
3.	Random Forest:
  * Advantages: Ensemble of decision trees, reduced overfitting.
  * Disadvantages: Longer training time, complexity.
4.	LightGBM:
  * Advantages: Efficient handling of gradient boosting, fast prediction speed.
  * Disadvantages: Moderate training time, potential complexity.
5.	CatBoost:
  * Advantages: Handles categorical features well, robust performance, competitive prediction quality.
  * Disadvantages: Longer training time compared to some models.

Justification for Choosing CatBoost:
1.	Prediction Quality: CatBoost demonstrates competitive prediction quality with lower RMSE values compared to linear regression, decision tree, and random forest. This indicates its effectiveness in accurately estimating car market values.
2.	Prediction Speed: Despite longer training time, CatBoost exhibits fast prediction speed, completing predictions for all datasets in a reasonable time frame. This ensures efficient performance during real-time use.
3.	Robust Performance: CatBoost's ability to handle categorical features makes it suitable for Rusty Bargain's dataset, which likely contains such features. Its robust performance, evidenced by low RMSE values, ensures reliable predictions even in the presence of categorical variables.
4.	Balanced Trade-offs: While CatBoost may have a longer training time compared to some models like linear regression or decision trees, its superior prediction quality justifies the trade-off. The slightly longer training time is acceptable considering the significant improvement in prediction accuracy.
5.	Real-world Applicability: Considering Rusty Bargain's need for accurate estimations of car market values, the robust performance and competitive prediction quality of CatBoost make it the optimal choice for deployment in their application.

In summary, the trade-offs between different models were carefully evaluated, and CatBoost emerged as the best-performing model due to its competitive prediction quality, efficient prediction speed, robust performance, and suitability for Rusty Bargain's dataset. The slight increase in training time is justified by the significant improvement in prediction accuracy, making CatBoost the optimal choice for deployment in Rusty Bargain's application.


# Future Directions




## Potential Improvements for the Model:

1.	Incorporating More Diverse Data Sources:
  * Obtain additional data sources such as car market trends, economic indicators, or customer preferences to enhance model performance.
  * Include data from social media or online forums to capture sentiment and consumer behavior towards certain car models or brands.
2.	Implementing Advanced Feature Engineering Techniques:
  * Explore more sophisticated feature engineering methods such as polynomial features, interaction terms, or embeddings to capture complex relationships between features.
  * Use domain knowledge to create new features that may have a significant impact on car prices, such as vehicle history, maintenance records, or geographic location.
3.	Exploring Newer Machine Learning Algorithms:
  * Investigate newer algorithms such as XGBoost, AutoML, or deep learning models to further improve prediction accuracy and handle more complex patterns in the data.
  * Experiment with ensemble methods or model stacking techniques to combine the strengths of multiple models and mitigate individual model weaknesses.


## Infrastructure and Computational Resources:

1.	Scalability:
  * Ensure the deployment infrastructure can handle increased data volume and user traffic over time.
  * Implement scalable cloud-based solutions such as AWS or Google Cloud Platform to accommodate growing computational demands.
2.	Maintainability:
  * Develop robust monitoring and logging systems to track model performance, data drift, and potential issues in real-time.
  * Establish regular maintenance schedules for model updates, retraining, and validation to ensure continued accuracy and relevance.
3.	Computational Resources:
  * Allocate sufficient computational resources for training and inference, considering the complexity of the model and the size of the dataset.
  * Optimize model training pipelines and hyperparameter tuning processes to minimize computational costs while maximizing performance.
4.	Model Versioning and Deployment:
  * Implement version control for models to track changes, rollback to previous versions if necessary, and ensure reproducibility.
  * Utilize containerization technologies such as Docker for seamless deployment across different environments and platforms.
5.	Data Privacy and Security:
  * Implement robust security measures to protect sensitive user data and ensure compliance with privacy regulations such as GDPR or CCPA.
  * Implement data anonymization techniques where necessary to protect user privacy while maintaining model performance.

By incorporating these potential improvements and carefully considering infrastructure and computational resources, the model can remain scalable, maintainable, and effective in a production environment, ensuring accurate estimations of car market values for Rusty Bargain's application.