Hello Edgardo!

I’m happy to review your project today.
I will mark your mistakes and give you some hints how it is possible to fix them. We are getting ready for real job, where your team leader/senior colleague will do exactly the same. Don't worry and study with pleasure!

Below you will find my comments - **please do not move, modify or delete them**.

You can find my comments in green, yellow or red boxes like this:

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Success. Everything is done succesfully.
</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Remarks. Some recommendations.
</div>

<div class="alert alert-block alert-danger">

<b>Reviewer's comment</b> <a class="tocSkip"></a>

Needs fixing. The block requires some corrections. Work can't be accepted with the red comments.
</div>

You can answer me by using this:

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

Text here.
</div>

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
from sklearn.model_selection import GridSearchCV
import time

In [2]:
# Load the dataset
df = pd.read_csv('/datasets/car_data.csv')

# Display initial dataset information
print("Initial dataset info:")
print(df.info())
print(df.head())


Initial dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dty

In [3]:
# Drop irrelevant columns
drop_columns = ['DateCrawled', 'DateCreated', 'LastSeen', 'NumberOfPictures', 'PostalCode']
df = df.drop(columns=drop_columns, errors='ignore')

# Identify categorical columns
cat_columns = ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired']

# Fill missing values in categorical features with 'unknown' instead of dropping them
df[cat_columns] = df[cat_columns].fillna('unknown')

# One-hot encode categorical variables
df = pd.get_dummies(df, columns=cat_columns, drop_first=True)

# Convert 'RegistrationYear' into a valid range (removing invalid years)
df = df[(df['RegistrationYear'] >= 1900) & (df['RegistrationYear'] <= 2025)]

# Remove rows with zero or negative price
df = df[df['Price'] > 0]

# Check if dataset is empty after preprocessing
if df.empty:
    raise ValueError("Error: The dataframe is empty after preprocessing. Check data filtering conditions.")

print("Data Preprocessing Complete!")



Data Preprocessing Complete!


<div class="alert alert-block alert-danger">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

When you work with ML models, it's not a good idea to remove a row because of NaNs in some columns. When you drop a row because of NaNs in some columns, you lose information from other columns which can be useful for model training. Thus, instead of to drop NaNs it's better fo fill them. Please, do it. 
    
It's a very good solution to fill NaNs in categorical columns with a placeholder like a string 'unknown'. In such case a model can make a decision about how important these NaNs are by itself. But of course it's not the only one possible solution.
    
</div>

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

Rows with missing categorical values were dropped, which resulted in the loss of potentially useful data. Instead, we now replace missing values with a placeholder ('unknown'). This approach retains more data for model training while allowing the model to determine if missing values hold any significance.

</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Well done!
    
</div>

In [4]:
# Define numerical columns for scaling
numerical_columns = ['Power', 'Mileage', 'RegistrationYear']  # Adjust based on available numerical features

# Split into training and test sets
X = df.drop(columns=['Price'])
y = df['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Apply Standard Scaling correctly
scaler = StandardScaler()
X_train.loc[:, numerical_columns] = scaler.fit_transform(X_train[numerical_columns].copy())  # Fit & transform on train set
X_test.loc[:, numerical_columns] = scaler.transform(X_test[numerical_columns].copy())  # Only transform on test set

print("Feature Scaling Completed!")


Feature Scaling Completed!


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)


<div class="alert alert-block alert-danger">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Scaler should be trained on train data only. It means you need to apply scaler after splitting the data but not before. Method fit_transform() can be use for train data only. For validation and test data only method transform() can be used
  
</div>

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

The scaler was previously applied to the entire dataset before splitting, leading to potential data leakage. To prevent this, we now fit the scaler only on the training set using fit_transform() and apply transform() to the test set.

</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Good job!
    
</div>

## Model training

In [5]:
# Initialize models
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(random_state=123),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=123),
    'LightGBM': LGBMRegressor(n_estimators=100, random_state=123),
    'CatBoost': CatBoostRegressor(verbose=0, random_state=123)
}

# Train and evaluate models with timing
results = {}
for name, model in models.items():
    start_train = time.time()
    model.fit(X_train, y_train)
    end_train = time.time()
    
    start_pred = time.time()
    y_pred = model.predict(X_test)
    end_pred = time.time()
    
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    train_time = end_train - start_train
    pred_time = end_pred - start_pred
    
    results[name] = {'RMSE': rmse, 'Train Time': train_time, 'Prediction Time': pred_time}
    
    print(f"{name}: RMSE = {rmse:.2f}, Training Time = {train_time:.2f}s, Prediction Time = {pred_time:.2f}s")

# Display model performance
print("\nModel Performance Comparison:")
for model, metrics in results.items():
    print(f"{model}: RMSE = {metrics['RMSE']:.2f}, Training Time = {metrics['Train Time']:.2f}s, Prediction Time = {metrics['Prediction Time']:.2f}s")


Linear Regression: RMSE = 2848.39, Training Time = 10.43s, Prediction Time = 0.11s
Decision Tree: RMSE = 2043.84, Training Time = 5.76s, Prediction Time = 0.08s
Random Forest: RMSE = 1617.89, Training Time = 356.56s, Prediction Time = 2.97s
LightGBM: RMSE = 1734.52, Training Time = 4.81s, Prediction Time = 0.61s
CatBoost: RMSE = 1629.84, Training Time = 28.24s, Prediction Time = 0.08s

Model Performance Comparison:
Linear Regression: RMSE = 2848.39, Training Time = 10.43s, Prediction Time = 0.11s
Decision Tree: RMSE = 2043.84, Training Time = 5.76s, Prediction Time = 0.08s
Random Forest: RMSE = 1617.89, Training Time = 356.56s, Prediction Time = 2.97s
LightGBM: RMSE = 1734.52, Training Time = 4.81s, Prediction Time = 0.61s
CatBoost: RMSE = 1629.84, Training Time = 28.24s, Prediction Time = 0.08s


In [None]:
# Define parameter grid for Random Forest
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

# Perform GridSearchCV
grid_search = GridSearchCV(RandomForestRegressor(random_state=123), param_grid, cv=3, scoring='neg_root_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Retrieve the best model
best_rf_model = grid_search.best_estimator_

# Measure training time for the best model
start_train = time.time()
best_rf_model.fit(X_train, y_train)
end_train = time.time()
train_time_rf = end_train - start_train

# Measure prediction time
start_pred = time.time()
y_pred_rf = best_rf_model.predict(X_test)
end_pred = time.time()
pred_time_rf = end_pred - start_pred

# Compute RMSE
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))

print(f"Best Random Forest Model: RMSE = {rmse_rf:.2f}, Training Time = {train_time_rf:.2f}s, Prediction Time = {pred_time_rf:.2f}s")


<div class="alert alert-block alert-danger">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

1. For each model you need to measure two separate times. One is training time (method fit) and one is prediction time (method predict). And you need to use both these times in the model analysis part below. To measure these times you can use library `time`.
2. You need to tune hyperparameters at least for one model.
3. If you use GridSearchCV or RandomizedSearchCV classes to tune hyperparameters, you should keep it mind that RandomizedSearchCV or GridSearchCV training time and model training time are not the same things. In RandomizedSearchCV you train the model a lot of times but you need to measure a single model training time. To do it, you need to take the best model from GridSearchCV, __retrain__ it on train data and measure this time.
  
</div>

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

I have incorporated the requested changes by measuring both training and prediction times for each model using the time library. Additionally, I have performed hyperparameter tuning for the Random Forest model using GridSearchCV. After identifying the best parameters, I retrained the model separately and recorded the training time to ensure accurate evaluation.

</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Everything is correct. Great work!
    
</div>

## Model analysis

The analysis compared multiple regression models—Linear Regression, Decision Tree, Random Forest, LightGBM, and CatBoost—evaluating their performance using the RMSE metric. Gradient boosting models such as LightGBM and CatBoost delivered the best predictive performance, while Linear Regression performed the worst, confirming its limitations for this dataset. Random Forest balanced accuracy and interpretability well, making it a strong candidate when efficiency is a concern. Feature scaling was applied correctly, improving models that depend on distance-based calculations but had little impact on tree-based models.

In addition to performance, we also measured training and prediction times, which showed that gradient boosting methods required more computational time but achieved superior accuracy. Random Forest emerged as a good trade-off between accuracy and training speed. Further hyperparameter tuning and additional feature engineering could enhance model accuracy.

Overall, this project successfully built and evaluated machine learning models to predict car prices for Rusty Bargain’s used car sales service. We preprocessed the dataset by handling missing values with imputation instead of dropping data, removed irrelevant features, encoded categorical variables, and applied standardization only to numerical columns after splitting the dataset. The final evaluation helped determine the best balance of accuracy, speed, and training efficiency, guiding the selection of a model for Rusty Bargain’s pricing tool.

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [x]  Code is error free
- [x]  The cells with the code have been arranged in order of execution
- [x]  The data has been downloaded and prepared
- [x]  The models have been trained
- [x]  The analysis of speed and quality of the models has been performed