Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

In [1]:
# import libraries
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

from sklearn.metrics import mean_squared_error

from IPython.display import display

In [2]:
# load data
df = pd.read_csv('/datasets/car_data.csv')

# view
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


In [3]:
# describe numerical data
df.describe()

Unnamed: 0,Price,RegistrationYear,Power,Mileage,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


From looking at the numerical data, it seems that `RegistrationYear` has some strange values - I will remove values that are outside the 1900-2022 range. It also looks like `NumberOfPictures` has 0 for every value, so that column can be removed.

I am also suspicious about the 0 minimum values for `Price` and `Power`, so I will look into those more.

In [4]:
# view info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [5]:
# store original length before we clean
original_len = len(df)

There are a total of 16 columns in this dataset, and there are 354,369 total rows. I will rename the columns from title case to snake case. Then, I will remove unnecessary columns. The three date columns (`date_crawled`, `date_created`, and `last_seen`), `number_of_pictures`, and `postal_code` will not provide useful information for our models, so they can be dropped. `registration_month` will also be removed given that we already have the year and do not need to get any more specific.

I will also need to address missing values, as `vehicle_type`, `gearbox`, `model`, `fuel_type`, and `not_repaired` have null values.



### Rename + Drop Columns

In [6]:
# rename columns
df.columns = df.columns.str.lower()
df = df.rename(columns={'datecrawled': 'date_crawled', 'vehicletype': 'vehicle_type', 'registrationyear': 'registration_year', 'registrationmonth': 'registration_month', 'fueltype': 'fuel_type', 'notrepaired': 'not_repaired', 'datecreated': 'date_created', 'numberofpictures': 'number_of_pictures', 'postalcode': 'postal_code', 'lastseen': 'last_seen'})

In [7]:
# drop unnecessary columns
df = df.drop(['date_crawled', 'last_seen', 'date_created', 'number_of_pictures', 'postal_code', 'registration_month'], axis=1)
df.columns

Index(['price', 'vehicle_type', 'registration_year', 'gearbox', 'power',
       'model', 'mileage', 'fuel_type', 'brand', 'not_repaired'],
      dtype='object')

Columns have been updated to snake_case, and five columns have been removed.

### Update Values

`registration_year`: The rows with registration years that don't make sense will be dropped. 

In [8]:
# drop rows with nonsensical years
df = df.query('registration_year >= 1900 and registration_year <= 2022')

`not_repaired`: This column has three values: `NaN`, `yes`, and `no`. We can update this column to a boolean where `yes` values are 1 and `no` values are 0.

In [9]:
# update values to numeric
df.loc[df['not_repaired'] == 'yes', 'not_repaired'] = 1
df.loc[df['not_repaired'] == 'no', 'not_repaired'] = 0

# check values have been updated
df['not_repaired'].value_counts()

0    247146
1     36045
Name: not_repaired, dtype: int64

### Address Missing Values

In [10]:
# view data
df.head(10)

Unnamed: 0,price,vehicle_type,registration_year,gearbox,power,model,mileage,fuel_type,brand,not_repaired
0,480,,1993,manual,0,golf,150000,petrol,volkswagen,
1,18300,coupe,2011,manual,190,,125000,gasoline,audi,1.0
2,9800,suv,2004,auto,163,grand,125000,gasoline,jeep,
3,1500,small,2001,manual,75,golf,150000,petrol,volkswagen,0.0
4,3600,small,2008,manual,69,fabia,90000,gasoline,skoda,0.0
5,650,sedan,1995,manual,102,3er,150000,petrol,bmw,1.0
6,2200,convertible,2004,manual,109,2_reihe,150000,petrol,peugeot,0.0
7,0,sedan,1980,manual,50,other,40000,petrol,volkswagen,0.0
8,14500,bus,2014,manual,125,c_max,30000,petrol,ford,
9,999,small,1998,manual,101,golf,150000,,volkswagen,


In [11]:
# get percentage of missing values
columns = ['vehicle_type', 'gearbox', 'model', 'fuel_type', 'not_repaired']
num_rows = len(df)

for column in columns:
    num_missing = df[column].isnull().sum()
    percent_missing = num_missing / num_rows * 100
    print(f'{column}: {percent_missing:.3}% of data is missing')

vehicle_type: 10.5% of data is missing
gearbox: 5.56% of data is missing
model: 5.54% of data is missing
fuel_type: 9.25% of data is missing
not_repaired: 20.0% of data is missing


#### Model + Gearbox
Both these columns have under 6% of data missing, so I will simply remove the missing data.

In [12]:
# drop missing model rows
df = df.dropna(subset=['model', 'gearbox']).reset_index(drop=True)

In [13]:
# check to make sure data has been filled
df[['model', 'gearbox']].isnull().sum()

model      0
gearbox    0
dtype: int64

#### Fuel Type

9.25% of fuel type data is missing. Rather than dropping these rows, the fuel type is likely related to a car's brand and model, so I will fill the null values with the most common fuel type for each.

In [14]:
# fill fuel type
df['fuel_type'] = df.groupby(['brand', 'model'])['fuel_type'].apply(lambda x:x.fillna(x.value_counts().index.tolist()[0]))

In [15]:
# check data was filled
df['fuel_type'].isnull().sum()

0

#### Vehicle Type
For missing values, `vehicle_type` has 10.5% rows with null values. This is a significant amount of values, but the type of vehicle is likely related to `model` and `fuel_type` again, so I will fill missing values using those. 

In [16]:
# fill vehicle type
df['vehicle_type'] = df.groupby(['brand', 'model'])['vehicle_type'].apply(lambda x: x.fillna(x.value_counts().index.tolist()[0]))

In [17]:
# check data was filled
df['vehicle_type'].isnull().sum()

0

#### Not Repaired

20% of this data is missing. Previously, the values in this column were updated from strings to numeric values of 0 or 1. I'm thinking that the missing data was likely a user error, and that the null values result from people not selecting that their vehicle has not been repaired. So, we will assume that "0" (yes) is the norm and will fill the null values with 0s.

In [18]:
# fill missing values
df['not_repaired'] = df['not_repaired'].fillna(0)

In [19]:
# check data was filled
df['not_repaired'].isnull().sum()

0

### Address 0s in Power and Price

In [20]:
# get percentage of 0s in power and price
columns = ['power', 'price']

for column in columns:
    num_zeros = len(df.loc[df[column] == 0])
    percent_zeros = num_zeros / num_rows * 100
    print(f'{column}: {percent_zeros:.3}% of data is 0')

power: 6.72% of data is 0
price: 2.02% of data is 0


#### Power

6.72% of data has a power of 0. This does not make sense. I will fill power with the average power of a brand/model.

In [21]:
# replace 0 with nulls
df['power'] = df['power'].replace(0, np.NaN)

# fill vehicle type
df['power'] = df['power'].fillna(df.groupby(['brand', 'model'])['power'].transform('median'))

In [22]:
# check data was filled
df['power'].isnull().sum()

1

In [23]:
# drop remaining
df = df.dropna(subset=['power']).reset_index(drop=True)

#### Price

Price is our target value. It doesn't make sense that these cars would be selling for 0 euros, so these rows will be dropped.

In [24]:
# drop remaining
df = df.dropna(subset=['price']).reset_index(drop=True)

### Conclusion

In [25]:
new_len = len(df)
print(f'Original length: {original_len}')
print(f'New length: {new_len}')

print(f'{new_len / original_len * 100:.3}% of data was retained.')

Original length: 354369
New length: 318936
90.0% of data was retained.


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 318936 entries, 0 to 318935
Data columns (total 10 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   price              318936 non-null  int64  
 1   vehicle_type       318936 non-null  object 
 2   registration_year  318936 non-null  int64  
 3   gearbox            318936 non-null  object 
 4   power              318936 non-null  float64
 5   model              318936 non-null  object 
 6   mileage            318936 non-null  int64  
 7   fuel_type          318936 non-null  object 
 8   brand              318936 non-null  object 
 9   not_repaired       318936 non-null  int64  
dtypes: float64(1), int64(4), object(5)
memory usage: 24.3+ MB


In [27]:
df.isnull().sum()

price                0
vehicle_type         0
registration_year    0
gearbox              0
power                0
model                0
mileage              0
fuel_type            0
brand                0
not_repaired         0
dtype: int64

We have now finished cleaning the data. 10% of data was removed, which leaves us with 318,936 rows and 10 columns. Missing values have been filled, and the 0 values in power and price have been addressed. We are now ready to move into the model training.

## Model training
Five different models will be tested: Linear Regression, Random Forest, LightGBM, CatBoost, and XGBoost. Linear Regression will be our sanity test.

### Model Preparation

The data will be split into training, testing, and validation sets with a 60%/20%/20% split. The training set will be used to train the model and the validation set will be used to calculate the RMSE for each model. The testing set will be set aside to provide an unbiased final estimate for the best model.

The numerical data will be scaled using a standard scaler.

#### Split Data

In [28]:
# get features and target
features = df.drop(['price'], axis=1)
target = df['price']

In [29]:
# reserving 20% of the data for the test dataset
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.20, random_state=12345
)

# splitting the 80% of the source dataset into the training dataset (60%) and validation dataset (20%)  
features_train, features_valid, target_train, target_valid = train_test_split(
    features_train, target_train, test_size=0.25, random_state=12345
)

In [30]:
print(f'Size of training set: {len(features_train) / new_len:.3}%')
print(f'Size of validation set: {len(features_valid) / new_len:.3}%')
print(f'Size of testing set: {len(target_valid) / new_len:.3}%')

Size of training set: 0.6%
Size of validation set: 0.2%
Size of testing set: 0.2%


#### Scale Numeric Data

In [31]:
# scale numeric values
numeric = ['registration_year', 'power', 'mileage']

scaler = StandardScaler()
scaler.fit(features_train[numeric])

features_train[numeric] = scaler.transform(features_train[numeric])
features_valid[numeric] = scaler.transform(features_valid[numeric])
features_test[numeric] = scaler.transform(features_test[numeric])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features_test[numeric] = scaler.transform(features_test[numeric])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)


### Linear Regression

First, we'll do linear regression for a sanity check. Linear regression requires categorical variables to be encoded.

In [32]:
# define categorical columns
categorical = ['vehicle_type', 'gearbox', 'model', 'fuel_type', 'brand']

In [33]:
# create ColumnTransformer to encode columns
ct = ColumnTransformer(
    [('one-hot-encoder', OneHotEncoder(drop='first'), categorical)], remainder='passthrough')

# fit on training set
ct.fit(features_train)

ColumnTransformer(remainder='passthrough',
                  transformers=[('one-hot-encoder', OneHotEncoder(drop='first'),
                                 ['vehicle_type', 'gearbox', 'model',
                                  'fuel_type', 'brand'])])

In [34]:
features_train_ohe = ct.transform(features_train)
features_valid_ohe = ct.transform(features_valid)

In [35]:
# empty dataframe to store model results
rmse_scores = pd.DataFrame(columns=['model', 'rmse'])

In [36]:
%%time

lr = LinearRegression()
lr.fit(features_train_ohe, target_train)
predicted_value = lr.predict(features_valid_ohe)
rmse = mean_squared_error(predicted_value, target_valid) ** 0.5

# add score to dataframe
rmse_scores = rmse_scores.append({'model': 'linear regression', 'rmse': rmse}, ignore_index=True)

print(f'Linear Regression RMSE: {rmse}')

Linear Regression RMSE: 2952.2088798096925
CPU times: user 5.05 s, sys: 10.4 s, total: 15.5 s
Wall time: 15.4 s


For our sanity check, Linear Regression has a RMSE of 2952. It also ran in 16 seconds.

### Random Forest

The next model to test will be random forest. OHE does not work well with trees, so I will use the OrdinalEncoder for it.

In [37]:
# encode data
encoder = OrdinalEncoder()

# create copies
features_train_encoded = features_train.copy()
features_valid_encoded = features_valid.copy()
features_test_encoded = features_test.copy()

# only fit encoder on the training set
encoder.fit(features_train_encoded[categorical])

# transform training and validation sets
features_train_encoded[categorical] = encoder.transform(features_train[categorical].astype('str'))
features_valid_encoded[categorical] = encoder.transform(features_valid_encoded[categorical].astype('str'))
features_test_encoded[categorical] = encoder.transform(features_test_encoded[categorical].astype('str'))

In [38]:
%%time

best_rmse = float('inf')
best_depth = 0
best_est = 0

# looping through max depths
for depth in range(6, 12):

    # looping through number of estimators
    for est in range(10, 121, 10): 
        rf = RandomForestRegressor(random_state=12345, n_estimators=est, max_depth=depth)
        
        rf.fit(features_train_encoded, target_train) 
        predictions_valid = rf.predict(features_valid_encoded)
        
        rmse = mean_squared_error(predictions_valid, target_valid) ** 0.5

        if rmse < best_rmse:
            print(f'{depth}, {est}, {rmse}')
            best_rmse = rmse
            best_depth = depth
            best_est = est

# add score to dataframe
rmse_scores = rmse_scores.append({'model': 'random forest', 'rmse': best_rmse}, ignore_index=True)

print(f'Random Forest - Depth: {best_depth}; Est: {best_est}; RMSE: {best_rmse}')

6, 10, 2338.9663978328035
6, 20, 2337.531975145627
6, 30, 2331.980581208885
6, 40, 2329.5166085573364
7, 10, 2215.2286475917035
7, 30, 2211.0047001033377
7, 40, 2210.369380414424
8, 10, 2126.720344928382
8, 20, 2124.394011751589
8, 30, 2119.8316463829615
8, 40, 2118.9934002453456
9, 10, 2053.7878153748384
9, 20, 2051.6692514209963
9, 30, 2046.8685544263187
9, 40, 2045.9575988713998
9, 50, 2045.7694470186568
9, 60, 2045.5401693955887
10, 10, 1986.1242965267031
10, 20, 1982.4764001809826
10, 30, 1977.462050826856
10, 40, 1976.2163271027432
10, 50, 1975.1937001762462
10, 60, 1974.649182034377
10, 70, 1974.4572544347466
10, 110, 1974.4268961721114
10, 120, 1973.9870829441536
11, 10, 1927.087630136927
11, 20, 1919.937232878869
11, 30, 1914.8156559765614
11, 40, 1912.7531102821858
11, 50, 1911.3674485462802
11, 60, 1911.0513135765198
11, 70, 1910.7161781126167
11, 120, 1910.6903192571663
Random Forest - Depth: 11; Est: 120; RMSE: 1910.6903192571663
CPU times: user 22min 3s, sys: 1.4 s, total

The best Random Forest model has a depth of 11 and 120 estimators. Its RMSE is 1910. It took 23 minutes to do all the hyperparameter tuning. 

### LightGBM

In [39]:
features_train_cat = features_train.copy()
features_valid_cat = features_valid.copy()
features_test_cat = features_test.copy()

for col in categorical:
    features_train_cat[col] = features_train_cat[col].astype('category')
    features_valid_cat[col] = features_valid_cat[col].astype('category')
    features_test_cat[col] = features_test_cat[col].astype('category')

In [40]:
%%time

best_rmse = float('inf')
best_leaves = 0

for leaves in range(80, 150, 10):
    lgbm = LGBMRegressor(num_leaves=leaves)
    lgbm.fit(features_train_cat, target_train)
    predictions_valid = lgbm.predict(features_valid_cat)
    
    rmse = mean_squared_error(predictions_valid, target_valid) ** 0.5
    
    if rmse < best_rmse:
        print(f'{leaves}, {rmse}')
        best_rmse = rmse
        best_leaves = leaves

# add score to dataframe
rmse_scores = rmse_scores.append({'model': 'lgbm', 'rmse': best_rmse}, ignore_index=True)

print(f'LightGBM - Leaves: {leaves}; RMSE: {best_rmse}')

80, 1695.4317703182624
90, 1686.9185008196268
100, 1683.8286701788215
110, 1681.7235149959824
120, 1681.0034149749874
130, 1678.3765722923017
140, 1675.136354361288
LightGBM - Leaves: 140; RMSE: 1675.136354361288
CPU times: user 45.3 s, sys: 323 ms, total: 45.7 s
Wall time: 46.1 s


One hyperparameter (number of leaves) was tested for LightGBM. 140 leaves gives the best RSME of 1675, and it took 45 seconds to run.

### CatBoost

In [41]:
%%time 

cat = CatBoostRegressor(loss_function="RMSE", iterations=150, random_seed=12345)
cat.fit(features_train, target_train, cat_features=categorical, verbose=10)
predictions_valid = cat.predict(features_valid)
rmse = mean_squared_error(predictions_valid, target_valid) ** 0.5

rmse_scores = rmse_scores.append({'model': 'catboost', 'rmse': rmse}, ignore_index=True)

print(f'CatBoost - RMSE: {rmse}')

Learning rate set to 0.439087
0:	learn: 3411.4679383	total: 198ms	remaining: 29.5s
10:	learn: 2014.2576874	total: 1.55s	remaining: 19.6s
20:	learn: 1922.3951436	total: 2.81s	remaining: 17.3s
30:	learn: 1871.5760373	total: 4.05s	remaining: 15.6s
40:	learn: 1839.6077846	total: 5.29s	remaining: 14.1s
50:	learn: 1813.4444476	total: 6.55s	remaining: 12.7s
60:	learn: 1797.4121125	total: 7.77s	remaining: 11.3s
70:	learn: 1781.6463410	total: 9.03s	remaining: 10s
80:	learn: 1771.0059688	total: 10.3s	remaining: 8.74s
90:	learn: 1763.1138702	total: 11.5s	remaining: 7.46s
100:	learn: 1754.0685931	total: 12.8s	remaining: 6.19s
110:	learn: 1743.9297492	total: 14s	remaining: 4.92s
120:	learn: 1734.8075529	total: 15.3s	remaining: 3.65s
130:	learn: 1726.7915499	total: 16.5s	remaining: 2.39s
140:	learn: 1720.2409035	total: 17.7s	remaining: 1.13s
149:	learn: 1713.8256325	total: 18.9s	remaining: 0us
CatBoost - RMSE: 1776.2748385926488
CPU times: user 19.2 s, sys: 88 ms, total: 19.3 s
Wall time: 19.5 s


CatBoost's best RMSE is 1776, and it took 19 seconds to run.

### XGBoost
XGBoost can't handle categorical values, so I will use the encoded version of the data.

In [42]:
%%time

xgb = XGBRegressor()
xgb.fit(features_train_encoded, target_train)
predictions_valid = xgb.predict(features_valid_encoded)
rmse = mean_squared_error(predictions_valid, target_valid) ** 0.5

rmse_scores = rmse_scores.append({'model': 'xgboost', 'rmse': rmse}, ignore_index=True)

print(f'XGBoost - RMSE: {rmse}')

XGBoost - RMSE: 1750.6822436860396
CPU times: user 28.1 s, sys: 137 ms, total: 28.2 s
Wall time: 28.4 s


XGBoost took 28 seconds to run and has an RMSE of 1745.

## Model analysis

### Time Analysis

* **Linear Regression:** 16 seconds
* **Random Forest:** 23 minutes
* **LightGBM:** 45 seconds
* **CatBoost:** 19 seconds
* **XGBoost:** 28 seconds

RandomForest took the longest time to run because there were so many hyperparameters that were tested. The runtime of it could be reduced by narrowing the scope of the hyperparameters. The other models were very quick, each running in under a minute. If additional hyperparameters were tuned in LightGBM, CatBoost, and XGBoost, then their runtimes would have increased.

### RMSE Analysis

In [43]:
rmse_scores

Unnamed: 0,model,rmse
0,linear regression,2952.20888
1,random forest,1910.690319
2,lgbm,1675.136354
3,catboost,1776.274839
4,xgboost,1750.682244


Linear Regression had the worst RMSE of 2952, and LightGBM had the best RMSE of 1675. The other Gradient Boosting algorithms were close, and Random Forest still had a significant improvement over Linear Regression.

### Testing LightGBM
I will now test the LightGBM model with 140 leaves on the testing dataset to confirm the RMSE.

In [44]:
%%time

lgbm = LGBMRegressor(num_leaves=140)
lgbm.fit(features_train_cat, target_train)
predictions_test = lgbm.predict(features_test_cat)

rmse = mean_squared_error(predictions_test, target_test) ** 0.5

print(f'LightGBM - Leaves: 140; RMSE: {rmse}')

LightGBM - Leaves: 140; RMSE: 1662.361851812816
CPU times: user 7.29 s, sys: 92.8 ms, total: 7.38 s
Wall time: 7.48 s


With LightGBM, the test set has a RMSE of 1662.

# Conclusion

Rusty Bargain is interested in the quality, speed, and training time for the models. I would recommend that Rusty Bargain not consider Linear Regression nor Random Forest for their model, as they had the worst RMSE scores. 

The Gradient Boosting algorithms (CatBoost, LightGBM, and XGBoost) are the models to consider, and of these I would suggest Rusty Bargain consider both CatBoost and LightGBM. LightGBM had the best RMSE of all the models tested, but it took the most time of the Gradient Boosting algorithms due to testing different hyperparameters. If Rusty Bargain scales and has even more data, the runtime will increase. CatBoost had a slightly lower RMSE but was the quickest Gradient Boosting model, so if Rusty Bargain expects that they will scale in the future then it is also worth considering.