[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nilsjennissen/machine-learning/blob/main/notebooks/regression_models.ipynb)

# Regression Models in Practice
### Predicting used car prices with the best model possible

**Business Objective:** We have been hired by a used car dealer (Cars Cars Cars) to create a pricing model they can use in their operations. They have provided as a file (attached) with car prices and several features we can use to create a model to predict these prices.
car data.csv Download car data.csv
The dependent variable is column "selling price" in the file. "Present price" describes the current price of the car if we were going to buy it new. The rest of the columns are self explanatory.

Your job is to use all regression and data preparation techniques you know to create the best model for our customer.

You have to deliver a visually appealing notebook that explores all alternatives, describes all relevant steps and presents conclusions in a clear way. We have been hired by Cars Cars Cars CTO, and eventhough she has basic knowledge about regression techniques and a technical background she might need some context in some of the steps to fully understand our deliverable.

Of course after presenting the different alternatives with its performance metrics we have to describe the one that is obtaining the best results.

This is the first work we do for this customer and we want to create a great impression so we can become their data science partners for the rest of their projects, so use all you've got to create a great model with a performance better than the competition!!

## **Project Outline**
#### **1. Data Explorartion**
#### **2. Data Preprocessing**
* **2.1 Data Cleaning**
* **2.2 Summary Report**
* **2.3 Model Functions**
#### **3. Model Comparison**
* **3.1 Linear Regression**
* **3.2  Support Vector Machines**
* **3.3 Linear SVR**
* **3.4 MLP Regressor**
* **3.5 Decision Tree Regressor**
* **3.6 Stochastic Gradient Descent**
* **3.7 Random Forest with GridSearchCV**
* **3.8 XGB**
* **3.9 LGBM**
* **3.10 GradientBoostingRegressor with HyperOpt**
* **3.11 RidgeRegressor**
* **3.12 BaggingRegressor**
* **3.13 ExtraTreesRegressor**
* **3.14 AdaBoostRegressor**
* **3.15 VotingRegressor**
#### **4. Model Selection**
#### **5. Prediction**

In [4]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Preprocessing
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
import ydata_profiling as pp

# Models
from sklearn.linear_model import LinearRegression, SGDRegressor, RidgeCV
from sklearn.svm import SVR, LinearSVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.ensemble import BaggingRegressor, AdaBoostRegressor, VotingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.tree import DecisionTreeRegressor
import sklearn.model_selection
from sklearn.model_selection import cross_val_predict as cvp
from sklearn import metrics
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import xgboost as xgb
import lightgbm as lgb

# Model tuning
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe, space_eval

# Error handling
import warnings
warnings.filterwarnings("ignore")

# Data visualization
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff

## **1. Data Exploration**

In the first step we will explore the data and get a better understanding of the data. We will look at the data types, missing values, and the distribution of the data. We will also look at the correlation between the variables and the target variable.

In [5]:
train0 = pd.read_csv('../data/car_data.csv')
train0.head(5)

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,2014,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,sx4,2013,4.75,9.54,43000,Diesel,Dealer,Manual,0
2,ciaz,2017,7.25,9.85,6900,Petrol,Dealer,Manual,0
3,wagon r,2011,2.85,4.15,5200,Petrol,Dealer,Manual,0
4,swift,2014,4.6,6.87,42450,Diesel,Dealer,Manual,0


In [6]:
# Regression for all variables vs Selling_Price in plotly dark theme

fig = px.scatter_matrix(train0, dimensions=["Present_Price", "Kms_Driven", "Owner", "Year", "Selling_Price"], color="Selling_Price", title="Regression  for all variables vs Selling_Price in plotly dark theme")
fig.update_traces(diagonal_visible=False, marker=dict(size=2, line=dict(width=1, color='DarkSlateGrey')), selector=dict(mode='markers'))
fig.update_layout(width=1000, height=600)
# invert the colorscale for the y-axis
fig.update_layout(coloraxis_colorscale='sunsetdark')

fig.show()

**Hint:** For visibility purposes the color is prechoosen. To adjust to audience needs, use one of the following color scales
for continuous data.


In [7]:
named_colorscales = list(px.colors.named_colorscales())
named_colorscales

['aggrnyl',
 'agsunset',
 'blackbody',
 'bluered',
 'blues',
 'blugrn',
 'bluyl',
 'brwnyl',
 'bugn',
 'bupu',
 'burg',
 'burgyl',
 'cividis',
 'darkmint',
 'electric',
 'emrld',
 'gnbu',
 'greens',
 'greys',
 'hot',
 'inferno',
 'jet',
 'magenta',
 'magma',
 'mint',
 'orrd',
 'oranges',
 'oryel',
 'peach',
 'pinkyl',
 'plasma',
 'plotly3',
 'pubu',
 'pubugn',
 'purd',
 'purp',
 'purples',
 'purpor',
 'rainbow',
 'rdbu',
 'rdpu',
 'redor',
 'reds',
 'sunset',
 'sunsetdark',
 'teal',
 'tealgrn',
 'turbo',
 'viridis',
 'ylgn',
 'ylgnbu',
 'ylorbr',
 'ylorrd',
 'algae',
 'amp',
 'deep',
 'dense',
 'gray',
 'haline',
 'ice',
 'matter',
 'solar',
 'speed',
 'tempo',
 'thermal',
 'turbid',
 'armyrose',
 'brbg',
 'earth',
 'fall',
 'geyser',
 'prgn',
 'piyg',
 'picnic',
 'portland',
 'puor',
 'rdgy',
 'rdylbu',
 'rdylgn',
 'spectral',
 'tealrose',
 'temps',
 'tropic',
 'balance',
 'curl',
 'delta',
 'oxy',
 'edge',
 'hsv',
 'icefire',
 'phase',
 'twilight',
 'mrybm',
 'mygbm']

### **2. Data Preprocessing**
In the second step we will preprocess the data. We will look at the data types, missing values, and the distribution of the data. We will also look at the correlation between the variables and the target variable.

In [8]:
# Determination categorical features
numerics = ['int8', 'int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categorical_columns = []
features = train0.columns.values.tolist()
for col in features:
    if train0[col].dtype in numerics: continue
    categorical_columns.append(col)
# Encoding categorical features
for col in categorical_columns:
    if col in train0.columns:
        le = LabelEncoder()
        le.fit(list(train0[col].astype(str).values))
        train0[col] = le.transform(list(train0[col].astype(str).values))

In [9]:
train0['Year'] = (train0['Year']-1900).astype(int)
train0['Selling_Price'] = train0['Selling_Price'].astype(int)
train0['Present_Price'] = train0['Present_Price'].astype(int)

In [10]:
train0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301 entries, 0 to 300
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Car_Name       301 non-null    int64
 1   Year           301 non-null    int64
 2   Selling_Price  301 non-null    int64
 3   Present_Price  301 non-null    int64
 4   Kms_Driven     301 non-null    int64
 5   Fuel_Type      301 non-null    int64
 6   Seller_Type    301 non-null    int64
 7   Transmission   301 non-null    int64
 8   Owner          301 non-null    int64
dtypes: int64(9)
memory usage: 21.3 KB


In [11]:
train0['Selling_Price'].value_counts()

Selling_Price
0     78
4     34
5     30
1     29
3     28
2     24
6     17
7     15
8     10
9      8
11     6
23     4
14     4
10     3
18     2
19     2
12     2
16     1
33     1
35     1
20     1
17     1
Name: count, dtype: int64

In [12]:
train0.corr()

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
Car_Name,1.0,0.017265,0.492795,0.479768,0.064453,-0.371446,-0.829718,-0.059855,-0.081192
Year,0.017265,1.0,0.229302,-0.049269,-0.524342,-0.053643,-0.039896,0.000394,-0.182104
Selling_Price,0.492795,0.229302,1.0,0.877518,0.032597,-0.503878,-0.542332,-0.367935,-0.083877
Present_Price,0.479768,-0.049269,0.877518,1.0,0.206896,-0.439934,-0.515092,-0.349275,0.007089
Kms_Driven,0.064453,-0.524342,0.032597,0.206896,1.0,-0.166801,-0.101419,-0.16251,0.089216
Fuel_Type,-0.371446,-0.053643,-0.503878,-0.439934,-0.166801,1.0,0.352415,0.080466,0.055705
Seller_Type,-0.829718,-0.039896,-0.542332,-0.515092,-0.101419,0.352415,1.0,0.06324,0.124269
Transmission,-0.059855,0.000394,-0.367935,-0.349275,-0.16251,0.080466,0.06324,1.0,-0.050316
Owner,-0.081192,-0.182104,-0.083877,0.007089,0.089216,0.055705,0.124269,-0.050316,1.0


In [13]:
# Correlation matrix
corrmat = train0.corr()
fig = ff.create_annotated_heatmap(z=corrmat.values, x=list(corrmat.columns), y=list(corrmat.index), annotation_text=corrmat.round(2).values, colorscale='tropic')
fig.update_layout(title_text='Correlation matrix with rounded numbers in size 10x10', width=1000, height=600)
fig.show()

In [14]:
train0.describe()

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
count,301.0,301.0,301.0,301.0,301.0,301.0,301.0,301.0,301.0
mean,62.571429,113.627907,4.215947,7.033223,36947.20598,1.787375,0.352159,0.86711,0.043189
std,25.573535,2.891554,5.098681,8.663654,38886.883882,0.425801,0.478439,0.340021,0.247915
min,0.0,103.0,0.0,0.0,500.0,0.0,0.0,0.0,0.0
25%,47.0,112.0,0.0,1.0,15000.0,2.0,0.0,1.0,0.0
50%,69.0,114.0,3.0,6.0,32000.0,2.0,0.0,1.0,0.0
75%,82.0,116.0,6.0,9.0,48767.0,2.0,1.0,1.0,0.0
max,97.0,118.0,35.0,92.0,500000.0,2.0,1.0,1.0,3.0


### **2.2. Summary Report**
The summary report delivers a quick overview of the data. It is a great way to get a quick overview of the data and to identify potential problems. The report is generated using the ydata_profiling library.

In [15]:
train0 = train0.dropna()
# pp.ProfileReport(train0)

In [16]:
target_name = 'Selling_Price'
train_target0 = train0[target_name]
train0 = train0.drop([target_name], axis=1)

In [17]:
# Synthesis test0 from train0
train0, test0, train_target0, test_target0 = train_test_split(train0, train_target0, test_size=0.2, random_state=0)

In [18]:
# For boosting model
train0b = train0
train_target0b = train_target0
# Synthesis valid as test for selection models
trainb, testb, targetb, target_testb = train_test_split(train0b, train_target0b, test_size=0.2, random_state=0)

In [19]:
#For models from Sklearn
scaler = StandardScaler()
train0 = pd.DataFrame(scaler.fit_transform(train0), columns = train0.columns)

In [20]:
train0.head(3)

Unnamed: 0,Car_Name,Year,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,0.598144,0.147092,0.10899,1.05706,-1.880246,-0.754074,0.37073,-0.139347
1,0.519998,-0.531795,-0.336246,-0.352927,0.482367,-0.754074,0.37073,-0.139347
2,0.754436,1.165424,3.225646,-0.774061,-1.880246,-0.754074,-2.697381,-0.139347


In [21]:
len(train0)

240

In [22]:
# Synthesis valid as test for selection models
train, test, target, target_test = train_test_split(train0, train_target0, test_size=0.2, random_state=0)

In [23]:
train.head(3)

Unnamed: 0,Car_Name,Year,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
110,0.871655,-0.871239,-0.113628,-0.165177,0.482367,-0.754074,0.37073,-0.139347
239,-1.902534,0.147092,-0.781483,-0.590949,0.482367,1.32613,0.37073,-0.139347
63,-1.433657,0.486536,-0.781483,-0.757414,0.482367,1.32613,0.37073,-0.139347


In [24]:
test.head(3)

Unnamed: 0,Car_Name,Year,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
109,0.754436,-0.531795,2.557791,1.57327,-1.880246,-0.754074,-2.697381,-0.139347
71,0.949802,0.82598,2.001245,-0.227103,-1.880246,-0.754074,-2.697381,-0.139347
37,-1.98068,1.165424,-0.670174,-0.888209,0.482367,1.32613,0.37073,-0.139347


In [25]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 192 entries, 110 to 172
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Car_Name       192 non-null    float64
 1   Year           192 non-null    float64
 2   Present_Price  192 non-null    float64
 3   Kms_Driven     192 non-null    float64
 4   Fuel_Type      192 non-null    float64
 5   Seller_Type    192 non-null    float64
 6   Transmission   192 non-null    float64
 7   Owner          192 non-null    float64
dtypes: float64(8)
memory usage: 13.5 KB


In [26]:
test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 48 entries, 109 to 7
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Car_Name       48 non-null     float64
 1   Year           48 non-null     float64
 2   Present_Price  48 non-null     float64
 3   Kms_Driven     48 non-null     float64
 4   Fuel_Type      48 non-null     float64
 5   Seller_Type    48 non-null     float64
 6   Transmission   48 non-null     float64
 7   Owner          48 non-null     float64
dtypes: float64(8)
memory usage: 3.4 KB


In [27]:
acc_train_r2 = []
acc_test_r2 = []
acc_train_d = []
acc_test_d = []
acc_train_rmse = []
acc_test_rmse = []
acc_train_mae = []
acc_test_mae = []

### **2.3 Model Functions**
Packing our model functions in a single function allows us to easily call them later for multiple models. We will use the following functions:

In [28]:
def acc_d(y_meas, y_pred):
    # Relative error between predicted y_pred and measured y_meas values
    return mean_absolute_error(y_meas, y_pred)*len(y_meas)/sum(abs(y_meas))

def acc_rmse(y_meas, y_pred):
    # RMSE between predicted y_pred and measured y_meas values
    return (mean_squared_error(y_meas, y_pred))**0.5

In [29]:
def acc_boosting_model(num,model,train,test,num_iteration=0):
    # Calculation of accuracy of boosting model by different metrics

    global acc_train_r2, acc_test_r2, acc_train_d, acc_test_d, acc_train_rmse, acc_test_rmse, acc_train_mae, acc_test_mae

    if num_iteration > 0:
        ytrain = model.predict(train, num_iteration = num_iteration)
        ytest = model.predict(test, num_iteration = num_iteration)
    else:
        ytrain = model.predict(train)
        ytest = model.predict(test)

    print('target = ', targetb[:5].values)
    print('ytrain = ', ytrain[:5])

    acc_train_r2_num = round(r2_score(targetb, ytrain) * 100, 2)
    print('acc(r2_score) for train =', acc_train_r2_num)
    acc_train_r2.insert(num, acc_train_r2_num)

    acc_train_d_num = round(acc_d(targetb, ytrain) * 100, 2)
    print('acc(relative error) for train =', acc_train_d_num)
    acc_train_d.insert(num, acc_train_d_num)

    acc_train_rmse_num = round(acc_rmse(targetb, ytrain) * 100, 2)
    print('acc(rmse) for train =', acc_train_rmse_num)
    acc_train_rmse.insert(num, acc_train_rmse_num)

    acc_train_mae_num = round(mean_absolute_error(targetb, ytrain) * 100, 2)
    print('acc(mae) for train =', acc_train_mae_num)
    acc_train_mae.insert(num, acc_train_mae_num)

    print('target_test =', target_testb[:5].values)
    print('ytest =', ytest[:5])

    acc_test_r2_num = round(r2_score(target_testb, ytest) * 100, 2)
    print('acc(r2_score) for test =', acc_test_r2_num)
    acc_test_r2.insert(num, acc_test_r2_num)

    acc_test_d_num = round(acc_d(target_testb, ytest) * 100, 2)
    print('acc(relative error) for test =', acc_test_d_num)
    acc_test_d.insert(num, acc_test_d_num)

    acc_test_rmse_num = round(acc_rmse(target_testb, ytest) * 100, 2)
    print('acc(rmse) for test =', acc_test_rmse_num)
    acc_test_rmse.insert(num, acc_test_rmse_num)

    acc_test_mae_num = round(mean_absolute_error(target_testb, ytest) * 100, 2)
    print('acc(mae) for test =', acc_test_mae_num)
    acc_test_mae.insert(num, acc_test_mae_num)


In [30]:
def acc_model(num,model,train,test):
    # Calculation of accuracy of model акщь Sklearn by different metrics

    global acc_train_r2, acc_test_r2, acc_train_d, acc_test_d, acc_train_rmse, acc_test_rmse, acc_train_mae, acc_test_mae

    ytrain = model.predict(train)
    ytest = model.predict(test)

    print('target = ', target[:5].values)
    print('ytrain = ', ytrain[:5])

    acc_train_r2_num = round(r2_score(target, ytrain) * 100, 2)
    print('acc(r2_score) for train =', acc_train_r2_num)
    acc_train_r2.insert(num, acc_train_r2_num)

    acc_train_d_num = round(acc_d(target, ytrain) * 100, 2)
    print('acc(relative error) for train =', acc_train_d_num)
    acc_train_d.insert(num, acc_train_d_num)

    acc_train_rmse_num = round(acc_rmse(target, ytrain) * 100, 2)
    print('acc(rmse) for train =', acc_train_rmse_num)
    acc_train_rmse.insert(num, acc_train_rmse_num)

    acc_train_mae_num = round(mean_absolute_error(target, ytrain) * 100, 2)
    print('acc(mae) for train =', acc_train_mae_num)
    acc_train_mae.insert(num, acc_train_mae_num)

    print('target_test =', target_test[:5].values)
    print('ytest =', ytest[:5])

    acc_test_r2_num = round(r2_score(target_test, ytest) * 100, 2)
    print('acc(r2_score) for test =', acc_test_r2_num)
    acc_test_r2.insert(num, acc_test_r2_num)

    acc_test_d_num = round(acc_d(target_test, ytest) * 100, 2)
    print('acc(relative error) for test =', acc_test_d_num)
    acc_test_d.insert(num, acc_test_d_num)

    acc_test_rmse_num = round(acc_rmse(target_test, ytest) * 100, 2)
    print('acc(rmse) for test =', acc_test_rmse_num)
    acc_test_rmse.insert(num, acc_test_rmse_num)

    acc_test_mae_num = round(mean_absolute_error(target_test, ytest) * 100, 2)
    print('acc(mae) for test =', acc_test_mae_num)
    acc_test_mae.insert(num, acc_test_mae_num)


## 3. Model Comparison
Now, that we have prepared our data and defined our functions, we can start to compare different models. We will use the above modelsnamed in the outline.

### 3.1 Linear Regression
Linear Regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression. Reference Wikipedia.

Note the confidence score generated by the model based on our training dataset.

Model Improvement:
1. Feature Selection: Choose the most important features that have a high correlation with the target variable. Removing unnecessary or irrelevant features can improve the model's performance.

2. Data Cleaning: Remove outliers, missing values, and duplicates that can influence the model's accuracy.

3. Normalization: Normalizing the data can improve the model's performance by reducing the effect of large or small values.

4. Regularization: Regularization techniques such as L1 and L2 can help to reduce overfitting by adding a penalty term to the cost function.

5. Bias-Variance tradeoff: Balancing model bias and variance can help to improve the model's performance by reducing underfitting and overfitting.

6. Cross-validation: Use cross-validation techniques to estimate the model's performance and avoid overfitting.

7. Hyperparameter tuning: Experiment with different hyperparameters such as learning rates and regularization parameters to find the optimal values that improve the model's performance.



In [31]:
# Linear Regression

linreg = LinearRegression()
linreg.fit(train, target)
acc_model(0,linreg,train,test)

target =  [2 0 0 0 7]
ytrain =  [2.64426551 0.49514931 0.83348214 0.87416613 6.57241738]
acc(r2_score) for train = 86.53
acc(relative error) for train = 30.5
acc(rmse) for train = 187.52
acc(mae) for train = 126.29
target_test = [14 20  1  9  0]
ytest = [16.60386604 16.41219025  2.1424285   9.64345065  3.77632914]
acc(r2_score) for test = 82.07
acc(relative error) for test = 42.93
acc(rmse) for test = 214.93
acc(mae) for test = 164.57


In [32]:
# Linear Regression with Bagging
# Get params for Linear Regression
params = linreg.get_params()
params

{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False}

In [33]:
# Linear Regression Grid Search
# Create the parameter grid based on the results of random search
parameters = {
    'copy_X': [True, False],
    'fit_intercept': [True, False],
    'positive': [True, False],
    'n_jobs': [-1, 1, 2, 3, 4],
}

grid_search = GridSearchCV(estimator=linreg, param_grid=parameters, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(train, target)
grid_search.best_params_

Fitting 5 folds for each of 40 candidates, totalling 200 fits


{'copy_X': True, 'fit_intercept': True, 'n_jobs': -1, 'positive': True}

### 3.2 Support Vector Machines
Support Vector Machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training samples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new test samples to one category or the other, making it a non-probabilistic binary linear classifier. Obtained from Wikipedia.

Model Improvement:
1. Choosing optimal parameters: SVMs have several parameters such as the regularization parameter (C), kernel function, kernel parameters, and gamma parameter. Choosing the optimal values for these parameters can improve the performance of SVMs.

2. Feature selection: Feature selection can help to reduce the dimensionality of the data and focus on the most relevant features. This can improve the performance of SVMs by reducing overfitting.

3. Choosing the right kernel function: SVMs use kernel functions to map the data into a higher-dimensional space. Choosing the right kernel function can improve the performance of SVMs.

4. Data preprocessing: Data preprocessing techniques such as normalization, scaling, and outlier removal can help to improve the performance of SVMs.

5. Using ensemble methods: Ensemble methods such as bagging and boosting can improve the performance of SVMs by building multiple classifiers and combining their predictions.

6. Using kernel methods: Kernel methods such as linear SVM and kernel ridge regression can improve the performance of SVMs by providing a better way of handling linear and nonlinear relationships between variables.


In [105]:
# Support Vector Machines

svr = SVR()
svr.fit(train, target)
acc_model(1,svr,train,test)

target =  [2 0 0 0 7]
ytrain =  [ 2.50442859 -0.0289507   0.10061086  0.61474311  6.00538421]
acc(r2_score) for train = 48.61
acc(relative error) for train = 31.39
acc(rmse) for train = 366.3
acc(mae) for train = 129.99
target_test = [14 20  1  9  0]
ytest = [7.26926331 8.78543825 0.34673698 5.98769573 0.46354626]
acc(r2_score) for test = 63.97
acc(relative error) for test = 36.2
acc(rmse) for test = 304.66
acc(mae) for test = 138.75


In [163]:
# GridSearch for Linear SVR

parameters = {'C': [0.1, 1, 10, 100], 'gamma': [10, 1, 0.1, 0.01, 0.001, 0.0001], 'kernel': ['rbf']}
grid = GridSearchCV(SVR(), parameters, refit=True, verbose=2)
grid.fit(train, target)
grid.best_params_

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV] END ........................C=0.1, gamma=10, kernel=rbf; total time=   0.0s
[CV] END ........................C=0.1, gamma=10, kernel=rbf; total time=   0.0s
[CV] END ........................C=0.1, gamma=10, kernel=rbf; total time=   0.0s
[CV] END ........................C=0.1, gamma=10, kernel=rbf; total time=   0.0s
[CV] END ........................C=0.1, gamma=10, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.

{'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}

### 3.3 Linear Support Vector Regression
Linear SVR is a similar to SVM method. Its also builds on kernel functions but is appropriate for unsupervised learning. Reference Wikipedia.

Model Improvement:
1. Feature engineering: By selecting the right set of features or transforming them, one can improve the accuracy of the model significantly. This involves identifying the relevant features and removing irrelevant ones as well as creating new composite features that capture the underlying patterns in the data.

2. Regularization: By adding regularization terms to the objective function, one can prevent overfitting and improve the generalization of the model. A popular approach is to add L1 or L2 regularization terms, which encourage the coefficients to be small.

3. Hyperparameter tuning: Linear SVR has several hyperparameters such as the regularization parameter, kernel parameter, and penalty term, that can be fine-tuned to improve the performance of the model. Grid search or randomized search can be used to find the optimal hyperparameters.

4. Scaling inputs: Since Linear SVR uses a distance metric to measure similarity between data points, it is sensitive to the scale of the input features. By scaling the input features to a common range, Linear SVR can perform better and converge faster.

5. Collecting more data: When there is insufficient data, the model may not generalize well. Collecting more data of the right quality can lead to a more reliable and robust model.

6. Use a non-linear model: If the relationship between the input features and target variable is non-linear, a non-linear SVR algorithm such as RBF kernel can be used to improve the performance.


In [106]:
# Linear SVR

linear_svr = LinearSVR()
linear_svr.fit(train, target)
acc_model(2,linear_svr,train,test)

target =  [2 0 0 0 7]
ytrain =  [2.97532147 0.2284961  0.41958991 0.42128161 5.34079231]
acc(r2_score) for train = 83.77
acc(relative error) for train = 27.75
acc(rmse) for train = 205.85
acc(mae) for train = 114.9
target_test = [14 20  1  9  0]
ytest = [15.22096721 14.13512563  1.41598373  9.43410964  1.86992038]
acc(r2_score) for test = 84.15
acc(relative error) for test = 36.12
acc(rmse) for test = 202.05
acc(mae) for test = 138.47


In [161]:
# GridSearch for Linear SVR

linear_svr = LinearSVR()
parameters = {'C': [0.01, 0.1, 1, 10, 100, 1000], 'max_iter': [1000, 10000]}

clf = GridSearchCV(linear_svr, parameters, cv=5)
clf.fit(train, target)
clf.best_params_


{'C': 1, 'max_iter': 10000}

### 3.4 MLP Regressor
MLPRegressor is a multi-layer Perceptron regressor. This model optimizes the squared-loss using LBFGS or stochastic gradient descent. Reference Scikit-learn.

Model Improvement:
1. Feature engineering: Feature engineering is the process of selecting and transforming relevant variables in your dataset to improve model performance. Look for variables that have a strong relationship with your target variable and engineer new variables by combining or transforming existing variables.

2. Model architecture: Change the architecture of your MLP Regressor. This includes increasing the number of layers, neurons, and tuning the activation function.

3. Regularization: Regularization is a technique used to reduce overfitting by adding a penalty term to the loss function. L1 and L2 regularization can be used to prevent overfitting.

4. Hyperparameter optimization: Hyperparameters are the variables that control the behavior of the model. Tune the hyperparameters of the MLP Regressor, such as the learning rate, batch size, number of epochs, etc. to improve model performance.

5. Data preprocessing: Preprocess your data by scaling or normalizing it, and dealing with missing or outlier values. This can help improve model performance.

6. Ensemble learning: Ensemble learning involves combining multiple models to make more accurate predictions. Consider using a combination of MLP Regressors or other models to improve overall performance.

7. Increase training data: Ensure you have enough training data to train and validate the model effectively. Increasing the amount of training data will help the model learn better patterns and relationships.


In [107]:
# MLPRegressor

mlp = MLPRegressor()
param_grid = {'hidden_layer_sizes': [i for i in range(2,20)],
              'activation': ['relu'],
              'solver': ['adam'],
              'learning_rate': ['constant'],
              'learning_rate_init': [0.01],
              'power_t': [0.5],
              'alpha': [0.0001],
              'max_iter': [1000],
              'early_stopping': [True],
              'warm_start': [False]}
mlp_GS = GridSearchCV(mlp, param_grid=param_grid,
                      cv=10, verbose=True, pre_dispatch='2*n_jobs')
mlp_GS.fit(train, target)
acc_model(3,mlp_GS,train,test)

Fitting 10 folds for each of 18 candidates, totalling 180 fits
target =  [2 0 0 0 7]
ytrain =  [3.13541036 0.12141531 0.36879656 1.00992005 6.24039399]
acc(r2_score) for train = 87.88
acc(relative error) for train = 29.94
acc(rmse) for train = 177.86
acc(mae) for train = 123.97
target_test = [14 20  1  9  0]
ytest = [17.37889276 19.03160785  0.43155143  8.20112166  1.69732631]
acc(r2_score) for test = 86.05
acc(relative error) for test = 32.42
acc(rmse) for test = 189.55
acc(mae) for test = 124.28


In [159]:
# GridSearch for MLPRegressor

mlp = MLPRegressor(max_iter=100)

parameter = {
    'hidden_layer_sizes': [(10,30,10),(20,)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
}

mlp_GS = GridSearchCV(mlp, parameter, cv=10, verbose=True, pre_dispatch='2*n_jobs')
mlp_GS.fit(train, target)
mlp_GS.best_params_

Fitting 10 folds for each of 32 candidates, totalling 320 fits


{'activation': 'relu',
 'alpha': 0.0001,
 'hidden_layer_sizes': (10, 30, 10),
 'learning_rate': 'constant',
 'solver': 'sgd'}

### 3.5 Stochastic Gradient Descent
Stochastic gradient descent (often shortened to SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data). Reference Wikipedia.

Model Improvement:
1. Use a smaller learning rate: Stochastic Gradient Descent (SGD) can become unstable if the learning rate is too large. Reducing the learning rate can make the algorithm converge more slowly but it can prevent it from overshooting the optimal solution or getting trapped in poor local minima.

2. Increase the batch size: By increasing the batch size, you can get better and more stable updates to the weights. Larger batches can help smooth out the noise in the gradients, which can make the convergence more steady.

3. Implement better initialization techniques: Proper initialization of the model’s parameters can significantly improve the optimization process of the algorithm. Some techniques include Xavier Initialization or variations of it, He Initialization or Batch Normalization.

4. Regularize the model: Regularization techniques like L1 or L2 regularization or Dropout can help to prevent overfitting, which can help to prevent the model from becoming unstable and avoid getting stuck in poor local minima.

5. Use adaptive learning rate methods such as Adam or RMSprop which adjusts the learning rate dynamically based on the estimated second moment of the gradient, and thus can handle varying gradient magnitudes more effectively.

6. Train for more epochs: By training the model for more epochs, the algorithm has more time to learn the underlying patterns in the data and provides more stable convergence. It is important to stop the training at an appropriate time though in order to avoid overfitting.

By following the above tips, you can significantly improve the performance of a Stochastic Gradient Descent algorithm.


In [108]:
# Stochastic Gradient Descent

sgd = SGDRegressor()
sgd.fit(train, target)
acc_model(4,sgd,train,test)

target =  [2 0 0 0 7]
ytrain =  [2.64748499 0.43895775 0.81075472 0.93583095 6.56794136]
acc(r2_score) for train = 86.53
acc(relative error) for train = 30.63
acc(rmse) for train = 187.57
acc(mae) for train = 126.82
target_test = [14 20  1  9  0]
ytest = [16.60725345 16.4513307   2.09702845  9.61476803  3.81247281]
acc(r2_score) for test = 82.05
acc(relative error) for test = 42.94
acc(rmse) for test = 215.06
acc(mae) for test = 164.6


In [158]:
# GridSearch
parameters = {
    'alpha': 10.0 ** -np.arange(1, 7),
    'loss': ['squared_loss', 'huber', 'epsilon_insensitive'],
    'penalty': ['l2', 'l1', 'elasticnet'],
    'learning_rate': ['constant', 'optimal', 'invscaling'],
}
grid_search = GridSearchCV(sgd, parameters, cv=10, verbose=True, pre_dispatch='2*n_jobs')
grid_search.fit(train, target)
grid_search.best_params_


Fitting 10 folds for each of 162 candidates, totalling 1620 fits


{'alpha': 0.01,
 'learning_rate': 'constant',
 'loss': 'epsilon_insensitive',
 'penalty': 'l2'}

### 3.6 Descision Tree Regressor
This model uses a Decision Tree as a predictive model which maps features (tree branches) to conclusions about the target value (tree leaves). Tree models where the target variable can take a finite set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. Reference Wikipedia.

Model Improvement:
1. Increase the number of trees: Increasing the number of trees in the forest can help improve the model's performance. However, there is a tradeoff between model performance and computational complexity.

2. Tune hyperparameters: Hyperparameters such as the number of features to consider when splitting, the maximum depth of the tree, and the minimum number of samples required to split a node can greatly affect the performance of the model. Grid search or random search can be used to find the best hyperparameters.

3. Feature selection: Feature selection can help improve the model's performance by removing irrelevant or redundant features. Feature selection methods such as mutual information or recursive feature elimination can be used to identify the most important features.

4. Ensemble methods: Ensemble methods such as bagging or boosting can be used to improve the performance of the model. Bagging involves creating multiple models on different subsets of the data, while boosting involves iteratively creating models that focus on the misclassified samples.

5. Handling imbalanced data: If the data is imbalanced, meaning one class has significantly more samples than the other, techniques such as oversampling or undersampling can be used to balance the data and improve the performance of the model.

6. Preprocessing: Preprocessing the data can also help improve the performance of the model. Techniques such as normalization or scaling can help improve the accuracy of the model by ensuring that all features are on the same scale.

7. Regularization: Regularization techniques such as L1 or L2 regularization can help prevent overfitting of the model and improve its generalization performance.


In [125]:
# Decision Tree Regression
# Individual Parameters for Decision Tree
param_grid = {'criterion': ['mse', 'friedman_mse', 'mae'],
                'splitter': ['best', 'random'],
                'max_depth': [i for i in range(2,20)],
                'min_samples_split': [i for i in range(2,20)],
                'min_samples_leaf': [i for i in range(2,20)],
                'min_weight_fraction_leaf': [0.0],
                'max_features': ['auto', 'sqrt', 'log2', None],
                'random_state': [None],
                'max_leaf_nodes': [None],
                'min_impurity_decrease': [0.0],
                'min_impurity_split': [None],
                'presort': [False]}

decision_tree = DecisionTreeRegressor(max_depth=4, min_samples_leaf=15, min_samples_split=10)
decision_tree.fit(train, target)
acc_model(5,decision_tree,train,test)

target =  [2 0 0 0 7]
ytrain =  [3.2962963  0.         0.         0.         5.72413793]
acc(r2_score) for train = 70.64
acc(relative error) for train = 30.9
acc(rmse) for train = 276.88
acc(mae) for train = 127.94
target_test = [14 20  1  9  0]
ytest = [16.26666667 16.26666667  0.77272727 16.26666667  0.        ]
acc(r2_score) for test = 64.29
acc(relative error) for test = 42.79
acc(rmse) for test = 303.32
acc(mae) for test = 164.03


### 3.7 Random Forest
Random Forest is one of the most popular model. Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees (n_estimators= [100, 300]) at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Reference Wikipedia.

Model Improvement:
1. Increasing the number of trees: Gradient boosting works by combining the predictions of multiple weak learners (i.e., trees). Adding more trees can improve the model's accuracy, but may also increase its complexity and lead to overfitting.

2. Tuning the learning rate: The learning rate determines the contribution of each tree to the final prediction. A smaller learning rate results in a more robust model but requires more trees to achieve the same level of accuracy.

3. Regularizing the model: Regularization techniques, such as L1/L2 regularization, can help prevent overfitting by penalizing large coefficients. Another option is early stopping, which stops training once the performance on a validation set starts to degrade.

4. Addressing missing values: Gradient boosting algorithms cannot handle missing values, so it is important to either impute them or exclude them from the dataset.

5. Feature engineering: Feature engineering involves creating new features or transforming existing ones to better capture the underlying patterns in the data. This can help improve model performance by providing more relevant information to the algorithm.

6. Using a more advanced algorithm: Gradient boosting is just one of many regression algorithms available. It may be worth exploring other algorithms, such as neural networks or support vector machines, to see if they provide better performance on your dataset.


In [149]:
# Random Forest

#random_forest = GridSearchCV(estimator=RandomForestRegressor(), param_grid={'n_estimators': [100, 1000]}, cv=5)
random_forest = RandomForestRegressor()
random_forest.fit(train, target)
acc_model(6,random_forest,train,test)

target =  [2 0 0 0 7]
ytrain =  [2.52 0.   0.   0.   6.85]
acc(r2_score) for train = 98.32
acc(relative error) for train = 7.04
acc(rmse) for train = 66.23
acc(mae) for train = 29.16
target_test = [14 20  1  9  0]
ytest = [15.25 19.12  0.85  6.76  0.  ]
acc(r2_score) for test = 96.83
acc(relative error) for test = 15.38
acc(rmse) for test = 90.39
acc(mae) for test = 58.96


In [150]:
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
print(random_forest.get_params())

Parameters currently in use:

{'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': None, 'max_features': 1.0, 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}


In [153]:
# Tuning the parameters
parameters = {'bootstrap': [True],
              'max_depth': [ 70, 80, 90],
              'max_features': [2,3,4,5,6,7],
              'min_samples_leaf': [1, 2, 4],
              'min_samples_split': [2, 5, 10],
              'n_estimators': [100, 200, 300]}
# GridSearch
grid_search = GridSearchCV(estimator = random_forest, param_grid = parameters, cv = 5, n_jobs = -1, verbose = 2)
grid_search = grid_search.fit(train, target)
grid_search.best_params_

Fitting 5 folds for each of 486 candidates, totalling 2430 fits


{'bootstrap': True,
 'max_depth': 90,
 'max_features': 7,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 200}

### 3.8 XGBoost
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples. Reference Wikipedia.

Model Improvement:
1. Feature Selection: By selecting only the relevant features, you can reduce the complexity of the model and prevent overfitting.

2. Regularization Strength: By adjusting the regularization strength parameter (alpha), you can control the trade-off between fitting the training data well and avoiding overfitting.

3. Scaling: Scaling the data can help to improve the performance of the model, especially when the features have vastly different ranges.

4. Cross-Validation: Cross-validation can help to evaluate model performance and select the best hyperparameters.

5. Grid Search: Grid search can be used to identify the optimal hyperparameters for the model by trying different combinations of hyperparameters.

6. Ensemble Methods: You can use ensemble methods like bagging or boosting to improve the performance of the model by combining multiple models.


In [111]:
xgb_clf = xgb.XGBRegressor(objective = 'reg:squarederror')
parameters = {'n_estimators': [60, 100, 120, 140],
              'learning_rate': [0.01, 0.1],
              'max_depth': [5, 7],
              'reg_lambda': [0.5]}
xgb_reg = GridSearchCV(estimator=xgb_clf, param_grid=parameters, cv=5, n_jobs=-1).fit(trainb, targetb)
print("Best score: %0.3f" % xgb_reg.best_score_)
print("Best parameters set:", xgb_reg.best_params_)
acc_boosting_model(7,xgb_reg,trainb,testb)

Best score: 0.863
Best parameters set: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 140, 'reg_lambda': 0.5}
target =  [2 0 0 0 7]
ytrain =  [ 2.038697    0.00718174  0.01788602 -0.01223567  6.882671  ]
acc(r2_score) for train = 99.97
acc(relative error) for train = 1.44
acc(rmse) for train = 8.69
acc(mae) for train = 5.96
target_test = [14 20  1  9  0]
ytest = [ 1.4035552e+01  2.2608608e+01  9.0486073e-01  8.2483301e+00
 -2.8453700e-03]
acc(r2_score) for test = 97.64
acc(relative error) for test = 11.93
acc(rmse) for test = 77.93
acc(mae) for test = 45.74


### 3.9 LightGBM
LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages: Faster training speed and higher efficiency. Lower memory usage. Better accuracy. Support of parallel and GPU learning. Capable of handling large-scale data. Reference Wikipedia.

Model Improvement:
1. Tune hyperparameters: XGBoost has many hyperparameters that can affect model performance, such as the learning rate, number of trees, max depth, min child weight, and subsample rate. Experimenting with different combinations of hyperparameters can often lead to significant improvements in performance. You can use techniques like grid search or random search to find the optimal hyperparameters for your specific problem.

2. Feature engineering: Creating new features or transforming existing features can improve the predictive power of your model. For example, you can try binning continuous variables, creating interaction terms, or encoding categorical variables.

3. Increase the number of trees: The number of trees in an XGBoost model can greatly affect its performance. You can experiment with increasing the number of trees until you start to see diminishing returns in performance.

4. Use early stopping: XGBoost allows for early stopping, which can help prevent overfitting and improve model performance. You can monitor the performance of your model on a validation set and stop training once the performance stops improving.

5. Regularization: XGBoost offers several forms of regularization, such as L1 and L2 regularization, which can help prevent overfitting and improve model performance. You can experiment with different regularization parameters to find the optimal setting for your model.

6. Ensemble learning: You can combine multiple XGBoost models into an ensemble to improve performance. Ensemble methods like bagging, boosting, and stacking can help reduce variance and improve model accuracy.


In [112]:
Xtrain, Xval, Ztrain, Zval = train_test_split(trainb, targetb, test_size=0.2, random_state=0)
train_set = lgb.Dataset(Xtrain, Ztrain, silent=False)
valid_set = lgb.Dataset(Xval, Zval, silent=False)

In [113]:

params = {
    'boosting_type':'gbdt',
    'objective': 'regression',
    'num_leaves': 31,
    'learning_rate': 0.1,
    'max_depth': -1,
    'subsample': 0.8,
    'bagging_fraction' : 1,
    'max_bin' : 5000 ,
    'bagging_freq': 20,
    'colsample_bytree': 0.6,
    'metric': 'rmse',
    'min_split_gain': 0.5,
    'min_child_weight': 1,
    'min_child_samples': 10,
    'scale_pos_weight':1,
    'zero_as_missing': False,
    'seed':0,
    'lambda_l2': 0.1,
}
modelL = lgb.train(params, train_set = train_set, num_boost_round=1000,
                   early_stopping_rounds=800,verbose_eval=500, valid_sets=valid_set)

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 116
[LightGBM] [Info] Number of data points in the train set: 153, number of used features: 7
[LightGBM] [Info] Start training from score 4.215686
Training until validation scores don't improve for 800 rounds
[500]	valid_0's rmse: 1.83175
[1000]	valid_0's rmse: 1.83175
Did not meet early stopping. Best iteration is:
[286]	valid_0's rmse: 1.82767


In [114]:
# Use function to evaluate model
acc_boosting_model(8,modelL,trainb,testb)

target =  [2 0 0 0 7]
ytrain =  [ 1.53290397  0.05197494 -0.05307475 -0.44752419  7.32934543]
acc(r2_score) for train = 93.5
acc(relative error) for train = 17.04
acc(rmse) for train = 130.32
acc(mae) for train = 70.55
target_test = [14 20  1  9  0]
ytest = [16.77426511 21.00969488  0.45413043 13.27340528  1.18569097]
acc(r2_score) for test = 91.67
acc(relative error) for test = 28.96
acc(rmse) for test = 146.53
acc(mae) for test = 111.02


In [147]:
# Parameter Tuning
parameters = {
    'learning_rate': [0.01, 0.1],
    'max_depth': [5, 7],
    'num_leaves': [31, 63],
    'min_child_samples': [10, 20],
    'min_child_weight': [1, 2],
    'min_split_gain': [0.5, 1],
    'subsample': [0.8, 1],
    'colsample_bytree': [0.6, 1],
    'reg_lambda': [0.1, 1],
    'n_estimators': [100, 200]
}

lgbm = lgb.LGBMRegressor()
grid_search = GridSearchCV(estimator=lgbm, param_grid=parameters, cv=5, n_jobs=-1).fit(trainb, targetb)
grid_search.best_params_

{'colsample_bytree': 1,
 'learning_rate': 0.1,
 'max_depth': 7,
 'min_child_samples': 10,
 'min_child_weight': 1,
 'min_split_gain': 0.5,
 'n_estimators': 100,
 'num_leaves': 31,
 'reg_lambda': 0.1,
 'subsample': 0.8}

### 3.10 Gradient Boosting Regressor with HyperOpt
Gradient Boosting builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage nclasses regression trees are fit on the negative gradient of the binomial or multinomial deviance loss function. Binary classification is a special case where only a single regression tree is induced. The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data and max_features=n_features, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed. Reference sklearn documentation.

Model Improvement:
1. Increasing the number of trees: Gradient boosting works by combining the predictions of multiple weak learners (i.e., trees). Adding more trees can improve the model's accuracy, but may also increase its complexity and lead to overfitting.

2. Tuning the learning rate: The learning rate determines the contribution of each tree to the final prediction. A smaller learning rate results in a more robust model but requires more trees to achieve the same level of accuracy.

3. Regularizing the model: Regularization techniques, such as L1/L2 regularization, can help prevent overfitting by penalizing large coefficients. Another option is early stopping, which stops training once the performance on a validation set starts to degrade.

4. Addressing missing values: Gradient boosting algorithms cannot handle missing values, so it is important to either impute them or exclude them from the dataset.

5. Feature engineering: Feature engineering involves creating new features or transforming existing ones to better capture the underlying patterns in the data. This can help improve model performance by providing more relevant information to the algorithm.

6. Using a more advanced algorithm: Gradient boosting is just one of many regression algorithms available. It may be worth exploring other algorithms, such as neural networks or support vector machines, to see if they provide better performance on your dataset.


In [115]:
def hyperopt_gb_score(params):
    clf = GradientBoostingRegressor(**params)
    current_score = cross_val_score(clf, train, target, cv=10).mean()
    print(current_score, params)
    return current_score

space_gb = {
    'n_estimators': hp.choice('n_estimators', range(100, 1000)),
    'max_depth': hp.choice('max_depth', np.arange(2, 10, dtype=int))
}

best = fmin(fn=hyperopt_gb_score, space=space_gb, algo=tpe.suggest, max_evals=10)
print('best:')
print(best)

0.8354951061318934                                    
{'max_depth': 2, 'n_estimators': 439}                 
0.8323609222143945                                                              
{'max_depth': 6, 'n_estimators': 314}                                           
0.8116292707075103                                                              
{'max_depth': 8, 'n_estimators': 690}                                           
0.8407208415503119                                                              
{'max_depth': 2, 'n_estimators': 504}                                           
0.8388431883735775                                                              
{'max_depth': 2, 'n_estimators': 868}                                           
0.8127082863895939                                                              
{'max_depth': 9, 'n_estimators': 679}                                           
0.8577011374451102                                                              

In [116]:
params = space_eval(space_gb, best)
params

{'max_depth': 8, 'n_estimators': 690}

In [117]:
# Gradient Boosting Regression

gradient_boosting = GradientBoostingRegressor(**params)
gradient_boosting.fit(train, target)
acc_model(9,gradient_boosting,train,test)

target =  [2 0 0 0 7]
ytrain =  [ 2.00000000e+00 -3.79195170e-10  6.04709765e-09  1.98372188e-08
  6.99999997e+00]
acc(r2_score) for train = 100.0
acc(relative error) for train = 0.0
acc(rmse) for train = 0.0
acc(mae) for train = 0.0
target_test = [14 20  1  9  0]
ytest = [ 1.40000000e+01  2.29999190e+01  1.00001947e+00  7.96416511e+00
 -1.11907730e-07]
acc(r2_score) for test = 95.92
acc(relative error) for test = 15.33
acc(rmse) for test = 102.48
acc(mae) for test = 58.76


### 3.11 Ridge Regressor
Tikhonov Regularization, colloquially known as Ridge Regression, is the most commonly used regression algorithm to approximate an answer for an equation with no unique solution. This type of problem is very common in machine learning tasks, where the "best" solution must be chosen using limited data. If a unique solution exists, algorithm will return the optimal value. However, if multiple solutions exist, it may choose any of them. Reference Brilliant.org.

Model Improvement:
1. Feature Selection: By selecting only the relevant features, you can reduce the complexity of the model and prevent overfitting.

2. Regularization Strength: By adjusting the regularization strength parameter (alpha), you can control the trade-off between fitting the training data well and avoiding overfitting.

3. Scaling: Scaling the data can help to improve the performance of the model, especially when the features have vastly different ranges.

4. Cross-Validation: Cross-validation can help to evaluate model performance and select the best hyperparameters.

5. Grid Search: Grid search can be used to identify the optimal hyperparameters for the model by trying different combinations of hyperparameters.

6. Ensemble Methods: You can use ensemble methods like bagging or boosting to improve the performance of the model by combining multiple models.


In [118]:
# Ridge Regressor

ridge = RidgeCV(cv=5)
ridge.fit(train, target)
acc_model(10,ridge,train,test)

target =  [2 0 0 0 7]
ytrain =  [2.80302314 0.42630154 0.79777369 0.973802   6.57824184]
acc(r2_score) for train = 86.34
acc(relative error) for train = 30.68
acc(rmse) for train = 188.83
acc(mae) for train = 127.02
target_test = [14 20  1  9  0]
ytest = [16.19815843 16.05746353  1.96791578  9.49766861  3.76771824]
acc(r2_score) for test = 81.88
acc(relative error) for test = 43.05
acc(rmse) for test = 216.04
acc(mae) for test = 165.03


In [146]:
# Hyperparameter tuning
parameters = {'alpha_per_target': [0.1, 1, 10, 100],
                'fit_intercept': [True,False]
              }
ridge = RidgeCV()
clf = GridSearchCV(ridge, parameters)
clf.fit(train, target)
clf.best_params_

{'alpha_per_target': 1, 'fit_intercept': True}

### 3.12 Bagging Regressor
Bootstrap aggregating, also called Bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach. Bagging leads to "improvements for unstable procedures", which include, for example, artificial neural networks, classification and regression trees, and subset selection in linear regression. On the other hand, it can mildly degrade the performance of stable methods such as K-nearest neighbors. Reference Wikipedia.

Model Improvement:
1. Increase the number of estimators: The bagging regressor averages the predictions from multiple base estimators. Hence, increasing the number of estimators can help in capturing a wider range of variability in the data and improve overall performance.

2. Increase the sample size: Increasing the sample size can help in improving the diversity of samples used for building base models, thereby improving the bagging regressor's performance.

3. Tuning Hyperparameters: Hyperparameters such as the maximum number of features and the minimum number of samples split may be tuned to optimize the model's performance.

4. Feature Selection: Feature selection techniques can be employed to remove irrelevant or redundant features, which can lead to improved performance.

5. Resampling: Resampling techniques such as oversampling or undersampling can be used to balance the distribution of the target variable, which can help improve the model's performance.

6. Using a different base estimator: Switching to a different base estimator, such as support vector machines (SVMs) or random forests, may yield better performance in certain datasets.


In [119]:
# Bagging Regressor

bagging = BaggingRegressor()
bagging.fit(train, target)
acc_model(11,bagging,train,test)

target =  [2 0 0 0 7]
ytrain =  [2.7 0.  0.  0.  6.9]
acc(r2_score) for train = 97.02
acc(relative error) for train = 8.44
acc(rmse) for train = 88.2
acc(mae) for train = 34.95
target_test = [14 20  1  9  0]
ytest = [14.8 16.9  1.   7.   0. ]
acc(r2_score) for test = 95.53
acc(relative error) for test = 17.5
acc(rmse) for test = 107.28
acc(mae) for test = 67.08


In [137]:
parameters = {'n_estimators': [10, 50, 100, 200],
              'max_samples': [0.1, 0.3, 0.5, 0.7, 0.9, 1.0],
              'max_features': [0.1, 0.3, 0.5, 0.7, 0.9, 1.0]}
grid_search = GridSearchCV(estimator = bagging,
                           param_grid = parameters,
                           scoring = 'neg_mean_squared_error',
                           cv = 10,
                           n_jobs = -1)
grid_search = grid_search.fit(train, target)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print("Best Accuracy: {:.2f} %".format(best_accuracy*100))
best_parameters

Best Accuracy: -338.74 %


{'max_features': 1.0, 'max_samples': 0.9, 'n_estimators': 10}

## 3.13 Extra Trees Regressor
ExtraTreesRegressor implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values. Reference sklearn documentation.

In extremely randomized trees, randomness goes one step further in the way splits are computed. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias. Reference sklearn documentation.

Model Improvement:
1. Hyperparameter Tuning: The performance of the Extra Trees Regressor can be significantly improved by tuning its hyperparameters. Some of the important hyperparameters to tune are the number of trees in the forest (n_estimators), the maximum depth of each tree (max_depth), the minimum number of samples required to split an internal node (min_samples_split), the minimum number of samples required to be at a leaf node (min_samples_leaf), among others. You can use techniques such as Grid Search or Random Search to find the best combination of hyperparameters.

2. Feature Selection: You can try to improve the performance of the Extra Trees Regressor by selecting only the most important features. This can be achieved by using techniques such as Recursive Feature Elimination (RFE) or Feature Importance Ranking.

3. Ensemble Learning: You can also improve the performance of the Extra Trees Regressor by combining it with other regression models such as Random Forests, Gradient Boosting, or AdaBoost. This can be achieved using techniques such as Stacking or Boosting.

4. Data Preprocessing: Data preprocessing techniques such as normalization, standardization, or scaling can also help improve the performance of the Extra Trees Regressor.

5. Cross-validation: It is important to evaluate the performance of the Extra Trees Regressor using cross-validation techniques such as k-fold cross-validation to ensure that the model is not overfitting or underfitting the data.

6. Increasing the Size of the Dataset: Increasing the size of the dataset used for training can also help improve the performance of the Extra Trees Regressor by reducing the effect of noise and outliers.

In [120]:
# Extra Trees Regressor

etr = ExtraTreesRegressor()
etr.fit(train, target)
acc_model(12,etr,train,test)

target =  [2 0 0 0 7]
ytrain =  [2. 0. 0. 0. 7.]
acc(r2_score) for train = 100.0
acc(relative error) for train = 0.0
acc(rmse) for train = 0.0
acc(mae) for train = 0.0
target_test = [14 20  1  9  0]
ytest = [14.14 22.91  0.93  9.11  0.  ]
acc(r2_score) for test = 96.78
acc(relative error) for test = 13.07
acc(rmse) for test = 91.04
acc(mae) for test = 50.1


In [136]:
# GridSearch with Extra Trees Regressor
etr = ExtraTreesRegressor()
param_grid = {'n_estimators': [100, 200, 500],
              'min_samples_leaf': [5, 10, 20, 40],
              'max_features': [2,3,4,5,6,7]}
grid_search = GridSearchCV(etr, param_grid, cv=5)
grid_search.fit(train, target)
print(grid_search.best_params_)

{'max_features': 7, 'min_samples_leaf': 5, 'n_estimators': 500}


### 3.14 AdaBoost Regressor
The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction. The data modifications at each so-called boosting iteration consist of applying N weights to each of the training samples. Initially, those weights are all set to 1/N, so that the first step simply trains a weak learner on the original data. For each successive iteration, the sample weights are individually modified and the learning algorithm is reapplied to the reweighted data. At a given step, those training examples that were incorrectly predicted by the boosted model induced at the previous step have their weights increased, whereas the weights are decreased for those that were predicted correctly. As iterations proceed, examples that are difficult to predict receive ever-increasing influence. Each subsequent weak learner is thereby forced to concentrate on the examples that are missed by the previous ones in the sequence. Reference sklearn documentation.

Model Improvement:
1. Adjusting the learning rate- the learning rate determines how much the contribution of each classifier affects the final outcome. If the learning rate is too high, the model may become very complex and overfit the data. If it's too low, the model may fail to capture the underlying patterns in the data. tuning the learning rate can improve the overall performance.

2. Increasing the number of estimators- the more estimators there are, the more opportunities the model has to learn the underlying patterns in the data. Increasing the number of estimators can improve the overall performance of the model.

3. Removing outliers- Ada Boost Regressor is sensitive to outliers because it is fitted using a weighted version of the data. Removing outliers from the dataset can improve the performance of the model.

4. Feature selection- the model may be overfitting the data because it's using too many features. Removing irrelevant features can improve the performance of the model.

5. Cross-validation- using cross-validation to evaluate the performance of the model can help identify the optimal hyperparameters, such as the learning rate and number of estimators. Cross-validation can provide insight into how well the model will perform on new data.


In [121]:
# AdaBoost Regression

Ada_Boost = AdaBoostRegressor()
Ada_Boost.fit(train, target)
acc_model(13,Ada_Boost,train,test)

target =  [2 0 0 0 7]
ytrain =  [3.88       0.44444444 0.44444444 0.44444444 7.6       ]
acc(r2_score) for train = 96.64
acc(relative error) for train = 17.67
acc(rmse) for train = 93.61
acc(mae) for train = 73.17
target_test = [14 20  1  9  0]
ytest = [15.33333333 23.          0.64        7.625       0.44444444]
acc(r2_score) for test = 94.03
acc(relative error) for test = 25.3
acc(rmse) for test = 124.04
acc(mae) for test = 96.98


### 3.15 Voting Regressor
VotingRegressor is an ensemble meta-estimator that fits base regressors each on the whole dataset. It, then, averages the individual predictions to form a final prediction. Such a meta-estimator can be useful for a set of equally well performing model in order to balance out their individual weaknesses. Reference sklearn documentation.

Model Improvement:
1. Use a combination of diverse algorithms: A Voting Regressor combines the predictions of multiple base algorithms. Using a diverse set of algorithms with different strengths and weaknesses can improve the overall performance.

2. Hyperparameter Tuning: Fine-tuning the hyperparameters of the base algorithms and the Voting Regressor itself can boost the performance. You can use techniques like Grid Search or Random Search for hyperparameter tuning.

3. Feature Engineering: Building new features or selecting relevant features from the available dataset can improve the performance of the Voting Regressor.

4. Outlier Detection: Outliers can have a significant impact on the performance of the model. Removing or treating outliers can improve the performance of the Voting Regressor.

5. Ensemble Learning with Bagging or Boosting: Using Ensemble Learning techniques like Bagging or Boosting along with Voting Regressors can also improve the performance.


In [127]:
Voting_Reg = VotingRegressor(estimators=[('lin', linreg), ('ridge', ridge), ('sgd', sgd)])
Voting_Reg.fit(train, target)
acc_model(14,Voting_Reg,train,test)

target =  [2 0 0 0 7]
ytrain =  [2.69792504 0.46518294 0.8222522  0.9302291  6.56720124]
acc(r2_score) for train = 86.51
acc(relative error) for train = 30.57
acc(rmse) for train = 187.7
acc(mae) for train = 126.58
target_test = [14 20  1  9  0]
ytest = [16.45926634 16.28914939  2.07721834  9.58136894  3.78603899]
acc(r2_score) for test = 82.02
acc(relative error) for test = 42.98
acc(rmse) for test = 215.24
acc(mae) for test = 164.78


In [129]:
# Parameter Tuning
params = {'weights': [[0.1, 0.2, 0.7], [0.2, 0.3, 0.5], [0.3, 0.4, 0.3], [0.4, 0.3, 0.3], [0.5, 0.2, 0.3], [0.6, 0.2, 0.2], [0.7, 0.1, 0.2], [0.8, 0.1, 0.1], [0.9, 0.05, 0.05]]}
grid = GridSearchCV(Voting_Reg, param_grid=params, cv=5, scoring='r2')
grid.fit(train, target)
print(grid.best_params_)

{'weights': [0.3, 0.4, 0.3]}


In [130]:
Voting_Reg = VotingRegressor(estimators=[('lin', linreg), ('ridge', ridge), ('sgd', sgd)], weights=[0.3, 0.4, 0.3])
Voting_Reg.fit(train, target)
acc_model(14,Voting_Reg,train,test)

target =  [2 0 0 0 7]
ytrain =  [2.7070612  0.46078849 0.82092719 0.9389384  6.56603074]
acc(r2_score) for train = 86.5
acc(relative error) for train = 30.58
acc(rmse) for train = 187.77
acc(mae) for train = 126.62
target_test = [14 20  1  9  0]
ytest = [16.42306567 16.25932207  2.06717397  9.56819602  3.78169031]
acc(r2_score) for test = 82.03
acc(relative error) for test = 42.97
acc(rmse) for test = 215.17
acc(mae) for test = 164.7


## 4. Modell Comparison
To compare all models we derive the metrics for all models and compare them.

*R2 Score*
The R-squared test, also known as the coefficient of determination, is a statistical measure used to evaluate the goodness of fit of a regression model. It measures the proportion of the variation in the dependent variable that is explained by the independent variable(s).

The R-squared value ranges from 0 to 1, where 0 indicates that the model does not fit the data and 1 indicates that the model perfectly fits the data. Typically, a higher R-squared value indicates a better fit of the model to the data.

However, it is important to note that a high R-squared value does not necessarily mean that the model is accurate or that it can predict future values. Additionally, a low R-squared value does not necessarily mean that the model is not useful or accurate.

Therefore, it is important to use additional methods such as residual plots, significance tests, and cross-validation to evaluate the performance of the regression model.


*Durbin Watson Test:*
The d-test or the Durbin-Watson test is used to check for the presence of autocorrelation in a regression model. Autocorrelation occurs when the residuals of a regression model are correlated, meaning that the errors of one observation are similar to the errors of its neighboring observations. This can cause problems with the statistical inference of the regression model, leading to biased estimates of the regression coefficients and incorrect standard errors.

The d-test computes a test statistic that measures the degree of autocorrelation present in the residuals of a regression model. The test statistic, d, ranges from 0 to 4, with d = 2 indicating no autocorrelation, d < 2 indicating positive autocorrelation, and d > 2 indicating negative autocorrelation.

To interpret the d-test, you can use the following general guidelines:

- If d is close to 2, then the regression model does not exhibit significant autocorrelation.
- If d is less than 2, then the regression model exhibits positive autocorrelation, meaning that the residuals of the model are positively correlated.
- If d is greater than 2, then the regression model exhibits negative autocorrelation, meaning that the residuals of the model are negatively correlated.

Typically, a d-test value between 1.5 and 2.5 suggests that the model is reasonably free from autocorrelation. However, the exact threshold for a significant d value can vary depending on the context of the analysis and other factors such as sample size and data structure. Therefore, it is important to use the d-test in conjunction with other diagnostic tests and visualizations to assess the validity of a regression model.

*RMSE*
The RMSE test (root mean squared error) evaluates the accuracy of a regression model by measuring the difference between the predicted values and the actual values. It is calculated by taking the square root of the average squared difference between the predicted and actual values.

A lower RMSE value indicates that the model has better accuracy in predicting the outcome variable. However, the interpretation of the RMSE value depends on the scale of the outcome variable. For instance, an RMSE of 10 for a variable that ranges from 0 to 100 is different from an RMSE of 10 for a variable that ranges from 0 to 10. Therefore, it is essential to consider the context of the problem and the scale of the outcome variable when interpreting the RMSE value.

*MAE*
The MAE (Mean Absolute Error) test is a measure of the average absolute differences between the actual values and the predicted values made by a regression model. It is calculated by taking the sum of the absolute differences between actual and predicted values, and dividing this by the number of observations.

Interpreting the MAE test result involves looking at the scale of the data being analyzed. The closer the MAE score is to zero, the smaller the error and the better the model's predictive power. A larger MAE score, on the other hand, indicates a larger difference between the actual and predicted values, suggesting that the model is less accurate. Therefore, a low MAE score is desirable, while a high MAE score indicates that the model is not accurate and may require further improvements.



In [123]:
models = pd.DataFrame({
    'Model': ['Linear Regression',
              'Support Vector Machines',
              'Linear SVR',
              'MLPRegressor',
              'Stochastic Gradient Decent',
              'Decision Tree Regressor',
              'Random Forest',
              'GradientBoostingRegressor',
              'RidgeRegressor',
              'XGBoost',
              'LightGBM',
              'BaggingRegressor',
              'ExtraTreesRegressor',
              'AdaBoostRegressor',
              'VotingRegressor'],

    'r2_train': acc_train_r2,
    'r2_test': acc_test_r2,
    'd_train': acc_train_d,
    'd_test': acc_test_d,
    'rmse_train': acc_train_rmse,
    'rmse_test': acc_test_rmse,
    'mae_train': acc_train_mae,
    'mae_test': acc_test_mae
})


In [124]:
models.Model.to_list()

['Linear Regression',
 'Support Vector Machines',
 'Linear SVR',
 'MLPRegressor',
 'Stochastic Gradient Decent',
 'Decision Tree Regressor',
 'Random Forest',
 'GradientBoostingRegressor',
 'RidgeRegressor',
 'XGBoost',
 'LightGBM',
 'BaggingRegressor',
 'ExtraTreesRegressor',
 'AdaBoostRegressor',
 'VotingRegressor']

In [48]:
pd.options.display.float_format = '{:,.2f}'.format

In [49]:
print('Prediction accuracy for models by R2 criterion - r2_test')
models.sort_values(by=['r2_test', 'r2_train'], ascending=False)

Prediction accuracy for models by R2 criterion - r2_test


Unnamed: 0,Model,r2_train,r2_test,d_train,d_test,rmse_train,rmse_test,mae_train,mae_test
11,BaggingRegressor,97.71,97.97,6.88,11.68,77.29,72.33,28.49,44.79
7,GradientBoostingRegressor,99.97,97.64,1.44,11.93,8.69,77.93,5.96,45.74
6,Random Forest,97.88,96.66,7.47,15.58,74.38,92.79,30.94,59.71
12,ExtraTreesRegressor,100.0,96.62,0.0,13.77,0.0,93.27,0.0,52.77
9,XGBoost,100.0,95.19,0.0,16.5,0.0,111.35,0.0,63.24
13,AdaBoostRegressor,95.99,92.3,20.35,30.02,102.38,140.84,84.28,115.09
8,RidgeRegressor,93.5,91.67,17.04,28.96,130.32,146.53,70.55,111.02
3,MLPRegressor,90.55,88.02,25.88,31.12,157.11,175.7,107.14,119.3
2,Linear SVR,83.76,84.15,27.75,36.12,205.9,202.06,114.9,138.46
0,Linear Regression,86.53,82.07,30.5,42.93,187.52,214.93,126.29,164.57


In [50]:
print('Prediction accuracy for models by relative error - d_test')
models.sort_values(by=['d_test', 'd_train'], ascending=True)

Prediction accuracy for models by relative error - d_test


Unnamed: 0,Model,r2_train,r2_test,d_train,d_test,rmse_train,rmse_test,mae_train,mae_test
11,BaggingRegressor,97.71,97.97,6.88,11.68,77.29,72.33,28.49,44.79
7,GradientBoostingRegressor,99.97,97.64,1.44,11.93,8.69,77.93,5.96,45.74
12,ExtraTreesRegressor,100.0,96.62,0.0,13.77,0.0,93.27,0.0,52.77
6,Random Forest,97.88,96.66,7.47,15.58,74.38,92.79,30.94,59.71
9,XGBoost,100.0,95.19,0.0,16.5,0.0,111.35,0.0,63.24
8,RidgeRegressor,93.5,91.67,17.04,28.96,130.32,146.53,70.55,111.02
13,AdaBoostRegressor,95.99,92.3,20.35,30.02,102.38,140.84,84.28,115.09
3,MLPRegressor,90.55,88.02,25.88,31.12,157.11,175.7,107.14,119.3
2,Linear SVR,83.76,84.15,27.75,36.12,205.9,202.06,114.9,138.46
1,Support Vector Machines,48.61,63.97,31.39,36.2,366.3,304.66,129.99,138.75


In [51]:
print('Prediction accuracy for models by RMSE - rmse_test')
models.sort_values(by=['rmse_test', 'rmse_train'], ascending=True)

Prediction accuracy for models by RMSE - rmse_test


Unnamed: 0,Model,r2_train,r2_test,d_train,d_test,rmse_train,rmse_test,mae_train,mae_test
11,BaggingRegressor,97.71,97.97,6.88,11.68,77.29,72.33,28.49,44.79
7,GradientBoostingRegressor,99.97,97.64,1.44,11.93,8.69,77.93,5.96,45.74
6,Random Forest,97.88,96.66,7.47,15.58,74.38,92.79,30.94,59.71
12,ExtraTreesRegressor,100.0,96.62,0.0,13.77,0.0,93.27,0.0,52.77
9,XGBoost,100.0,95.19,0.0,16.5,0.0,111.35,0.0,63.24
13,AdaBoostRegressor,95.99,92.3,20.35,30.02,102.38,140.84,84.28,115.09
8,RidgeRegressor,93.5,91.67,17.04,28.96,130.32,146.53,70.55,111.02
3,MLPRegressor,90.55,88.02,25.88,31.12,157.11,175.7,107.14,119.3
2,Linear SVR,83.76,84.15,27.75,36.12,205.9,202.06,114.9,138.46
0,Linear Regression,86.53,82.07,30.5,42.93,187.52,214.93,126.29,164.57


In [52]:
# Plot with plotly dark theme
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=models['Model'], y=models['r2_train'], name='r2_train'))
fig.add_trace(go.Scatter(x=models['Model'], y=models['r2_test'], name='r2_test'))
fig.update_layout(title='R2-criterion for 15 popular models for train and test datasets',
                   xaxis_title='Models',
                   yaxis_title='R2-criterion, %')
fig.update_layout(template='plotly_dark')
# update size
fig.show()


In [53]:
# Plot with plotly dark theme
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=models['Model'], y=models['d_train'], name='d_train'))
fig.add_trace(go.Scatter(x=models['Model'], y=models['d_test'], name='d_test'))
fig.update_layout(title='Relative errors for 15 popular models for train and test datasets',
                   xaxis_title='Models',
                   yaxis_title='Relative error, %')
fig.update_layout(template='plotly_dark')
fig.show()

In [54]:
# Plot with plotly dark theme
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=models['Model'], y=models['rmse_train'], name='rmse_train'))
fig.add_trace(go.Scatter(x=models['Model'], y=models['rmse_test'], name='rmse_test'))
fig.update_layout(title='RMSE for 15 popular models for train and test datasets',
                   xaxis_title='Models',
                   yaxis_title='RMSE, %')
fig.update_layout(template='plotly_dark')
fig.show()


In [55]:
# Plot with plotly dark theme
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=models['Model'], y=models['mae_train'], name='mae_train'))
fig.add_trace(go.Scatter(x=models['Model'], y=models['mae_test'], name='mae_test'))
fig.update_layout(title='MAE for 15 popular models for train and test datasets',
                   xaxis_title='Models',
                   yaxis_title='MAE, %')
fig.update_layout(template='plotly_dark')
fig.show()


In [56]:
models

Unnamed: 0,Model,r2_train,r2_test,d_train,d_test,rmse_train,rmse_test,mae_train,mae_test
0,Linear Regression,86.53,82.07,30.5,42.93,187.52,214.93,126.29,164.57
1,Support Vector Machines,48.61,63.97,31.39,36.2,366.3,304.66,129.99,138.75
2,Linear SVR,83.76,84.15,27.75,36.12,205.9,202.06,114.9,138.46
3,MLPRegressor,90.55,88.02,25.88,31.12,157.11,175.7,107.14,119.3
4,Stochastic Gradient Decent,86.53,81.97,30.59,43.0,187.57,215.53,126.67,164.83
5,Decision Tree Regressor,70.64,64.29,30.9,42.79,276.88,303.32,127.94,164.03
6,Random Forest,97.88,96.66,7.47,15.58,74.38,92.79,30.94,59.71
7,GradientBoostingRegressor,99.97,97.64,1.44,11.93,8.69,77.93,5.96,45.74
8,RidgeRegressor,93.5,91.67,17.04,28.96,130.32,146.53,70.55,111.02
9,XGBoost,100.0,95.19,0.0,16.5,0.0,111.35,0.0,63.24


In [75]:
models.Model.to_list()

['GradientBoostingRegressor',
 'BaggingRegressor',
 'ExtraTreesRegressor',
 'Random Forest',
 'XGBoost',
 'AdaBoostRegressor',
 'RidgeRegressor',
 'MLPRegressor',
 'Linear Regression',
 'Linear SVR',
 'Stochastic Gradient Decent',
 'VotingRegressor',
 'LightGBM',
 'Decision Tree Regressor',
 'Support Vector Machines']

In [57]:
n_models = len(models)
models['r2_train_rank'] = models['r2_train'].rank(method='dense', ascending=False)
models['r2_test_rank'] = models['r2_test'].rank(method='dense', ascending=False)
models['d_train_rank'] = models['d_train'].rank(method='dense', ascending=False)
models['d_test_rank'] = models['d_test'].rank(method='dense', ascending=False)
models['rmse_train_rank'] = models['rmse_train'].rank(method='dense')
models['rmse_test_rank'] = models['rmse_test'].rank(method='dense')
models['mae_train_rank'] = models['mae_train'].rank(method='dense')
models['mae_test_rank'] = models['mae_test'].rank(method='dense')
models['test_ranksum'] = (models['r2_train_rank'] + models['r2_test_rank'] +
                          models['d_train_rank'] + models['d_test_rank'] +
                          models['rmse_train_rank'] + models['rmse_test_rank'] +
                          models['mae_train_rank'] + models['mae_test_rank'])
models = models.sort_values(by=['test_ranksum'], ascending=True)
models = models.sort_values(by=['test_ranksum'], ascending=True)
models = models.reset_index(drop=True)
models_rank = models[['Model', 'test_ranksum', 'r2_train_rank', 'r2_test_rank', 'd_train_rank', 'd_test_rank', 'rmse_train_rank', 'rmse_test_rank', 'mae_train_rank', 'mae_test_rank']]
models_rank

Unnamed: 0,Model,test_ranksum,r2_train_rank,r2_test_rank,d_train_rank,d_test_rank,rmse_train_rank,rmse_test_rank,mae_train_rank,mae_test_rank
0,GradientBoostingRegressor,39.0,2.0,2.0,13.0,14.0,2.0,2.0,2.0,2.0
1,BaggingRegressor,41.0,4.0,1.0,12.0,15.0,4.0,1.0,3.0,1.0
2,ExtraTreesRegressor,41.0,1.0,4.0,14.0,13.0,1.0,4.0,1.0,3.0
3,Random Forest,43.0,3.0,3.0,11.0,12.0,3.0,3.0,4.0,4.0
4,XGBoost,43.0,1.0,5.0,14.0,11.0,1.0,5.0,1.0,5.0
5,AdaBoostRegressor,53.0,5.0,6.0,9.0,9.0,5.0,6.0,6.0,7.0
6,RidgeRegressor,57.0,6.0,7.0,10.0,10.0,6.0,7.0,5.0,6.0
7,MLPRegressor,61.0,7.0,8.0,8.0,8.0,7.0,8.0,7.0,8.0
8,Linear Regression,67.0,8.0,10.0,6.0,4.0,8.0,10.0,9.0,12.0
9,Linear SVR,72.0,11.0,9.0,7.0,7.0,12.0,9.0,8.0,9.0


In [58]:
# The three overall best performing models
print(f'----- The best performing model is ----- \n\n The {models.Model[0]} \n \n r2_train = {models.r2_train[0]} r2_test = {models.r2_test[0]} \n d_train = {models.d_train[0]} d_test = {models.d_test[0]} \n rmse_train = {models.rmse_train[0]} rmse_test = {models.rmse_test[0]} \n mae_train = {models.mae_train[0]} mae_test = {models.mae_test[0]}\n')

print(f'----- The second best performing model is ----- \n\n The {models.Model[1]} \n \n r2_train = {models.r2_train[1]} r2_test = {models.r2_test[1]} \n d_train = {models.d_train[1]} d_test = {models.d_test[1]} \n rmse_train = {models.rmse_train[1]} rmse_test = {models.rmse_test[1]} \n mae_train = {models.mae_train[1]} mae_test = {models.mae_test[1]}\n')

print(f'----- The third best performing model is ----- \n\n The {models.Model[2]} \n \n r2_train = {models.r2_train[2]} r2_test = {models.r2_test[2]} \n d_train = {models.d_train[2]} d_test = {models.d_test[2]} \n rmse_train = {models.rmse_train[2]} rmse_test = {models.rmse_test[2]} \n mae_train = {models.mae_train[2]} mae_test = {models.mae_test[2]}\n')


----- The best performing model is ----- 

 The GradientBoostingRegressor 
 
 r2_train = 99.97 r2_test = 97.64 
 d_train = 1.44 d_test = 11.93 
 rmse_train = 8.69 rmse_test = 77.93 
 mae_train = 5.96 mae_test = 45.74

----- The second best performing model is ----- 

 The BaggingRegressor 
 
 r2_train = 97.71 r2_test = 97.97 
 d_train = 6.88 d_test = 11.68 
 rmse_train = 77.29 rmse_test = 72.33 
 mae_train = 28.49 mae_test = 44.79

----- The third best performing model is ----- 

 The ExtraTreesRegressor 
 
 r2_train = 100.0 r2_test = 96.62 
 d_train = 0.0 d_test = 13.77 
 rmse_train = 0.0 rmse_test = 93.27 
 mae_train = 0.0 mae_test = 52.77



# 5. Prediction

In [60]:
test0.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61 entries, 223 to 92
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Car_Name       61 non-null     int64
 1   Year           61 non-null     int64
 2   Present_Price  61 non-null     int64
 3   Kms_Driven     61 non-null     int64
 4   Fuel_Type      61 non-null     int64
 5   Seller_Type    61 non-null     int64
 6   Transmission   61 non-null     int64
 7   Owner          61 non-null     int64
dtypes: int64(8)
memory usage: 4.3 KB


In [61]:
test0

Unnamed: 0,Car_Name,Year,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
223,94,115,9,61381,1,0,1,0
150,52,111,0,6000,2,1,1,0
226,82,115,5,24678,2,0,1,0
296,69,116,11,33988,1,0,1,0
52,86,117,19,15000,1,0,0,0
...,...,...,...,...,...,...,...,...
137,20,113,0,16000,2,1,1,0
227,83,111,4,57000,2,0,1,0
26,92,113,5,55138,2,0,1,0
106,40,114,3,16500,2,1,1,1


In [62]:
#For models from Sklearn
testn = pd.DataFrame(scaler.transform(test0), columns = test0.columns)

In [63]:
#Linear Regression model for basic train
linreg.fit(train0, train_target0)
linreg.predict(testn)[:3]

array([ 6.67129475, -0.8974948 ,  3.73727157])

In [64]:
#Ridge Regressor model for basic train
ridge.fit(train0, train_target0)
ridge.predict(testn)[:3]

array([ 6.75784942, -0.76673981,  3.80798505])

## Prediction functions

In [68]:
# Function to predict with Gradient Boosting Regressor model
def predict_gbr(train, train_target, test):
    # Initiate the model
    gbr = GradientBoostingRegressor(n_estimators=1000, learning_rate=0.05, max_depth=4, max_features='sqrt',
                                    min_samples_leaf=15, min_samples_split=10, loss='huber', random_state=42)
    gbr.fit(train, train_target)
    gbr_pred = gbr.predict(test)
    return gbr_pred

In [69]:
# Function to predict with Bagging Regressor model
def predict_br(train, train_target, test):
    # Initiate the model
    br = BaggingRegressor(n_estimators=1000, max_samples=0.5, max_features=0.5, random_state=42)
    br.fit(train, train_target)
    br_pred = br.predict(test)
    return br_pred

In [70]:
# Function to predict with Extra Trees Regressor model
def predict_etr(train, train_target, test):
    # Initiate the model
    etr = ExtraTreesRegressor(n_estimators=1000, max_depth=6, max_features='sqrt', min_samples_leaf=15,
                              min_samples_split=10, random_state=42)
    etr.fit(train, train_target)
    etr_pred = etr.predict(test)
    return etr_pred

In [71]:
df_pred = pd.DataFrame({'Actual': test_target0, 'predict_br': predict_br(train0, train_target0, testn),
                        'predict_gbr': predict_gbr(train0, train_target0, testn),
                        'predict_etr': predict_etr(train0, train_target0, testn)})
df_pred

Unnamed: 0,Actual,predict_br,predict_gbr,predict_etr
223,8,6.18,6.45,5.86
150,0,0.19,0.16,1.78
226,5,4.62,3.67,4.75
296,9,7.86,8.44,6.16
52,18,15.55,22.60,7.98
...,...,...,...,...
137,0,0.33,0.15,1.51
227,2,2.75,1.92,4.35
26,4,3.42,2.75,4.47
106,1,1.40,1.26,1.81



Thank you very much for your attention. We present the three best performing models: Gradient Boosting Regressor, Bagging Regressor and Extra Trees Regressor. We will use the Gradient Boosting Regressor model for further predictions. The relevant statistics for the model are presented below.
