# Project 3
Xingwang Yu

# 1. Introduction

`Supervised learning`, also known as supervised machine learning, is a subcategory of machine learning. It is defined by its use of labeled datasets to train algorithms that to classify data or predict outcomes accurately.Basically supervised learning is when we teach or train the machine using data that is well labelled, which means some data is already tagged with the correct answer. After that, the machine is provided with a new set of data so that the supervised learning algorithm analyses the training data and produces a correct outcome from labelled data.

`Supervised learning` is classified into two categories of algorithms: 
<br>`Classification`: A classification problem is when the output variable is a category, such as “Red” or “blue” , “disease” or “no disease”.
<br>`Regression`: A regression problem is when the output variable is a real value, such as “dollars” or “weight”.



`Data Set Information`:

To fit supervised learning models, I found a dataset from kaggle https://www.kaggle.com/datasets/aungpyaeap/fish-market?resource=download. 
This dataset is a record of 7 common different fish species in fish market sales. The dataset includes measurement of weight, lenght of different part (length1, length2, Length3), height and width. 

of  With this dataset, a predictive model can be performed using machine friendly data and estimate the weight of fish can be predicted.
In this datasets, the scientists mearsured the seed grain structure belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each. To construct the data, seven geometric parameters of wheat kernels were measured from column 1-7, while the column 8 shows the variety type. 

`Objective`:

Try to build different class of supervised learning regession models to prodict the weight of fish.

Let's import the data first and convert it to spark dataframe.

In [31]:
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [60]:
# read the data 
dat = pd.read_csv("Fish.csv", header = 0)

dat.head()

Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width
0,Bream,242.0,23.2,25.4,30.0,11.52,4.02
1,Bream,290.0,24.0,26.3,31.2,12.48,4.3056
2,Bream,340.0,23.9,26.5,31.1,12.3778,4.6961
3,Bream,363.0,26.3,29.0,33.5,12.73,4.4555
4,Bream,430.0,26.5,29.0,34.0,12.444,5.134


In [61]:
#convert to spark df
fish = spark.createDataFrame(dat)
fish.show(5)

+-------+------+-------+-------+-------+-------+------+
|Species|Weight|Length1|Length2|Length3| Height| Width|
+-------+------+-------+-------+-------+-------+------+
|  Bream| 242.0|   23.2|   25.4|   30.0|  11.52|  4.02|
|  Bream| 290.0|   24.0|   26.3|   31.2|  12.48|4.3056|
|  Bream| 340.0|   23.9|   26.5|   31.1|12.3778|4.6961|
|  Bream| 363.0|   26.3|   29.0|   33.5|  12.73|4.4555|
|  Bream| 430.0|   26.5|   29.0|   34.0| 12.444| 5.134|
+-------+------+-------+-------+-------+-------+------+
only showing top 5 rows



# 2. Splitting the Data, Metrics, and Models
## Evaluation metrics
Evaluation metrics explain the performance of the model. An important aspect of evaluation metrics is their capability to discriminate among model results. There are several metrics we could used to evaluation the model performance. I choosed two commonly used metrics`RMSE` and `MAE` to evaluating the performance of regression models.

`RMSE`, Root Mean Squared Error, is the square root of the mean squared error between the predicted and actual values. RMSE is the aggregated mean and subsequent square root of these errors, which helps us understand the model performance over the whole dataset. A benefit of using RMSE is that the metric it produces is in terms of the unit being predicted. For example, using RMSE in a house price prediction model would give the error in terms of house price, which can help end users easily understand model performance.

`MAE`, Mean Absolute Error, is the average absolute error between actual and predicted values. MAE is the aggregated mean of these errors, which helps us understand the model performance over the whole dataset. MAE is a popular metric to use as the error value is easily interpreted. This is because the value is on the same scale as the target you are predicting for.

`Similarities` between `MAE` and `RMSE`:
<br>Aside from the fact that they both are error metrics for regression models, the other similarities are:
1. Error is given in terms of the value you are predicting for
2. The lower the value the more accurate the model is
3. The resulting values can be between 0 and infinity

`Difference` between `RMSE` and `MAE`:
<br>Whilst they both have the same goal of measuring regression model error, there are some key differences that you should be aware of:
1. RMSE is more sensitive to outliers
2. RMSE penalises large errors more than MAE due to the fact that errors are squared initially
3. MAE returns values that are more interpretable as it is simply the average of absolute error

## Splitting data
When it comes to data analysis, you can split your data into training and testing sets. 

A train test split is when you split your data into a training set and a testing set. The training set is used for training the model, and the testing set is used to test your model. This allows you to train your models on the training set, and then test their accuracy on the unseen testing set. There are a few different ways to do a train test split, but the most common is to simply split your data into two sets. For example 80% for training and 20% for testing. This ensures that both sets are representative of the entire dataset, and gives you a good way to measure the accuracy of your models.

In [97]:
#split dataset into 80/20 taining and testing sets
train, test = fish.randomSplit([0.8,0.2], seed = 1)
print(train.count(), test.count())

129 30


## Models
The following regression model will be used to predict fish weight.
1. `Linear regression`
<br>Linear Regression is an ML algorithm used for supervised learning. Linear regression performs the task to predict a dependent variable(target) based on the given independent variable(s). So, this regression technique finds out a linear relationship between a dependent variable and the other given independent variables. Hence, the name of this algorithm is Linear Regression.
2. `LASSO Regression`
<br>Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters). This particular type of regression is well-suited for models showing high levels of muticollinearity or when you want to automate certain parts of model selection, like variable selection/parameter elimination.
3. `ElasticNet Regression`
<br> Elastic net is a combination of the two most popular regularized variants of linear regression: ridge and lasso. Ridge utilizes an L2 penalty and lasso uses an L1 penalty.
4. `Decision Tree`
<br>Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.
5. `Gradient-Boosted Trees (GBTs)`
<br>In gradient boosting decision trees, we combine many weak learners to come up with one strong learner. The weak learners here are the individual decision trees. All the trees are conncted in series and each tree tries to minimise the error of the previous tree. Due to this sequential connection, boosting algorithms are usually slow to learn, but also highly accurate. In statistical learning, models that learn slowly perform better.

# 3. Model fitting
Now, let's make tranformatios on data, which could be used in model fitting. I will make two type of transformation: one inlcude interaction terms which will be used in linear models, while another one do not have interaction terms, which could be used in tree based models.

In [78]:
from pyspark.ml.feature import SQLTransformer, VectorAssembler

sqlTrans_1 = SQLTransformer(
    statement = """
                SELECT Length1, Length2, Length3, Height, log(Width) as log_width,
                Weight as label FROM __THIS__
                """
)

sqlTrans_2 = SQLTransformer(
    statement = """
                SELECT Length1, Length2, Length3, Height, log(Width) as log_width,
                (Length1 * Length2) as interacted,
                (Height * Height) as poly_height,
                Weight as label FROM __THIS__
                """
)
df = sqlTrans_2.transform(fish)


assembler_1 = VectorAssembler(inputCols = ["Length1", "Length2", "Length3", "Height", "log_width"], outputCol = "features", handleInvalid = 'keep')
assembler_2 = VectorAssembler(inputCols = ["Length1", "Length2", "Length3", "Height", "log_width", "interacted", "poly_height"], outputCol = "features", handleInvalid = 'keep')

assembler_2.transform(df).show(5)

+-------+-------+-------+-------+------------------+-----------------+------------------+-----+--------------------+
|Length1|Length2|Length3| Height|         log_width|       interacted|       poly_height|label|            features|
+-------+-------+-------+-------+------------------+-----------------+------------------+-----+--------------------+
|   23.2|   25.4|   30.0|  11.52|1.3912819026309295|           589.28|          132.7104|242.0|[23.2,25.4,30.0,1...|
|   24.0|   26.3|   31.2|  12.48|1.4599165009905044|            631.2|          155.7504|290.0|[24.0,26.3,31.2,1...|
|   23.9|   26.5|   31.1|12.3778|1.5467323770179757|633.3499999999999|153.20993284000002|340.0|[23.9,26.5,31.1,1...|
|   26.3|   29.0|   33.5|  12.73|1.4941392880706375|            762.7|162.05290000000002|363.0|[26.3,29.0,33.5,1...|
|   26.5|   29.0|   34.0| 12.444|1.6358850824489488|            768.5|154.85313600000003|430.0|[26.5,29.0,34.0,1...|
+-------+-------+-------+-------+------------------+------------

## 3.1. Linear Regression

In [101]:
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator

#create initial LinearRegression model
lr = LinearRegression()
#create ParaGrid for cross validation
paramGrid_lr = ParamGridBuilder() \
    .addGrid(lr.regParam, [0, 0.01, 0.05,0.1, 1]) \
    .build()
pipeline_lr = Pipeline(stages = [sqlTrans_2, assembler_2, lr])

#create cross validator for two different metrics
lr_cv_1 = CrossValidator(estimator = pipeline_lr,
                          estimatorParamMaps = paramGrid_lr,
                          evaluator = RegressionEvaluator(metricName='rmse'),
                          numFolds=5)
lr_cv_2 = CrossValidator(estimator = pipeline_lr,
                          estimatorParamMaps = paramGrid_lr,
                          evaluator = RegressionEvaluator(metricName='mae'),
                          numFolds=5)
# Run cross-validation, and choose the best set of parameters.
lr_cv_model1 = lr_cv_1.fit(train)
lr_cv_model2 = lr_cv_2.fit(train)

#print cv results for model selection
lr_list = []
for i in range(len(paramGrid_lr)):
    lr_list.append([lr_cv_model1.avgMetrics[i], lr_cv_model2.avgMetrics[i], paramGrid_lr[i].values()])
lr_list

[[70.08353755479689, 51.23363828168392, dict_values([0.0])],
 [70.02820192663172, 51.2905742348669, dict_values([0.01])],
 [70.17034520438938, 52.014761408171424, dict_values([0.05])],
 [70.42731139753221, 52.46821092605671, dict_values([0.1])],
 [74.25880502828645, 53.361536846519094, dict_values([1.0])]]

Metric RMSE and MAE showed same results in choose best parameters. So either models could be considered as best model for linear model.

In [102]:
lr_best = lr_cv_model1

## 3.2. LASSO Regression

In [103]:
#create initial LASSO model
la = LinearRegression()
#create ParaGrid for cross validation
paramGrid_la = ParamGridBuilder() \
    .addGrid(la.regParam, [0, 0.01, 0.05,0.1, 1]) \
    .addGrid(la.elasticNetParam, [1]) \
    .build()
pipeline_la = Pipeline(stages = [sqlTrans_2, assembler_2, la])

#create cross validator for two different metrics
la_cv_1 = CrossValidator(estimator = pipeline_la,
                          estimatorParamMaps = paramGrid_la,
                          evaluator = RegressionEvaluator(metricName='rmse'),
                          numFolds=5)
la_cv_2 = CrossValidator(estimator = pipeline_la,
                          estimatorParamMaps = paramGrid_la,
                          evaluator = RegressionEvaluator(metricName='mae'),
                          numFolds=5)
# Run cross-validation, and choose the best set of parameters.
la_cv_model1 = la_cv_1.fit(train)
la_cv_model2 = la_cv_2.fit(train)

#print cv results for model selection
la_list = []
for i in range(len(paramGrid_la)):
    la_list.append([la_cv_model1.avgMetrics[i], la_cv_model2.avgMetrics[i], paramGrid_la[i].values()])
la_list

[[70.08353755479689, 51.23363828168392, dict_values([0.0, 1.0])],
 [72.98550353440467, 52.554758583731044, dict_values([0.01, 1.0])],
 [73.63979991378065, 52.39935214770154, dict_values([0.05, 1.0])],
 [73.28490862017793, 52.55172277962646, dict_values([0.1, 1.0])],
 [76.17374891266998, 54.173500298453746, dict_values([1.0, 1.0])]]

RMSE and MAE showing same results acrossing parameters. Either model will be considered as best model in LASSO.

In [104]:
la_best = la_cv_model1

## 3.3. ElasticNet Regression

In [105]:
#create initial ElasticNet model
en = LinearRegression()
#create ParaGrid for cross validation
paramGrid_en = ParamGridBuilder() \
    .addGrid(en.regParam, [0, 0.01, 0.05,0.1, 1]) \
    .addGrid(en.elasticNetParam, [0, 0.5, 0.8, 0.9, 1]) \
    .build()
pipeline_en = Pipeline(stages = [sqlTrans_2, assembler_2, en])

#create cross validator for two different metrics
en_cv_1 = CrossValidator(estimator = pipeline_en,
                          estimatorParamMaps = paramGrid_en,
                          evaluator = RegressionEvaluator(metricName='rmse'),
                          numFolds=5)
en_cv_2 = CrossValidator(estimator = pipeline_en,
                          estimatorParamMaps = paramGrid_en,
                          evaluator = RegressionEvaluator(metricName='mae'),
                          numFolds=5)
# Run cross-validation, and choose the best set of parameters.
en_cv_model1 = en_cv_1.fit(train)
en_cv_model2 = en_cv_2.fit(train)

#print cv results for model selection
en_list = []
for i in range(len(paramGrid_en)):
    en_list.append([en_cv_model1.avgMetrics[i], en_cv_model2.avgMetrics[i], paramGrid_en[i].values()])
en_list

[[70.08353755479689, 51.23363828168392, dict_values([0.0, 0.0])],
 [70.08353755479689, 51.23363828168392, dict_values([0.0, 0.5])],
 [70.08353755479689, 51.23363828168392, dict_values([0.0, 0.8])],
 [70.08353755479689, 51.23363828168392, dict_values([0.0, 0.9])],
 [70.08353755479689, 51.23363828168392, dict_values([0.0, 1.0])],
 [70.02820192663172, 51.2905742348669, dict_values([0.01, 0.0])],
 [73.44043576980751, 52.28176419490781, dict_values([0.01, 0.5])],
 [73.23251364312262, 52.58175372187534, dict_values([0.01, 0.8])],
 [73.2339282113476, 52.58282421892039, dict_values([0.01, 0.9])],
 [72.98550353440467, 52.554758583731044, dict_values([0.01, 1.0])],
 [70.17034520438938, 52.014761408171424, dict_values([0.05, 0.0])],
 [72.43687679328559, 52.432669906336955, dict_values([0.05, 0.5])],
 [71.09293708483156, 52.396290552994984, dict_values([0.05, 0.8])],
 [72.84846501574646, 52.62634198661726, dict_values([0.05, 0.9])],
 [73.63979991378065, 52.39935214770154, dict_values([0.05, 1.0])]

In this model fitting, RMSE and MAE choosed different best parameters. As above mentioned, RMSE is more sensitive to outliers. I choose RMSE as major metric, and select model1 as best ElasticNet Regression model.

In [106]:
en_best = en_cv_model1

## 3.4. Decision Tree

In [92]:
from pyspark.ml.regression import DecisionTreeRegressor

#create initial DecisionTree model
dt = DecisionTreeRegressor()
#create ParaGrid for cross validation
paramGrid_dt = ParamGridBuilder() \
    .addGrid(dt.maxDepth, [1, 2, 5, 10]) \
    .addGrid(dt.maxBins, [4, 5, 8, 10]) \
    .build()
pipeline_dt = Pipeline(stages = [sqlTrans_1, assembler_1, dt])

#create cross validator for two different metrics
dt_cv_1 = CrossValidator(estimator = pipeline_dt,
                          estimatorParamMaps = paramGrid_dt,
                          evaluator = RegressionEvaluator(metricName='rmse'),
                          numFolds=5)
dt_cv_2 = CrossValidator(estimator = pipeline_dt,
                          estimatorParamMaps = paramGrid_dt,
                          evaluator = RegressionEvaluator(metricName='mae'),
                          numFolds=5)
# Run cross-validation, and choose the best set of parameters.
dt_cv_model1 = dt_cv_1.fit(train)
dt_cv_model2 = dt_cv_2.fit(train)

#print cv results for model selection
dt_list = []
for i in range(len(paramGrid_dt)):
    dt_list.append([dt_cv_model1.avgMetrics[i], dt_cv_model2.avgMetrics[i], paramGrid_dt[i].values()])
dt_list

[[205.5271650344436, 153.80191051980339, dict_values([1, 4])],
 [201.95860860517513, 145.38820283023335, dict_values([1, 5])],
 [197.4496302810485, 152.47646818934797, dict_values([1, 8])],
 [201.18204097353976, 150.13400760153553, dict_values([1, 10])],
 [133.03507721541718, 96.555292328788, dict_values([2, 4])],
 [155.60801536656044, 105.41755708445444, dict_values([2, 5])],
 [142.7960942020221, 105.08760797053988, dict_values([2, 8])],
 [138.72181102738412, 96.85440822497256, dict_values([2, 10])],
 [106.02549090920866, 70.55277440789473, dict_values([5, 4])],
 [111.3864869289024, 72.68564840097247, dict_values([5, 5])],
 [102.43959080948294, 63.70976667317384, dict_values([5, 8])],
 [95.55481511234447, 59.05176390005563, dict_values([5, 10])],
 [102.58416821775408, 67.58016496021523, dict_values([10, 4])],
 [109.66238057416153, 70.99627729900513, dict_values([10, 5])],
 [102.86740776100396, 60.545728362200144, dict_values([10, 8])],
 [107.68285191260543, 61.19526261748122, dict_val

Metric RMSE and MAE showed same results in choose best parameters. So either models could be considered as best model for linear model.

In [96]:
dt_best = dt_cv_model1

## 3.5. Gradient-Boosted Trees (GBTs)

In [93]:
from pyspark.ml.regression import GBTRegressor

#create initial GBTs model
gbt = GBTRegressor()
#create ParaGrid for cross validation
paramGrid_gbt = ParamGridBuilder() \
    .addGrid(gbt.maxDepth, [1, 2, 5, 10]) \
    .addGrid(gbt.maxBins, [4, 5, 8, 10]) \
    .build()
pipeline_gbt = Pipeline(stages = [sqlTrans_1, assembler_1, gbt])

#create cross validator for two different metrics
gbt_cv_1 = CrossValidator(estimator = pipeline_gbt,
                          estimatorParamMaps = paramGrid_gbt,
                          evaluator = RegressionEvaluator(metricName='rmse'),
                          numFolds=5)
gbt_cv_2 = CrossValidator(estimator = pipeline_gbt,
                          estimatorParamMaps = paramGrid_gbt,
                          evaluator = RegressionEvaluator(metricName='mae'),
                          numFolds=5)
# Run cross-validation, and choose the best set of parameters.
gbt_cv_model1 = gbt_cv_1.fit(train)
gbt_cv_model2 = gbt_cv_2.fit(train)

#print cv results for model selection
gbt_list = []
for i in range(len(paramGrid_gbt)):
    gbt_list.append([gbt_cv_model1.avgMetrics[i], gbt_cv_model2.avgMetrics[i], paramGrid_gbt[i].values()])
gbt_list

[[156.36913725524218, 111.90174726256305, dict_values([1, 4])],
 [162.45438598897243, 113.0311278129147, dict_values([1, 5])],
 [147.0625434284022, 105.92187880257805, dict_values([1, 8])],
 [151.41591979283132, 104.12063177489956, dict_values([1, 10])],
 [117.05876055865777, 79.1402497312001, dict_values([2, 4])],
 [117.31475219718337, 78.41860869144597, dict_values([2, 5])],
 [110.16540161729506, 77.06563233695758, dict_values([2, 8])],
 [102.21056917559324, 70.69748510595282, dict_values([2, 10])],
 [104.3634759529403, 68.66846119146763, dict_values([5, 4])],
 [110.72172185687107, 71.45111925931445, dict_values([5, 5])],
 [103.39034729727173, 61.460600348751655, dict_values([5, 8])],
 [92.18896441649525, 55.540543923198435, dict_values([5, 10])],
 [102.58416821775408, 67.58016496021523, dict_values([10, 4])],
 [109.66238057416153, 70.99627729900511, dict_values([10, 5])],
 [102.86516975990864, 60.545103362200145, dict_values([10, 8])],
 [107.68193496247541, 61.184996070802114, dict_

Metric RMSE and MAE showed same results in choose best parameters. So either models could be considered as best model for linear model.

In [95]:
gbt_best = gbt_cv_model1

# 4. Model testing

In the last, I will evaluate the best models from each class on the test set. Compare the test_error of each models.

In [108]:
# linear regression model
lr_best.transform(test).show(5)
lr_test_error = RegressionEvaluator().evaluate(lr_best.transform(test))

+-------+-------+-------+-------+------------------+------------------+------------------+-----+--------------------+------------------+
|Length1|Length2|Length3| Height|         log_width|        interacted|       poly_height|label|            features|        prediction|
+-------+-------+-------+-------+------------------+------------------+------------------+-----+--------------------+------------------+
|   26.3|   29.0|   33.5|  12.73|1.4941392880706375|             762.7|162.05290000000002|363.0|[26.3,29.0,33.5,1...|334.28851347878185|
|   29.1|   31.5|   36.4|13.7592| 1.474305238442604| 916.6500000000001|      189.31558464|500.0|[29.1,31.5,36.4,1...|448.08479850158983|
|   29.4|   32.0|   37.2|14.9544|1.6430274154276556|             940.8|      223.63407936|600.0|[29.4,32.0,37.2,1...| 568.8604897057808|
|   29.4|   32.0|   37.2| 15.438|1.7191887763932197|             940.8|238.33184400000002|600.0|[29.4,32.0,37.2,1...| 634.4453093327111|
|   30.4|   33.0|   38.3|14.8604| 1.66494

In [109]:
# Lasso regression model
la_best.transform(test).show(5)
la_test_error = RegressionEvaluator().evaluate(la_best.transform(test))

+-------+-------+-------+-------+------------------+------------------+------------------+-----+--------------------+-----------------+
|Length1|Length2|Length3| Height|         log_width|        interacted|       poly_height|label|            features|       prediction|
+-------+-------+-------+-------+------------------+------------------+------------------+-----+--------------------+-----------------+
|   26.3|   29.0|   33.5|  12.73|1.4941392880706375|             762.7|162.05290000000002|363.0|[26.3,29.0,33.5,1...|330.2763140532298|
|   29.1|   31.5|   36.4|13.7592| 1.474305238442604| 916.6500000000001|      189.31558464|500.0|[29.1,31.5,36.4,1...|447.5664645354469|
|   29.4|   32.0|   37.2|14.9544|1.6430274154276556|             940.8|      223.63407936|600.0|[29.4,32.0,37.2,1...|569.2546668753753|
|   29.4|   32.0|   37.2| 15.438|1.7191887763932197|             940.8|238.33184400000002|600.0|[29.4,32.0,37.2,1...|635.9476129169128|
|   30.4|   33.0|   38.3|14.8604| 1.664948302361

In [110]:
# ElasticNet Regression
en_best.transform(test).show(5)
en_test_error = RegressionEvaluator().evaluate(en_best.transform(test))

+-------+-------+-------+-------+------------------+------------------+------------------+-----+--------------------+------------------+
|Length1|Length2|Length3| Height|         log_width|        interacted|       poly_height|label|            features|        prediction|
+-------+-------+-------+-------+------------------+------------------+------------------+-----+--------------------+------------------+
|   26.3|   29.0|   33.5|  12.73|1.4941392880706375|             762.7|162.05290000000002|363.0|[26.3,29.0,33.5,1...|334.28851347878185|
|   29.1|   31.5|   36.4|13.7592| 1.474305238442604| 916.6500000000001|      189.31558464|500.0|[29.1,31.5,36.4,1...|448.08479850158983|
|   29.4|   32.0|   37.2|14.9544|1.6430274154276556|             940.8|      223.63407936|600.0|[29.4,32.0,37.2,1...| 568.8604897057808|
|   29.4|   32.0|   37.2| 15.438|1.7191887763932197|             940.8|238.33184400000002|600.0|[29.4,32.0,37.2,1...| 634.4453093327111|
|   30.4|   33.0|   38.3|14.8604| 1.66494

In [111]:
# Decision tree
dt_best.transform(test).show(5)
dt_test_error = RegressionEvaluator().evaluate(dt_best.transform(test))

+-------+-------+-------+-------+------------------+-----+--------------------+----------+
|Length1|Length2|Length3| Height|         log_width|label|            features|prediction|
+-------+-------+-------+-------+------------------+-----+--------------------+----------+
|   26.3|   29.0|   33.5|  12.73|1.4941392880706375|363.0|[26.3,29.0,33.5,1...|     390.0|
|   29.1|   31.5|   36.4|13.7592| 1.474305238442604|500.0|[29.1,31.5,36.4,1...|     390.0|
|   29.4|   32.0|   37.2|14.9544|1.6430274154276556|600.0|[29.4,32.0,37.2,1...|     485.9|
|   29.4|   32.0|   37.2| 15.438|1.7191887763932197|600.0|[29.4,32.0,37.2,1...|     610.0|
|   30.4|   33.0|   38.3|14.8604| 1.664948302361668|700.0|[30.4,33.0,38.3,1...|     485.9|
+-------+-------+-------+-------+------------------+-----+--------------------+----------+
only showing top 5 rows



In [112]:
# Gradient-Boosted Trees (GBTs)
gbt_best.transform(test).show(5)
gbt_test_error = RegressionEvaluator().evaluate(gbt_best.transform(test))

+-------+-------+-------+-------+------------------+-----+--------------------+------------------+
|Length1|Length2|Length3| Height|         log_width|label|            features|        prediction|
+-------+-------+-------+-------+------------------+-----+--------------------+------------------+
|   26.3|   29.0|   33.5|  12.73|1.4941392880706375|363.0|[26.3,29.0,33.5,1...|392.82534762144195|
|   29.1|   31.5|   36.4|13.7592| 1.474305238442604|500.0|[29.1,31.5,36.4,1...| 404.8448709134235|
|   29.4|   32.0|   37.2|14.9544|1.6430274154276556|600.0|[29.4,32.0,37.2,1...| 499.6066411595096|
|   29.4|   32.0|   37.2| 15.438|1.7191887763932197|600.0|[29.4,32.0,37.2,1...| 613.7742114756664|
|   30.4|   33.0|   38.3|14.8604| 1.664948302361668|700.0|[30.4,33.0,38.3,1...|496.51787724051997|
+-------+-------+-------+-------+------------------+-----+--------------------+------------------+
only showing top 5 rows



In [113]:
print(lr_test_error, la_test_error, en_test_error, dt_test_error, gbt_test_error)

69.72791190392779 70.32578975740446 69.72791190392779 100.52487007272099 97.91809918197725


We could see that linear regression and elasticNet regression models showing lowest test error, could be considered as best models over the other models.