<img src=http://fd.perso.eisti.fr/Logos/TORUS2.png>

In this section, let us present to you some Machine Learning algorithms, there are many, but 3 algorithms below can be considered as the most popular in Machine Learning :

- 1/ Regression - Linear Regression
- 2/ Classification - Random Forest
- 3/ Clustering - KMeans

This notebook will focus on the first one, we'll take a dataset and then build a linear regression model based on it. 

"Linear regression is the most basic type of regression and commonly used predictive analysis.  The overall idea of regression is to examine two things: (1) does a set of predictor variables do a good job in predicting an outcome variable?  Is the model using the predictors accounting for the variability in the changes in the dependent variable? (2) Which variables in particular are significant predictors of the dependent variable?  And in what way do they--indicated by the magnitude and sign of the beta estimates--impact the dependent variable?  These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables. (3) What is the regression equation that shows how the set of predictor variables can be used to predict the outcome?  The simplest form of the equation with one dependent and one independent variable is defined by the formula y = c + b*x, where y = estimated dependent score, c = constant, b = regression coefficients, and x = independent variable."

(source : http://www.statisticssolutions.com/what-is-linear-regression/)

### Read dataset (csv format) from HDFS

Here we use the dataset from http://www.statsci.org/data/general/water.html 

The target variable will be monthly water usage (gallons) and the variables descriptives are : 
- Average monthly temperature (F)
- Amount of production (M pounds)
- Number of plant operating days in the month
- Number of persons on the monthly plant payroll

In [ ]:
import org.apache.spark.sql._            
val spark = SparkSession.builder().getOrCreate()

val data = spark.read.format("com.databricks.spark.csv")
                .option("header", "true")
                .option("inferSchema", "true") 
                .load("hdfs://hupi-factory-02-01-01-01/user/hupi/dataset_torusVN/formation4_ML/water.csv")

In [ ]:
data.show()

+-----------+----------+----+-------+-----+
|Temperature|Production|Days|Persons|Water|
+-----------+----------+----+-------+-----+
|       58.8|      7107|  21|    129| 3067|
|       65.2|      6373|  22|    141| 2828|
|       70.9|      6796|  22|    153| 2891|
|       77.4|      9208|  20|    166| 2994|
|       79.3|     14792|  25|    193| 3082|
|       81.0|     14564|  23|    189| 3898|
|       71.9|     11964|  20|    175| 3502|
|       63.9|     13526|  23|    186| 3060|
|       54.5|     12656|  20|    190| 3211|
|       39.5|     14119|  20|    187| 3286|
|       44.5|     16691|  22|    195| 3542|
|       43.6|     14571|  19|    206| 3125|
|       56.0|     13619|  22|    198| 3022|
|       64.7|     14575|  22|    192| 2922|
|       73.0|     14556|  21|    191| 3950|
|       78.9|     18573|  21|    200| 4488|
|       79.4|     15618|  22|    200| 3295|
+-----------+----------+----+-------+-----+



In [ ]:
data.printSchema()

root
 |-- Temperature: double (nullable = true)
 |-- Production: integer (nullable = true)
 |-- Days: integer (nullable = true)
 |-- Persons: integer (nullable = true)
 |-- Water: integer (nullable = true)



In [ ]:
data.describe()

res23: org.apache.spark.sql.DataFrame = [summary: string, Temperature: string ... 4 more fields]


### Some descriptions of data

#### Statistics summary 

In [ ]:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}


In [ ]:
// Convert df to RDD to be able to use the library MultiVariateStatisticalSummary.
val rdd = data.map(l => (l(0).asInstanceOf[Double], l(1).asInstanceOf[Integer].toDouble, l(2).asInstanceOf[Integer].toDouble,
                        l(3).asInstanceOf[Integer].toDouble, l(4).asInstanceOf[Integer].toDouble)).rdd

In [ ]:
rdd.take(2)

res40: Array[(Double, Double, Double, Double, Double)] = Array((58.8,7107.0,21.0,129.0,3067.0), (65.2,6373.0,22.0,141.0,2828.0))


In [ ]:
// Convert rdd to the rdd of vectors
val observations = rdd.map(l => Vectors.dense(l._1, l._2, l._3, l._4, l._5))

observations: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[63] at map at <console>:86


In [ ]:
// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println("Vectors of observations' mean : " + summary.mean)  
println("Vectors of observations' variance : " + summary.variance)  
println("Vectors of observations' number of column not null : " + summary.numNonzeros)  
println()

Vectors of observations' mean : [64.8529411764706,12900.470588235294,21.470588235294116,181.8235294117647,3303.705882352941]
Vectors of observations' variance : [182.52264705882362,1.2438223764705881E7,2.1397058823529416,483.7794117647057,199539.47058823524]
Vectors of observations' number of column not null : [17.0,17.0,17.0,17.0,17.0]

summary: org.apache.spark.mllib.stat.MultivariateStatisticalSummary = org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@6c85eb06


#### Correlations of variables 

In [ ]:
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.rdd.RDD

import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.rdd.RDD


In [ ]:
// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method
// If a method is not specified, Pearson's method will be used by default.
val correlMatrix: Matrix = Statistics.corr(observations, "pearson")
println(correlMatrix.toString)

1.0                   -0.02410741870356305  0.43762975958335126   ... (5 total)
-0.02410741870356305  1.0                   0.10573054707596519   ...
0.43762975958335126   0.10573054707596519   1.0                   ...
-0.08205777488270032  0.9184797375869633    0.03188119325449726   ...
0.28575755805713965   0.6307494802500775    -0.08882582642644302  ...
correlMatrix: org.apache.spark.mllib.linalg.Matrix =
1.0                   -0.02410741870356305  0.43762975958335126   ... (5 total)
-0.02410741870356305  1.0                   0.10573054707596519   ...
0.43762975958335126   0.10573054707596519   1.0                   ...
-0.08205777488270032  0.9184797375869633    0.03188119325449726   ...
0.28575755805713965   0.6307494802500775    -0.08882582642644302  ...


In this example, we don't have many variables descriptives, so we suppose that we can use all variables to build the regression model. Otherwise, we need to do a selection of variables to select the variables that affect the most the target variable. To do selection variable, depending on the type of variables, we can use different methods. In Spark, we have some basic tools to do that, for example https://spark.apache.org/docs/latest/ml-features.html#feature-selectors 

###  Vector Assembler

To prepare for the construction of linear regression by using ML library, we have to have a data with 2 columns only ("label" and "features"). To have that, we need to put all the variables descriptives into a single vector column named "features" and column of the target variable should be renamed to "label". 

In [ ]:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors


In [ ]:
val assembler = new VectorAssembler()
  .setInputCols(Array("Temperature", "Production", "Days", "Persons"))
  .setOutputCol("features")

assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_dc3154249315


In [ ]:
val training = assembler.transform(data)
                        .select("Water", "features")
                        .withColumnRenamed("Water", "label")

training: org.apache.spark.sql.DataFrame = [label: int, features: vector]


In [ ]:
val Array(train, test) = data.randomSplit(Array(0.8, 0.2))

train: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Temperature: double, Production: int ... 3 more fields]
test: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Temperature: double, Production: int ... 3 more fields]


In [ ]:
test.show()

+-----------+----------+----+-------+-----+
|Temperature|Production|Days|Persons|Water|
+-----------+----------+----+-------+-----+
|       54.5|     12656|  20|    190| 3211|
|       56.0|     13619|  22|    198| 3022|
|       58.8|      7107|  21|    129| 3067|
|       79.3|     14792|  25|    193| 3082|
|       79.4|     15618|  22|    200| 3295|
+-----------+----------+----+-------+-----+



In [ ]:
training.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
| 3067|[58.8,7107.0,21.0...|
| 2828|[65.2,6373.0,22.0...|
| 2891|[70.9,6796.0,22.0...|
| 2994|[77.4,9208.0,20.0...|
| 3082|[79.3,14792.0,25....|
| 3898|[81.0,14564.0,23....|
| 3502|[71.9,11964.0,20....|
| 3060|[63.9,13526.0,23....|
| 3211|[54.5,12656.0,20....|
| 3286|[39.5,14119.0,20....|
| 3542|[44.5,16691.0,22....|
| 3125|[43.6,14571.0,19....|
| 3022|[56.0,13619.0,22....|
| 2922|[64.7,14575.0,22....|
| 3950|[73.0,14556.0,21....|
| 4488|[78.9,18573.0,21....|
| 3295|[79.4,15618.0,22....|
+-----+--------------------+



### Build a linear regression model 

To have the best model, we can try to fluctuate the parameters such as : number of max iterations, regularization parameters, etc. To find all the parameters supported by Spark that we can play with, you can see it in : https://spark.apache.org/docs/1.6.2/api/scala/index.html#org.apache.spark.ml.regression.LinearRegression

In [ ]:
import org.apache.spark.ml.regression.LinearRegression

val lr = new LinearRegression()
  .setMaxIter(10)
  .setRegParam(0.3)
  .setElasticNetParam(0.8)

// Fit the model
val lrModel = lr.fit(training)

import org.apache.spark.ml.regression.LinearRegression
lr: org.apache.spark.ml.regression.LinearRegression = linReg_f4f69c802e71
lrModel: org.apache.spark.ml.regression.LinearRegressionModel = linReg_f4f69c802e71


In [ ]:
// Print the coefficients and intercept for linear regression
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")

Coefficients: [14.327053237030215,0.1953629027869717,-128.50694198665602,-18.495546085518836] Intercept: 5976.326064543363


### Evaluation of model 

Some other metrics that can be computed : https://spark.apache.org/docs/1.6.2/api/scala/index.html#org.apache.spark.ml.regression.LinearRegressionTrainingSummary

In [ ]:
// Summarize the model over the training set and print out some metrics
val trainingSummary = lrModel.summary
println(s"numIterations: ${trainingSummary.totalIterations}")
println(s"objectiveHistory: [${trainingSummary.objectiveHistory.mkString(",")}]")
trainingSummary.residuals.show()
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
println(s"r2: ${trainingSummary.r2}")

numIterations: 11
objectiveHistory: [0.5000000000000036,0.4615658864385346,0.25876938382285175,0.23163114088245704,0.22711753479176247,0.1808987988418104,0.16024337290053345,0.13534375971375456,0.1271681430235667,0.12544451188702493,0.1208688800121053]
+-------------------+
|          residuals|
+-------------------+
|-55.629718236040844|
| 107.52700670548484|
| 228.17084840174948|
|   -249.74210402269|
|-137.95550009320868|
| 367.23518292388235|
| -34.90355653024835|
| -78.47215188658129|
|  65.62923234827531|
| 14.232465869832595|
| 101.10206637405372|
| -70.90405082350026|
|  71.98242976633992|
|-450.40314497328154|
| 315.39172024014533|
| 150.54924041607228|
| -343.8099664802862|
+-------------------+

RMSE: 211.22957297313167
r2: 0.7624201711079546
trainingSummary: org.apache.spark.ml.regression.LinearRegressionTrainingSummary = org.apache.spark.ml.regression.LinearRegressionTrainingSummary@6c4cbb05


### Conclusion

Without any optimization, the quality of the model is pretty good (r2 = 0.76). In reality, we can try to optimize this indicator by removing the anomalies, selecting the most important features to train model, adding more observations or more variables and fluctuating the parameters when we train model...

### Note :

All models created in Spark can be saved in HDFS by doing : 

* model.save(sc, "file:///Apps/spark/data/mllib/testModelPath") 

To load it for future usage : 

* val sameModel = SVMModel.load(sc, "file:///Apps/spark/data/mllib/testModelPath"). 

In this example, it's SVM model, so it's SVMModel.load

Plus, for some models, we can convert it to PMML format. It's good if you knew already PMML, if not, it's also fine ;) you can read here : https://www.ibm.com/developerworks/library/ba-ind-PMML1/index.html.

You can see list of supported models in Spark here : https://spark.apache.org/docs/2.0.0-preview/mllib-pmml-model-export.html

# Exercice
## Comment peut-on transformer ce code en créant un pipeline ? 
## Comment peut-on améliorer le modèle avec une cross-validation ?

### Pre-processing
Préparer les phases de `assembler` et on va ajouter l'étape de `Standardisation` des données.
Il faut donc créer deux objets :
- VectorAssembler
- StandardScaler (https://spark.apache.org/docs/3.2.1/ml-features.html#standardscaler)

Rechercher dans la documentation les fonctions nécessaires : https://spark.apache.org/docs/3.2.1/ml-guide.html

Nous appelerons ces deux objets (ie. variables immuables "val") *assembler* et *scaler*. Ces deux premières étapes du pipeline, ont pour objectif de transformer et formater les données pour le modèle et de normaliser les données numériques, afin que les variables numériques soient comparables entre elles.

In [ ]:
val assembler =

In [ ]:
val scaler =

### Model
Créer le modèle de régression linéaire de votre choix (Linéaire simple, Lasso, Ridge, ElasticNet).
Créer les ensembles de test et d'entraînement.

In [ ]:
val lr =

In [ ]:
val train, test =

### Pipeline
Créer la chaîne pipeline avec les différentes étapes `stages`.
Puis entraîner le modèle.

In [ ]:
val pipeline =

In [ ]:
val lrModel =

### Metrics
Calculer les métriques des erreurs usuelles, qui sont le RMSE et le coefficient de détermination (noté r2).
Le coefficient de détermination est une mesure de la qualité de la prédiction d'une régression linéaire. Il représente le pourcentage de la variance expliquée par la régression (ie. des prédictions faites) sur la variance totale de la variable réelle. Cet indicateur estime donc la corrélation entre les prédictions et la réalité. Plus il est proche de 1, plus le modèle est performat.

**! Attention :**
Ici il ne suffit pas de faire `lrModel.summary` car ici `lrModel` est de type pipeline. Il faut donc extraire d'abord du pipeline la composante représentant le modèle.
Pour cela vous allez avoir besoin des objets suivants :
- la fonction .stages(*numero_stage*) --> en spécifiant le numéro de l'étape qu'on souhaite extraire
- la fonction .asInstanceOf[*type_stage*] --> en spécifiant le type de l'objet qu'on souhaite extraire du pipeline
- la librairie du *type_stage* à extraire : import org.apache.spark.ml.regression.LinearRegressionModel


In [ ]:
val traininSummary =

### Test
Appliquer le modèle entraîner sur l'ensemble de données de test. Vous allez donc devoir utiliser la fonction `.transform`, qui permet d'appliquer le modèle, avec les différentes étapes du pipeline qui sont nécessaires, automatiquement.

Ensuite, évaluer les prédictions effectuées à l'aide de l'indicateur **RMSE**. Pour cela vous allez avoir besoin des objets suivants :
- la librairie *org.apache.spark.ml*, contenant la fonction *evaluation.RegressionEvaluator* donc cela donne *org.apache.spark.ml.evaluation.RegressionEvaluator*
- instancier un objet, qu'on appelera *evaluator*, permettant de calculer le RMSE entre la variable réelle `Water` (en tant que LabelCol) et la variable prédite `prediction` (en tant que Predictionol).

Suivre la documentation suivante : https://spark.apache.org/docs/2.1.1/api/scala/index.html#org.apache.spark.ml.evaluation.RegressionEvaluator

Enfin, appliquer l'objet *evaluator* créé, à l'objet *predictions* (dataframe contenant les prédictions faites sur l'ensemble de test), en utilisant la fonction *.evaluate*.
Puis pour finir afficher le résultat du RMSE.

In [ ]:
val predictions =

In [ ]:
val evaluator =

In [ ]:
val rmse =
println(s"Root Mean Squared Error (RMSE) on test data = $rmse")

### Cross validation
L'objectif est d'utiliser la méthode de **Cross-validation** pour ajuster les différents paramètres de la régression linéaire.
Les paramètres à ajuster sont les suivants :
- regParam
- elasticNetParam

Afin de réaliser cela, nous allons tout d'abord créer une grille de recherche.
Par exemple, voici l'ensemble des valeurs que nous allons tester pour chaque paramètre :
- Pour `regParam` nous allos vouloir tester les valeurs suivantes : 0.1, 0.01, 0.2, 0.3
- Pour `elasticNetParam` nous allos vouloir tester les valeurs suivantes : 0.1, 0.8
Une fois que la grille est créée, il faut ensuite créer le modèle de cross-validation, puis pour terminer l'appliquer à l'ensemble d'entraînement.

En étudiant la documentation construisez les objets suivants :
- paramGrid
- cv
- cvModel

Enfin tester à nouveau le nouveau modèle `cvModel` sur l'ensemble de test.

In [ ]:
val paramGrid =

In [ ]:
val cv =

In [ ]:
val cvModel = 

In [ ]:
val predictionsCvModel =

In [ ]:
val evaluatorCvModel =

In [ ]:
val rmse =

# Correction

In [ ]:
import org.apache.spark.sql.DataFrame
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.StandardScaler
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.regression.LinearRegressionModel

// instancier le vector
val assembler = new VectorAssembler() 
                    .setInputCols(Array("Temperature", "Production", "Days", "Persons")) 
                    .setOutputCol("assembled_features")
val scaler = new StandardScaler() 
  .setInputCol("assembled_features") 
  .setOutputCol("features") 
  .setWithStd(true) 
  .setWithMean(false)

val lr = new LinearRegression() 
  .setMaxIter(10) 
  .setRegParam(0.3) 
  .setElasticNetParam(0.8) 
  .setLabelCol("Water")
val Array(train, test) = data.randomSplit(Array(0.8, 0.2), seed = 11L)

// on met les differentes etapes du pipeline
val pipeline = new Pipeline() 
  .setStages(Array(assembler, scaler, lr))
val lrModel = pipeline.fit(train)

// Summarize the model over the training set and print out some metrics
val trainingSummary = lrModel.stages(2).asInstanceOf[LinearRegressionModel].summary 
println(s"numIterations: ${trainingSummary.totalIterations}")
println(s"objectiveHistory: [${trainingSummary.objectiveHistory.mkString(",")}]")
trainingSummary.residuals.show()
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
println(s"r2: ${trainingSummary.r2}")

numIterations: 11
objectiveHistory: [0.5,0.4628383730928235,0.2829957345590899,0.2543568688091208,0.29653911937073074,0.18868639292818604,0.16783880945958807,0.12182703917661503,0.11158188874690116,0.11052877064931642,0.10549294538646442]
+-------------------+
|          residuals|
+-------------------+
| 18.624537188468366|
| 12.414151923516783|
|  48.32058006373518|
|   137.497254411021|
|-39.917993971117085|
| -48.89329149249215|
| -411.0259980437604|
|  185.1370180846202|
|-101.66802303226223|
|  160.0380896405577|
|-110.47243822587825|
| -269.7091729679487|
| 419.65528642152367|
+-------------------+

RMSE: 201.14093389878042
r2: 0.7935707380603677
import org.apache.spark.sql.DataFrame
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.StandardScaler
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.regression.LinearRegressionModel
assembler: org.apache.spark.ml.

In [ ]:
import org.apache.spark.ml.evaluation.RegressionEvaluator

val predictions = lrModel.transform(test)

// Select (prediction, true label) and compute test error. 
val evaluator = new RegressionEvaluator() 
  .setLabelCol("Water") 
  .setPredictionCol("prediction") 
  .setMetricName("rmse")

val rmse = evaluator.evaluate(predictions) 
println(s"Root Mean Squared Error (RMSE) on test data = $rmse")

Root Mean Squared Error (RMSE) on test data = 273.1306867994529
import org.apache.spark.ml.evaluation.RegressionEvaluator
predictions: org.apache.spark.sql.DataFrame = [Temperature: double, Production: int ... 6 more fields]
evaluator: org.apache.spark.ml.evaluation.RegressionEvaluator = regEval_ca1c8e73fc01
rmse: Double = 273.1306867994529


In [ ]:
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}

// instancier le vector
val assembler = new VectorAssembler() 
                    .setInputCols(Array("Temperature", "Production", "Days", "Persons")) 
                    .setOutputCol("assembled_features")
val scaler = new StandardScaler() 
  .setInputCol("assembled_features") 
  .setOutputCol("features") 
  .setWithStd(true) 
  .setWithMean(false)

val lr = new LinearRegression() 
  .setMaxIter(10) 
  .setRegParam(0.3) 
  .setElasticNetParam(0.8) 
  .setLabelCol("Water")
val Array(train, test) = data.randomSplit(Array(0.8, 0.2), seed = 11L)

// on met les differentes etapes du pipeline
val pipeline = new Pipeline() 
  .setStages(Array(assembler, scaler, lr))
val lrModel = pipeline.fit(train)

// on cree la grille des parametres qu'il va devoir tester
val paramGrid = new ParamGridBuilder() 
  .addGrid(lr.regParam, Array(0.1, 0.01, 0.2, 0.3))
  .addGrid(lr.elasticNetParam, Array(0.1, 0.8)) 
  .build()
val cv = new CrossValidator() 
  .setEstimator(pipeline) 
  .setEvaluator(evaluator) 
  .setEstimatorParamMaps(paramGrid) 
  .setNumFolds(3)  // Use 3+ in practice, but 6 is well
val cvModel = cv.fit(train)

import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_0a246ac0fbea
scaler: org.apache.spark.ml.feature.StandardScaler = stdScal_654560b4188d
lr: org.apache.spark.ml.regression.LinearRegression = linReg_bfaf34af3926
train: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Temperature: double, Production: int ... 3 more fields]
test: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Temperature: double, Production: int ... 3 more fields]
pipeline: org.apache.spark.ml.Pipeline = pipeline_beab22694683
lrModel: org.apache.spark.ml.PipelineModel = pipeline_beab22694683
paramGrid: Array[org.apache.spark.ml.param.ParamMap] =
Array({
	linReg_bfaf34af3926-elasticNetParam: 0.1,
	linReg_bfaf34af3926...

In [ ]:
// Make predictions on test documents. cvModel uses the best model found (lrModel). 

// afficher les predictions
val predictionsCvModel = cvModel.transform(test)

// evaluation
val evaluatorCvModel = new RegressionEvaluator() 
  .setLabelCol("Salary")  
  .setPredictionCol("prediction") 
  .setMetricName("rmse") 
val rmse = evaluator.evaluate(predictionsCvModel) 
println(s"Root Mean Squared Error (RMSE) on test data = $rmse")

Root Mean Squared Error (RMSE) on test data = 249.8005072117436
predictionsCvModel: org.apache.spark.sql.DataFrame = [Temperature: double, Production: int ... 6 more fields]
evaluatorCvModel: org.apache.spark.ml.evaluation.RegressionEvaluator = regEval_d70de646ad94
rmse: Double = 249.8005072117436
