Here you can find the solution of exercise 1!

### Read dataset (json format) from HDFS

Here we use the dataset from hdfs://hupi-factory-02-01-01-01/user/hupi/dataset_torusVN/geoVN.json

The target variable will be temperature daily mean and the variables descriptives are : 
- High cloud cover daily mean
- Low cloud cover daily mean
- Mean Sea Level Pressure daily mean
- Medium cloud cover daily mean
- Relative humidity daily mean
- Shortwave Radiation - backwards daily sum
- Total Precipitation daily sum
- Total cloud cover daily mean

In [ ]:
import org.apache.spark.sql.SparkSession

val sparkSession = SparkSession
  .builder()
  .appName("linearRegModel_solution")
  .getOrCreate()

// For implicit conversions like converting RDDs to DataFrames
import sparkSession.implicits._

val df = sparkSession.read.json("hdfs://hupi-factory-02-01-01-01/user/hupi/dataset_torusVN/geoVN.json")

import org.apache.spark.sql.SparkSession
sparkSession: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@7fcc5885
import sparkSession.implicits._
df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]


In [ ]:
df.show()

+--------------------+
|     _corrupt_record|
+--------------------+
|                   [|
|                   {|
|       "Year": 2017,|
|         "Month": 8,|
|          "Day": 23,|
|    "City": "Ho C...|
|      "Lat": 10.823,|
|      "Lon": 106.63,|
|    "timestamp": ...|
|    "Temperature ...|
|    "Relative hum...|
|    "Mean Sea Lev...|
|    "Total Precip...|
|    "Total cloud ...|
|    "High cloud c...|
|    "Medium cloud...|
|    "Low cloud co...|
|    "Shortwave Ra...|
|                  },|
|                   {|
+--------------------+
only showing top 20 rows



### The result is strange! it's not what you are expecting, right? 

It's because the format of json needed in Spark is 1 object per line, here we have 1 object with multi-lines, so this code won't work, we should do this way..

In [ ]:
val df = sparkSession.read.json(sparkContext.wholeTextFiles("hdfs://hupi-factory-02-01-01-01/user/hupi/dataset_torusVN/geoVN.json").values)  

df: org.apache.spark.sql.DataFrame = [City: string, Day: bigint ... 14 more fields]


In [ ]:
df.take(2)

res5: Array[org.apache.spark.sql.Row] = Array([Ho Chi Minh,23,85.08,10.823,106.63,42.42,1007.0,89.46,8,86.92,2094.17,26.27,9.4,89.92,2017,1503439200], [Ho Chi Minh,24,44.92,10.823,106.63,31.29,1007.35,70.12,8,82.5,3365.09,27.16,2.6,72.88,2017,1503525600])


In [ ]:
// List of differents cities in the dataset
df.select("City").distinct.take(50)

res7: Array[org.apache.spark.sql.Row] = Array([Da Nang], [Can Tho], [Hue], [Ha Noi], [Ho Chi Minh])


In [ ]:
df.printSchema()

root
 |-- City: string (nullable = true)
 |-- Day: long (nullable = true)
 |-- High cloud cover daily mean [high cld lay]: double (nullable = true)
 |-- Lat: double (nullable = true)
 |-- Lon: double (nullable = true)
 |-- Low cloud cover daily mean [low cld lay]: double (nullable = true)
 |-- Mean Sea Level Pressure daily mean [MSL]: double (nullable = true)
 |-- Medium cloud cover daily mean [mid cld lay]: double (nullable = true)
 |-- Month: long (nullable = true)
 |-- Relative humidity daily mean [2 m above gnd]: double (nullable = true)
 |-- Shortwave Radiation - backwards daily sum [sfc]: double (nullable = true)
 |-- Temperature daily mean [2 m above gnd]: double (nullable = true)
 |-- Total Precipitation daily sum [sfc]: double (nullable = true)
 |-- Total cloud cover daily mean [sfc]: double (nullable = true)
 |-- Year: long (nullable = true)
 |-- timestamp: long (nullable = true)



### Description of data 

In [ ]:
// Number of elements in data
df.count()

res9: Long = 40


#### Statistics summary 

In [ ]:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}


In [ ]:
// Convert df to RDD to be able to use the library MultiVariateStatisticalSummary. Here we choose only the necessary variables
val rdd = df.drop("City").drop("Day").drop("Year").drop("Month").drop("timestamp").drop("Lat").drop("Lon")
            .map(l => (l(6).asInstanceOf[Double], l(1).asInstanceOf[Double], l(2).asInstanceOf[Double],
                       l(3).asInstanceOf[Double], l(4).asInstanceOf[Double], l(5).asInstanceOf[Double],
                       l(0).asInstanceOf[Double], l(7).asInstanceOf[Double], l(8).asInstanceOf[Double])).rdd

rdd: org.apache.spark.rdd.RDD[(Double, Double, Double, Double, Double, Double, Double, Double, Double)] = MapPartitionsRDD[17] at rdd at <console>:81


In [ ]:
rdd.take(2)

res9: Array[(Double, Double, Double, Double, Double, Double, Double, Double, Double)] = Array((26.27,42.42,1007.0,89.46,86.92,2094.17,85.08,9.4,89.92), (27.16,31.29,1007.35,70.12,82.5,3365.09,44.92,2.6,72.88))


In [ ]:
// Convert rdd to the rdd of vectors
val observations = rdd.map(l => Vectors.dense(l._1, l._2, l._3, l._4, l._5, l._6, l._7, l._8, l._9))

observations: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[24] at map at <console>:80


In [ ]:
// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println("Vectors of observations' mean : " + summary.mean)  
println("Vectors of observations' variance : " + summary.variance)  
println("Vectors of observations' number of column not null : " + summary.numNonzeros)  
println()

Vectors of observations' mean : [27.372999999999998,41.60125,1005.7774999999998,67.84125,84.66675,2915.84025,69.06050000000002,9.390000000000002,73.85525]
Vectors of observations' variance : [2.256365128205129,678.8604830128204,4.685562820512852,788.4269907051283,51.62532506410257,3008952.402069167,561.1052356410256,192.21579487179488,626.8092255769232]
Vectors of observations' number of column not null : [40.0,40.0,40.0,40.0,40.0,40.0,40.0,35.0,40.0]

summary: org.apache.spark.mllib.stat.MultivariateStatisticalSummary = org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2d11d3c5


#### Correlations of variables 

In [ ]:
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.rdd.RDD

import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.rdd.RDD


In [ ]:
// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method
// If a method is not specified, Pearson's method will be used by default.
val correlMatrix: Matrix = Statistics.corr(observations, "pearson")
//println(correlMatrix.toString)

correlMatrix: org.apache.spark.mllib.linalg.Matrix =
1.0                   -0.6071271749991745  -0.7769139178723304    ... (9 total)
-0.6071271749991745   1.0                  0.15213518644495438    ...
-0.7769139178723304   0.15213518644495438  1.0                    ...
-0.6317688824653139   0.5795801157147833   0.3987737633967863     ...
-0.8783589380133455   0.7848585544175861   0.5355160905617118     ...
0.7788360980837538    -0.7321669112163407  -0.5401236564390007    ...
-0.07984925706043007  0.34275011990685916  -0.054919658480978115  ...
-0.3896111580334033   0.6302167222500573   0.17145792931429643    ...
-0.6190037706669987   0.6470804476922558   0.3584018763482733     ...


###  Convert to RDD labeledpoint

It's arbitrary to work with ML or MLLib when we create the model, but here in the end, we want to convert the model in PMML, so it's better to work with MLLib because the ML library is still not supported PMML converter yet...

In [ ]:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD


In [ ]:
val dataLabeled = rdd.map(l => LabeledPoint(l._1, Vectors.dense(l._2, l._3, l._4, l._5, l._6, l._7, l._8, l._9)))                  

dataLabeled: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[28] at map at <console>:88


In [ ]:
dataLabeled.take(5)

res22: Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((26.27,[42.42,1007.0,89.46,86.92,2094.17,85.08,9.4,89.92]), (27.16,[31.29,1007.35,70.12,82.5,3365.09,44.92,2.6,72.88]), (26.68,[42.5,1007.31,84.71,86.17,2028.31,78.04,10.0,85.28]), (26.49,[60.21,1005.54,66.5,87.83,1682.99,62.83,14.4,73.17]), (26.42,[61.5,1005.94,97.12,86.46,2837.32,49.08,4.3,97.12]))


### Build a linear regression model 

In [ ]:
// Building the model
val numIterations = 100
val stepSize = 0.00000000001
val model = LinearRegressionWithSGD.train(dataLabeled, numIterations, stepSize)

       val model = LinearRegressionWithSGD.train(dataLabeled, numIterations, stepSize)
                   ^
numIterations: Int = 100
stepSize: Double = 1.0E-11
model: org.apache.spark.mllib.regression.LinearRegressionModel = org.apache.spark.mllib.regression.LinearRegressionModel: intercept = 0.0, numFeatures = 8


### Evaluation of model 

In [ ]:
// Evaluate model on training examples and compute training error
val valuesAndPreds = dataLabeled.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2) }.mean()
println("training Mean Squared Error = " + MSE)

training Mean Squared Error = 751.2262320762441
valuesAndPreds: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[35] at map at <console>:95
MSE: Double = 751.2262320762441


### To predict new value 

In [ ]:
// We created the input to test, it should have the format RDD[Vector]
val test = sc.parallelize(List("47.42,1017.0,91.46,80.92,2194.17,95.08,4.4,79.92")).map(l => l.split(","))
              .map(l => Vectors.dense(l(0).toDouble, l(1).toDouble, l(2).toDouble, l(3).toDouble, 
                                      l(4).toDouble, l(5).toDouble, l(6).toDouble, l(7).toDouble))

test: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[485] at map at <console>:86


In [ ]:
val predictResult = model.predict(test)

predictResult: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[486] at mapPartitions at GeneralizedLinearAlgorithm.scala:70


In [ ]:
predictResult.collect()

res66: Array[Double] = Array(0.003554256277734528)


### Conclusion :

Without any optimization and with the dataset not very big (40 observations), the quality of the model is not very good (because MSE is very big, this indicator means the difference between the real value and the predicted value and it shouldn't be big). To optimize this model, we can add more observations or more descriptives variables (here we just take randomly these variables as example), play with the parameters when we train the model, filter out the anomalies in the dataset or selection variables, etc.

### Convert to PMML 

In [ ]:
// Export to PMML to a String in PMML format
println("PMML Model:\n" + model.toPMML)

PMML Model:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML version="4.2" xmlns="http://www.dmg.org/PMML-4_2">
    <Header description="linear regression">
        <Application name="Apache Spark MLlib" version="2.1.1"/>
        <Timestamp>2017-08-30T09:46:03</Timestamp>
    </Header>
    <DataDictionary numberOfFields="9">
        <DataField name="field_0" optype="continuous" dataType="double"/>
        <DataField name="field_1" optype="continuous" dataType="double"/>
        <DataField name="field_2" optype="continuous" dataType="double"/>
        <DataField name="field_3" optype="continuous" dataType="double"/>
        <DataField name="field_4" optype="continuous" dataType="double"/>
        <DataField name="field_5" optype="continuous" dataType="double"/>
        <DataField name="field_6" optype="continuous" dataType="double"/>
        <DataField name="field_7" optype="continuous" dataType="double"/>
        <DataField name="target" optype="continuous" dataType="doubl

### Steps to create an API in Hupi - Interface

1/ After printing the PMML code, we can now copy (from </PMML version... to </PMML>) 

2/ You can now go to Hupi - Link, click on "Predict", here you can see all predict endpoints, click now on "Create predict endpoint", you fill in method name, then click on "Add query object", you will see "Query Object", click on it and paste the PMML code in "Modele". In Query Object, beside "Modele", we need to complete also : Name, Description (optional), Predict engine (here choose OpenScoring), Export type (csv, etc.). Then click on "Submit". One sign shows that Hupi understands the model is after clicking on submit, when you reopen "Query Object", there will have "Summary" in which it describes the model such as :

{
"data":{
"id":"someId",
"miningFunction":"regression",
....
}
}

If you see something like this, the API is there! if not, there must be something wrong or you didn't copy properly the PMML code!

### Steps to make the predict result visible in Hupi - interface 

1/ After having create the endpoint predict with PMML, we can now create a Widget. You can click on "Widgets", then "Create widget". Write a "Name" then click on "Details", fill in "URI", click and choose the endpoint predict that you recently created, and "Render Type" : csv. After this, click on "Options", we have to add the descriptives variables in the model (Attention : the names should be the same names as in the PMML code : in the PMML code, you can see DataField name = "field_0", ...Here field_0 is the name of your first descriptive variable. So now, we can click on "add API params", in the left, we fill "field_0" and the value in the right. Here we have 8 variables, so we need to create 8 lines like this...When you are done, click on "Submit"!

2/ Then in the left, click on "Widget Preview" to see the output that will be shown in Hupi - Front. 

3/ What's left is just create a dashboard and add it in a theme, then add the Widget in the dashboard !

### To create the data visualizations in Hupi - Front

Here we want to create some visualizations of the dataset in Hupi - Front, we have to save the dataset in the database MongoDB and then create the endpoint that use this collection and then widgets that describe data with graphics.

To be able to run the code below, we should add a dependency in "Edit" -> "Edit notebook metadata" -> "customDeps". See more in https://github.com/spark-notebook/spark-notebook/blob/master/docs/metadata.md

In [ ]:
import com.mongodb.spark._
import com.mongodb.spark.config._
import org.bson.Document

import com.mongodb.spark._
import com.mongodb.spark.config._
import org.bson.Document


In [ ]:
df.write.format("com.mongodb.spark.sql").option("uri", "mongodb://hupi:Hupi123\\!@54.188.232.161:27017/hupi.dataset_climateVn")
.mode("overwrite").save()

com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting to connect. Client view of cluster state is {type=UNKNOWN, servers=[{address=54.188.232.161:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.SocketTimeoutException: connect timed out}}]
  at com.mongodb.connection.BaseCluster.getDescription(BaseCluster.java:163)
  at com.mongodb.Mongo.getClusterDescription(Mongo.java:411)
  at com.mongodb.Mongo.getServerAddressList(Mongo.java:404)
  at com.mongodb.spark.connection.MongoClientCache$$anonfun$logClient$1.apply(MongoClientCache.scala:161)
  at com.mongodb.spark.connection.MongoClientCache$$anonfun$logClient$1.apply(MongoClientCache.scala:161)
  at com.mongodb.spark.LoggingTrait$class.logInfo(LoggingTrait.scala:48)
  at com.mongodb.spark.Logging.logInfo(Logging.scala:24)
  at com.mongodb.spark.connection.MongoClientCache.logClient(MongoClientCache.scala:161)
  at com.mongodb.spark.co