<img src=http://fd.perso.eisti.fr/Logos/TORUS2.png>




# Exercise time!

So, after all the lessons, it's your turn now to practice how to build a linear regression model in Spark!

### Exercise 1 : 

In HDFS of factory02 : "hdfs://hupi-factory-02-01-01-01/user/hupi/dataset_torusVN/geoVN.json", we have a small dataset (40 observations) of some characteristics of climate of some cities in Vietnam (source : https://www.meteoblue.com/en/weather/). 

Let's try building a linear regression model that predicts the temperature daily mean thanks to 8 variables below : 
- High cloud cover daily mean
- Low cloud cover daily mean
- Mean Sea Level Pressure daily mean
- Medium cloud cover daily mean
- Relative humidity daily mean
- Shortwave Radiation - backwards daily sum
- Total Precipitation daily sum
- Total cloud cover daily mean

To put it simply, we'll just care about these numerical variables; otherwise in this dataset, we can also inspect the relation between temperature daily mean and the city (categorical variable).

### Exercise 2 : 

Create an API that can predict and give the result in the interface of Hupi. (suggestion : convert model to PMML format -> create an API in Hupi - Predict -> create a Widget...)

### Exercise 3 :

With the dataset geoVN.json, let's create some visualizations in Hupi !

In [ ]:
import org.apache.spark.sql.SparkSession

val sparkSession = SparkSession
  .builder()
  .appName("linearRegModel_solution")
  .getOrCreate()

// For implicit conversions like converting RDDs to DataFrames
import sparkSession.implicits._

val df = sparkSession.read.json("hdfs://hupi-factory-02-01-01-01/user/hupi/dataset_torusVN/geoVN.json")

import org.apache.spark.sql.SparkSession
sparkSession: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@726f6ffc
import sparkSession.implicits._
df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]


In [ ]:
df.show()

+--------------------+
|     _corrupt_record|
+--------------------+
|                   [|
|                   {|
|       "Year": 2017,|
|         "Month": 8,|
|          "Day": 23,|
|    "City": "Ho C...|
|      "Lat": 10.823,|
|      "Lon": 106.63,|
|    "timestamp": ...|
|    "Temperature ...|
|    "Relative hum...|
|    "Mean Sea Lev...|
|    "Total Precip...|
|    "Total cloud ...|
|    "High cloud c...|
|    "Medium cloud...|
|    "Low cloud co...|
|    "Shortwave Ra...|
|    "datestamp":"...|
|                  },|
+--------------------+
only showing top 20 rows



In [ ]:
val df = sparkSession.read.json(sparkContext.wholeTextFiles("hdfs://hupi-factory-02-01-01-01/user/hupi/dataset_torusVN/geoVN.json").values)  

df: org.apache.spark.sql.DataFrame = [City: string, Day: bigint ... 15 more fields]


In [ ]:
df.show()

+-----------+---+------------------------------------------+-------+-------+----------------------------------------+----------------------------------------+-------------------------------------------+-----+--------------------------------------------+-----------------------------------------------+--------------------------------------+-----------------------------------+----------------------------------+----+----------+----------+
|       City|Day|High cloud cover daily mean [high cld lay]|    Lat|    Lon|Low cloud cover daily mean [low cld lay]|Mean Sea Level Pressure daily mean [MSL]|Medium cloud cover daily mean [mid cld lay]|Month|Relative humidity daily mean [2 m above gnd]|Shortwave Radiation - backwards daily sum [sfc]|Temperature daily mean [2 m above gnd]|Total Precipitation daily sum [sfc]|Total cloud cover daily mean [sfc]|Year| datestamp| timestamp|
+-----------+---+------------------------------------------+-------+-------+----------------------------------------+-----

In [ ]:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}


In [ ]:
// Convert df to RDD to be able to use the library MultiVariateStatisticalSummary. Here we choose only the necessary variables
val rdd = df.drop("City").drop("Day").drop("Year").drop("Month").drop("timestamp").drop("Lat").drop("Lon").drop("datestamp")
            .map(l => (l(6).asInstanceOf[Double], l(1).asInstanceOf[Double], l(2).asInstanceOf[Double],
                       l(3).asInstanceOf[Double], l(4).asInstanceOf[Double], l(5).asInstanceOf[Double],
                       l(0).asInstanceOf[Double], l(7).asInstanceOf[Double], l(8).asInstanceOf[Double])).rdd

rdd: org.apache.spark.rdd.RDD[(Double, Double, Double, Double, Double, Double, Double, Double, Double)] = MapPartitionsRDD[17] at rdd at <console>:83


In [ ]:
// Convert rdd to the rdd of vectors
val observations = rdd.map(l => Vectors.dense(l._1, l._2, l._3, l._4, l._5, l._6, l._7, l._8, l._9))

observations: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[18] at map at <console>:82


In [ ]:
// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println("Vectors of observations' mean : " + summary.mean)  
println("Vectors of observations' variance : " + summary.variance)  
println("Vectors of observations' number of column not null : " + summary.numNonzeros)  
println()

Vectors of observations' mean : [27.372999999999998,41.60125,1005.7774999999998,67.84125,84.66675,2915.84025,69.06050000000002,9.390000000000002,73.85525]
Vectors of observations' variance : [2.256365128205129,678.8604830128204,4.685562820512852,788.4269907051283,51.62532506410257,3008952.402069167,561.1052356410256,192.21579487179488,626.8092255769232]
Vectors of observations' number of column not null : [40.0,40.0,40.0,40.0,40.0,40.0,40.0,35.0,40.0]

summary: org.apache.spark.mllib.stat.MultivariateStatisticalSummary = org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2983b5fe


In [ ]:
df.take(10)

res13: Array[org.apache.spark.sql.Row] = Array([Ho Chi Minh,23,85.08,10.823,106.63,42.42,1007.0,89.46,8,86.92,2094.17,26.27,9.4,89.92,2017,23-08-2017,1503439200], [Ho Chi Minh,24,44.92,10.823,106.63,31.29,1007.35,70.12,8,82.5,3365.09,27.16,2.6,72.88,2017,24-08-2017,1503525600], [Ho Chi Minh,25,78.04,10.823,106.63,42.5,1007.31,84.71,8,86.17,2028.31,26.68,10.0,85.28,2017,25-08-2017,1503612000], [Ho Chi Minh,26,62.83,10.823,106.63,60.21,1005.54,66.5,8,87.83,1682.99,26.49,14.4,73.17,2017,26-08-2017,1503698400], [Ho Chi Minh,27,49.08,10.823,106.63,61.5,1005.94,97.12,8,86.46,2837.32,26.42,4.3,97.12,2017,27-08-2017,1503784800], [Ho Chi Minh,28,84.62,10.823,106.63,71.67,1007.61,81.42,8,90.33,1815.6,26.32,16.2,87.58,2017,28-08-2017,1503871200], [Ho Chi Minh,29,43.88,10.823,106.63,54.21,1008.26,4...

In [ ]:
df.select("City").distinct.take(10)

res15: Array[org.apache.spark.sql.Row] = Array([Da Nang], [Can Tho], [Hue], [Ha Noi], [Ho Chi Minh])


In [ ]:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}


In [ ]:
// Convert df to RDD to be able to use the library MultiVariateStatisticalSummary. Here we choose only the necessary variables
val rdd = df.drop("City").drop("Day").drop("Year").drop("Month").drop("timestamp").drop("Lat").drop("Lon").drop("datestamp")
            .map(l => (l(6).asInstanceOf[Double], l(1).asInstanceOf[Double], l(2).asInstanceOf[Double],
                       l(3).asInstanceOf[Double], l(4).asInstanceOf[Double], l(5).asInstanceOf[Double],
                       l(0).asInstanceOf[Double], l(7).asInstanceOf[Double], l(8).asInstanceOf[Double])).rdd

rdd: org.apache.spark.rdd.RDD[(Double, Double, Double, Double, Double, Double, Double, Double, Double)] = MapPartitionsRDD[32] at rdd at <console>:85


In [ ]:
rdd.take(2)

res21: Array[(Double, Double, Double, Double, Double, Double, Double, Double, Double)] = Array((26.27,42.42,1007.0,89.46,86.92,2094.17,85.08,9.4,89.92), (27.16,31.29,1007.35,70.12,82.5,3365.09,44.92,2.6,72.88))


In [ ]:
val observations = rdd.map(l => Vectors.dense(l._1, l._2, l._3, l._4, l._5, l._6, l._7, l._8, l._9))

observations: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[33] at map at <console>:83


In [ ]:
// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println("Vectors of observations' mean : " + summary.mean)  
println("Vectors of observations' variance : " + summary.variance)  
println("Vectors of observations' number of column not null : " + summary.numNonzeros)  
println()

Vectors of observations' mean : [27.372999999999998,41.60125,1005.7774999999998,67.84125,84.66675,2915.84025,69.06050000000002,9.390000000000002,73.85525]
Vectors of observations' variance : [2.256365128205129,678.8604830128204,4.685562820512852,788.4269907051283,51.62532506410257,3008952.402069167,561.1052356410256,192.21579487179488,626.8092255769232]
Vectors of observations' number of column not null : [40.0,40.0,40.0,40.0,40.0,40.0,40.0,35.0,40.0]

summary: org.apache.spark.mllib.stat.MultivariateStatisticalSummary = org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@7c87c97f


In [ ]:
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.rdd.RDD

import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.rdd.RDD


In [ ]:
// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method
// If a method is not specified, Pearson's method will be used by default.
val correlMatrix: Matrix = Statistics.corr(observations, "pearson")
//println(correlMatrix.toString)

correlMatrix: org.apache.spark.mllib.linalg.Matrix =
1.0                   -0.6071271749991745  -0.7769139178723304    ... (9 total)
-0.6071271749991745   1.0                  0.15213518644495438    ...
-0.7769139178723304   0.15213518644495438  1.0                    ...
-0.6317688824653139   0.5795801157147833   0.3987737633967863     ...
-0.8783589380133455   0.7848585544175861   0.5355160905617118     ...
0.7788360980837538    -0.7321669112163407  -0.5401236564390007    ...
-0.07984925706043007  0.34275011990685916  -0.054919658480978115  ...
-0.3896111580334033   0.6302167222500573   0.17145792931429643    ...
-0.6190037706669987   0.6470804476922558   0.3584018763482733     ...


In [ ]:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD


In [ ]:
val dataLabeled = rdd.map(l => LabeledPoint(l._1, Vectors.dense(l._2, l._3, l._4, l._5, l._6, l._7, l._8, l._9)))                  

In [ ]:
// Building the model
val numIterations = 100
val stepSize = 0.00000000001
val model = LinearRegressionWithSGD.train(dataLabeled, numIterations, stepSize)