### Preparing HDFS
Using magic

Create input folder on HDFS if not exists

Copy from data from local

<span style="color:red"><em style=font-size:40px;>!</em>The following cells are basically a copy and paste of our assessment 1 submission

In [1]:
!pwd
! hadoop fs -rm -R /tmp/rs_input
! hadoop fs -mkdir -p  /tmp/rs_input
! hadoop fs -put   -p  ./../data-raw/Melbourne_housing_FULL.csv             /tmp/rs_input/raw.csv
! hadoop fs -ls        /tmp/rs_input/

/home/big-data-realestate-master/scripts

20/06/02 14:05:14 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.


Deleted /tmp/rs_input


Found 1 items


-rwxrwxrwx   1 1000 staff    5018236 2020-06-02 10:54 /tmp/rs_input/raw.csv




In [2]:
//load raw into df
val df_raw = spark
    .read
    .format("csv")
    .option("header", "true")
    .load("hdfs://localhost:9000/tmp/rs_input/raw.csv")

Intitializing Scala interpreter ...

Spark Web UI available at http://a56a95a92af2:4041
SparkContext available as 'sc' (version = 2.4.5, master = local[*], app id = local-1591106723251)
SparkSession available as 'spark'


df_raw: org.apache.spark.sql.DataFrame = [Suburb: string, Address: string ... 19 more fields]


In [3]:
//only select columns we need now
var df_working= df_raw.select("Price",
                          "Method",
                          "Type",
                          "Distance",
                          "Rooms",
                          "Bathroom",
                          "Car",
                          "Landsize",
                          "Lattitude",
                          "Longtitude",    
                          "Suburb",
                          "Address",
                          "Date")


//add meaningful to column names
df_working = df_working.withColumnRenamed("Method","MethodOfSale")
    .withColumnRenamed("Distance","DistanceFromCBD")
    .withColumnRenamed("Type","PropertyType")
    .withColumnRenamed("Lattitude","Latitude")

df_working: org.apache.spark.sql.DataFrame = [Price: string, MethodOfSale: string ... 11 more fields]
df_working: org.apache.spark.sql.DataFrame = [Price: string, MethodOfSale: string ... 11 more fields]


In [4]:
//when profiling there are a number of columns with a "#N/A" which need to be removed
df_working = df_working.filter($"DistanceFromCBD" =!= "#N/A")

df_working: org.apache.spark.sql.DataFrame = [Price: string, MethodOfSale: string ... 11 more fields]


In [5]:
//refactored to remove the for column loop
df_working = df_working.withColumn("Price",col("Price").cast("Double"))
    .withColumn("Rooms",col("Rooms").cast("Int"))
    .withColumn("DistanceFromCBD",col("DistanceFromCBD").cast("Double"))
    .withColumn("Bathroom",col("Bathroom").cast("Int"))
    .withColumn("Car",col("Car").cast("Int"))
    .withColumn("Landsize",col("Landsize").cast("Double"))
    .withColumn("Latitude",col("Latitude").cast("Double"))
    .withColumn("Longtitude",col("Longtitude").cast("Double"))
    

df_working: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: string ... 11 more fields]


In [6]:
//convert categorically values to ints
//make sure the categorical type is upper
df_working = df_working.withColumn("PropertyType", upper(col("PropertyType")))

df_working = df_working.withColumn("PropertyType",
when(col("PropertyType") === "H", "1")
.when(col("PropertyType") === "U", "2")
.when(col("PropertyType") ==="T", "3")
.otherwise("0"))

df_working: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: string ... 11 more fields]
df_working: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: string ... 11 more fields]


In [7]:
//convert categorically values to ints
//make sure the categorical type is upper
df_working = df_working.withColumn("MethodOfSale", upper(col("MethodOfSale")))

df_working: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: string ... 11 more fields]


In [8]:
df_working = df_working.withColumn("MethodOfSale",
when(col("MethodOfSale") === "S", "1")
.when(col("MethodOfSale") === "SP", "2")
.when(col("MethodOfSale") === "PI", "3")
.when(col("MethodOfSale") === "PN", "4")
.when(col("MethodOfSale") === "SN", "5")
.when(col("MethodOfSale") === "VB", "6")
.when(col("MethodOfSale") === "W", "7")
.when(col("MethodOfSale") === "SA", "8")
.when(col("MethodOfSale") === "SS", "9")                                
.otherwise("0"))

df_working: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: string ... 11 more fields]


In [9]:
//cast categorical values to Ints
//Not strictly needed with the one hot encoding later
df_working = df_working.withColumn("PropertyType",col("PropertyType").cast("Int"))
    .withColumn("MethodOfSale",col("MethodOfSale").cast("Int"))

df_working: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 11 more fields]


In [10]:
// make first letter of suburb upper case
df_working= df_working.withColumn("Suburb", initcap(col("Suburb")))

df_working: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 11 more fields]


In [11]:
//split address on Street name and Suffix
df_working = df_working.withColumn("StreetName",split(col("Address")," ").getItem(1)).
    withColumn("StreetSuffix",split(col("Address")," ").getItem(2)).drop("Address")

df_working: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 12 more fields]


In [12]:
//fix "The Parade, The *** adddresses"
df_working = df_working.withColumn("StreetName",
when(col("StreetName").like("The"), concat(lit("The "),col("StreetSuffix")))
.otherwise(col("StreetName")))

//remove the street suffix
df_working = df_working.withColumn("StreetSuffix",
when(col("StreetName").contains("The"), lit(""))
.otherwise(col("StreetSuffix")))

df_working: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 12 more fields]
df_working: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 12 more fields]


In [13]:
//Rebuild street name from cleaned tokens
//this is approach was due to the legacy code we already had tokenizing the address so i just rejoined them
//one could write a regex replace to remove the street numbers unit numbers etc. bu i know these columns are clean
df_working = df_working.withColumn("StreetName", concat(col("StreetName"), lit(" "),col("StreetSuffix"))).drop("StreetSuffix")

df_working: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 11 more fields]


In [14]:
// make first letter of Street upper case
df_working= df_working.withColumn("StreetName", initcap(col("StreetName")))

df_working: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 11 more fields]


In [15]:
//drop all properties with land area less than 12 sqm 
df_working = df_working.filter(!($"Landsize"<12))

df_working: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 11 more fields]


In [16]:
//drop rows where type = h and landsize < 50 sqm
df_working = df_working.filter(!($"Landsize"<50 && $"PropertyType"===1))

df_working: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 11 more fields]


In [17]:
val df_not_null = df_working.na.drop()

df_not_null: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 11 more fields]


In [18]:
df_not_null.printSchema()

root
 |-- Price: double (nullable = true)
 |-- MethodOfSale: integer (nullable = true)
 |-- PropertyType: integer (nullable = true)
 |-- DistanceFromCBD: double (nullable = true)
 |-- Rooms: integer (nullable = true)
 |-- Bathroom: integer (nullable = true)
 |-- Car: integer (nullable = true)
 |-- Landsize: double (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Longtitude: double (nullable = true)
 |-- Suburb: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- StreetName: string (nullable = true)



In [19]:
! hadoop fs -rm -R /tmp/output
! hadoop fs -mkdir -p /tmp/output

20/06/02 14:05:37 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.


Deleted /tmp/output




In [20]:
//this coalesce and write out isn't necessary if doing e2e
val df_output = df_not_null.coalesce(1)
   .write
   .format("csv")
   .option("header","true")
   .mode("overwrite").option("sep",",")
   .save("hdfs://localhost:9000/tmp/output")

df_output: Unit = ()


In [21]:
! hadoop fs -mkdir -p /tmp/output
! rm ./../data-clean/cleanMelbourneData.csv
! hadoop fs -copyToLocal /tmp/output/\*.csv ./../data-clean/cleanMelbourneData.csv

##  <span style="color:red"><em style=font-size:40px;>!</em>Assessment 2 Modelling etc. Starts from here


## Import Data
Here you have the option to just start from the outputed file from the wrangling if that has been ran before. 
Or can go e2e

I think ideally we would want to move from output to a new input folder somewhere on HDFS

In [22]:
//use the set wrangled from above
var df_clean = df_not_null

df_clean: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 11 more fields]


### Load from saved local file

ctrl + / or cmd + / should can comment / uncomment selections

In [23]:
// ! hadoop fs -mkdir -p  /tmp/output
// ! hadoop fs -put   -p  ./../data-clean/cleanMelbourneData.csv  /tmp/output  

In [24]:
// // Load Clean Dataset into a DataFrame from HDFS after wrangling is completed
// var df_clean = spark
//     .read
//     .format("csv")
//     .option("header", "true")
//     .load("hdfs://localhost:9000/tmp/input/*.csv")
// df_clean.cache()

// //when we use the output from the wrangle 
// //this can be removed as types should be fine
// df_clean = df_clean.withColumn("Price",col("Price").cast("Double"))
//     .withColumn("Rooms",col("Rooms").cast("Int"))
//     .withColumn("DistanceFromCBD",col("DistanceFromCBD").cast("Double"))
//     .withColumn("MethodOfSale",col("MethodOfSale").cast("Int"))
//     .withColumn("PropertyType",col("PropertyType").cast("Int"))
//     .withColumn("Bathroom",col("Bathroom").cast("Int"))
//     .withColumn("Car",col("Car").cast("Int"))
//     .withColumn("Landsize",col("Landsize").cast("Double"))
//     .withColumn("Latitude",col("Latitude").cast("Double"))
//     .withColumn("Longtitude",col("Longtitude").cast("Double"))


df_clean: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 11 more fields]
df_clean: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 11 more fields]


In [25]:
//debug check file load
df_clean.cache()
df_clean.printSchema()
df_clean.count()

root
 |-- Price: double (nullable = true)
 |-- MethodOfSale: integer (nullable = true)
 |-- PropertyType: integer (nullable = true)
 |-- DistanceFromCBD: double (nullable = true)
 |-- Rooms: integer (nullable = true)
 |-- Bathroom: integer (nullable = true)
 |-- Car: integer (nullable = true)
 |-- Landsize: double (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Longtitude: double (nullable = true)
 |-- Suburb: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- StreetName: string (nullable = true)



res1: Long = 15728


 ### Check Spark Parameters

In [26]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf,SparkContext}

val cs = spark.sparkContext.getConf
sc.getConf.getAll.foreach { println }

(spark.driver.host,a56a95a92af2)
(spark.repl.class.uri,spark://a56a95a92af2:38611/classes)
(spark.rdd.compress,True)
(spark.repl.class.outputDir,/tmp/tmpgpucxy3c)
(spark.serializer.objectStreamReset,100)
(spark.master,local[*])
(spark.executor.id,driver)
(spark.driver.port,38611)
(spark.submit.deployMode,client)
(spark.app.id,local-1591106723251)
(spark.app.name,spylon-kernel)
(spark.ui.showConsoleProgress,true)


import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
cs: org.apache.spark.SparkConf = org.apache.spark.SparkConf@1fa4ca92


### Construct vectors from attributes
#### Transform Sale Date into a numeric value

In [27]:
//does this don't do much for Pat's LR model
//so i removed it, but i have left it here incase others used it
df_clean = df_clean.withColumn("Date",unix_timestamp($"Date", "dd/mm/yyyy"))

df_clean: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 11 more fields]


In [28]:
import org.apache.spark.ml.feature.{FeatureHasher,OneHotEncoder,StandardScaler,VectorAssembler}

//PK i have refactored the separte instances of hashers, encoders and assemblers to just using the feature hasher
//this will handle all the categorical items and and one hot encoding in single call

//set the feature names
val featureColumnNames= Array("MethodOfSale",
            "PropertyType",
            "DistanceFromCBD",
            "Rooms",
            "Bathroom",
            "Car",
            "Landsize",
            "Latitude",
            "Longtitude",
            "Suburb",
            "Date",
            "StreetName")

//set the categorical names
val categoricalFeatureColumnNames= Array("MethodOfSale",
            "PropertyType",
            "Suburb",
            "StreetName")

//define hasher
val hasher = new FeatureHasher()
  .setInputCols(featureColumnNames)
  .setCategoricalCols(categoricalFeatureColumnNames)
  .setOutputCol("hashedFeatures")


import org.apache.spark.ml.feature.{FeatureHasher, OneHotEncoder, StandardScaler, VectorAssembler}
featureColumnNames: Array[String] = Array(MethodOfSale, PropertyType, DistanceFromCBD, Rooms, Bathroom, Car, Landsize, Latitude, Longtitude, Suburb, Date, StreetName)
categoricalFeatureColumnNames: Array[String] = Array(MethodOfSale, PropertyType, Suburb, StreetName)
hasher: org.apache.spark.ml.feature.FeatureHasher = featureHasher_45faf028dfeb


### Split Data into a Training and a Testing Set

In [114]:
//define function to take a data frame and set training and test samples
//stratafied sample on propertype type
//original random sample may have been skewing the model
//used propertype as the prices vary wildly based on the type 

//idea is split into training and testing - then stratify the training set so its balanced.

//https://maet3608.github.io/nuts-ml/tutorial/split_stratify.html#id1

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._


def train_test_split(data: DataFrame) = {
    
    //split into 80% 20%
    var Array(train, test) = data.randomSplit(Array(0.8, 0.2), seed = 30)
    
    //Stratified sampler
    //want to over sample the other propertyTypes
    val fractions = Map(1 -> 0.8,2 ->0.9, 3 -> 0.9)

     train = train.stat.sampleBy("PropertyType",fractions,36L)
    
    (train, test)
}

val (train, test) = train_test_split(df_clean)

train.cache()
test.cache()

//debug sample counts
println(train.count())
println(test.count())

10273
3108


import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
train_test_split: (data: org.apache.spark.sql.DataFrame)(org.apache.spark.sql.Dataset[org.apache.spark.sql.Row], org.apache.spark.sql.Dataset[org.apache.spark.sql.Row])
train: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Price: double, MethodOfSale: int ... 11 more fields]
test: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Price: double, MethodOfSale: int ... 11 more fields]


In [115]:
train.groupBy("PropertyType").count().show()
test.groupBy("PropertyType").count().show()

+------------+-----+
|PropertyType|count|
+------------+-----+
|           1| 8430|
|           3|  834|
|           2| 1009|
+------------+-----+

+------------+-----+
|PropertyType|count|
+------------+-----+
|           1| 2602|
|           3|  223|
|           2|  283|
+------------+-----+



In [116]:
df_clean.groupBy("PropertyType").count().show()

+------------+-----+
|PropertyType|count|
+------------+-----+
|           1|13160|
|           3| 1155|
|           2| 1413|
+------------+-----+



### 1. Apply Linear Regression


###  Useful Specs for this section 

Feel free to move to references i guess

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.regression.LinearRegression

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.FeatureHasher

https://spark.apache.org/docs/latest/ml-classification-regression.html#linear-regression

https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#regression-model-evaluation

https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/DataFrameStatFunctions.html#sampleBy(java.lang.String,%20java.util.Map,%20long)








In [117]:
//imports for linear regression code in one spot
//to be moved with other imports to start

import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.Pipeline

import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.Pipeline


In [118]:
//function for getting execution time from start and end times
def getExecutionTime(start: Long , end : Long) = {
    val duration:Long = (end - start) / 1000
    (duration)
}

getExecutionTime: (start: Long, end: Long)Long


### Run first LR model with default Params
Assess the training set

In [119]:
//Assess the above LR
val startTimeMillis = System.currentTimeMillis()

//just give a LR a go with default settings values
val lr = new LinearRegression()
    .setLabelCol("Price")
    .setFeaturesCol("hashedFeatures")
    .setPredictionCol("Predicted Price")
    .setMaxIter(50)

//make a dataset for testing the model and printing its summarry
val featurized = hasher.transform(train)
val lrModel = lr.fit(featurized)

// Print the coefficients and intercept for linear regression
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")

// Summarize the model over the training set and print out some metrics
val trainingSummary = lrModel.summary

println("\n")

println(s"numIterations: ${trainingSummary.totalIterations}")
println(s"objectiveHistory: [${trainingSummary.objectiveHistory.mkString(",")}]")
println("\n")

trainingSummary.residuals.show()

println("\n")
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
println(s"r2: ${trainingSummary.r2}")
println("\n")

//print runtime
val endTimeMillis = System.currentTimeMillis()

print("Model was executed "
      + getExecutionTime(startTimeMillis,endTimeMillis))


Coefficients: (262144,[23,38,54,85,200,209,238,240,246,403,441,457,507,508,510,562,589,603,629,651,706,709,845,894,943,950,1004,1025,1031,1059,1179,1249,1272,1340,1364,1372,1438,1512,1603,1750,1764,1815,1819,1867,1874,1972,2065,2079,2090,2158,2242,2284,2307,2354,2389,2398,2467,2476,2620,2630,2669,2695,2702,2760,2764,2807,2835,2933,2940,3036,3093,3101,3161,3187,3208,3298,3351,3380,3447,3556,3640,3667,3677,3735,3748,3817,3846,3849,3894,3963,3972,3987,4080,4093,4127,4157,4160,4223,4266,4278,4345,4381,4396,4397,4411,4447,4462,4496,4549,4566,4614,4652,4694,4780,4804,4843,4853,4878,4890,4914,4917,4960,5030,5125,5147,5154,5160,5166,5177,5196,5409,5417,5442,5510,5516,5540,5569,5632,5646,5700,5753,5777,5852,6144,6186,6204,6239,6349,6366,6453,6463,6484,6503,6509,6512,6637,6645,6701,6732,6756,6904,6943,6947,6951,6983,6998,7013,7116,7145,7210,7248,7265,7326,7329,7333,7369,7385,7393,7408,7458,7560,7564,7591,7612,7617,7634,7645,7665,7783,7785,7903,7916,7918,7968,7995,8039,8135,8154,8159,8177,8189,82

7845,127912,127951,127962,127971,127998,128024,128028,128109,128179,128214,128260,128290,128293,128338,128350,128371,128372,128373,128390,128442,128453,128559,128564,128770,128803,128884,128970,129035,129067,129071,129102,129105,129203,129336,129546,129559,129560,129811,129859,129938,129999,130078,130093,130147,130148,130173,130186,130230,130259,130286,130381,130423,130507,130510,130514,130521,130581,130688,130723,130785,130819,130877,130897,130903,130951,131012,131028,131101,131104,131229,131288,131457,131491,131516,131536,131566,131710,131722,131724,131736,131747,131795,131828,131850,131890,131905,131947,131965,131997,132007,132026,132102,132192,132275,132329,132335,132351,132370,132396,132437,132457,132528,132553,132586,132610,132668,132671,132689,132728,132776,132808,132882,132885,132902,132943,132985,132988,133007,133089,133091,133112,133199,133216,133218,133253,133254,133344,133361,133384,133391,133444,133462,133464,133469,133500,133526,133548,133598,133605,133877,133897,133956,1

,239452,239498,239501,239509,239519,239586,239590,239681,239727,239770,239886,239892,239936,240028,240054,240190,240218,240301,240325,240386,240419,240434,240468,240616,240632,240633,240661,240684,240727,240732,240749,240841,240883,240906,240912,240922,241010,241108,241246,241254,241255,241301,241374,241432,241480,241492,241572,241588,241611,241687,241728,241753,241788,241863,241889,241964,241995,242079,242089,242106,242126,242127,242129,242156,242222,242244,242252,242282,242283,242328,242340,242343,242367,242399,242427,242470,242508,242514,242557,242589,242713,242819,242852,243009,243060,243074,243105,243126,243133,243162,243164,243222,243235,243244,243367,243405,243472,243522,243534,243545,243793,243799,243801,243810,243839,243843,243860,243965,243976,243983,243992,244051,244056,244072,244103,244123,244140,244154,244200,244207,244252,244328,244338,244352,244357,244389,244431,244449,244460,244468,244479,244517,244561,244574,244577,244581,244583,244622,244627,244639,244642,244691,24478

94.631957229685,-105351.38652019898,-686416.2949883492,251364.77327814145,143034.0016193169,283947.55699549767,-232107.34278846483,55687.28583920882,319375.5118417378,202454.3769670063,311714.77901291155,-171431.34875643297,406980.2053221618,-129566.78572940026,35006.879887464565,200398.86132644812,28592.587289834904,-313832.9557386239,-174025.58943905425,-191586.77784542457,-28214.1436482989,-44995.83497144939,76674.18442963524,-81771.52975361675,64325.87164293021,-402835.88867088064,-304270.00102415064,-325443.60552530934,109887.94012634692,8645.277571525357,-64498.70032566854,168977.88278928204,-394578.1518944938,-251894.480254779,-529110.8667880578,-427107.7072808105,-138800.79064961968,-244611.5938948206,-258091.34217253534,1456955.794694138,-323140.3181197952,197338.38553131226,-340081.4401682048,143224.73153809045,241010.16002579205,-753857.1432747939,123036.66296245053,252373.40736847473,-39973.8638351435,-269599.31121432927,251318.80243646476,-304585.02412978397,20407.40008517

36975.41600266882,116803.4508351746,-216752.94727066305,1155448.9230948559,-45731.49765563104,-32151.0536468973,-434708.01690383663,204070.13767110242,-315132.1628665098,220668.1848442239,-245923.89011808022,-151350.6189895174,-350333.78552637354,140058.10376376627,653752.1005965851,-137021.98287302515,-60367.63824824973,82449.50396837243,205527.67887176975,50684.786631512434,326881.3653379352,-368484.9102474471,-51737.19907510888,-15712.670289040543,-132222.4404167573,179600.7410543008,-184123.91854294625,202320.19208903436,-242265.98032224557,-245627.4061583595,600841.311114372,-27101.623027731388,138638.0239709315,110510.4006682328,-67072.27374382319,-167698.50158377827,-278032.6851056337,-380436.3400691596,-223176.26591279675,594716.9909126469,-677892.2007836606,-83727.985822288,220260.66237950945,64303.08178567357,0.0022466891121900004,-424748.87892724056,-255963.18232314836,213178.2225946541,39170.383345501185,1517396.305581802,-266398.7351718261,-129905.87843232616,-224403.11788

6,103580.620958171,-368453.9188414116,-129972.97916727733,38189.96326890995,-149354.0396849474,-11705.945015140527,-126.97727345923806,-75141.32031439105,55608.96430321707,232677.2761535849,344331.36876532534,333079.5433298157,-475241.09809094726,-520301.4311058087,23532.168410880335,-408767.8165957876,-450244.74464209744,-195689.7906543386,756371.0980096933,-184923.9595139436,-203627.12585252165,57711.610899987245,-357401.8647242227,-24856.235902508695,-54263.442929728044,224212.071652204,-258277.12405929578,-594629.8102889367,79908.3881883879,235517.56605148411,1410993.047039548,91678.4351875785,278517.9560351472,-27007.429169908984,-76608.85034308856,-124278.33813169823,581510.9693727903,-7074.196807199812,-94652.11571962539,-54768.199600177635,-1434730.115044083,228869.85578995518,205809.6023548966,-61700.58976024948,-351748.7008368668,20493.22788896349,217465.7624198998,123898.56465280677,57530.27881181271,76104.5138251775,1771260.1942720674,-52340.776956114045,-222297.51635620807

96773,-83704.73560739405,-131507.74445649734,1026331.7632881631,-302852.89313290507,-647212.9524949691,36662.77328128072,-4966.355199062868,-332488.07015400316,236470.6183460766,-364910.9500249778,-81568.00255343867,931083.3082866682,-398818.1034492645,-482382.7250322303,59284.55140680122,-349119.81685110135,-663877.7503095716,-79268.36113229186,1901.9787556065019,-450501.89690821036,224717.58840072952,-16459.96974612706,-164060.36322190764,-171447.6838047353,15404.25417113253,48577.13277818021,-554670.8832577026,85251.44568319233,71230.61401615324,-4444.58569405301,-507314.58430647897,-670294.2971474967,34413.97036408891,-350763.4777778561,504218.3526177744,-79942.4945490173,1127907.8246040116,504713.3837500387,-533362.4018712844,-246446.36311085845,-183245.39113788857,-1936.3470778586193,-179823.715454834,107061.98627230724,-147533.45761828634,49716.553710961125,85734.17295429013,77095.31039920269,-77098.76783654459,354814.45420454687,-87476.89576392616,759.2176321579847,-342121.7958

.61857439156,283211.51692666416,337196.4708180904,493044.05175082776,255851.17309573584,63893.933837726836,-171711.6636562693,-72171.80921838929,-37989.20465383729,-291637.73193408066,-14886.29899040423,-481536.33049194573,-555141.8911461651,-345438.5350749559,-135869.1412630539,-9131.350161404333,-136904.6253481785,-3482.201624380965,-18498.12672987224,-144637.36929051194,284316.06326022465,35526.436561962524,269269.22466616024,-575513.3031103954,-164014.54839280547,78909.3626003322,-115434.22483054985,-74002.54632279222,118498.7502314577,-224583.93237890405,-487423.7733369999,-12507.423597691373,12453.974070287688,-28753.104236098945,-82885.52533481181,1268746.8900277887,-176150.4864178791,331379.35795844713,-147827.23842212465,-60842.330875779044,565552.8688807827,-431120.0216986986,-595882.627792843,362762.09156592574,-142782.67095753274,167451.52650722364,597141.7053819329,-20871.855466946476,22983.468075605633,-88479.87418984676,64172.45320556843,287993.04402332846,-450883.700840

7888,80217.24418818258,-280586.7937164755,-144112.5654966034,-85063.07120974947,71728.71408624755,160860.4577079213,63699.42248883291,196747.27699949496,-289351.63036593684,160589.5653599724,150873.33726343824,1225581.3306056058,-421939.401422438,726858.2864975749,1078823.682940675,-121240.25025710114,175960.5306311771,-306906.2234471558,-198922.54278224707,128459.65187682227,-597920.283987507,-146236.11031890803,-66241.2517440943,-747717.6105784192,504638.59692590515,61722.42579197432,115671.89936858452,-150324.09836016115,139000.18012082184,-323860.8526392616,-65570.5426112456,-44681.15690257717,-569869.7348277194,28855.56219295129,147199.3424553658,-153983.527165325,545191.815760448,-295198.23254054587,-443527.5815945553,-266037.3811889529,253998.7921796395,-128008.20728495096,-183509.31077877266,-376885.95378444664,1253937.1645436992,-276791.8418716923,484698.27581162547,274597.77057150844,-163520.0630841403,-223178.00321190304,-248156.68859348743,-684663.6687301904,-264760.0689731

+-------------------+
|          residuals|
+-------------------+
| 412.15276976674795|
| -767895.0032207184|
|  131277.2134665437|
| -634876.1045000292|
|-102938.24788755924|
| 193757.02808455005|
| 207213.82493167743|
| 118055.83231917024|
| 308584.27891164646|
|  306426.0624011792|
| 360900.56870378554|
|   52488.2821607776|
| 141071.52487400174|
|-193584.34176416695|
|   308282.288923122|
|-172392.29499951378|
| 130767.25863386318|
|-222046.03844140843|
|-53801.391875989735|
| -4.722123693674803|
+-------------------+
only showing top 20 rows



RMSE: 256722.38750802382
r2: 0.8493933892352158


Model was executed 3

startTimeMillis: Long = 1591108657739
lr: org.apache.spark.ml.regression.LinearRegression = linReg_17fa852a9b35
featurized: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 12 more fields]
lrModel: org.apache.spark.ml.regression.LinearRegressionModel = linReg_17fa852a9b35
trainingSummary: org.apache.spark.ml.regression.LinearRegressionTrainingSummary = org.apache.spark.ml.regression.LinearRegressionTrainingSummary@7eae722c
endTimeMillis: Long = 1591108661605


In [120]:
// define an evaluator for the cross validation
def evaluate ( predictions: DataFrame, metric: String) = {
    val eval =  new RegressionEvaluator()
       .setLabelCol("Price")
       .setPredictionCol("Predicted Price")
       .setMetricName(metric)
println("Root Mean Squared Error "+  metric.toUpperCase()
        +" on test data = " + eval.evaluate(predictions))
}

evaluate: (predictions: org.apache.spark.sql.DataFrame, metric: String)Unit


### Run Crossvalidator on LR model
Construct paramgrid of regParam and ElasticNet

Estimate the performance using RMSE

Get best Params to apply in final pipeline

One could add this to the pipepline but this takes the longest to run
as its assessing 4*4 combinations - so instead the learnings of this are passed to pipeline
Then this wouldn't need to run

<span style="color:red"><em style=font-size:40px;>!</em>Runtime of next cell is around 330 seconds or so, on my VM with like 10gig of RAM and 4 CPUs</span>



In [121]:
//Creates a crossvalidator on only the LR model
//Had some issues retreiving the params when using on pipeline
//and the pipeline only has one estimator in it so this was easier than traversing the stages of the pipline

val startTimeMillis = System.currentTimeMillis()

// make some featurised sets for just running the model thru the Crossvalidator
val featurized_training = hasher.transform(train)
val featurized_test = hasher.transform(test)

//set LR with 100 max iter
val lr = new LinearRegression()
    .setLabelCol("Price")
    .setFeaturesCol("hashedFeatures")
    .setPredictionCol("Predicted Price")
    .setMaxIter(100)


// We use a ParamGridBuilder to construct a grid of parameters to search over.
val paramGrid = new ParamGridBuilder()
  .addGrid(lr.regParam, Array(0,0.1,0.5,1))
  .addGrid(lr.elasticNetParam, Array(0,0.1,0.5,1))
  .build()

// We now treat the model as an Estimator, wrapping it in a CrossValidator instance.
// This will allow us to choose best params for the model
val cv = new CrossValidator()
  .setEstimator(lr)
  .setEvaluator(new RegressionEvaluator()
  .setLabelCol("Price")
  .setPredictionCol("Predicted Price")
  .setMetricName("rmse"))
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(3)  
  .setParallelism(2)

// Run cross-validation, and choose the best set of parameters.
val cvModel = cv.fit(featurized_training)

// Make predictions on test documents. 
//cvModel uses the best model found.
cvModel.transform(featurized_test)
  .select("Price", "Predicted Price")
  .show()

//print runtime
val endTimeMillis = System.currentTimeMillis()

print("Model was executed "
      + getExecutionTime(startTimeMillis,endTimeMillis))


+--------+-------------------+
|   Price|    Predicted Price|
+--------+-------------------+
|170000.0|-165805.71619416028|
|280000.0| -52954.83617828041|
|280500.0|  421654.2463353276|
|283000.0| -250676.2730807215|
|290000.0|  737490.8979906812|
|300000.0|  229874.6147556901|
|300000.0|  417832.1629817188|
|305000.0|  650712.7257229611|
|310000.0|  417717.0282851383|
|316000.0|  55748.07751482725|
|320000.0|  317543.3528402597|
|320000.0|  261896.0108422488|
|320000.0| 132878.09726958722|
|320000.0|  626981.9097635373|
|325000.0|  -48903.9248617962|
|333000.0|-121796.17546715587|
|340000.0| 153495.73643434793|
|345000.0|  746647.0338011608|
|348000.0| -53608.77498348057|
|350000.0| -82522.69145133346|
+--------+-------------------+
only showing top 20 rows

Model was executed 285

startTimeMillis: Long = 1591108662397
featurized_training: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 12 more fields]
featurized_test: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 12 more fields]
lr: org.apache.spark.ml.regression.LinearRegression = linReg_f0fa089a12f2
paramGrid: Array[org.apache.spark.ml.param.ParamMap] =
Array({
	linReg_f0fa089a12f2-elasticNetParam: 0.0,
	linReg_f0fa089a12f2-regParam: 0.0
}, {
	linReg_f0fa089a12f2-elasticNetParam: 0.1,
	linReg_f0fa089a12f2-regParam: 0.0
}, {
	linReg_f0fa089a12f2-elasticNetParam: 0.5,
	linReg_f0fa089a12f2-regParam: 0.0
}, {
	linReg_f0fa089a12f2-elasticNetParam: 1.0,
	linReg_f0fa089a12f2-regParam: 0.0
}, {
	linReg_f0fa089a12f2-elasticNetParam: 0.0,
	linReg_f0fa089a12f2-regParam: 0.1
...

### Extract best Parameters from CrossValidation

In [122]:
//print out the params used for the best model
val bestModel = cvModel.bestModel

//save as ParamMap to pass into pipeline
val bestParamMap = bestModel.extractParamMap

//these the bestParams determined from CV
println(bestModel.extractParamMap) 

{
	linReg_f0fa089a12f2-aggregationDepth: 2,
	linReg_f0fa089a12f2-elasticNetParam: 1.0,
	linReg_f0fa089a12f2-epsilon: 1.35,
	linReg_f0fa089a12f2-featuresCol: hashedFeatures,
	linReg_f0fa089a12f2-fitIntercept: true,
	linReg_f0fa089a12f2-labelCol: Price,
	linReg_f0fa089a12f2-loss: squaredError,
	linReg_f0fa089a12f2-maxIter: 100,
	linReg_f0fa089a12f2-predictionCol: Predicted Price,
	linReg_f0fa089a12f2-regParam: 1.0,
	linReg_f0fa089a12f2-solver: auto,
	linReg_f0fa089a12f2-standardization: true,
	linReg_f0fa089a12f2-tol: 1.0E-6
}


bestModel: org.apache.spark.ml.Model[_] = linReg_f0fa089a12f2
bestParamMap: org.apache.spark.ml.param.ParamMap =
{
	linReg_f0fa089a12f2-aggregationDepth: 2,
	linReg_f0fa089a12f2-elasticNetParam: 1.0,
	linReg_f0fa089a12f2-epsilon: 1.35,
	linReg_f0fa089a12f2-featuresCol: hashedFeatures,
	linReg_f0fa089a12f2-fitIntercept: true,
	linReg_f0fa089a12f2-labelCol: Price,
	linReg_f0fa089a12f2-loss: squaredError,
	linReg_f0fa089a12f2-maxIter: 100,
	linReg_f0fa089a12f2-predictionCol: Predicted Price,
	linReg_f0fa089a12f2-regParam: 1.0,
	linReg_f0fa089a12f2-solver: auto,
	linReg_f0fa089a12f2-standardization: true,
	linReg_f0fa089a12f2-tol: 1.0E-6
}


In [123]:
//define new LR instance for using the best params in pipeline
val bestlr = new LinearRegression()
    .setLabelCol("Price")
    .setFeaturesCol("hashedFeatures")
    .setPredictionCol("Predicted Price")

bestlr: org.apache.spark.ml.regression.LinearRegression = linReg_6ad1528346e1


### Define Pipeline 

In [124]:
// add linear regression to stages
// I removed the scaler as the performance for gain was terrible
val lrStages = Array(
    hasher,
    bestlr
)

lrStages: Array[org.apache.spark.ml.PipelineStage with org.apache.spark.ml.util.DefaultParamsWritable{def copy(extra: org.apache.spark.ml.param.ParamMap): org.apache.spark.ml.PipelineStage with org.apache.spark.ml.util.DefaultParamsWritable{def copy(extra: org.apache.spark.ml.param.ParamMap): org.apache.spark.ml.PipelineStage with org.apache.spark.ml.util.DefaultParamsWritable}}] = Array(featureHasher_45faf028dfeb, linReg_6ad1528346e1)


In [125]:
val startTimeMillis = System.currentTimeMillis()

//define a pipleine
val lrPipe = new Pipeline().setStages(lrStages)

// We fit our DataFrame into the pipeline to generate a model
// pass best ParamMap from cross validation
val lrModel = lrPipe.fit(train,bestParamMap)


// Make predictions using the model and the test data
// pass best ParamMap from cross validation
val predictions = lrModel.transform(test,bestParamMap)

val endTimeMillis = System.currentTimeMillis()

print("Pipeline was executed "
      + getExecutionTime(startTimeMillis,endTimeMillis))


Pipeline was executed 6

startTimeMillis: Long = 1591108949801
lrPipe: org.apache.spark.ml.Pipeline = pipeline_b26e2ad9a747
lrModel: org.apache.spark.ml.PipelineModel = pipeline_b26e2ad9a747
predictions: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 13 more fields]
endTimeMillis: Long = 1591108956299


In [126]:
def getExecutionTime(start: Long , end : Long) = {
    val duration:Long = (end - start) / 1000
    (duration)
}

getExecutionTime: (start: Long, end: Long)Long


In [127]:
//finesse the output of predicted price and price to aid visual compare
predictions.withColumn("Predicted Price", round($"Predicted Price", 0))
    .select("Price","Predicted Price").show()

+--------+---------------+
|   Price|Predicted Price|
+--------+---------------+
|170000.0|      -135846.0|
|280000.0|       -35190.0|
|280500.0|       426359.0|
|283000.0|      -181652.0|
|290000.0|       746044.0|
|300000.0|       223785.0|
|300000.0|       444185.0|
|305000.0|       697371.0|
|310000.0|       462258.0|
|316000.0|        24641.0|
|320000.0|       318811.0|
|320000.0|       247288.0|
|320000.0|       183254.0|
|320000.0|       604481.0|
|325000.0|       -68539.0|
|333000.0|      -121701.0|
|340000.0|       701511.0|
|345000.0|       770691.0|
|348000.0|       -56040.0|
|350000.0|       -46910.0|
+--------+---------------+
only showing top 20 rows



#### Regression metrics

**Mean squared error (MSE)** -- the average of squared differences between the predicted outcome and the true outcome.

**R2 coefficient** -- the proportion of variance in the outcome that our model is capable of predicting based on its features.

In [128]:
evaluate(predictions,"rmse")
evaluate(predictions,"r2")

Root Mean Squared Error RMSE on test data = 420273.740044464
Root Mean Squared Error R2 on test data = 0.5752747994605538


Looks like we go down from our training set:

`RMSE: 256722.38750802382
r2: 0.8493933892352158`


to our test set with best params:

`420273.740044464
0.5752747994605538`


### 2. Apply K-means Model  

This modelling of this section is based on (Sarkar, 2017)

#### Training


Pipeline Estimator

In [44]:
import org.apache.spark.ml.clustering.KMeans

val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(df)

<console>: 39: error: not found: value df

#### Prediction

#### Testing/Evaluation

Pipeline Model Transformer

Evaluate the quality of clustering by "elbowing" the **Within SEt Sum of Squared Errors (WSSSE)** graph.

In [None]:
val WSSSE = model.computeCost(df)

println("WSSSE error + $WSSSE")

In [None]:
model.clusterCenters.foreach(println)

In [None]:
val transformed = model.transform(df)

In [None]:
//compute the diff between the labels abd the predicated values on data set

transformed.select("prediction").groupBy("prediction")count().orderBy().show("prediction")

In [None]:
val y1df = transformed.select($"label",$"prediction").where($"label!=prediction")

In [None]:
y1df.count()

In [None]:
transformed.filter("prediction = 0").show()

In [None]:
transformed.filter("prediction = 1").show()

In [None]:
transformed.filter("prediction = 0").describe.show()

In [None]:
transformed.filter("prediction = 1").describe.show()

In [None]:
println("No of mis-matche between predictions and labels ="+y1df.count() +"\nTotal no. of records "+transformed.count()+ \
         "\nCorrect predictions = "+ (1-y1df.count()).toDouble/transformed.count()+"\nMismatch "+\
          (1-y1df.count()).toDouble/transformed.count())

In [None]:
//Feed test input records for the model to predict their cluster
val testData = spark.createDataFrame(Seq().toDF("colName","features"))
model.transform(testData).show()

### 3. Apply Random Forest Regression

**Build Random Forest model**
Specify maxDepth, maxBins, auto and seed parameters.

**maxDepth** -- Maximum depth of a tree. Increasing the depth makes the model more powerful, but deep trees take longer to train.

**maxBins** -- Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node.

**auto** -- Automatically select the number of features to consider for splits at each tree node

**seed** -- Use a random seed number , allowing to repeat the results


If the number of trees is 1, then no bootstrapping is used at all. However, if the number of trees is > 1, then the bootstrapping is accomplished. Where, the parameter featureSubsetStrategy signifies the number of features to be considered for splits at each node. The supported values of featureSubsetStrategy are "auto", "all", "sqrt", "log2" and "on third". The supported numerical values, on the other hand, are (0.0-1.0] and [1-n]. However, if featureSubsetStrategy is chosen as "auto", the algorithm chooses the best feature subset strategy automatically


If the numTrees == 1, the featureSubsetStrategy is set to be "all". However, if the numTrees > 1 (i.e., forest), featureSubsetStrategy is set to be "onethird" for regression


Moreover, if a real value "n" is in the range (0, 1.0] is set, n*number_of_features is used consequently. However, if an integer value "n" is in the range (1, the number of features) is set, only n features are used alternatively


The parameter categoricalFeaturesInfo which is a map is used for storing arbitrary of categorical features. An entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1,...,k-1}
The impurity criterion used for information gain calculation. The supported values are “gini" and “variance”. The former is the only supported value for classification. The latter is used for regression


The maxDepth is the maximum depth of the tree. (e.g., depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes). However, the suggested value is 4 to get a better result


The maxBins signifies the maximum number of bins used for splitting the features; where the suggested value is 100 to get better results


Finally, the random seed is used for bootstrapping and choosing feature subsets to avoid the random nature of the results.

In [None]:
import org.apache.spark.ml.regression.RandomForestRegressor
import org.apache.spark.ml.tuning.CrossValidator
//import org.apache.spark.ml.Pipeline

val seed = 5043

val rf = new RandomForestRegressor()
  .setMaxBins(100)
  .setMaxDepth(6)
  .setNumTrees(10)
  .setFeatureSubsetStrategy("onethird")
  .setSeed(seed)
  .setLabelCol("Price")

In [None]:
val rfPredictions = time{predictions(rf, train, test)}
rfPredictions.cache()

In [None]:
rfPredictions.columns

In [None]:
rfPredictions.withColumn("prediction", round($"prediction", 0)).select("Price","prediction").show()

#### Regression metrics


In [None]:
evaluate(rfPredictions,"rmse")

In [None]:
evaluate(rfPredictions,"r2")

#### Testing/Evaluation/ Parameter Tuning

Cross-validation
<span style="color:red">
TO DO: 
* finish implementation for Cross-validation 
* check if finish run in reasonable time
</span>

In [None]:
import org.apache.spark.ml.regression.RandomForestRegressor

// Models hypoparameters
val numTrees = Seq(5,10,30)
val maxBins = Seq(50,100)
val maxDepth = Seq(2,3,5)
val impurity = Seq("gini","entropy","variance")
val featureSubsetStrategy = Seq("sqrt","onethird")

val rf = new RandomForestRegressor()
  .setLabelCol("Price")
  .setFeaturesCol("features")
  .setPredictionCol("prediction")


val rfParamMap = new ParamGridBuilder()
  .addGrid(rf.numTrees, numTrees)
  .addGrid(rf.maxDepth, maxDepth)
  .addGrid(rf.maxBins, maxBins)
  .addGrid(rf.featureSubsetStrategy, featureSubsetStrategy)
  .build()

val t0 = System.nanoTime()
val best_model = train_eval(rf, rfParamMap, train, test)
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0)/(1000000000) + " s")


#### Prediction

In [None]:
rfPredictions.withColumn("prediction", round($"prediction", 0)).select("Price","prediction").show()

// this will add new columns rawPrediction, probability and prediction
val predictionDf = randomForestModel.transform(testData)
predictionDf.show(10)

#### Tuning

#### Bias vs Variance Graph of Error (validation error and training error) versus training set size. 


<span style="color:red">
TO DO: 
produce graph -- validation error and training error should converge
</span>


### References

Apache Spark (n.d.). Spark ML Programming Guide. Retrieved from https://spark.apache.org/docs/1.2.2/ml-guide.html

Gorczynski M. (2017). Introduction to machine learning with spark and mllib (dataframe API). Retrieved from https://scalac.io/scala-spark-ml-machine-learning-introduction/

Hydrospheredata (2020). Program creek. Scala Code Examples. Scaler Retrieved from https://www.programcreek.com/scala/org.apache.spark.ml.feature.StandardScaler

Jen G. (2020) FeatureHasher. Retrieved from https://george-jen.gitbook.io/data-science-and-apache-spark/featurehasher

Johnson S (2019). From sckit-learn to Spark ML. Retrieved from https://towardsdatascience.com/from-scikit-learn-to-spark-ml-f2886fb46852

Johnson S (2019). Housing Prices - Spark ML Project Retrieved from https://github.com/scottdjohnson/HousingPricePredictions/blob/master/HousingPrices-SparkML.ipynb

Masri A. (2019). FeatureTransformation. Retrieved from
https://towardsdatascience.com/apache-spark-mllib-tutorial-7aba8a1dce6e

Sarkar A. (2017). Learning Spark SQL. Implementing a Spark ML clustering model. Packt Publishing.

Scala Doc (n.d.) Retrieved from https://docs.scala-lang.org


(2019) Random Forest Classifier with Apache Spark Retireved from https://medium.com/rahasak/random-forest-classifier-with-apache-spark-c63b4a23a7cc

Wagle M.(2020) _Predicting House Prices using Machine Learning_. Retrieved from https://medium.com/@manilwagle/predicting-house-prices-using-machine-learning-cab0b82cd3f