### Preparing HDFS
Using magic

Create input folder on HDFS if not exists

Copy from data from local

In [1]:
! hadoop fs -mkdir -p  /tmp/output
! hadoop fs -put   -p  ./../data-clean/cleanMelbourneData.csv  /tmp/output             

put: `/tmp/output/cleanMelbourneData.csv': File exists




In [73]:
// Load Clean Dataset into a DataFrame from HDFS after wrangling is completed
var df_clean = spark
    .read
    .format("csv")
    .option("header", "true")
    .load("hdfs://localhost:9000/tmp/output/*.csv")

df_clean: org.apache.spark.sql.DataFrame = [Price: string, MethodOfSale: string ... 11 more fields]


In [74]:
//when we use the output from the wrangle this can be removed as types should be fine
df_clean = df_clean.withColumn("Price",col("Price").cast("Double"))
    .withColumn("Rooms",col("Rooms").cast("Int"))
    .withColumn("DistanceFromCBD",col("DistanceFromCBD").cast("Double"))
    .withColumn("MethodOfSale",col("MethodOfSale").cast("Int"))
    .withColumn("PropertyType",col("PropertyType").cast("Int"))
    .withColumn("Bathroom",col("Bathroom").cast("Int"))
    .withColumn("Car",col("Car").cast("Int"))
    .withColumn("Landsize",col("Landsize").cast("Double"))
    .withColumn("Latitude",col("Latitude").cast("Double"))
    .withColumn("Longtitude",col("Longtitude").cast("Double"))


df_clean: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 11 more fields]


In [75]:
df_clean.printSchema()

root
 |-- Price: double (nullable = true)
 |-- MethodOfSale: integer (nullable = true)
 |-- PropertyType: integer (nullable = true)
 |-- DistanceFromCBD: double (nullable = true)
 |-- Rooms: integer (nullable = true)
 |-- Bathroom: integer (nullable = true)
 |-- Car: integer (nullable = true)
 |-- Landsize: double (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Longtitude: double (nullable = true)
 |-- Suburb: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- StreetName: string (nullable = true)



### Change attributes into vectors
#### Transform Sale Date into a numeric value

In [107]:
//does this do anything to the model??
df_clean = df_clean.withColumn("Date",unix_timestamp($"Date", "dd/mm/yyyy"))

df_clean: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 11 more fields]


In [108]:
import org.apache.spark.ml.feature.{FeatureHasher,OneHotEncoder,StandardScaler,VectorAssembler}

//PK i have refactored the below items into just using the feature hasher
//this will handle all the categorical items and and one hot encoding in single call

//set the feature names
val featureColumnNames= Array("MethodOfSale",
            "PropertyType",
            "DistanceFromCBD",
            "Rooms",
            "Bathroom",
            "Car",
            "Landsize",
            "Latitude",
            "Longtitude",
            "Suburb",
            "Date",
            "StreetName")

//set the categorical names
val categoricalFeatureColumnNames= Array("MethodOfSale",
            "PropertyType",
            "Suburb",
            "StreetName")


val hasher = new FeatureHasher()
  .setInputCols(featureColumnNames)
  .setCategoricalCols(categoricalFeatureColumnNames)
  .setOutputCol("hashedFeatures")


import org.apache.spark.ml.feature.{FeatureHasher, OneHotEncoder, StandardScaler, VectorAssembler}
featureColumnNames: Array[String] = Array(MethodOfSale, PropertyType, DistanceFromCBD, Rooms, Bathroom, Car, Landsize, Latitude, Longtitude, Suburb, Date, StreetName)
categoricalFeatureColumnNames: Array[String] = Array(MethodOfSale, PropertyType, Suburb, StreetName)
hasher: org.apache.spark.ml.feature.FeatureHasher = featureHasher_949c013a48ce


In [109]:
//this is still hanging...
// i think its because of the one hot encoding making a sparse vector - apparently the standardscaler doesnt like that
//might be a disadvantage to the way i am dealing with the features vs how Irina was assemblying

var scaler = new StandardScaler()
      .setInputCol("hashedFeatures")
      .setOutputCol("features")
      .setWithStd(true).setWithMean(true)


scaler: org.apache.spark.ml.feature.StandardScaler = stdScal_2548dbf07f2d


### Split Data into a Training and a Testing Set

In [110]:
//define function to take a data frame and set training and test samples
//stratafied sample on propertype type
//original random sample may have been skewing the model
// used propertype as the prices vary wildly based on the type so depending on how random sampling might skew

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._


def train_test_split(data: DataFrame) = {
    
    val train_fractions = Map(1 -> 0.8,2 ->0.8, 3 -> 0.8)
    val test_fractions = Map(1 -> 0.2,2 ->0.2, 3 -> 0.2)

    
    val train = data.stat.sampleBy("PropertyType",train_fractions,36L)
    val test = data.stat.sampleBy("PropertyType",test_fractions,1L)
    
    //random sampler
    //val Array(train, test) = data.randomSplit(Array(0.8, 0.2), seed = 30)
    
    (train, test)
}

val (train, test) = train_test_split(df_clean)
train.cache()
test.cache()

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
train_test_split: (data: org.apache.spark.sql.DataFrame)(org.apache.spark.sql.DataFrame, org.apache.spark.sql.DataFrame)
train: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 11 more fields]
test: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 11 more fields]
res56: test.type = [Price: double, MethodOfSale: int ... 11 more fields]


In [130]:
train.count()

res68: Long = 12524


In [131]:
test.count()

res69: Long = 3206


### 1. Apply Linear Regression

#### Training


https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.regression.LinearRegression

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.FeatureHasher

https://spark.apache.org/docs/latest/ml-classification-regression.html#linear-regression

https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#regression-model-evaluation

https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/DataFrameStatFunctions.html#sampleBy(java.lang.String,%20java.util.Map,%20long)








In [111]:
//imports for linear regression code in one spot
//to be cleaned
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.Predictor
import org.apache.spark.ml.PredictionModel
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.Pipeline



import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.Predictor
import org.apache.spark.ml.PredictionModel
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.Pipeline


In [112]:

//just give a LR a go with no real values
val lr = new LinearRegression()
    .setLabelCol("Price")
    .setFeaturesCol("hashedFeatures")
    .setPredictionCol("Predicted Price")
    .setMaxIter(50)



lr: org.apache.spark.ml.regression.LinearRegression = linReg_8eb0d0ecaaa8


In [113]:

val startTimeMillis = System.currentTimeMillis()

//make a dataset for testing the model and printing its summarry
val featurized = hasher.transform(train)
val lrModel = lr.fit(featurized)

// Print the coefficients and intercept for linear regression
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")

// Summarize the model over the training set and print out some metrics
val trainingSummary = lrModel.summary

println("\n")

println(s"numIterations: ${trainingSummary.totalIterations}")
println(s"objectiveHistory: [${trainingSummary.objectiveHistory.mkString(",")}]")
println("\n")

trainingSummary.residuals.show()

println("\n")
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
println(s"r2: ${trainingSummary.r2}")
println("\n")

//print runtime
val endTimeMillis = System.currentTimeMillis()
val durationSeconds = (endTimeMillis - startTimeMillis) / 1000
print("Model was executed "+durationSeconds)


Coefficients: (262144,[23,38,54,85,117,200,209,238,240,246,345,403,441,457,507,508,510,562,589,603,614,619,629,651,706,709,743,764,845,894,929,936,950,1004,1025,1031,1059,1179,1249,1255,1272,1339,1340,1364,1372,1382,1408,1438,1476,1512,1603,1750,1764,1815,1819,1867,1874,1876,1940,1972,2065,2079,2090,2158,2228,2242,2284,2307,2354,2389,2398,2467,2476,2536,2630,2669,2695,2702,2760,2764,2790,2807,2835,2933,3036,3093,3101,3161,3187,3208,3276,3298,3351,3380,3403,3446,3447,3490,3556,3640,3677,3678,3719,3735,3817,3846,3849,3894,3963,3972,3987,3996,4080,4093,4127,4157,4160,4278,4381,4396,4397,4411,4447,4462,4496,4549,4566,4614,4688,4694,4768,4780,4804,4838,4843,4853,4878,4890,4914,4917,4960,4970,5147,5154,5160,5166,5177,5196,5221,5261,5409,5417,5442,5504,5510,5516,5569,5580,5632,5700,5753,5777,5840,5852,6144,6186,6204,6349,6366,6371,6453,6463,6484,6503,6512,6637,6645,6701,6732,6756,6822,6904,6943,6947,6951,6983,6998,7013,7068,7101,7116,7145,7210,7248,7265,7275,7311,7326,7329,7333,7369,7385,7393

3526,113543,113610,113633,113669,113695,113762,113998,114014,114033,114051,114062,114188,114239,114481,114513,114524,114534,114665,114714,114742,114834,114844,114857,115082,115105,115145,115237,115259,115281,115300,115340,115348,115374,115465,115586,115627,115635,115690,115697,115729,115732,115769,115967,116019,116124,116188,116200,116260,116263,116373,116530,116547,116620,116678,116724,116764,116842,116860,116875,116905,116930,116979,117042,117065,117067,117102,117118,117166,117229,117263,117357,117366,117385,117461,117549,117573,117606,117635,117645,117680,117724,117828,117984,118006,118119,118174,118220,118350,118411,118468,118471,118485,118496,118569,118571,118649,118672,118728,118729,118765,118776,118785,118791,118863,118969,118979,119021,119065,119124,119171,119210,119243,119255,119369,119477,119508,119509,119549,119555,119603,119665,119705,119833,119836,119839,119875,119930,119947,119966,119998,120000,120036,120081,120143,120165,120278,120324,120382,120404,120421,120467,120542,1

,210378,210488,210551,210589,210619,210719,210736,210738,210758,210764,210780,210816,210880,210912,210970,211052,211084,211086,211098,211128,211135,211200,211299,211313,211342,211377,211541,211548,211595,211642,211667,211677,211681,211718,211791,211842,211863,211950,211997,212033,212053,212058,212127,212168,212265,212394,212485,212511,212533,212573,212614,212645,212709,212720,212730,212789,212800,212809,212861,212888,212915,212934,212938,213020,213033,213043,213185,213240,213262,213275,213315,213331,213341,213366,213368,213406,213526,213538,213587,213691,213695,213808,213816,213830,213866,213873,213938,213946,213986,214018,214024,214071,214090,214101,214210,214218,214240,214258,214358,214431,214453,214491,214548,214554,214664,214695,214725,214752,214974,214978,214983,215031,215044,215144,215210,215282,215304,215390,215485,215491,215526,215666,215670,215691,215781,215795,215814,215914,215943,215980,216031,216080,216148,216162,216170,216195,216197,216239,216241,216314,216444,216467,21649

-581126.4813886931,668785.7089561883,-216331.73611084296,-6928.15879251018,-285029.200042056,325024.0816851043,53020.205928546326,-335939.3902023439,165.49675509072978,-41649.28561394213,-3875.7915370299274,-103761.13871194526,74859.15606818479,-1931.5109856361978,143848.46067201035,294046.97667275433,153278.58160328658,-316464.90840849665,-181471.1272009443,-62702.142492732506,-20546.117626861716,-305767.8065106243,-182503.87080715728,405816.0965255974,-120054.85593879798,-295167.14830297144,-508598.19683714386,300549.3289666403,111735.54212121022,15228.449070341843,-175002.9107523372,-107122.39671956246,-253508.5121372805,-69840.3386417568,-253308.49407750845,-75284.97558597269,942382.1972169131,-66776.76891794975,78047.6232796648,-605338.4607969486,113635.21474842352,44665.47884500678,462693.4650653948,295916.1819925089,-111954.14152580076,-334575.0714818119,-46719.670088566665,-363907.52287072607,-588298.7427761031,40805.591990621426,-210801.61270549998,-160360.095858089,546962.197

97798.56608585405,-77496.9736319322,-249759.97703332859,1394323.1374360588,112648.97932600905,4240.07105227384,-156730.99187068993,-56460.2391054113,171534.76877887425,-14848.20923120454,4413.438094085146,-159355.40197992823,1125870.613816883,-472553.4184083991,37908.51519431292,-141228.22716030336,485939.8280681007,-482715.7201292423,-394582.76895896584,-252340.91124259468,45474.809652105454,3936.3836541861515,-290325.2894532499,45481.95325943584,59804.66221903521,-571841.5984576124,-164016.2122756847,91942.51572317771,-158300.8354846758,-100605.15570838955,588779.7273597217,-217899.3893321297,147075.2406806626,-57345.801870558535,-94925.01657135542,95608.83337515486,-371087.71589898015,-102531.30881346464,-167839.1765036103,-294377.0818831627,-30836.70075558411,455828.9521709994,190540.53070798644,-236332.68758850236,496912.7628215549,-55309.6284020095,-228677.33731210273,492078.38057769084,-324704.26633209846,-47034.0132905765,115370.81301648097,171224.79775345835,-84334.28496454921

315.16023410869,283701.89516398404,-249228.31927609173,74577.85270451056,-31835.772115956766,-42278.6083417283,-449971.09267210943,-115310.9734513294,305051.37417720706,-416257.60503690306,221628.7677045386,2586.122164575424,78485.8724596842,615277.3987204887,299817.2212346638,-168327.78290843923,-176238.8011031525,-410782.5891402835,65533.742154583444,-37620.17331208409,-279573.1629971545,36391.55472368694,-121312.4782006125,-83892.51518735099,-363406.32809808716,-53555.734699749155,135708.90897345592,224889.28906666624,-164307.16180911355,-37735.11798427381,94443.53695324484,110717.17522217176,-382035.88550881296,-46768.013707709004,-68154.1603470312,50036.853999657855,203609.81822703045,-358883.762103437,157384.5901712163,-219674.8758727339,-144350.45368532382,-34719.75058850424,-144003.331483744,68888.72112838687,155033.649553228,-169312.13230315136,879977.4813818652,-77498.00266953596,223074.87739327937,-5512.062219213638,-71423.40880060544,656184.3633644186,206582.7112491128,3847

5,-117456.8874901359,-465303.8018382937,332505.01996100927,-1990.7994851981605,-94649.46435286438,848675.7133774663,-364487.1021262088,54244.66580340076,-20941.22661212702,-243363.4600273671,115954.10095677486,-120452.3601200733,-79192.13840663478,465008.09607908985,-71993.56377149615,524251.1665808189,208044.6529898963,-363209.3296012689,-187545.37188693252,-145572.74889611226,-237781.0186124839,22457.05798337223,101660.49958097024,-196189.82278614273,-117419.52342477666,45379.449781416726,-329224.2829220939,312685.6584161706,-482087.4011532549,6050.258642910409,117859.32460058058,597109.7971697181,41711.08354021517,-92820.26148189943,58442.77385648041,50624.163782439864,76090.70343520722,193886.18103077426,-25568.864696823737,2577322.1538255354,-172022.6803803173,37812.38285473252,-12467.904898797691,27286.156455289685,900800.6029126311,1131744.0789687866,-191306.03140668536,80115.63230405361,-163210.55511248444,-219596.88973788964,262012.7582683897,-135827.23213694955,-300162.251718

802.99096640706,-479066.8348011017,12200.385482689206,437466.9042584367,-428101.60723277443,-145513.72043716133,466286.3570669017,67799.2924828717,-280355.3771077132,-66526.17102978066,-261878.26873220893,-424482.869316967,30741.52689385294,79850.31834032107,-151121.70746531096,-62421.51058619627,247202.83355179316,92034.326608712,9556.720724661867,-236642.13738908755,-292664.59662078327,75017.01549777533,-446965.15088477737,109357.71756685269,-16852.787366360506,-395404.7187409603,10848.213607911448,-576110.2512162257,-54703.40610558137,-90297.0325002907,-139160.3393189984,-57734.70214710722,46621.12239685103,-80401.88831556682,278033.0229480572,-276437.97294965095,-272743.6198657884,63755.28135188478,84303.32982188402,-432559.1440602205,-153543.34322823965,-97490.77111563133,-27398.982022098236,-185289.3627418177,-74889.55680292269,-53294.538066850415,95792.0595769988,-445991.4326645524,-286616.76681761915,-295272.7363976469,423077.76126608736,10445.798330789254,302242.51786896796,27

0410679616,73909.38391188231,-76616.37351381924,-145863.06707240615,32363.894162493132,184508.6710850191,-228723.00872420456,-108276.87832390699,-80244.15890977586,-100717.70843324742,-33494.23395701074,-29754.4873896928,-92730.74238618772,-593841.2807946327,-78058.79344635889,333898.7454200613,-79752.0465560586,-72858.07599369127,-146282.7385260609,487512.1262934445,362065.49879630434,-640685.3741388131,452480.6494146159,-181348.15607650985,256131.39068757658,-61291.57410018318,-11022.136436967503,-5981.967478145819,-65806.5142770495,-7130.719386070387,383740.84236095106,-56136.71577587514,-761644.4245954815,-246400.1626577564,-374962.79379796924,166281.8703884751,-275350.8315355015,-371158.9949072498,19971.910181825737,18154.79624374688,544413.1875996903,-304289.6508605782,28773.138814746173,26324.969733189148,217461.4216749443,-239962.1813584231,60238.72096991222,-164056.4687244344,-659961.1453484222,243572.5302953908,-348005.56086982443,-39837.06516682492,-141115.5643547855,-99385.

7289,206181.46485987236,74544.13629626716,-86702.58768562434,-109040.6135767328,-202056.53453613835,-85534.27082701848,352801.65110538935,252179.98853524282,-1863854.6329850422,-123169.38356205595,238331.85070489012,236639.7189606825,192890.40769395654,326983.43938790757,8537.286292420602,-355679.6415485308,57447.01801733591,19121.620123982877,51064.794753386755,245003.365150945,99.64347720536483,30960.473740244415,-241691.71471360014,7616.583197607183,-80516.14927254111,119212.34741505455,89139.46255639424,-621998.1960771588,-854581.4554139111,-402472.7946231673,-46391.03826242153,-525685.4513906287,-376659.97670176503,43698.26711377317,-515761.7586138262,122584.0563162475,-40496.661864562746,2073316.0608474205,45795.42307672849,226620.5285379532,-290213.99230933655,85899.96769179952,404919.4206047832,-214488.7340112474,-401662.44920114137,-187491.29152646955,1162920.5011573837,-66081.26520417185,-21323.851012599334,-387338.6641880157,75904.990532151,118191.11957785154,-258344.8409801

+-------------------+
|          residuals|
+-------------------+
| 124643.22977177799|
| -4320.831734955311|
|  45852.67703692615|
| 225754.79120358825|
|-18257.075094625354|
|  444561.2278931886|
|  30402.38388106227|
|  273434.4455059469|
| 359521.61268754303|
|  35907.54964736104|
|-163877.12408967316|
|-360812.05324706435|
| -264993.1869752407|
| -90918.73420293629|
| 122154.03254805505|
|  354102.4649517834|
| 271414.07931022346|
| -603138.3618478626|
|  79914.57188279927|
|-273378.59052853286|
+-------------------+
only showing top 20 rows



RMSE: 263870.6324993336
r2: 0.8404008764684372


Model was executed 4

startTimeMillis: Long = 1591087889892
featurized: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 12 more fields]
lrModel: org.apache.spark.ml.regression.LinearRegressionModel = linReg_8eb0d0ecaaa8
trainingSummary: org.apache.spark.ml.regression.LinearRegressionTrainingSummary = org.apache.spark.ml.regression.LinearRegressionTrainingSummary@464732c8
endTimeMillis: Long = 1591087894030
durationSeconds: Long = 4


In [122]:
// define an evaluator for the cross validation

def evaluate ( predictions: DataFrame, metric: String) = {
    val eval =  new RegressionEvaluator()
       .setLabelCol("Price")
       .setPredictionCol("Predicted Price")
       .setMetricName(metric)
println("Root Mean Squared Error "+  metric.toUpperCase()+" on test data = " + eval.evaluate(predictions))
    
}

evaluate: (predictions: org.apache.spark.sql.DataFrame, metric: String)Unit


In [123]:
//Creates a crossvalidator on only the LR model
//Had some issues retreiving the params when using on pipeline
//and the pipeline only has one estimator in it so this was easier than traversing the stages of the pipline

val startTimeMillis = System.currentTimeMillis()

// make some featurised sets for just running the model thru the Crossvalidator
val featurized_training = hasher.transform(train)
val featurized_test = hasher.transform(test)

val lr = new LinearRegression()
    .setLabelCol("Price")
    .setFeaturesCol("hashedFeatures")
    .setPredictionCol("Predicted Price")
    .setMaxIter(100)


// We use a ParamGridBuilder to construct a grid of parameters to search over.
val paramGrid = new ParamGridBuilder()
  .addGrid(lr.regParam, Array(0,0.1,0.5,1))
  .addGrid(lr.elasticNetParam, Array(0,0.1,0.5,1))
  .build()

// We now treat the model as an Estimator, wrapping it in a CrossValidator instance.
// This will allow us to  choose parameters for the model
// A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
val cv = new CrossValidator()
  .setEstimator(lr)
  .setEvaluator(new RegressionEvaluator()
  .setLabelCol("Price")
  .setPredictionCol("Predicted Price")
  .setMetricName("rmse"))
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(3)  // Use 3+ in practice
  .setParallelism(2)

// Run cross-validation, and choose the best set of parameters.
val cvModel = cv.fit(featurized_training)


// Make predictions on test documents. cvModel uses the best model found (lrModel).
cvModel.transform(featurized_test)
  .select("Price", "Predicted Price")
  .show()


//print runtime
val endTimeMillis = System.currentTimeMillis()
val durationSeconds = (endTimeMillis - startTimeMillis) / 1000
print("Model was executed "+durationSeconds)


+---------+------------------+
|    Price|   Predicted Price|
+---------+------------------+
|1480000.0|1369526.1622787267|
|1465000.0|1354873.2180548385|
|1876000.0|1436733.7498542443|
|1097000.0|1066517.5303966776|
| 955000.0| 1320762.935670115|
| 890000.0|1018575.7513571382|
|1135000.0|1200742.7617844567|
|1290000.0|1896127.2463711426|
|1290000.0|1213006.3669830933|
|1195000.0|1467259.5847418606|
|1030000.0|1026758.0009696409|
| 700000.0| 938885.1424624473|
| 785000.0| 898991.2641843781|
| 725000.0|1671942.9998553544|
| 515000.0|251047.99763536453|
| 440000.0|105952.36475855857|
| 830000.0| 839097.2289405391|
| 670000.0|  698949.629578501|
| 805000.0| 762312.6240490228|
| 510000.0|157605.34881892055|
+---------+------------------+
only showing top 20 rows

Model was executed 311

startTimeMillis: Long = 1591088802701
featurized_training: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 12 more fields]
featurized_test: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 12 more fields]
lr: org.apache.spark.ml.regression.LinearRegression = linReg_db13a97d237e
paramGrid: Array[org.apache.spark.ml.param.ParamMap] =
Array({
	linReg_db13a97d237e-elasticNetParam: 0.0,
	linReg_db13a97d237e-regParam: 0.0
}, {
	linReg_db13a97d237e-elasticNetParam: 0.0,
	linReg_db13a97d237e-regParam: 0.1
}, {
	linReg_db13a97d237e-elasticNetParam: 0.0,
	linReg_db13a97d237e-regParam: 0.5
}, {
	linReg_db13a97d237e-elasticNetParam: 0.0,
	linReg_db13a97d237e-regParam: 1.0
}, {
	linReg_db13a97d237e-elasticNetParam: 0.1,
	linReg_db13a97d237e-regParam: 0.0
...

In [124]:
//print out the params used for the best model
val bestModel = cvModel.bestModel
println(bestModel.extractParamMap)  

{
	linReg_db13a97d237e-aggregationDepth: 2,
	linReg_db13a97d237e-elasticNetParam: 1.0,
	linReg_db13a97d237e-epsilon: 1.35,
	linReg_db13a97d237e-featuresCol: hashedFeatures,
	linReg_db13a97d237e-fitIntercept: true,
	linReg_db13a97d237e-labelCol: Price,
	linReg_db13a97d237e-loss: squaredError,
	linReg_db13a97d237e-maxIter: 100,
	linReg_db13a97d237e-predictionCol: Predicted Price,
	linReg_db13a97d237e-regParam: 1.0,
	linReg_db13a97d237e-solver: auto,
	linReg_db13a97d237e-standardization: true,
	linReg_db13a97d237e-tol: 1.0E-6
}


bestModel: org.apache.spark.ml.Model[_] = linReg_db13a97d237e


In [125]:
//define the model using the best params
val bestlr = new LinearRegression()
    .setLabelCol("Price")
    .setFeaturesCol("hashedFeatures")
    .setPredictionCol("Predicted Price")
    .setMaxIter(100)
    .setRegParam(0.5)
    .setElasticNetParam(0.5)


bestlr: org.apache.spark.ml.regression.LinearRegression = linReg_33414493d48d


In [126]:
// add linear regression to stages
val lrStages = Array(
            hasher,
            //scaler,
            bestlr
)

lrStages: Array[org.apache.spark.ml.PipelineStage with org.apache.spark.ml.util.DefaultParamsWritable{def copy(extra: org.apache.spark.ml.param.ParamMap): org.apache.spark.ml.PipelineStage with org.apache.spark.ml.util.DefaultParamsWritable{def copy(extra: org.apache.spark.ml.param.ParamMap): org.apache.spark.ml.PipelineStage with org.apache.spark.ml.util.DefaultParamsWritable}}] = Array(featureHasher_949c013a48ce, linReg_33414493d48d)


In [127]:
//define a pipleine
val startTimeMillis = System.currentTimeMillis()

val lrPipe = new Pipeline().setStages(lrStages)

//We fit our DataFrame into the pipeline to generate a model
val lrModel = lrPipe.fit(train)


//Make predictions using the model and the test data
val predictions = lrModel.transform(test)

val endTimeMillis = System.currentTimeMillis()
val durationSeconds = (endTimeMillis - startTimeMillis) / 1000
print("pipeline was executed "+durationSeconds)


pipeline was executed 11

startTimeMillis: Long = 1591089115440
lrPipe: org.apache.spark.ml.Pipeline = pipeline_a0461b738a59
lrModel: org.apache.spark.ml.PipelineModel = pipeline_a0461b738a59
predictions: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: int ... 13 more fields]
endTimeMillis: Long = 1591089126580
durationSeconds: Long = 11


In [128]:
//finesse the output of predicted price and price to aid visual compare
predictions.withColumn("Predicted Price", round($"Predicted Price", 0)).select("Price","Predicted Price").show()

+---------+---------------+
|    Price|Predicted Price|
+---------+---------------+
|1480000.0|      1369244.0|
|1465000.0|      1354288.0|
|1876000.0|      1437297.0|
|1097000.0|      1067049.0|
| 955000.0|      1321311.0|
| 890000.0|      1018996.0|
|1135000.0|      1201490.0|
|1290000.0|      1895986.0|
|1290000.0|      1213165.0|
|1195000.0|      1467420.0|
|1030000.0|      1027372.0|
| 700000.0|       938908.0|
| 785000.0|       897102.0|
| 725000.0|      1671212.0|
| 515000.0|       250236.0|
| 440000.0|       104906.0|
| 830000.0|       838576.0|
| 670000.0|       699074.0|
| 805000.0|       761578.0|
| 510000.0|       156872.0|
+---------+---------------+
only showing top 20 rows



#### Regression metrics

**Mean squared error (MSE)** -- the average of squared differences between the predicted outcome and the true outcome.

**R2 coefficient** -- the proportion of variance in the outcome that our model is capable of predicting based on its features.

In [129]:
evaluate(predictions,"rmse")
evaluate(predictions,"r2")


Root Mean Squared Error RMSE on test data = 315695.0402335918
Root Mean Squared Error R2 on test data = 0.7645970865560703


Looks like we go down from our training set:

`RMSE: 263870.6324993336
r2: 0.8404008764684372`


to our test set with best params:

`RMSE: 315689.4604548912
R2: 0.7646054077799563`



### 2. Apply KNN

#### Training


Pipeline Estimator

#### Prediction

#### Testing/Evaluation

Pipeline Model Transformer

### 3. Apply Random Forest Regression

#### Training

Pipeline Estimator

#### Prediction

#### Testing/Evaluation

Pipeline Model Transformer

### References

Apache Spark (n.d.). _Spark ML Programming Guide._ Retrieved from https://spark.apache.org/docs/1.2.2/ml-guide.html

Hydrospheredata (n.d.). _org.apache.spark.ml.feature.StandardScaler Scala Examples._ Retrieved from https://towardsdatascience.com/from-scikit-learn-to-spark-ml-f2886fb46852

Johnson S (2019). _From sckit-learn to Spark ML._ Retrieved from https://www.programcreek.com/scala/org.apache.spark.ml.feature.StandardScaler
Masri A. (2019). _FeatureTransformation._ Retrieved from https://towardsdatascience.com/apache-spark-mllib-tutorial-7aba8a1dce6e
