## Go (Even More) Hands On!

__You already have learned enough to do some SparkML modeling__

So let's dive into a real data set...

####UCI ML Repository Combined Cycle Power Plant Data Set 

Adapted from: https://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant

Features consist of hourly average ambient variables 
* Temperature (T) in the range 1.81°C and 37.11°C,
* Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
* Relative Humidity (RH) in the range 25.56% to 100.16%
* Exhaust Vacuum (V) in teh range 25.36-81.56 cm Hg
* Net hourly electrical energy output (PE) 420.26-495.76 MW

The averages are taken from various sensors located around the plant that record the ambient variables every second. 

The variables are given without normalization.

*Let's try and predict (model) the power output (PE) as a function of the other 4 variables.*

To make it easier to get started, especially since DBFS and the Spark DataFrame readers may be new to you, we'll walk through finding and ingesting the data:

In [4]:
%sh ls /dbfs/databricks-datasets/power-plant/data/

In [5]:
%sh head /dbfs/databricks-datasets/power-plant/data/Sheet1.tsv

In [6]:
spark.read.text("dbfs:/databricks-datasets/power-plant/data/").show(truncate=False)

In [7]:
data = spark.read.option("delimiter", "\t") \
          .option("header", True) \
          .option("inferSchema", True) \
          .csv("dbfs:/databricks-datasets/power-plant/data/")

data.show()

In [8]:
from pyspark.ml.feature import *

assembler = VectorAssembler(inputCols=["AT","V","AP","RH"], outputCol="features")

assembler.transform(data).show()

In [9]:
from pyspark.ml.regression import *

lr = LinearRegression(labelCol="PE")

In [10]:
train, test = data.randomSplit([0.75, 0.25])

lrModel = lr.fit ( assembler.transform(train) )

lrModel.transform( assembler.transform(test) ).show()

In [11]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[assembler, lr])


In [12]:
model = pipeline.fit(train)

In [13]:
summary = model.stages[-1].summary
print(summary.r2)
print(summary.rootMeanSquaredError)

In [14]:
from pyspark.ml.evaluation import *

eval = RegressionEvaluator(labelCol="PE", predictionCol="prediction")

In [15]:
predictions = model.transform(test)
eval.evaluate(predictions)

In [16]:
data.describe().show()

In [17]:
# basic steps are...

# split off a test set

# assemble predictor columns into a vector column

# fit a LinearRegression to the training set data frame

# see how well it worked using an Evaluator and the test data