ScaDaMaLe Course
[site](https://lamastex.github.io/scalable-data-science/sds/3/x/) and
[book](https://lamastex.github.io/ScaDaMaLe/index.html)

### Power Plant ML Pipeline Application - DataFrame Part

This is the Spark SQL parts of an end-to-end example of using a number
of different machine learning algorithms to solve a supervised
regression problem.

This is a break-down of *Power Plant ML Pipeline Application* from
databricks.

**This will be a recurring example in the sequel**

##### Table of Contents

-   **Step 1: Business Understanding**
-   **Step 2: Load Your Data**
-   **Step 3: Explore Your Data**
-   **Step 4: Visualize Your Data**
-   *Step 5: Data Preparation*
-   *Step 6: Data Modeling*
-   *Step 7: Tuning and Evaluation*
-   *Step 8: Deployment*

*We are trying to predict power output given a set of readings from
various sensors in a gas-fired power generation plant. Power generation
is a complex process, and understanding and predicting power output is
an important element in managing a plant and its connection to the power
grid.*

-   Given this business problem, we need to translate it to a Machine
    Learning task (actually a *Statistical* Machine Learning task).  
-   The ML task here is *regression* since the label (or target) we will
    be trying to predict takes a *continuous numeric* value
    -   Note: if the labels took values from a finite discrete set, such
        as, `Spam`/`Not-Spam` or `Good`/`Bad`/`Ugly`, then the ML task
        would be *classification*.

**Today, we will only cover Steps 1, 2, 3 and 4 above**. You need
introductions to linear algebra, stochastic gradient descent and
decision trees before we can accomplish the **applied ML task** with
some intuitive understanding. If you can't wait for ML then **check out
[Spark MLLib Programming
Guide](https://spark.apache.org/docs/latest/mllib-guide.html) for
comming attractions!**

The example data is provided by UCI at [UCI Machine Learning Repository
Combined Cycle Power Plant Data
Set](https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant)

You can read the background on the UCI page, but in summary:

-   we have collected a number of readings from sensors at a Gas Fired
    Power Plant (also called a Peaker Plant) and
-   want to use those sensor readings to predict how much power the
    plant will generate in a couple weeks from now.
-   Again, today we will just focus on Steps 1-4 above that pertain to
    DataFrames.

More information about Peaker or Peaking Power Plants can be found on
Wikipedia
[https://en.wikipedia.org/wiki/Peaking*power*plant](https://en.wikipedia.org/wiki/Peaking_power_plant).

In [None]:
//This allows easy embedding of publicly available information into any other notebook
//when viewing in git-book just ignore this block - you may have to manually chase the URL in frameIt("URL").
//Example usage:
// displayHTML(frameIt("https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation#Topics_in_LDA",250))
def frameIt( u:String, h:Int ) : String = {
      """<iframe 
 src=""""+ u+""""
 width="95%" height="""" + h + """"
 sandbox>
  <p>
    <a href="http://spark.apache.org/docs/latest/index.html">
      Fallback link for browsers that, unlikely, don't support frames
    </a>
  </p>
</iframe>"""
   }
displayHTML(frameIt("https://en.wikipedia.org/wiki/Peaking_power_plant",300))

In [None]:
displayHTML(frameIt("https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant",500))

In [None]:
sc.version.replace(".", "").toInt

In [None]:
// a good habit to ensure the code is being run on the appropriate version of Spark - we are using Spark 2.2 actually if we use SparkSession object spark down the road...
require(sc.version.replace(".", "").toInt >= 140, "Spark 1.4.0+ is required to run this notebook. Please attach it to a Spark 1.4.0+ cluster.")

  

Step 1: Business Understanding
------------------------------

The first step in any machine learning task is to understand the
business need.

As described in the overview we are trying to predict power output given
a set of readings from various sensors in a gas-fired power generation
plant.

The problem is a regression problem since the label (or target) we are
trying to predict is numeric

Step 2: Load Your Data
----------------------

Now that we understand what we are trying to do, we need to load our
data and describe it, explore it and verify it.

Data was downloaded already as these five Tab-separated-variable or tsv
files.

In [None]:
display(dbutils.fs.ls("/databricks-datasets/power-plant/data")) // Ctrl+Enter

  

Now let us load the data from the Tab-separated-variable or tsv text
file into an `RDD[String]` using the familiar `textFile` method.

In [None]:
val powerPlantRDD = sc.textFile("/databricks-datasets/power-plant/data/Sheet1.tsv") // Ctrl+Enter

In [None]:
powerPlantRDD.take(5).foreach(println) // Ctrl+Enter to print first 5 lines

In [None]:
// let us make sure we are using Spark version greater than 2.2 - we need a version closer to 2.0 if we want to use SparkSession and SQLContext 
require(sc.version.replace(".", "").toInt >= 220, "Spark 2.2.0+ is required to run this notebook. Please attach it to a Spark 2.2.0+ cluster.")

In [None]:
// this reads the tsv file and turns it into a dataframe
val powerPlantDF = spark.read // use 'sqlContext.read' instead if you want to use older Spark version > 1.3  see 008_ notebook
    .format("csv") // use spark.csv package
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .option("delimiter", "\t") // Specify the delimiter as Tab or '\t'
    .load("/databricks-datasets/power-plant/data/Sheet1.tsv")

In [None]:
powerPlantDF.printSchema // print the schema of the DataFrame that was inferred

In [None]:
powerPlantDF.count

  

### 2.1. Alternatively, load data via the upload GUI feature in databricks

USE THIS FOR OTHER SMALLish DataSets you want to import to your CE
------------------------------------------------------------------

Since the dataset is relatively small, we will use the upload feature in
Databricks to upload the data as a table.

First download the Data Folder from [UCI Machine Learning Repository
Combined Cycle Power Plant Data
Set](https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant)

The file is a multi-tab Excel document so you will need to save each tab
as a Text file export.

I prefer exporting as a Tab-Separated-Values (TSV) since it is more
consistent than CSV.

Call each file Folds5x2\_pp&lt;Sheet 1..5&gt;.tsv and save to your
machine.

Go to the Databricks Menu &gt; Tables &gt; Create Table

Select Datasource as "File"

Upload *ALL* 5 files at once.

See screenshots below (but refer
<https://docs.databricks.com/user-guide/importing-data.html> for latest
methods to import data):

**2.1.1. Create Table** \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

When you import your data, name your table `power_plant`, specify all of
the columns with the datatype `Double` and make sure you check the
`First row is header` box.

![alt
text](http://training.databricks.com/databricks_guide/1_4_ML_Power_Plant_Import_Table.png)

**2.1.2. Review Schema** \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

Your table schema and preview should look like this after you click
`Create Table`:

![alt
text](http://training.databricks.com/databricks_guide/1_4_ML_Power_Plant_Import_Table_Schema.png)

Now that your data is loaded let's explore it.

Step 3: Explore Your Data
-------------------------

Now that we understand what we are trying to do, we need to load our
data and describe it, explore it and verify it.

#### Viewing the table as text

By uisng `.show` method we can see some of the contents of the table in
plain text.

This works in pure Apache Spark, say in `Spark-Shell` without any
notebook layer on top of Spark like databricks, zeppelin or jupyter.

It is a good idea to use this method when possible.

In [None]:
powerPlantDF.show(10) // try putting 1000 here instead of 10

  

#### Viewing as DataFrame

This is almost necessary for a data scientist to gain visual insights
into all pair-wise relationships between the several (3 to 6 or so)
variables in question.

In [None]:
display(powerPlantDF) 

In [None]:
powerPlantDF.count() // count the number of rows in DF

  

#### Viewing as Table via SQL

Let us look at what tables are already available, as follows:

In [None]:
sqlContext.tables.show() // Ctrl+Enter to see available tables

  

We can also access the list of tables and databases using
`spark.catalog` methods as explained here:

-   <https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html>

In [None]:
spark.catalog.listTables.show(false)

In [None]:
spark.catalog.listDatabases.show(false)

  

We need to create a temporary view of the DataFrame as a table before
being able to access it via SQL.

In [None]:
powerPlantDF.createOrReplaceTempView("power_plant_table") // Shift+Enter

In [None]:
sqlContext.tables.show() 

  

Note that table names are in lower-case only!

**You Try!**

In [None]:
//sqlContext // uncomment and put . after sqlContext and hit Tab to see what methods are available

In [None]:
//sqlContext.dropTempTable("power_plant_table") // uncomment and Ctrl+Enter if you want to remove the table!

  

The following SQL statement simply selects all the columns (due to `*`)
from `powerPlantTable`.

In [None]:
-- Ctrl+Enter to query the rows via SQL
SELECT * FROM power_plant_table

  

Note that the output of the above command is the same as
`display(powerPlantDF)` we did earlier.

We can use the SQL `desc` command to describe the schema. This is the
SQL equivalent of `powerPlantDF.printSchema` we saw earlier.

In [None]:
desc power_plant_table

  

**Schema Definition**

Our schema definition from UCI appears below:

-   AT = Atmospheric Temperature in C
-   V = Exhaust Vaccum Speed
-   AP = Atmospheric Pressure
-   RH = Relative Humidity
-   PE = Power Output

PE is our label or target. This is the value we are trying to predict
given the measurements.

*Reference [UCI Machine Learning Repository Combined Cycle Power Plant
Data
Set](https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant)*

Let's do some basic statistical analysis of all the columns.

We can use the describe function with no parameters to get some basic
stats for each column like count, mean, max, min and standard deviation.
More information can be found in the [Spark API
docs](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame)

In [None]:
display(powerPlantDF.describe())

  

Step 4: Visualize Your Data
---------------------------

To understand our data, we will look for correlations between features
and the label. This can be important when choosing a model. E.g., if
features and a label are linearly correlated, a linear model like Linear
Regression can do well; if the relationship is very non-linear, more
complex models such as Decision Trees or neural networks can be better.
We use the Databricks built in visualization to view each of our
predictors in relation to the label column as a scatter plot to see the
correlation between the predictors and the label.

In [None]:
select AT as Temperature, PE as Power from power_plant_table

  

From the above plot, it looks like there is strong linear correlation
between temperature and Power Output!

In [None]:
select V as ExhaustVaccum, PE as Power from power_plant_table;

  

The linear correlation is not as strong between Exhaust Vacuum Speed and
Power Output but there is some semblance of a pattern.

In [None]:
select AP as Pressure, PE as Power from power_plant_table;

In [None]:
select RH as Humidity, PE as Power from power_plant_table;

  

...and atmospheric pressure and relative humidity seem to have little to
no linear correlation.

These pairwise plots can also be done directly using `display` on
`select`ed columns of the DataFrame `powerPlantDF`.

In general **we will shy from SQL as much as possible** to focus on ML
pipelines written with DataFrames and DataSets with occassional
in-and-out of RDDs.

The illustations in `%sql` above are to mainly reassure those with a
RDBMS background and SQL that their SQL expressibility can be directly
used in Apache Spark and in databricks notebooks.

In [None]:
display(powerPlantDF.select($"RH", $"PE"))

  

Furthermore, you can interactively start playing with `display` on the
full DataFrame!

In [None]:
display(powerPlantDF) // just as we did for the diamonds dataset

  

We will do the following steps in the sequel.

-   *Step 5: Data Preparation*
-   *Step 6: Data Modeling*
-   *Step 7: Tuning and Evaluation*
-   *Step 8: Deployment*

Datasource References:

-   Pinar Tüfekci, Prediction of full load electrical power output of a
    base load operated combined cycle power plant using machine learning
    methods, International Journal of Electrical Power & Energy Systems,
    Volume 60, September 2014, Pages 126-140, ISSN 0142-0615, [Web
    Link](http://www.journals.elsevier.com/international-journal-of-electrical-power-and-energy-systems/)
-   Heysem Kaya, Pinar Tüfekci , Sadik Fikret Gürgen: Local and Global
    Learning Methods for Predicting Power of a Combined Gas & Steam
    Turbine, Proceedings of the International Conference on Emerging
    Trends in Computer and Electronics Engineering ICETCEE 2012, pp.
    13-18 (Mar. 2012, Dubai) [Web
    Link](http://www.cmpe.boun.edu.tr/~kaya/kaya2012gasturbine.pdf)

### We will continue later with ML pipelines to do prediction with a fitted model from this dataset