<div><img src="http://www.stevinsonauto.net/assets/Icon_Brake.png", width=270, height=270, align = 'right'> 

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/51/IBM_logo.svg/640px-IBM_logo.svg.png", width = 90, height = 90, align = 'right', style="margin:0px 25px"></div>

# Data Science for Automotive:  Classifying Brake Events

_________________

In this lab you will explore braking data to differentiate between driver profiles using Python and Apache Spark.  Along the way you'll learn how to  manipulate, visualize, and ultimately model data sets in DSX.


#### Table of Contents

1. Problem Statement

2. Load Data from IBM Object Storage

3. Exploratory Data Analysis with SparkSQL and PixieDust

4. Machine Learning with SparkML

5. Conclusion

______________________

### Problem Statement

The service bays at dealerships have seen an increase in warranty claims related to brakes. However, it may not always make sense to honor a warranty claim.  Sometimes the brake issue may be due to vehicle malfunction; other times it can be due to aggressive driving style.  

>Using historical telematics data of known driver types, can we classify the driving style of customers making warranty claims?  If so, we'll be able to provide a service to dealerships that allows them to classify the brake event and driver type for customers making a warranty claim.  

To do this you will need to analyze connected car data to discover the various patterns associated with different drivers.  *Please keep in mind that the hints often contain the solution to the exercise, so consider that before using them.*

__Note:__  Before we dive in, double click on this text.  You'll notice that the cell has changed and you should see some formatting syntax.  This is a what a Markdown cell looks like before it is executed.  Now click on the play button in the menu bar above.  Voila!  Nicely rendered text.  If you accidentally double click on a Markdown cell this will happen, so just execute the cell again with the play button and it will render.  

Ok, let's get started!

_______

### Load Data

There are many ways to connect to data sources in DSX.  For this lab we will be using the "Files/Connections" panel located in the top right of your DSX console.  The icon for this panel is "10/01".  

After opening the panel click on the "Insert to code" button for the `historical_brake_events.csv` data asset.  You should see a drop down menu with several options.  **Click on the code cell below** and then insert a SparkSession DataFrame using the drop down menu.

In [None]:
## Insert SparkSession DataFrame in this cell using the Files/Connections feature in DSX


What's happening in this code cell?  First, we are importing the library `ibmos2spark` (IBM Object Storage to Spark) which allows us to take a file in Object Storage and immediately push it to Spark.  Then we have the credentials for our Object Storage instance defined for us.  This is needed to provide access to the data set.  At the end of the cell we define the name of our Spark DataFrame (default is df_data_1) and take the first couple records to inspect them.  

You can execute the code in this cell by pressing the "Play" button in the Notebook menu above.  The output will be displayed immediately below the code cell.

#### Check SparkSession

Before loading the data into Spark, a SparkSession had to be created.  Think of this as the gateway to many parts of the Spark API.  The default variable name for the SparkSession is simply `spark`.  Check the `spark` variable now and remember that it's there - this will be important later.

In [None]:
## Check SparkSession by typing `spark` on the next line then clicking the 'play' button in the menu bar


#### Rename the DataFrame

We will explore the data in just a moment, but first give the DataFrame a better name.  `df_data_1` is uninspired and doesn't help us keep track of what we're working with later.  The convention in Spark programming is to add a 'DF' at the end of the name to signify 'DataFrame'.
<br><br>
 <div class="panel-group" id="accordion-1">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-1" href="#collapse1-1">
        Hint</a>
      </h4>
    </div>
    <div id="collapse1-1" class="panel-collapse collapse">
      <div class="panel-body">
You can reassign a variable by using the `=` operator.  For example, the command "renamedDF = df_data_1" will let you work with the new variable `renamedDF`. </div>
    </div>
  </div>

In [None]:
## Give `df_data_1` a more appropriate name for this project - something that better reflects the data set


____________

### Exploratory Data Analysis

A preview of the data set was shown when we loaded it from Object Storage.  The format - Row RDD - was a bit difficult to interpret visually.  There are several methods to display the data in a tabular format.

`df_data_1.show()`

`df_data_1.toPandas()`

Try these now, but substitute `df_data_1` for your renamed DataFrame.
<br><br>

 <div class="panel-group" id="accordion-2">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-1" href="#collapse1-2">
        Advanced Optional</a>
      </h4>
    </div>
    <div id="collapse1-2" class="panel-collapse collapse">
      <div class="panel-body">You can limit the number of records shown in the `.show()` method by passing an additional parameter, `n = <int>`.  For example, `df_data_1.show(n = 10)` will display the first ten records.</div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-2" href="#collapse2-2">
        Advanced Optional 2</a>
      </h4>
    </div>
    <div id="collapse2-2" class="panel-collapse collapse">
      <div class="panel-body">Unlike `.show()`, the `.toPandas()` method doesn't take any additional parameters.  However, you can call the `.limit()` method before calling `.toPandas()` to limit the number of rows displayed. These commands can be chained using the '.' syntax.</div>
    </div>
  </div>

In [None]:
## Display your DataFrame as a table using one of the two methods above.


Take some time to look at the data and become acquainted with the various fields.  You can also check the schema by using the `.printSchema()` method on the DataFrame.

In [None]:
## Inspect the schema of your DataFrame



Notice anything strange?  The data types are all coded as 'string'.  Before we can compute any statistics or aggregations we'll need to cast these into the proper types.  Welcome to data analysis, where the data is dirty and the work is cut out for you! 

#### SparkSQL

There are different ways to do this in Spark but perhaps the simplest and most intuitive way is to use SparkSQL.  While not entirely ANSI compliant, SparkSQL is a powerful way for SQL developers to work on big data.  Let's start with the basics and then work toward changing the types.

##### Querying a Temporary View

In order to run queries in familiar SQL syntax you'll need to create a temporary view.  To do this, call the `.createOrReplaceTempView()` method on your existing DataFrame.  Note that the table name is one of the parameters and must be in quotes.
<br><br>
<div class="panel-group" id="accordion-1">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-32" href="#collapse1-32">
        Hint</a>
      </h4>
    </div>
    <div id="collapse1-32" class="panel-collapse collapse">
      <div class="panel-body">Try: df_data_1.createOrReplaceTempView("yourtablename"), but replace 'yourtablename' with what you want to call your table. </div>
    </div>
  </div>
</div>

In [None]:
## Create the temporary view here



Good.  Now we can access the SparkSession (remember?) and run queries against the temporary table.

The syntax for this is,

> `SparkSession.sql("*insert SQL statement here*")`

Keep in mind you are running queries against the **temp table** and not the DataFrame.  Try a simple `"SELECT * FROM yourtablename"` statement and see what is returned.  You may have to add a command that you've already learned to display the data. 

<br>
 <div class="panel-group" id="accordion-31">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-1" href="#collapse1-31">
        Hint</a>
      </h4>
    </div>
    <div id="collapse1-31" class="panel-collapse collapse">
      <div class="panel-body">Try chaining `.show()` or `.toPandas()` after your `.sql()` method.
      </div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-31" href="#collapse2-31">
        Hint 2</a>
      </h4>
    </div>
    <div id="collapse2-31" class="panel-collapse collapse">
      <div class="panel-body">Try `spark.sql("SELECT * FROM tablenamehere").limit(5).toPandas()`</div>
    </div>
  </div>
</div>

In [None]:
## Try a simple select statement using the SparkSession and display the results



##### Casting Columns as Different Data Types

Now that you've built a gateway to writing SQL statements in Spark, it's time to cast columns into the correct type.  For example, `column1` can be cast as a floating point type using the following statement:

> `SELECT cast(column1 as float) FROM yourtablename`

In order to confirm that this worked you'll have to check the schema again.  
<br>
<br>
 <div class="panel-group" id="accordion-5">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-5" href="#collapse1-5">
        Hint</a>
      </h4>
    </div>
    <div id="collapse1-5" class="panel-collapse collapse">
      <div class="panel-body">Chain `.printSchema()` onto `.sql()`.</div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-5" href="#collapse2-5">
        Advanced Optional</a>
      </h4>
    </div>
    <div id="collapse2-5" class="panel-collapse collapse">
      <div class="panel-body">Use <i>limit</i> (i.e. <i>limit 2</i>) within your SQL statement to limit the number of rows returned.   Use `.show()` to display the values in the console.</div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-5" href="#collapse3-5">
        Advanced Optional 2</a>
      </h4>
    </div>
    <div id="collapse3-5" class="panel-collapse collapse">
      <div class="panel-body">Type:<br>
      `spark.sql("SELECT cast(brake_time_sec as float) FROM yourtablename LIMIT 2").show()`<br></div>
    </div>
  </div>
</div> 


In [None]:
## Convert one column to float and then print the schema to confirm the type has changed



Excellent.  **Now cast all variables without letters as the proper type and store the result in a new DataFrame.**  Variables with decimals should be `float`, while others can be `int`. 
<br><br>
<div class="panel-group" id="accordion-51">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-51" href="#collapse1-51">
        Hint</a>
      </h4>
    </div>
    <div id="collapse1-51" class="panel-collapse collapse">
      <div class="panel-body">You can assign the output of a SparkSQL command to a variable in the same way you renamed your original RDD.  For example, 
      <br><br> `cleanTypesDF = spark.sql("...")` </div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-52" href="#collapse1-52">
        Hint 2</a>
      </h4>
    </div>
    <div id="collapse1-52" class="panel-collapse collapse">
      <div class="panel-body">Try: 
      <br><br> `cleanTypesDF = spark.sql("SELECT cast(... as ...), cast(... as ...) FROM yourtablename")` </div>
    </div>
  </div>
  </div>



In [None]:
## Use SparkSQL to create a DataFrame with the proper types and assign it to a new variable.
## Note: If you place your SQL statement in triple quotes - """ """ - you can ignore the indentation requirements of Python, making your 
## statement easier to read.



## After changing the types and putting them in a new DF, print the schema again



Your schema should now have `VIN`, `type`, and `road_type` as strings; `brake_time_sec` and `brake_distance_ft` as floats, and the rest as integers.  Update the temporary view with the correct types and continue to the next step.

<br>
<div class="panel-group" id="accordion-21">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-21" href="#collapse1-21">
        Hint</a>
      </h4>
    </div>
    <div id="collapse1-21" class="panel-collapse collapse">
      <div class="panel-body">You can update the temporary view by using `.createOrReplaceTempView("yourtablename")` on your new DataFrame with the proper types.</div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-22" href="#collapse1-22">
        Hint 2</a>
      </h4>
    </div>
    <div id="collapse1-22" class="panel-collapse collapse">
      <div class="panel-body">Try: 
      <br><br> `cleanTypesDF.createOrReplaceTempView("tablenamehere")` </div>
    </div>
  </div>
  </div>

In [None]:
## Update the temporary view based on your cleanTypesDF



##### Summary Statistics

With the data converted to the proper type, we can now group and aggregate some of the fields to get a better understanding of what's going on.  

As an introduction to this technique let's start by taking a simple average of a few columns - `braking_score`, `travel_speed`, and `brake_distance_ft`.  This is done by including an aggregate function in your SQL statement.  For example, 

> `spark.sql("SELECT AVG(column1) as avg_column1 FROM yourtablename")`

Now it's your turn.  Try to get all averages returned in one statement.

<br>
<div class="panel-group" id="accordion-54">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-54" href="#collapse1-54">
        Hint</a>
      </h4>
    </div>
    <div id="collapse1-54" class="panel-collapse collapse">
      <div class="panel-body">Like selecting columns themselves, aggregate functions are separated by commas until you reach 'FROM ...'.</div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-55" href="#collapse1-55">
        Hint 2</a>
      </h4>
    </div>
    <div id="collapse1-55" class="panel-collapse collapse">
      <div class="panel-body">Try: 
      <br><br> `spark.sql("SELECT AVG(braking_score), AVG(travel_speed), AVG(braking_distance) FROM yourtablename")` </div>
    </div>
  </div>
  </div>

In [None]:
## Get the averages here.



We can also get the min, max, count, and other values from each column using aggregate functions.  More importantly, to better understand the variation between different brake events group the data by the `type` column and _then_ use aggregate functions.  This will provide a more meaningful result for our purposes.  

Try using the GROUP BY clause now to group the data by `type` and find the AVG `braking_score` and SUM of `abs_event`.

<br>
<div class="panel-group" id="accordion-70">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-70" href="#collapse1-70">
        Hint</a>
      </h4>
    </div>
    <div id="collapse1-70" class="panel-collapse collapse">
      <div class="panel-body">'GROUP BY' statements are placed after specifying the table being queried. </div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-71" href="#collapse1-71">
        Hint 2</a>
      </h4>
    </div>
    <div id="collapse1-71" class="panel-collapse collapse">
      <div class="panel-body">Try: 
      <br><br> `spark.sql("SELECT type, AVG(braking_score), SUM(abs_event) FROM yourtablename GROUP BY type").toPandas()`
 </div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-72" href="#collapse1-72">
        Advanced Optional</a>
      </h4>
    </div>
    <div id="collapse1-72" class="panel-collapse collapse">
      <div class="panel-body">
      To improve the column names of your results, provide an alias for the aggregate fields.  You can do this by adding "... as avg_score," when selecting a column.  For example, <br> <br>
      `"SELECT AVG(column1) as avgValue FROM yourtablename"`
 </div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-78" href="#collapse1-78">
        Advanced Optional 2</a>
      </h4>
    </div>
    <div id="collapse1-78" class="panel-collapse collapse">
      <div class="panel-body">Explore the dataset using other aggregate functions besides AVG and SUM.  What do you find?
 </div>
    </div>
  </div>
  </div>

In [None]:
## Select the average braking_score and sum of abs_events from your table, but make sure to group the data
## by type.



In [None]:
## Empty cell for Advanced Option 2 



What do you see?  What sort of conclusions would draw from these summary statistics?

Besides having a brake event type (the `type` column) there is a `road_type` column as well.  **For the final exercise of this section, group the data by type _and_ road type, then compute the same statistics.  This time include average `brake_time_sec`, `brake_distance_ft` and `travel_speed`.**
<br><br>
<div class="panel-group" id="accordion-57">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-57" href="#collapse1-57">
        Hint</a>
      </h4>
    </div>
    <div id="collapse1-57" class="panel-collapse collapse">
      <div class="panel-body">You can group by multiple columns by separating them with a comma.  For example, <br><br>
      `"SELECT column1, column2, column3 FROM tablenamehere GROUP BY column1, column2"`</div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-58" href="#collapse1-58">
        Advanced Option</a>
      </h4>
    </div>
    <div id="collapse1-58" class="panel-collapse collapse">
      <div class="panel-body">Try finding the sum of each `brake_pressure` field in your groups.  Does this tell you anything?
 </div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-60" href="#collapse1-60">
        Advanced Option 2</a>
      </h4>
    </div>
    <div id="collapse1-60" class="panel-collapse collapse">
      <div class="panel-body">
      Sort your result by adding an `ORDER BY` clause after the `GROUP BY` clause.  You can pick `type` or `road type`.
 </div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-59" href="#collapse1-59">
        Advanced Option 3</a>
      </h4>
    </div>
    <div id="collapse1-59" class="panel-collapse collapse">
      <div class="panel-body">
      Create a new column in your data set (not the aggregated one) that displays the ratio between brake distance and brake time.  No hints here!
 </div>
    </div>
  </div>
  </div>

In [None]:
## Create a more complete summary stats table, this time with the data grouped by type and road type



In [None]:
## Empty cell for Advanced Option 3



This table has some really informative descriptive statistics in it. You should be able to draw some preliminary conclusions about how the different brake events express themselves in the various columns.  Save this aggregate table in a new DataFrame because you're going to use it for plotting.  Remember to leave out any display methods like `.show()` - you just want the DataFrame.

In [None]:
## Run the same command as before, but this time save it in a new DF, 'aggDF'.



________

#### Data Visualization  with PixieDust

Data visualization in Python can be done but there is somewhat of a steep learning curve.  Luckily, some enterprising folks at IBM Watson Data Lab have created an easier alternative called PixieDust.  Even better, **IBM has open sourced this code and made the library available to everyone.**  The following cells are taken straight from the PixieDust tutorial and will make sure the library works in this Notebook.

Install PixieDust by accessing the UNIX system behind this Notebook with the '!' operator.  This will make sure the latest version is on the system.

In [None]:
## To confirm you have the latest version of PixieDust on your system, run this cell
!pip install --user --upgrade pixiedust

Now that you have PixieDust installed and up to date on your system, import the library into the Notebook.  Don't worry if you see a warning - as long as your version is greater than 1.0.2 you are good to go.  Also, now is a good time to import the `bokeh` and `seaborn` libraries because they are much prettier than the default `matplotlib`.  Thank me later.

In [None]:
import pixiedust
import bokeh
import seaborn

To create a visualization all you have to do is pass a DataFrame to the `display()` function and PixieDust will output an inline, interactive graphical display that you can configure based on what you want to see.  Here's the basic command:

> `display(myDF)`

It's that easy.  Try passing a DataFrame created by one of your queries to that function and see what happens!

> `display(spark.sql("..."))`

This will likely give you a table of the resulting DataFrame, but clicking on the 'Chart' icon in the top left corner allows you to select which variables and plot type to render.  Make sure you select 'Bokeh' as the rendering option on the right.

Remember that we want to understand how variables express themselves across event types.  Try to come up with some visualizations that give insight into the data.  Here's a tip: **Keys** are your X-axis and **Values** are your Y-axis.

<br>
 <div class="panel-group" id="accordion-62">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-1" href="#collapse1-62">
        Hint</a>
      </h4>
    </div>
    <div id="collapse1-62" class="panel-collapse collapse">
      <div class="panel-body">
      Try a scatter plot using `bokeh` with `brake_distance_ft` as value and `brake_time_sec as key`.  What happens if you color the data by type?  What pattern do you see?</div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-62" href="#collapse2-62">
       Advanced Optional</a>
      </h4>
    </div>
    <div id="collapse2-62" class="panel-collapse collapse">
      <div class="panel-body">Discover the breakdown of ABS events by type.  What does this tell you?<br></div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-62" href="#collapse3-62">
        Advanced Optional 2</a>
      </h4>
    </div>
    <div id="collapse3-62" class="panel-collapse collapse">
      <div class="panel-body">Try coming up with another creative way to visualize the data beyond bar and scatter plots.  Maybe try plotting some of the data from `aggDF`.  If you find something cool, share it with others!  <br></div>
    </div>
  </div>
</div> 

In [None]:
## Pass a DataFrame to the `display()` function from PixeDust here.



When you've finished exploring the data visually and have begun to draw some conclusions, record them in the Markdown cell immediately below this one.  Simply click inside the cell and you can edit the text inside.  You can write a paragraph or you can use the list format that I've prepared.  Similar to a code cell, you render the Markdown by pressing 'Play'.

##### Double Click Here to Add Your Conclusions

- Conclusion A:

- Conclusion B:

- Conclusion C:

_________
<br>
If you've made it this far there are several things that should have become clear from the data.  To avoid spoilers I have hidden the obvious conclusions in the 'spoiler' tab below.  Once you've thought about it feel free to open up the tab and compare notes.

<br>
 <div class="panel-group" id="accordion-66">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-1" href="#collapse1-66">
        Conclusions <b>(**Spoilers!**)</b></a>
      </h4>
    </div>
    <div id="collapse1-66" class="panel-collapse collapse">
      <div class="panel-body">Quality drivers have brake events with longer brake times, distances, and higher brake scores.  Aggressive drivers are the opposite.  Distracted drivers fall somewhere in between but have a noticeably higher number of ABS events.  These results should approximate what one would expect - distracted drivers probably slam on the brakes more often, aggressive drivers drive faster and brake more quickly, while quality drivers keep their distance and are gentler on the brake system.  </div>
    </div>
  </div>
</div>
Having arrived at these conclusions as a result of exploratory data analysis and data visualization, we can now proceed to modeling the relationship between brake event type and the rest of the variables.

_____

### Machine Learning

To make this exercise simpler, in this section you will simply need to specify the fields you want to include in your model and then execute the code cells as per the instructions. 

#### Classification vs. Regression

The way you frame a problem in machine learning often determines model choice, and problems can be characterized as one of either regression or classification.  When the variable you want to understand or predict is continuous or numeric (such as price, braking score, etc.) it is generally considered a regression problem.  However if you want to classify a data point as belonging to a particular group (such as aggressive, quality, or distracted) then it makes more sense to use a classification model.  In this exercise we will use a variant of the decision tree model - Random Forest - for classification.

#### SparkML

Apache Spark has a rich set of algorithms and data transformations included in the API that can be used for machine learning.  That's the great part!  The tricky part is that the models are built to accept the data in a particular format so some additional data preparation is required.  

##### Feature Selection

The first thing we are going to do is select the columns - features - we want to include in our model.  These features will be modeled against the variable we want to understand, also referred to as the label.  In other words we are going to provide the model with a list of features and their corresponding label (quality, aggressive, etc.).  It will learn the relationship between each feature and the different labels, allowing us to make predictions as to what the label should be given new values for the features.

In the code cell below, specify the names of the columns you want to include as your **features**.  Store the result in a new variable, `inputColumns`.

**Note:  For the sake of simplicity, leave out any columns of datatype 'string' from your features.**

<br>
 <div class="panel-group" id="accordion-67">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-1" href="#collapse1-67">
        Hint</b></a>
      </h4>
    </div>
    <div id="collapse1-67" class="panel-collapse collapse">
      <div class="panel-body">Try: <br>
      `inputColumns = ["brake_distance_ft", "brake_time_sec", "abs_event"]` <br>
      Make sure to include any columns you think affect the event type.</div>
    </div>
  </div>
</div>

In [None]:
## Specify columns to include here.  If you need to check the columns use .printSchema() on your cleanTypesDF
## use the format["col1", "col2", "col3"] etc...



##### Transformations

> _VectorAssembler_

Recall how I mentioned some of these transformations could be tricky?  We are going to perform two.  First, we'll consolidate all of our features from their respective columns into a single vector.  A features vector. 

Fill in the blank space in the code cell below with your input columns variable name, then run the cell to see a sample output of what a features vector looks like.

In [None]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler, StringIndexer

assembler = VectorAssembler(
    inputCols = putInputColumnVariableHere,  ## Add your input column variable here
    outputCol="features")

## Use the 'transform' method to build the features vector, then see a couple rows from the 'type' and new 'features' columns.
## Make sure you are doing this on your DataFrame with the proper data types (i.e., cleanTypesDF)



> _StringIndexer_

Next, we'll transform our `type` column - the one we want to be able to classify or predict.  Since it's a string, we'll have to convert it to another data type in order to use it with the implementation of Random Forest in Spark.

<br>
 <div class="panel-group" id="accordion-8">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-8" href="#collapse1-8">
        Advanced Optional</a>
      </h4>
    </div>
    <div id="collapse1-8" class="panel-collapse collapse">
      <div class="panel-body">Transform the `cleanTypesDF` with your StringIndexer, then select the type and indexedLabel columns and send the resulting DataFrame to Pandas to display results in a table.<br></div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-8" href="#collapse2-8">
        Advanced Optional 2</a>
      </h4>
    </div>
    <div id="collapse2-8" class="panel-collapse collapse">
      <div class="panel-body">You can try predicting another string type column, 'road_type', and see what kind of results you get.</div>
    </div>
  </div>
 </div>

In [None]:
## Make sure you are fitting the StringIndexer on your DataFrame with the proper data types!
labelIndexer = StringIndexer(inputCol="type", outputCol="indexedLabel").fit(cleanTypesDF)

## Display the results of this transformation with Pandas
labelIndexer.transform(cleanTypesDF).select("type", "indexedLabel").limit(5).toPandas()

With the transformations properly configured, you are in a position to build the Random Forest classifier. 

##### Pipelines

In SparkML, a [pipeline](https://spark.apache.org/docs/latest/ml-pipeline.html) is the logical flow of transformations and model selection to generate an output.  Much like you needed two transformers, other workflows may include multiple algorithms with their own transformations to arrive at a final output.  Think of the ML Pipeline as a cleaner, streamlined API for machine learning programs.  If you are interested in doing machine learning with Spark, the Pipelines construct is worth exploring.

First define the model and its parameters, then wrap the transformations and model in a Pipeline:

In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier

## Define model and its parameters.
rf = RandomForestClassifier(labelCol = "indexedLabel", featuresCol = "features")

## Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages = [assembler, labelIndexer, rf])

Before fitting the model to the data, split it up into training and test sets.  We'll train the model on 70% of the data, then use the other 30% to test the accuracy.

In [None]:
## If your DataFrame with proper types isn't `cleanTypesDF`, make sure to insert the correct DF name here.
(trainingDF, testDF) = cleanTypesDF.randomSplit([0.7, 0.3])

print "Rows in training data:", trainingDF.count()
print "Rows in test data:", testDF.count()

In [None]:
## Fit the pipeline on the training DataFrame.  This could take a minute or two while the model is trained.

RFmodel = 

Alright!  Your random forest classifier is now ready to make predictions.  

#### Prediction and Evaluation

Use the test data set to estimate the accuracy of your model.  The concept here is that the model has generalized the relationship between your labels and the features, but only based on the data points you provided. Feeding the model new rows of data it has never seen before and then checking to see if its predictions were correct will give you a sense of how well it has captured the relationship between the features and their labels.

<br>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-99" href="#collapse2-99">
        Advanced Optional</a>
      </h4>
    </div>
    <div id="collapse2-99" class="panel-collapse collapse">
      <div class="panel-body">See if you can show the predicted types alongside the actual types and the features column.  Use Pandas to display the table nicely.</div>
    </div>
  </div>
 </div>


In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

## Predict on test set
predictionsDF = RFmodel.transform(testDF)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction")
accuracy = evaluator.evaluate(predictionsDF)

## Show test error rate
print("Test Error = %g" % (1.0 - accuracy))

In [None]:
## Cell for Advanced Optional



You should have achieved a fairly accurate result.  Nicely done!  
_____

### Export Pipeline to Object Storage

It's one thing to build a model and it's quite another to operationalize it.  In this case the business may want to provide an application that service bays can use to check the recent driving behavior of customers.  Imagine providing customized service at the dealership based on the driving tendencies of the customer.  The possibilities for improving customer experience and increasing profit are real, but first you need to make this predictive model available to others in your organization. 

As your final task, save your pipeline to the UNIX system then push it to Object Storage.

##### Save and Zip Pipeline

When you save your pipeline it will create a directory in the local file system on DSX.  Before we can put it in Object Storage we'll have to zip the contents into a single file.

Saving a model or pipeline in Spark is pretty easy.  Use the `.save("./modelname")` method on your pipeline and it will be written to the system.

In [None]:
## Save your pipeline and model to the system

RFmodel.save("./brakeEventModel")

Now run the next cell to zip the directory and its contents into a single file.  This will generate a fair amount of output and could take a minute.

In [None]:
## Zip the directory

!tar -zcvf RFmodel.tar.gz brakeEventModel/  

Check the local file system for the presence of your zipped file using `!ls` or `!ls -ls`.

In [None]:
## Confirm model was written to local file system here.



##### Define Your Object Storage Credentials

In your Files/Connections tab, click on 'insert to code' to display the drop down menu.  You should see an option to 'insert credentials'.  These are the credentials to your Object Storage instance.  Add them in the cell below, run it, then run the following cell to define the function that will put the files in Object Storage.

In [None]:
## Insert credentials in this cell.
## Make sure the 'filename' at the end matches the zipped file you just created!!



In [None]:
## All you have to do here is run the cell - no changes needed.
## Define the function to put a file in Object Storage
from io import BytesIO  
import requests  
import json  

def put_file(credentials, local_file_name):  
    """This functions returns a StringIO object containing
    the file content from Bluemix Object Storage V3."""
    f = open(local_file_name,'r')
    my_data = f.read()
    url1 = ''.join(['https://identity.open.softlayer.com', '/v3/auth/tokens'])
    data = {'auth': {'identity': {'methods': ['password'],
            'password': {'user': {'name': credentials['username'],'domain': {'id': credentials['domain_id']},
            'password': credentials['password']}}}}}
    headers1 = {'Content-Type': 'application/json'}
    resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
    resp1_body = resp1.json()
    for e1 in resp1_body['token']['catalog']:
        if(e1['type']=='object-store'):
            for e2 in e1['endpoints']:
                        if(e2['interface']=='public'and e2['region']=='dallas'):
                            url2 = ''.join([e2['url'],'/', credentials['container'], '/', local_file_name])
    s_subject_token = resp1.headers['x-subject-token']
    headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'}
    resp2 = requests.put(url=url2, headers=headers2, data = my_data )
    print resp2

Now use the `put_file()` function and simply pass your credentials and the filenames of your pipeline and model.  A 'Response [201]' output indicates that the write was successful.

<br>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-93" href="#collapse2-93">
        Hint</a>
      </h4>
    </div>
    <div id="collapse2-93" class="panel-collapse collapse">
      <div class="panel-body">The `put_file()` function takes two parameters - credentials and the name of the file in the local file system.  
      <br>Try: <br>
      `put_file(credentials_1, "modelname.tar.gz")`</div>
    </div>
  </div>
 </div>

In [None]:
## Use this cell to put your zipped file containing the model into Object Storage.



____

## Conclusion

Well done!  You've built a Random Forest classifier that can accurately predict the type of brake event based on features surrounding braking distance, time, and others.  This model can help better understand the tendencies of various drivers and optimize operations in different parts of the automotive business.  A real world data set will likely not be as straightforward as the one you worked with here (it almost certainly won't).  However, the methodology and basic techniques have been laid out to help you on your way towards building data products infused with machine learning.  

Before you go, please take a moment to review the material that you have completed in this lab!  Going through the workflow one more time should help concretize the terms, flow, and methodology that you learned.

- **Problem Statement:**
    * Defined the question being asked of the data


- **Load Data from IBM Object Storage:**
    * Created Spark DataFrames, renamed them, and checked the SparkSession


- **Exploratory Data Analysis with SparkSQL and PixieDust:**
    - View schemas and display DataFrames with Pandas
    - Create temporary views and query them
    - Clean up data types with SQL statements
    - Create new DataFrames with aggregate functions
    - Visualized data with PixieDust
    - Summarized conclusions in Markdown


- **Machine Learning with SparkML:**
    - Selected features to learn relationship to labels
    - Built ML Pipeline with VectorAssembler, StringIndexer and Random Forest
    - Predicted on unseen data and evaluated accuracy
    - Exported model to Object Storage for future use

_____

#### Additional Resources

[Official Apache Spark Documentation](https://spark.apache.org)


Questions?  Contact <rafi.kurlansik@ibm.com> or tweet me @kurlare.<br>
Interested in learning more?  Explore the Community Tiles in DSX for more tutorials and data sets.  Or, head over to [CognitiveClass.ai](http://www.cognitiveclass.ai/) for free classes on Apache Spark, machine learning, and more.
_______


<div><br><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/51/IBM_logo.svg/640px-IBM_logo.svg.png" width = 200 height = 200>
</div><br>