# Model bike sharing data with SPSS
This Scala 2.10 notebook shows you how to create a predictive model of bike sharing trends by using IBM SPSS Algorithms on Apache Spark. You'll learn how to create a generalized linear model with the SPSS ML API, and how to view the model with the SPSS Model Viewer.

The generalized linear model (GLM) is an analytical algorithm for different types of data. It includes statistical models such as linear regression for normally distributed targets, logistic models for binary or multinomial targets, and log linear models for count data. In addition to building a model, the GLM provides features such as variable selection, automatic selection of the distribution and link function, and model evaluation statistics. The GLM has options for regularization, such as LASSO, ridge regression, and elastic net, and can handle a wide variety of data.

The bike sharing model will:
 - Identify what affects the amount of bike rentals.
 - Predict future daily bike rental amounts based on date, weather, and season. 

This notebooks runs on Scala 2.10 with Spark 1.6. Some familiarity with Scala is recommended.

## Table of contents 
This notebook contains these main sections:

1. [Overview of the bike sharing data](#overview)
1. [Prepare the data](#prepare)
1. [Configure the generalized linear model](#configure) 
1. [View the model](#view)
1. [Summary and next steps](next)

<a id="overview"></a>
## 1. Overview of the bike sharing data

You'll be looking at a the daily count of bike rentals between the years 2011 and 2012 in the Capital Bikeshare system, with corresponding weather and seasonal information. The [Capital Bikeshare](https://www.capitalbikeshare.com/home) system provides bicycles at over 400 stations in Washington, D.C. and neighboring cities in Virginia and Maryland. 

The data set that you'll use has the following fields:

- instant: the record ID
- dteday: the date
- season: the season (1 = spring, 2 = summer, 3 = fall, 4 = winter)
- yr: the year (0 = 2011, 1 = 2012)
- mnth: the month ( 1 - 12)
- hr: the hour (0 - 23)
- holiday: 0 = not a holiday, 1 = a holiday 
- weekday: the day of the week (Sunday = 0 - Friday = 6)
- workingday: 0 = a weekend or holiday, 1 = a work day
- weathersit: the weather conditions 
   - 1 = Clear or partly cloudy
   - 2 = Mist or clouds
   - 3 = Light precipitation
   - 4 = Heavy precipitation
- temp: the normalized temperature for the day in degrees Celsius (minimum = -8, maximum = +39) 
- atemp: the normalized feels-like temperature in degrees Celsius (minium = -16, maximum = +50) 
- hum: the normalized humidity (maximum = 100%)
- windspeed: the normalized wind speed in knots (maximum = 67)
- casual: the count of bikes rented to casual users
- registered: the count of bikes rented to registered users
- cnt: the total count of rented bikes (casual + registered)


<a id="prepare"></a>
# 2. Prepare the data
To prepare the bike sharing data:  

1. [Get the data into your notebook](#load)
1. [Create a Spark DataFrame](#df)
1. [Enrich the DataFrame](#enrich)

<a id="load"></a>
## 2.1. Get the data into your notebook
To get the data and load it into your notebook:

1. Download the `Bike-Sharing-Dataset.zip` file from this website: [https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset](https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset).
1. Extract the file.
1. Load the `day.csv` file into the notebook by clicking the __Add and Find Data__ icon on the notebook action bar. Drop the file into the box or browse to select the file.

The file is loaded to your object storage. The data set appears in the __Files__ list in the notebook and also in the __Data Assets__ section of the project.

<a id="df"></a>
## 2.2. Create a Spark DataFrame
After you create an SQLContext and insert your credentials, you can create a Spark DataFrame.

Run this cell to create an SQLContext for your DataFrame:

In [None]:
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)

def setHadoopConfig(credentials: collection.mutable.Map[String, String]) = {
    val prefix = "fs.swift.service." + credentials("name") 
    val hconf = sc.hadoopConfiguration
    hconf.set(prefix + ".auth.url", credentials("auth_url") + "/v3/auth/tokens")
    hconf.set(prefix + ".auth.endpoint.prefix", "endpoints")
    hconf.set(prefix + ".tenant", credentials("project_id"))
    hconf.set(prefix + ".username", credentials("user_id"))
    hconf.set(prefix + ".password", credentials("password"))
    hconf.setInt(prefix + ".http.port", 8080)
    hconf.set(prefix + ".region", credentials("region"))
    hconf.setBoolean(prefix + ".public", true)
}

Insert your object storage credentials for the data set. Put your cursor in the following cell and click __Insert to code__, which appears after the `days.csv` file. If the number after `credentials` in the first line of code is not 1, edit the code to say `credentials_1`.

Run this cell to set your credentials with Spark:

In [None]:
credentials_1("name") = "gle"
setHadoopConfig(credentials_1)

Create the Spark DataFrame:

In [None]:
val filePath = "swift://" + credentials_1("container") + "." + credentials_1("name") + "/"
val fileName = credentials_1("filename")
val df = sqlContext.read.format("com.databricks.spark.csv").
    option("header", "true").option("inferschema", "true").load(filePath + fileName)


<a id="enrich"></a>
## 2.3. Enrich the DataFrame

The generalized linear model algorithm requires generated properties for the fields in the DataFrame so that they have proper data types, measurable levels, and roles.   

Run the SPSS DataFrame assistant `enrich` function to generate those properties:

In [None]:
import com.ibm.spss.ml.utils.DataFrameImplicits._
val df2 = df.enrich

Show the first three rows of the DataFrame:

In [None]:
df2.show(3)

<a id="configure"></a>
# 3. Configure the generalized linear model 

Configure the generalized linear model with the `GeneralizedLinear()` method to analyze what conditions affect the number of rented bikes. 

First, import the SPSS generalized linear model algorithm package:

In [None]:
import com.ibm.spss.ml.classificationandregression.GeneralizedLinear
import com.ibm.spss.ml.classificationandregression.params._

Now, run the `GeneralizedLinear()` method. You set the `TargetField` parameter to `cnt` and the `Effects` list to the fields that describe the type of day, the season, and the weather conditions. By specifying `UNKNOWN` for the distribution and link function, the model automatically chooses the most appropriate settings for the data.

In [None]:
val gle = GeneralizedLinear().
  setTargetField("cnt").
  setInputFieldList(Array("season","yr","mnth","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed","casual","registered")).
  setEffects(List(
    Effect(List("season"), List(0)), 
    Effect(List("mnth"), List(0)),
    Effect(List("holiday"), List(0)),
    Effect(List("weekday"), List(0)),
    Effect(List("workingday"), List(0)),
    Effect(List("weathersit"), List(0)),
    Effect(List("temp"), List(0)),
    Effect(List("atemp"), List(0)),
    Effect(List("hum"), List(0)),
    Effect(List("windspeed"), List(0)))).
  setDistribution("UNKNOWN").
  setLinkFunction("UNKNOWN").      
  setUseVariableSelection(true).
  setVariableSelectionMethod("FORWARD_STEPWISE").
  setDetectTwoWayInteraction(true).
  setTargetSortOrder("DESCENDING")

val gle_model = gle.fit(df2)

<a id="view"></a>
# 4. View the model 

View the model with the SPSS Model Viewer. The visualization for the generalized linear model includes tests of model effects, statistics for each parameter, and a table and chart of standardized deviation residuals.

## 4.1 Generate a project token

Before you can run the model viewer, you need to generate a project token

1. In the **My Projects** banner, click the **More** icon and then click **Insert project token**. The project token is inserted into the first cell of the notebook, before the title.
2. Copy the text, which appears at the beginning of the notebook, into the following cell and run it.

## 4.2 Start the model viewer

Run the code in the following cell to start SPSS Model Viewer, where you can see a visualization and see model statistics and other characteristics.

In [None]:
import com.ibm.spss.scala.ModelViewer
kernel.magics.html(ModelViewer.toHTML(pc, gle_model))

<a id="next"></a>
# Summary and next steps
You have created a generalized linear model of the bike sharing data. Now you can:
 - Create a different model to compare model evaluations, like the test of model effects, residuals, and so on. See [SPSS documentation](https://apsportal.ibm.com/docs/content/kc_gen/integrations-gen2.html).
 - Predict further bike rental amounts for incoming data.

## Authors

Kang Jiangbo and Yu Wenpei are SPSS Algorithm Engineers at IBM.

### Data citations
Lichman, M. (2013). UCI Machine Learning Repository [https://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 

Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg [Web link](doi:10.1007/s13748-013-0040-3).

Copyright © 2017 IBM. This notebook and its source code are released under the terms of the MIT License.