# How to use a Jupyter Notebook in Watson Studio
* Each executable cell of the notebook is preceded with a `In [ ]` indicator.   
* When the bracket is empty, it means that the cell has not been executed.
* To execute a cell, either click the Run button above, or hit Shift-Enter key combination
* The bracket turns to `In [*]` while the code is running, and then a sequence number once completed.
* The output of a code cell will be displayed below that cell, either in text or graphical format
* Cells can be executed in any sequence, but the code dependencies are usually requiring a sequential execution

## Initialize some of the libraries we will need
A set of python packages are present in the Jupyter notebook environment, and need to be made known to the current execution namespace:
* Python standard libs
* `numpy` for numerical/mathematical functions in python
* `pandas` is the python analytics for data library, providing table-like objects known as Panda DataFrames
* `matplotlib` is a library for plotting mathematical functions
* `bokeh` is another graphical library for interactive plotting and graphing

In [None]:
import sys,types
import re

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import bokeh.plotting as bok


# Use Cloud Object Storage (COS) API to access data
This API is documented at https://console.bluemix.net/docs/services/cloud-object-storage/libraries/python.html#python   

**First pass instructions:**
Here we will access files of historical data points that have been collected and stored by a Watson IoT to COS Stream similar to the one you have been building earlier in this hands-on session.
This earlier run has collected data over about 5 days, showing the daily cycles in sensors readings.

In the first pass of the lab, we will access historical data, in the second pass, you will access your own data, which is currently being collected by the WIOTP->COS data stream.

**Second pass instructions:**
Once you will have reached the bottom of this notebook execution, we will re-run the notebook on your own dataset, and run a few extra steps to generate a concatenated data file for Machine Learning ingestion.

So, we will need to insert you own COS credentials belwo, instead of the ones for the historical data:
* Select the code cell below, and comment out the whole cell contents (add a `#` in front of each line, you can use the Ctrl-/ key shortcut for this)
* Position the cursor after the commented section
* On the top right side of the notebook, select the 1001 icon, which opens the Files/Connections drawer
* Freom the `Connections` tab, locate the connection to your COS, which should be named with a `cloud-object-storage-` prefix
* From the connection, select `Insert to code`
* This will insert a new `credentials_n = {}` code block.
* **Make sure** that the variable is suffixed with `1`, i.e. `credentials_1`

In [None]:
# Credentials for accessing the historical data stored in COS
# @hidden_cell
credentials_1 = {
  'iam_url':'https://iam.ng.bluemix.net/oidc/token',
  'api_key':'wwnkueE80Sv8jYLAHl7YLi9WIdz9dEUL3Ca5H6NN4JAz',
  'resource_instance_id':'crn:v1:bluemix:public:cloud-object-storage:global:a/7f9dc5344476457f2c0f53244a1825db:ea11a568-b1e1-4743-85a4-ba297aa46370::',
  'url':'https://s3-api.us-geo.objectstorage.service.networklayer.com'
}
# @hidden_cell
# credentials_1 = {
#   'iam_url':'https://iam.ng.bluemix.net/oidc/token',
#   'api_key':'rZDp4xWnLHvEWLrk1J_FBRJyY3g4vtt6CXWa59yvNlhu',
#   'resource_instance_id':'crn:v1:bluemix:public:cloud-object-storage:global:a/6edb55179ec85cbb11b70fed0e12c861:15f5b9b1-9cf9-4d16-99a5-7c408282997c::',
#   'url':'https://s3-api.us-geo.objectstorage.service.networklayer.com'
# }

**First pass:** We have configured the stream flow to store in the `raspilamp` bucket, with a `raspiLamp1/lampdata_[0-9]*_[0-9]*.csv` file path pattern, i.e. CSV files with a timestamp suffix

**Second pass:** change the variables below to reflect the structure of COS file path storage you have setup in the flow:
``` python
bucketName='raspilamp-20180420-x' # where x is your lamp number
dataFileRegExp='lampdata_[0-9]*_[0-9]*.csv'
```

In [None]:
# Data used for the Raspi Lamp 1 instance
bucketName='raspilamp'
#bucketName='raspilamp-20180420-x'

dataFileRegExp='raspiLamp1/lampdata_[0-9]*_[0-9]*.csv'
#dataFileRegExp='lampdata_[0-9]*_[0-9]*.csv'

This generic code cell will use the credentials filled-in the cell above in the `credentials_1` variable

In [None]:
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share your notebook.
#cos = ibm_boto3.client(service_name='s3',
cos = ibm_boto3.client('s3',
    ibm_api_key_id=credentials_1['api_key'],
    ibm_auth_endpoint=credentials_1['iam_url'],
    config=Config(signature_version='oauth'),
    endpoint_url=credentials_1['url'])

Now list the files from the given bucket whose names match the regular expression defined above

In [None]:
dataFilePrefix='' if len(dataFileRegExp.split('/'))==1 else dataFileRegExp.split('/')[0]
bucketFiles=cos.list_objects(Bucket=bucketName,Prefix=dataFilePrefix,MaxKeys=1000000)
keys = [bucketFile['Key'] for bucketFile in bucketFiles['Contents'] if re.match(dataFileRegExp,bucketFile['Key'])]
print("There are {0} lampdata files out of {1} files that match regexp in bucket {2}".format(len(keys),len(bucketFiles['Contents']),bucketName))

Now we actually read the files from the bucket.   
This may be a lengthy operation depending on how many files have been collected.   
The `In [*]:` in front of the cell will turn to a sequence number once completed

In [None]:
# Your data file was loaded into a botocore.response.StreamingBody object.
# Please read the documentation of ibm_boto3 and pandas to learn more about your possibilities to load the data.
# ibm_boto3 documentation: https://ibm.github.io/ibm-cos-sdk-python/
# pandas documentation: http://pandas.pydata.org/
objs=[cos.get_object(Bucket=bucketName, Key=key) for key in keys]
print("Loaded {0} lampdata files".format(len(objs)))

In [None]:
# We collect streaming bodies objects for all files, and add a __iter__ method to allow file-like operations
strbodies=[obj['Body'] for obj in objs if obj['Body']]
# add missing __iter__ method, so pandas accepts body as file-like object
for strb in strbodies: strb.__iter__= types.MethodType(__iter__,strb) 

## Read data files into pandas and concatenate into a single dataframe
Now that we have adressed the files from COS storage, we need to read them into objects that can be used by Data Science libraries.   
Here we use the Panda library.   

The following cell loads and convert the CSV files contents into Panda DataFrames, which are then concatenated to form a single one.

In [None]:
# Read from CSV files input, dropping duplicates
dfs = [pd.read_csv(strb).drop_duplicates() for strb in strbodies]
df=pd.concat(dfs)

#print("Loaded lamp data rows count:\n{0}\n{1}".format(df.count(),df.dtypes))
# Show types and non-N/A counts
pd.DataFrame({'types':df.dtypes,'count':df.count()})

### Clean the data for empty values
The following code cell drops lines which have empty cells (although in the stored sampling there should be none)

In [None]:
# Remove rows with N/A cells
df=df.dropna()

# Adjust the data frame columns types
* The panda CSV read has not necessarily identified the column types accurately.   
* In the following code cell, we force some columns to numeric format, and we add a `date` column built from the absolute timestamp column `ts` that was added by the streams flow.
* Some of the timestamp fields are known to be large integers, but have been converted to floats, we also convert tem back to `int64`

In [None]:
# change data types, somehow the integer columns are not seen as numeric
df[['ldr','leds','rawTemp','solar']] = df[['ldr','leds','rawTemp','solar']].apply(pd.to_numeric)

# Add a data column from ts column
df['date'] = pd.to_datetime(df['ts'],unit='s')

# Convert back to int64
df['ts']=df['ts'].astype(np.int64)
df['dts']=df['dts'].astype(np.int64)
df['rawTemp']=df['rawTemp'].astype(np.int64)

# Change the index for the timestamp date
We are operating on timeseries, so we will now use the absolute `date` as an index. In order to simplify further handling, we duplicate the `date` column and use the `dtidx` copy as index.

In [None]:
# index on date
df['dtidx']=df['date']
df=df.set_index('dtidx')

#### Display modified DataFrame's types
This is to show the final typology of the dataframe after transtyping and cleansing

In [None]:
pd.DataFrame({'types':df.dtypes,'count':df.count()})

# Explore the IoT data
We are now ready to start analysing the IoT data loaded from COS storage.

Panda dataframes have many built-in methods to display statistic, let's start by getting a glimpse of the data by printing the first and last 5 rows:

In [None]:
# display 5 first and last lines
pd.concat([df.head(5),df.tail(5)])

Use the `describe()` function to print statistics on numerical columns

In [None]:
df.describe()

# Generate vizualisations for various aspects of the data
The first tools used to explore a dataset will be some graphical representations.

Using the `plot()` primitive, it is possible to quickly graph several attributes of a DataFrame

We'll start by plotting timeseries graphs of the sensors values

In [None]:
# Individual graphs of sensors
p_sens=df.plot(kind='line',x='date',y=['ldr','solar','humidity','pressure','temp','temperature','rawTemp'],figsize=(20,15),subplots=True)

# Explore potential relationships of sensors
We are now interested in finding out which sensors have correlated values.   
The graphs plotted above seem to show that `ldr` and `solar` are strongly correlated, but that other values are not much related to each others, we'll verify this through a correlation matrix 

In [None]:
# Select data frame with sensor readings only
df_sens=df[['ldr','solar','humidity','pressure','temp','temperature','rawTemp']]

## Generate correlation matrix
The correlation matrix will yield a first-level evaluation on the correlations of attributes with each others.
We will compute the matrix and display it.   
The correlation is a value normalized on [-1,+1] which indicates how much parameters are co-related. the higher the value, the stronger the correlation.

In [None]:
#Compute correlation matrix
df_corr=df_sens.corr()
df_corr

### display of correlation matrix
To make things more understandable than an array of numbers, we can plot the matrix's heatmap.   
A red color indicates a high correlation, while blue is uncorrelated

In [None]:
# Plot correlation matrix as heatmap, Red is high correlation, yellow is medium, blue is low
axcorr=plt.matshow(df_corr,cmap=plt.cm.rainbow).axes
axcorr.set_xticklabels([' ']+df_corr.columns.tolist())
axcorr.set_yticklabels([' ']+df_corr.columns.tolist())
for label in axcorr.xaxis.get_ticklabels(): label.set_rotation(45)

# Focus on LDR vs solar relationship
Now that this early investigations have proven that there is a strong relationship between the solar and LDR values, we'll now focus on finding more about this.  
We'll start by plotting the two values on the same graph

In [None]:
# First plot the two data lines superimposed on the same graph
plight=df.plot(kind='line',x='date',y=['ldr','solar'],figsize=(20,3))

Since the values are not within the same range, we'll normalize them in the [0-1] range and plot again

In [None]:
# Now graph ldr and solar along time, after normalizing the data to the 0-1 range
dfSolLDR=df[['solar','ldr']]
dfSolLDRNorm=(dfSolLDR - dfSolLDR.min())/(dfSolLDR.max() - dfSolLDR.min())
dfSolLDRNorm['date']=df['date']
pSolLDR=dfSolLDRNorm.plot(kind='line',x='date',y=['ldr','solar'],figsize=(20,3))

## Display dependency relationship of ldr and solar values
The normalized plot above shows that the `solar` and `ldr` values track each others, but do not exactly follow the same pattern all the time. We will analyze this further through a scatter plot.

For this we will use a scatter graph of `solar` vs `ldr`.   
In order to study the influence of temperature, we vary the intensity of the dots color based on the temperature

In [None]:
# Show relationship between ldr and solar, color by temperature
p=df.plot(kind='scatter',x='solar',y='ldr',c='temperature',figsize=(20,10),colormap='Greens')

## Plot scatter matrix of all sensors
To confirm sensors relationships, we can also get a full cross-map of sensors relation ship displayed graphically.   
This will be a bit lengthy, but yields exhaustive results that show how sensors relate to each others:

In [None]:
# Plot scatter matrix of sensors, display kde on the diagonal 
from pandas.plotting import scatter_matrix
pscat=scatter_matrix(df_sens, alpha=0.2, figsize=(20,20), diagonal='kde')

## Optional: Interactive graphs plotting with `bokeh`
`bokeh` is another plotting library, which adds interactive capabilities to what matplotlib can do.   
Being interactive, it will work better on smaller datasets, so we will first aggregate data by the minute

In [None]:
# Agreggate by minutes. We use a grouper on the date, and retain the average value on the window
# This generates a new DataFrame with by-minute averages
dfMin=df.groupby([pd.Grouper(key='date',freq='min')]).mean()
dfMin.describe()

Display the bokeh graph with tools enabled. You will be able to interact with the graph using the tools palette on the right.

In [None]:
# Single-out the axis from the by-minute data frame 
xSolar = dfMin['solar']
yLdr = dfMin['ldr']

# Compute min and max temperatrures
tMin=dfMin['temperature'].min()
tMax=dfMin['temperature'].max()
print("temperature range: {0} - {1}".format(tMin,tMax))

# Generate an array of colors hex RGB values, where the intensity of the green color is proportional to temperature, and red when t is NaN
c = [ "#%02x%02x00" % (0xff if np.isnan(t) else 0x00, 0x00 if np.isnan(t) else int(0xff*(tMax-t)/(tMax-tMin))) for t in dfMin['temperature']]

## We could generate varying size dot through a radius array in the same fashion as for colors
#radii = np.random.random(size=N) * 1.5
#colors = [
#    "#%02x%02x%02x" % (int(r), int(g), 150) for r, g in zip(50+2*x, 30+2*y)
#]

# Enable some of the bokeh interactive tools
#ALL_TOOLS="hover,crosshair,pan,wheel_zoom,zoom_in,zoom_out,box_zoom,undo,redo,reset,tap,save,box_select,poly_select,lasso_select,"
TOOLS="hover,crosshair,pan,wheel_zoom,zoom_in,zoom_out,box_zoom,reset,save,box_select,poly_select,lasso_select"
p = bok.figure(tools=TOOLS)

#p.scatter(x, y, radius=radii,fill_color=colors, fill_alpha=0.6,line_color=None)
p.scatter(xSolar, yLdr,fill_alpha=0.6,color=c,name="scatter")

# Instruct bokeh to output to the notebook
bok.output_notebook()

bok.show(p)

# A glimpse at ML techniques: Classification and Clustering

Clustering is a Machine Learning technique whereby algorithms are applied to data sets to attempt to group data points together.   
This is often refered to as one of the 'Unsupervised Machine Learning' techniques.

In this lab section, we will show how to apply clustering to our lamp's [solar, ldr] cartesian points.

The following code is based on the SciKit Learn ![](http://scikit-learn.org/stable/_static/scikit-learn-logo-small.png)   
Machine Learning library, see http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html

To show the impact of the choice of a Clustering algorithm on the effectiveness of classification, we will add our RaspiLamp dataset to the sample provided by SciKit-Learn.
The sample code runs 9 algorithms and charts them together. This allows to have a feeling for the effectiveness of a given algorithm versus the shape and structure of the data points.

In [None]:
## Adapted from http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
import time
import warnings

import numpy as np
import matplotlib.pyplot as plt

from sklearn import cluster, datasets, mixture
from sklearn.neighbors import kneighbors_graph
from sklearn.preprocessing import StandardScaler
from itertools import cycle, islice

np.random.seed(0)

# ============
# Generate datasets. We choose the size big enough to see the scalability
# of the algorithms, but not too big to avoid too long running times
# ============
n_samples = 1500

noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,
                                      noise=.05)

noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05)
blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)
no_structure = np.random.rand(n_samples, 2), None

# Anisotropicly distributed data
random_state = 170
X, y = datasets.make_blobs(n_samples=n_samples, random_state=random_state)
transformation = [[0.6, -0.6], [-0.4, 0.8]]
X_aniso = np.dot(X, transformation)
aniso = (X_aniso, y)

# blobs with varied variances
varied = datasets.make_blobs(n_samples=n_samples,
                             cluster_std=[1.0, 2.5, 0.5],
                             random_state=random_state)

## Prepare Lamp Data: Reduce volume of data to analyze through aggregation
We will reduce the number of data points for our lamps to run through the algorithms

In [None]:
# Add a row of graph for our lamp data, create n_samples aggregates of averaged values
dfSubMeans=df[['date','solar','ldr']]
dfSubMeans=df.groupby(pd.cut(dfSubMeans['date'],n_samples)).mean()

# Convert to a numPy array of arrays for the solar and ldr columns, add a dummy zeroed column
npArrLamp=(dfSubMeans.as_matrix(columns=['solar','ldr']),np.zeros(n_samples))

In [None]:
# Put all datasets into an array to iterate over
datasets = [
    (npArrLamp,{'eps': .15, 'n_neighbors': 2}),
    (noisy_circles, {'damping': .77, 'preference': -240,
                     'quantile': .2, 'n_clusters': 2}),
    (noisy_moons, {'damping': .75, 'preference': -220, 'n_clusters': 2}),
    (varied, {'eps': .18, 'n_neighbors': 2}),
    (aniso, {'eps': .15, 'n_neighbors': 2}),
    (blobs, {}),
    (no_structure, {})]

# Set up plot details
plt.figure(figsize=(9 * 2 + 3, 12.5))
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05,
                    hspace=.01)

# Default model parameters
default_base = {'quantile': .3,
                'eps': .3,
                'damping': .9,
                'preference': -200,
                'n_neighbors': 10,
                'n_clusters': 3}

# iterate over datssets and algorithms to plot graphs
plot_num = 1
for i_dataset, (dataset, algo_params) in enumerate(datasets):
    # update parameters with dataset-specific values
    params = default_base.copy()
    params.update(algo_params)

    X, y = dataset

    # normalize dataset for easier parameter selection
    X = StandardScaler().fit_transform(X)

    # estimate bandwidth for mean shift
    bandwidth = cluster.estimate_bandwidth(X, quantile=params['quantile'])

    # connectivity matrix for structured Ward
    connectivity = kneighbors_graph(
        X, n_neighbors=params['n_neighbors'], include_self=False)
    # make connectivity symmetric
    connectivity = 0.5 * (connectivity + connectivity.T)

    # ============
    # Create cluster objects
    # ============
    ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True)
    two_means = cluster.MiniBatchKMeans(n_clusters=params['n_clusters'])
    ward = cluster.AgglomerativeClustering(
        n_clusters=params['n_clusters'], linkage='ward',
        connectivity=connectivity)
    spectral = cluster.SpectralClustering(
        n_clusters=params['n_clusters'], eigen_solver='arpack',
        affinity="nearest_neighbors")
    dbscan = cluster.DBSCAN(eps=params['eps'])
    affinity_propagation = cluster.AffinityPropagation(
        damping=params['damping'], preference=params['preference'])
    average_linkage = cluster.AgglomerativeClustering(
        linkage="average", affinity="cityblock",
        n_clusters=params['n_clusters'], connectivity=connectivity)
    birch = cluster.Birch(n_clusters=params['n_clusters'])
    gmm = mixture.GaussianMixture(
        n_components=params['n_clusters'], covariance_type='full')

    clustering_algorithms = (
        ('MiniBatchKMeans', two_means),
        ('AffinityPropagation', affinity_propagation),
        ('MeanShift', ms),
        ('SpectralClustering', spectral),
        ('Ward', ward),
        ('AgglomerativeClustering', average_linkage),
        ('DBSCAN', dbscan),
        ('Birch', birch),
        ('GaussianMixture', gmm)
    )

    for name, algorithm in clustering_algorithms:
        t0 = time.time()

        # catch warnings related to kneighbors_graph
        with warnings.catch_warnings():
            warnings.filterwarnings(
                "ignore",
                message="the number of connected components of the " +
                "connectivity matrix is [0-9]{1,2}" +
                " > 1. Completing it to avoid stopping the tree early.",
                category=UserWarning)
            warnings.filterwarnings(
                "ignore",
                message="Graph is not fully connected, spectral embedding" +
                " may not work as expected.",
                category=UserWarning)
            algorithm.fit(X)

        t1 = time.time()
        if hasattr(algorithm, 'labels_'):
            y_pred = algorithm.labels_.astype(np.int)
        else:
            y_pred = algorithm.predict(X)

        plt.subplot(len(datasets), len(clustering_algorithms), plot_num)
        if i_dataset == 0:
            plt.title(name, size=18)

        colors = np.array(list(islice(cycle(['#377eb8', '#ff7f00', '#4daf4a',
                                             '#f781bf', '#a65628', '#984ea3',
                                             '#999999', '#e41a1c', '#dede00']),
                                      int(max(y_pred) + 1))))
        plt.scatter(X[:, 0], X[:, 1], s=10, color=colors[y_pred])

        plt.xlim(-2.5, 2.5)
        plt.ylim(-2.5, 2.5)
        plt.xticks(())
        plt.yticks(())
        plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'),
                 transform=plt.gca().transAxes, size=15,
                 horizontalalignment='right')
        plot_num += 1

plt.show()

## How to interpret the output
You can see on the graphs that depending on the structure of data, some algorithms work better than others

For our dataset, it seems that DBSCAN is the best option.

*********
# This is the end of the first pass.
For the second pass run on your own data, we will need to reset the notebook:
* From the `Kernel` menu at the top, select `Restart & Clear Output`
* All cells results should be blanked, and all cells execution status back to `In [ ]:`
* Go back to the top of the notebook and re-run it following second-pass instructions
*********

# This is to be executed only for the second pass on your own data

## Machine Learning setup
In this section, we will just write back the full concatenated data frame to Object Storage, so that it can be fed to the Machine Learning process.

In [None]:
# Decide on the name of the file object in COS
allSensorFileName='lampdata_All.csv'

In [None]:
# Convert a subset of the columns to CSV format
csvStr=df.to_csv(columns=['ts','solar','ldr','humidity','pressure','temperature'],encoding='utf-8')

In [None]:
# Write out the CSV data to a file object in COS
import io
cos.upload_fileobj(io.BytesIO(csvStr.encode('utf-8')),bucketName,allSensorFileName)

In [None]:
# Optional: Verify the dataset by loading it back into a new dataframe
allObjs=cos.get_object(Bucket=bucketName, Key=allSensorFileName) 
allStrbod=allObjs['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
allStrbod.__iter__= types.MethodType(__iter__,allStrbod)

dfAllSens = pd.read_csv(allStrbod)
dfAllSens.describe()

In [None]:
# Make timestamp to a date and set as index
dfAllSens['date'] = pd.to_datetime(dfAllSens['ts'],unit='s')
dfAllSens['dtidx']=dfAllSens['date']
dfAllSens=dfAllSens.set_index('dtidx')

dfAllSens.dtypes

## Optional: run DBSCAN clustering

In [None]:
# Extract the two light-related sensors
dfLightSens=dfAllSens[['solar','ldr']].copy()

# Access sklearn clustering
import sklearn.cluster

# Select of the of the clustering algorithms
#skClust = sklearn.cluster.KMeans(n_clusters=3)
#skClust = sklearn.cluster.SpectralClustering(n_clusters=3)
skClust = sklearn.cluster.DBSCAN()

# run the algo on the data
run=skClust.fit(dfLightSens.as_matrix(),{'eps': .15, 'n_neighbors': 2})

# Add clustering result column to light sensors dataframe
dfLightSens['cluster']=skClust.labels_

dfLightSens.describe()

In [None]:
# Display relationship between ldr and solar, colored by cluster
pLightSens=dfLightSens.plot(kind='scatter',x='solar',y='ldr',c='cluster',figsize=(20,10),colormap='hsv')

# END