# Machine learning optimization using the cognitive assistant function

This notebook demonstrates the use of the cognitive assistant that is part of Jupyter Notebook running a Python kernel. In this sample, you will use cognitive assistant function to optimize machine learning piplelines for a notebook user's dataframe.

Cognitive assistant currently employs Apache® Spark (PySpark) and SKLearn (Python scikit-learn) analytics on Spark 2.1 and Python 2, optionally with feature selection, for classification and regression. 

The notebook remains interactive during the optimizations. The execution counter and the kernel activity indicator reflect the activity of both the notebook user and the cognitive assistant.

## Contents

This notebook has the following main sections:
1. [Download the data](#download)
1. [Format the data](#format)
1. [Run the cognitive assistant optimization](#cads)

<a id="download"></a>
## 1. Download the data

Because cognitive assistant is especially suited to large data sets, in the following step, you'll be retrieving a large (2.6 GB) .csv file that contains data on HIGGS boson particles. 

From the description of the data set on the UCI Web site: "The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator. The last seven features are functions of the first 21 features; these are high-level features derived by physicists to help discriminate between the two classes. There is an interest in using deep learning methods to obviate the need for physicists to manually develop such features. Benchmark results using Bayesian Decision Trees from a standard physics package and 5-layer neural networks are presented in the original paper. The last 500,000 examples are used as a test set."

In [None]:
%%bash
#Download a portion of a popular compressed csv dataset.  Target kilobytes of data are specified after '-ge' nine or so lines below
cat << 'EOF' > limited.sh
# optionally you can limit the download rate to wget via --limit-rate=2m
wget -nv  -O limited.csv.gz https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz &
WGET_PID=$!
while [ `ps $WGET_PID | wc -l ` -eq 2 ] ; do
    du -k --apparent-size limited.csv.gz
    sleep 1
    #modify the target kilobyte download size after -ge below, or comment out the line below to get the whole file
    if [ -e limited.csv.gz ] ; then if [ `du -k --apparent-size limited.csv.gz | awk '{print \$1}'` -ge 10000  ] ; then kill -15 $WGET_PID; fi; fi
done
EOF
cat limited.sh

In [None]:
!echo "expect an unexpected-end-of-file-message, that is ok if you specified a limited download size"
!bash limited.sh
#sed will swallow incomplete last line which will likely occur in the case of a partial download
!gunzip -c -q limited.csv.gz | sed -e 's/rarelyseen/rarelyseen/'> limited.csv
!echo `wc -l limited.csv` ' complete lines of csv retrieved'

In [None]:
!rm -f limited.csv.gz

<a id="format"></a>
## 2. Format the data

Format the data into a Pyspark SQL dataframe consisting of a numeric `label` column and a vector `features` column.

In [None]:
df = sqlContext.read.format('com.databricks.spark.csv')\
  .options(header='false', inferschema='true')\
  .load("limited.csv")

### 2.1 Remove compressed file

You can clear up space on your filesystem by removing the compressed *limited.csv.gz* file. Because of the deferred computation in Spark, the uncompressed file is read later and cannot be erased at this point.

In [None]:
!rm -f limited.csv.gz

In [None]:
allNames = [f.name for f in df.schema.fields]
#FIRST column is the label (unlike many csv where it is the last)
labelName = allNames[0]
print 'labelName ' + labelName
featureNames = [ f.name for f in df.schema.fields if f.name != labelName ]
print 'featureNames=' + str(featureNames)

In [None]:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import VectorAssembler


featureAssembler = VectorAssembler(
    inputCols=featureNames,
    outputCol="features")

In [None]:
#create features (vector) column then select label and features column, with renaming.
newDF=featureAssembler.transform(df).selectExpr(labelName + " as label","features as features")


In [None]:
print newDF

<a id="cads"></a>
## 3. Run the cognitive assistant optimization

Before you can use the cognitive assistant, you must import the cognitive_assistant package and then start the assistant.

In [None]:
import cognitive_assistant as cognitive

In [None]:
cognitive.assistant.startAssistant()

### 3.1 Invoke the cognitive assistant function

You invoke the cognitive assistant function by providing it the following information:

-  the name of the data frame
-  the prediction type
    -  For a classification algorithm, you specify predictionType='CLASSIFICATION' or goalTags="CADS_FS_JY_CL"
    -  For a regression algorithm, you specify predictionType='REGRESSION' or goalTags="CADS_FS_JY_RG"

In [None]:
cognitive.assistant.startOptimization(newDF, goalTags="CADS_FS_JY_CL")

### 3.2 Check on the progress

You can check on the progress of optimization by running the following code. Re-invoke this as many times as you want to view current results. An error message can be expected while the optimization is starting up and you may need to wait until the process is fulling running before checking the progress.

In [None]:
cognitive.assistant.visualizeProgress()

### 3.3 Do a quick check of CPU - Optional

To further check on the progress of your optimization, you can can run the following command to display CPU activity and the number of processes that are running.

In [None]:
!top -u $USER -n 1

### 3.4 Stop the cognitive assistant process

When optimization has completed or progressed to an acceptable point, the cognitive assistant can be stopped by running the following command:

In [None]:
cognitive.assistant.stopAssistant()

## Summary

You downloaded and formatted a publicly available data set and then used the cognitive assistant to optimize the pipeline. You're probably feeling pretty good about yourself right now. You ROCK!

## Authors

**Peter D. Kirchner**, PhD (Electrical Engineering), is a Research Scientist persuing computer science research in machine learning and cloud computing at the IBM Thomas J. Watson Research Center. He is presently engaged in cognitive automation of data science workflow to assist data scientists, focused on cloud-based deployments and scalability.

**Mike Sochka** is a content designer focusing on IBM Data Science Experience and Watson Machine Learning. 

***
### References

Baldi, P., P. Sadowski, and D. Whiteson. “Searching for Exotic Particles in High-energy Physics with Deep Learning.” Nature Communications 5 (July 2, 2014).

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Copyright © 2017 IBM. This notebook and its source code are released under the terms of the MIT License.