# Using sklearn's Iris Dataset with neon

Tony Reina<br>
28 JUNE 2017

Here's an example of how we can load one of the standard [sklearn](http://scikit-learn.org/stable/index.html) datasets into a neon model. We'll be using the [iris dataset](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html), a classification model which tries to predict the type of iris flower species (Setosa, Versicolour, and Virginica) based on 4 continuous parameters: Sepal Length, Sepal Width, Petal Length and Petal Width. It is based on Ronald Fisher's 1936 paper describing [Linear Discriminant Analysis](https://en.wikipedia.org/wiki/Iris_flower_data_set). The dataset is now considered one of the gold standards at monitoring the performance of a new classification method.

In this notebook, we'll walk through loading the data from sklearn into neon's ArrayIterator class and then passing that to a simple multi-layer perceptron model. We should get a misclassification rate of 2% to 8%.

>Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at

>     http://www.apache.org/licenses/LICENSE-2.0

> Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

## Load the iris dataset from sklearn

In [1]:
from sklearn import datasets

In [2]:
iris = datasets.load_iris()
X = iris.data  
Y = iris.target

nClasses = len(iris.target_names)  # Setosa, Versicolour, and Virginica iris species

## Use sklearn to split the data into training and testing sets

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33) # 66% training, 33% testing

## Make sure that the features are scaled to mean of 0 and standard deviation of 1

This is standard pre-processing for multi-layered perceptron inputs.

In [4]:
from sklearn.preprocessing import StandardScaler

scl = StandardScaler()

X_train = scl.fit_transform(X_train)
X_test = scl.transform(X_test)

## Generate a backend for neon to use

This sets up either our GPU or CPU connection to neon. If we don't start with this, then ArrayIterator won't execute.

We're asking neon to use the cpu, but can change that to a gpu if it is avaliable. Batch size refers to how many data points are taken at a time. Here's a primer on [Gradient Descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent).

>Technical note:  Your batch size must always be much less than the number of points in your data. So if you have 50 points, then set your batch size to something much less than 50. I'd suggest setting the batch size to no more than 10% of the number of data points. You can always just set your batch size to 1. In that case, you are no longer performing mini-batch gradient descent, but are performing the standard stochastic gradient descent.

In [5]:
from neon.data import ArrayIterator
from neon.backends import gen_backend

be = gen_backend(backend='cpu', batch_size=X_train.shape[0]//10)  # Change to 'gpu' if you have gpu support 

## Let's pass the data to neon

We pass our data (both features and labels) into neon's ArrayIterator class.  By default, ArrayIterator one-hot encodes the labels (which saves us a step). Once we get our ArrayIterators, then we can pass them directly into neon models.

In [6]:
training_data = ArrayIterator(X=X_train, y=y_train, nclass=nClasses, make_onehot=True)
testing_data = ArrayIterator(X=X_test, y=y_test, nclass=nClasses, make_onehot=True)

In [7]:
print ('I am using this backend: {}'.format(be))

I am using this backend: <neon.backends.nervanacpu.NervanaCPU object at 0x7fa55f41b450>


## Import the neon libraries we need for this MLP

In [8]:
from neon.initializers import GlorotUniform, Gaussian 
from neon.layers import GeneralizedCost, Affine, Dropout
from neon.models import Model 
from neon.optimizers import GradientDescentMomentum
from neon.transforms import Softmax, CrossEntropyMulti, Rectlin, Tanh
from neon.callbacks.callbacks import Callbacks 
from neon.transforms import Misclassification 

## Initialize the weights and bias variables

We could use numbers from the Gaussian distribution ($\mu=0, \sigma=0.3$) to initialize the weights and bias terms for our regression model. However, we can also use other initializations like [GlorotUniform](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).

In [9]:
init = GlorotUniform()    #Gaussian(loc=0, scale=0.3) 

## Define a multi-layered perceptron (MLP) model

We just use a simple Python list to add our different layers to the model. The nice thing is that we've already put our data into a neon ArrayIterator. That means the model will automatically know how to handle the input layer.

I've just thrown together a model haphazardly. In this model, the input layer feeds into a 4-neuron rectified linear unit affine layer. That feeds into an 8 neuron hyperbolic tangent layer (with 50% dropout). Finally, that outputs to a softmax of the nClasses. We'll predict based on the argmax of the softmax layer.

In [10]:
layers = [ 
          Affine(nout=4, init=init, bias=init, activation=Rectlin()), # Affine layer with 4 neurons (ReLU activation)
          Affine(nout=8, init=init, bias=init, activation=Tanh()), # Affine layer with 4 neurons (Tanh activation)
          Dropout(0.5),  # Dropout layer
          Affine(nout=nClasses, init=init, bias=init, activation=Softmax()) # Affine layer with softmax
         ] 

In [11]:
mlp = Model(layers=layers) 

## Cost function

How "close" is the model's prediction is to the true value? For the case of multi-class prediction we typically use [Cross Entropy](https://en.wikipedia.org/wiki/Cross_entropy). 

In [12]:
cost = GeneralizedCost(costfunc=CrossEntropyMulti()) 

## Gradient descent

All of our models will use [gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent). We will iteratively update the model weights and biases in order to minimize the cost of the model.

In [13]:
optimizer = GradientDescentMomentum(0.1, momentum_coef=0.2) 

callbacks = Callbacks(mlp, eval_set=training_data)

## Run the model

This starts gradient descent. The number of epochs is how many times we want to perform gradient descent on our entire training dataset. So 100 epochs means that we repeat gradient descent on our data 100 times in a row.

In [14]:
mlp.fit(training_data, optimizer=optimizer, num_epochs=100, cost=cost, callbacks=callbacks) 

Epoch 0   [Train |████████████████████|   10/10   batches, 1.17 cost, 0.03s]
Epoch 1   [Train |████████████████████|   10/10   batches, 1.14 cost, 0.03s]
Epoch 2   [Train |████████████████████|   10/10   batches, 1.05 cost, 0.03s]
Epoch 3   [Train |████████████████████|   10/10   batches, 1.01 cost, 0.03s]
Epoch 4   [Train |████████████████████|   10/10   batches, 0.99 cost, 0.03s]
Epoch 5   [Train |████████████████████|   10/10   batches, 0.90 cost, 0.03s]
Epoch 6   [Train |████████████████████|   10/10   batches, 0.85 cost, 0.03s]
Epoch 7   [Train |████████████████████|   10/10   batches, 0.79 cost, 0.02s]
Epoch 8   [Train |████████████████████|   10/10   batches, 0.76 cost, 0.03s]
Epoch 9   [Train |████████████████████|   10/10   batches, 0.67 cost, 0.03s]
Epoch 10  [Train |████████████████████|   10/10   batches, 0.68 cost, 0.04s]
Epoch 11  [Train |████████████████████|   10/10   batches, 0.63 cost, 0.04s]
Epoch 12  [Train |████████████████████|   10/10   batches, 0.60 cost, 0.04s]

## Run the model on the testing data

Let's run the model on the testing data and get the predictions. We can then compare those predictions with the true values to see how well our model has performed.

In [15]:
results = mlp.get_outputs(testing_data) 
prediction = results.argmax(1) 

error_pct = 100 * mlp.eval(testing_data, metric=Misclassification())[0]
print ('The model misclassified {:.1f}% of the test data.'.format(error_pct))

The model misclassified 2.0% of the test data.
