# Iris-classification 

This side project is my **first** machine learning side project in which I can practice the framework of Gluon. I choose this problem because this is a very basic classification problem, which is a good starting point for myself. 

details about this project: https://www.kaggle.com/uciml/iris;  Info about MXNET and Gluon: http://gluon.mxnet.io/

<cite>Data description from Kaggle</cite>
>The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.

>It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other."

## Outline

1. Preliminaries: Loading required libraries and importing data
2. Preparing data: Scaling, cleaning and converting data
3. Define Classifier: simple linear regression model
4. Validate performance of classifer

## (1) Preliminaries

### Load Required Libraries

In [1]:
from mxnet import gluon
from mxnet import autograd
from mxnet import ndarray as nd
import pandas as pd
import numpy as np

### Import Data 

The columns in this dataset are:

- **Id**: unique index of each sample
- **SepalLengthCm**: feature of sepal length
- **SepalWidthCm**: feature of sepal width
- **PetalLengthCm**: feature of petal length
- **PetalWidthCm**: feature of petal width
- **Species**: label of Iris species     (integer indicates: 0 - Iris-setosa; 1 - Iris-versicolor; 2 - Iris-virginic)


In [3]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,0
1,2,4.9,3.0,1.4,0.2,0
2,3,4.7,3.2,1.3,0.2,0
3,4,4.6,3.1,1.5,0.2,0
4,5,5.0,3.6,1.4,0.2,0


## (2) Prepare the data

### 1. merge *feature columns* of both training and testing dataset


In [4]:
all_X = pd.concat((train.loc[:, 'SepalLengthCm':'PetalWidthCm'],
        test.loc[:,'SepalLengthCm':'PetalWidthCm']))

### 2. scale the features to avoid skewness as:

$$x =  \frac{x - E_x}{ \sigma {x}} $$
where $E_x$ is the expectation of $x$, and $\sigma{x}$ is the standard deviation of $x$


In [6]:
numeric_feas = all_X.dtypes[all_X.dtypes != "object"].index
all_X[numeric_feas] = all_X[numeric_feas].apply(lambda x: (x - x.mean()) / (x.std()))

### 3. convert categorical variable into indicator variables


In [7]:
all_X = pd.get_dummies(all_X, dummy_na=True)

### 4. convert format into matrix as input of ndarray and mxnet

In [9]:
# number of samples in training set
num_train = train.shape[0]

X_train = all_X[:num_train].as_matrix()
y_train = train.Species.as_matrix()
# load training data into ndarray
X_train = nd.array(X_train)
y_train = nd.array(y_train)
y_train.reshape((num_train, 1))

# number of samples in testing set
num_test = test.shape[0]
X_test = all_X[num_train:].as_matrix()
y_test = test.Species.as_matrix()
# load testing data into ndarray
X_test = nd.array(X_test)
y_test = nd.array(y_test)
y_test.reshape((num_test, 1))



[[ 2.]
 [ 2.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 1.]
 [ 1.]]
<NDArray 8x1 @cpu(0)>

## (3) Define Classifier

* loss function
* model
* data iterator
* trainer instance

### 1. define loss function

In [10]:
square_loss = gluon.loss.L2Loss()

### 2. define model

In [11]:
net = gluon.nn.Sequential()
with net.name_scope():
    net.add(gluon.nn.Dense(1))
net.initialize()

### 3. define data iterator

In [12]:
batch_size = 10
dataset = gluon.data.ArrayDataset(X_train, y_train)
data_iter = gluon.data.DataLoader(dataset, batch_size, shuffle=True)

### 4. define trainer instance

Note the learning rate here is very important! It affects the convergence rate and accuracy of classification. Try to tune it and see the result change! 😁

In [13]:
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.1})
net.collect_params().initialize(force_reinit=True)

## (4) Train the model and validate its performance

In [16]:
epochs = 10
for e in range(epochs):
    total_loss = 0
    total_test_loss = 0
    total_sample = 0
    for data, label in data_iter:
        with autograd.record():
            output = net(data)
            loss = square_loss(output, y_train)
        loss.backward()
        total_sample += batch_size
        trainer.step(batch_size)
        # training error
        total_loss += nd.sum(loss).asscalar()
    print("-----------------------------------------")
    print("Epoch %d, average training loss: %f" % (e, total_loss/total_sample))
    # testing error
    if X_test is not None:
        test_output = net(X_test)
        test_loss = square_loss(test_output, y_test)
        total_test_loss += nd.sum(test_loss).asscalar()
    print("Epoch %d, average testing loss: %f" % (e, total_test_loss/total_sample))

-----------------------------------------
Epoch 0, average training loss: 0.000018
Epoch 0, average testing loss: 0.035717
-----------------------------------------
Epoch 1, average training loss: 0.000018
Epoch 1, average testing loss: 0.035707
-----------------------------------------
Epoch 2, average training loss: 0.000017
Epoch 2, average testing loss: 0.035714
-----------------------------------------
Epoch 3, average training loss: 0.000016
Epoch 3, average testing loss: 0.035708
-----------------------------------------
Epoch 4, average training loss: 0.000015
Epoch 4, average testing loss: 0.035708
-----------------------------------------
Epoch 5, average training loss: 0.000014
Epoch 5, average testing loss: 0.035729
-----------------------------------------
Epoch 6, average training loss: 0.000014
Epoch 6, average testing loss: 0.035718
-----------------------------------------
Epoch 7, average training loss: 0.000013
Epoch 7, average testing loss: 0.035729
----------------