# Iris flower classification using KNN Regression

This example uses the [classic Iris flower species](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/) dataset.
It captures petal length and width, sepal length and width and known class for 150 samples of three species.

We'll use [Tablesaw](https://jtablesaw.github.io/tablesaw/) to store and manipulate our data
and the KNN regression class from the [Smile](http://haifengl.github.io/) machine learning library.
So, we'll add those libraries to the classpath.

In [1]:
%%classpath add mvn
tech.tablesaw tablesaw-beakerx 0.38.2
tech.tablesaw tablesaw-aggregate 0.38.2
com.github.haifengl smile-core 2.5.3

And add some associated imports.

In [2]:
%import smile.classification.KNN
%import smile.validation.ConfusionMatrix
%import smile.validation.CrossValidation
%import tech.tablesaw.api.StringColumn
%import tech.tablesaw.api.Table
%import static tech.tablesaw.aggregate.AggregateFunctions.*

We'll also enable a BeakerX display widget for Tablesaw tables.

In [3]:
tech.tablesaw.beakerx.TablesawDisplayer.register()
OutputCell.HIDDEN

We start by loading data and printing its shape and structure.

In [4]:
rows = Table.read().csv("../resources/iris_data.csv")

In [5]:
rows.shape()

150 rows X 5 cols

In [6]:
rows.structure()

In [7]:
rows.xTabCounts('Class')

Let's also check stats for the columns.

In [8]:
def cols = ['Sepal length', 'Sepal width', 'Petal length', 'Petal width']
(0..<cols.size()).each {
    println rows.summarize(cols[it], mean, min, max).by('Class')
}
OutputCell.HIDDEN

                                  iris_data.csv summary                                  
      Class       |  Mean [Sepal length]  |  Min [Sepal length]  |  Max [Sepal length]  |
-----------------------------------------------------------------------------------------
     Iris-setosa  |                5.006  |                 4.3  |                 5.8  |
 Iris-versicolor  |                5.936  |                 4.9  |                   7  |
  Iris-virginica  |                6.588  |                 4.9  |                 7.9  |
                                iris_data.csv summary                                 
      Class       |  Mean [Sepal width]  |  Min [Sepal width]  |  Max [Sepal width]  |
--------------------------------------------------------------------------------------
     Iris-setosa  |               3.418  |                2.3  |                4.4  |
 Iris-versicolor  |                2.77  |                  2  |                3.4  |
  Iris-virginica  |      

Munge the data to get it into the right form and pass it to our classifier.

In [9]:
species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
def iris = rows.smile().toDataFrame()
def features = iris.drop("Class").toArray()
classes = iris.column("Class").toStringArray()
classIndexs = classes.collect{ species.indexOf(it) } as int[]
predictions = CrossValidation.classification(10, features, classIndexs) { x, y -> KNN.fit(x, y, 3) }
OutputCell.HIDDEN

In [10]:
def results = []
predictions.eachWithIndex{ predictedClass, idx ->
    def (actual, predicted) = [classes[idx], species[predictedClass]]
    results << (actual == predicted ? predicted : "$predicted/$actual".toString())
}
rows = rows.addColumns(StringColumn.create('Result', results))
ConfusionMatrix.of(classIndexs, predictions)

ROW=truth and COL=predicted
class  0 |      50 |       0 |       0 |
class  1 |       0 |      47 |       3 |
class  2 |       0 |       2 |      48 |

In [11]:
def predictedClasses = rows.column('Result').toList().toUnique()
plot = new Plot(title: 'Iris predicted[/actual] species', xBound: [0, 7.5], yBound: [0, 3],
                xLabel: 'Petal length', yLabel: 'Petal width')
predictedClasses.each { prediction ->
    def xs = rows.column('Petal length').where(rows.column('Result').isEqualTo(prediction)).toList()
    def ys = rows.column('Petal width').where(rows.column('Result').isEqualTo(prediction)).toList()
    plot << new Points(x: xs, y: ys, displayName: prediction)
}
plot