# Assignment: Cluster Iris flowers

In this assignment you are going to build an unsupervised learning app that clusters Iris flowers into discrete groups. 

There are three types of Iris flowers: Versicolor, Setosa, and Virginica. Each flower has two sets of leaves: the inner Petals and the outer Sepals.

Your goal is to build an app that can identify an Iris flower by its sepal and petal size.

![MNIST digits](./assets/flowers.png)

Your challenge is that you're not going to use the dataset labels. Your app has to recognize patterns in the dataset and cluster the flowers into three groups without any help. 

Clustering is an example of **unsupervised learning** where the data science model has to figure out the labels on its own. 

The first thing you will need for your app is a data file with Iris flower petal and sepal sizes. You can use this [CSV file](https://github.com/mdfarragher/DSC/blob/master/Clustering/IrisFlower/iris-data.csv). 

The file has already been downloaded and is available to your code as **iris-data.csv**. It looks like this:

![Data file](./assets/data.png)

It’s a CSV file with 5 columns:

* The length of the Sepal in centimeters
* The width of the Sepal in centimeters
* The length of the Petal in centimeters
* The width of the Petal in centimeters
* The type of Iris flower

You are going to build a clustering data science model that reads the data and then guesses the label for each flower in the dataset.

Of course the app won't know the real names of the flowers, so it's just going to number them: 1, 2, and 3.

## Get started

Let’s get started. You'll need to install the ML.NET package first:

In [1]:
#r nuget:Microsoft.ML

Now you are ready to add code:

In [2]:
using Microsoft.ML;
using Microsoft.ML.Data;
using System;

You will also need two classes: one to hold a flower and one to hold your model prediction:

In [3]:
public class IrisData
{
    [LoadColumn(0)] public float SepalLength;
    [LoadColumn(1)] public float SepalWidth;
    [LoadColumn(2)] public float PetalLength;
    [LoadColumn(3)] public float PetalWidth;
    [LoadColumn(4)] public string Label;
}

public class IrisPrediction
{
    [ColumnName("PredictedLabel")]
    public uint ClusterID;

    [ColumnName("Score")]
    public float[] Score;
}

The **IrisData** class holds one single flower. Note how the fields are tagged with the **LoadColumn** attribute that tells ML.NET how to load the data from the data file.

We are loading the label in the 5th column, but we won't be using the label during training because we want the model to figure out the iris flower types on its own.

There's also an **IrisPrediction** class which will hold a prediction for a single flower. The prediction consists of the ID of the cluster that the flower belongs to. Clusters are numbered from 1 upwards. And notice how the score field is an array? Each individual score value represents the distance of the flower to one specific cluster.  

## Loading the data

Next you'll need to load the data in memory:

In [7]:
var mlContext = new MLContext();

// read the iris flower data from a text file
Console.Write("Loading data...");
var data = mlContext.Data.LoadFromTextFile<IrisData>(
    path: "iris-data.csv", 
    hasHeader: false, 
    separatorChar: ',');

// split the data into a training and testing partition
var partitions = mlContext.Data.TrainTestSplit(data, testFraction: 0.2);
Console.WriteLine("done");

Loading data...done


This code uses the **LoadFromTextFile** method to load the CSV data directly into memory, and then calls **TrainTestSplit** to split the dataset into an 80% training partition and a 20% test partition.

Let's see if that worked. We're going to deserialize the training data into an enumeration of **IrisData** instances and do a quick visual check of the data:

In [8]:
// get an array of heartdata instances
var data = mlContext.Data.CreateEnumerable<IrisData>(partitions.TrainSet, reuseRowObject: false).ToArray();

// display the result
display(data.Take(10));

index,SepalLength,SepalWidth,PetalLength,PetalWidth,Label
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.0,3.4,1.5,0.2,Iris-setosa
6,4.4,2.9,1.4,0.2,Iris-setosa
7,4.9,3.1,1.5,0.1,Iris-setosa
8,5.4,3.7,1.5,0.2,Iris-setosa
9,4.8,3.4,1.6,0.2,Iris-setosa


That looks great. For every flower we have the width and length of the petals and sepals.

## Training the model

Now let’s build the data science pipeline:

In [9]:
// set up a learning pipeline
// step 1: concatenate features into a single column
var pipeline = mlContext.Transforms.Concatenate(
        "Features", 
        "SepalLength", 
        "SepalWidth", 
        "PetalLength", 
        "PetalWidth")

    // step 2: use k-means clustering to find the iris types
    .Append(mlContext.Clustering.Trainers.KMeans(
        featureColumnName: "Features",
        numberOfClusters: 3));

// train the model on the data file
Console.Write("Training model....");
var model = pipeline.Fit(partitions.TrainSet);
Console.WriteLine("done");

Training model....done


Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components.

This pipeline has two components:

* **Concatenate** which converts the PixelValue vector into a single column called Features. This is a required step because ML.NET can only train on a single input column.
* A **KMeans** component which performs K-Means Clustering on the data and tries to find all Iris flower types. 

With the pipeline fully assembled, the code trains the model with a call to **Fit**.

## Evaluating the model

You now have a fully- trained model. So now it's time to take the test set, predict the type of each flower, and calculate the accuracy metrics of the model:

In [11]:
// evaluate the model
Console.WriteLine("Evaluating model:");
var predictions = model.Transform(partitions.TestSet);
var metrics = mlContext.Clustering.Evaluate(
    predictions, 
    scoreColumnName: "Score", 
    featureColumnName: "Features");
Console.WriteLine($"   Average distance:     {metrics.AverageDistance}");
Console.WriteLine($"   Davies Bouldin index: {metrics.DaviesBouldinIndex}");

Evaluating model:
   Average distance:     0.5095654442196801
   Davies Bouldin index: 0.6106797026421757


This code calls **Transform** to set up predictions for every flower in the test set, and **Evaluate** to evaluate the predictions and automatically calculates two metrics:

* **AverageDistance**: this is the average distance of a flower to the center point of its cluster, averaged over all clusters in the dataset. It is a measure for the 'tightness' of the clusters. Lower values are better and mean more concentrated clusters. 
* **DaviesBouldinIndex**: this metric is the average 'similarity' of each cluster with its most similar cluster. Similarity is defined as the ratio of within-cluster distances to between-cluster distances. So in other words, clusters which are farther apart and more concentrated will result in a better score. Low values indicate better clustering.

So Average Distance measures how concentrated the clusters are in the dataset, and the Davies Bouldin Index measures both concentration and how far apart the clusters are spaced. Both metrics are negative-based with zero being the perfect score.

We're getting an average distance of 0.51. Since all input features are in centimeters, this distance is also in units of centimeters. So what this means is that when we create a 4-dimensional solution space out of the 4 input features, on average every flower is 0.51 centimeters away from its cluster centroid.

So is that good or bad?

It's impossible to say actually. We would have to know the total extent of the solution space and see how far the cluster centroids are spaced apart.

A much better metric is the Davies Bouldin Index that measures the ratio of average distances inside each cluster and average distances between clusters. That gives us a range of 0...N, with 0 meaning super-concentrated clusters spaced far apart, and increasing values meaning  more and more sparse clusters that start to overlap.

We get a Davies Bouldin Index value of 0.61, which means the clusters are reasonably spread out and non-overlapping, and that the quality of this clustering model is fair.

## Making a prediction

To wrap up, let’s use the model to make predictions.

You will pick three arbitrary flowers from the test set, run them through the model, and compare the predictions with the labels provided in the data file.

Here’s how to do it:

In [12]:
// show predictions for a couple of flowers
Console.WriteLine("Predicting 3 flowers from the test set....");
var flowers = mlContext.Data.CreateEnumerable<IrisData>(partitions.TestSet, reuseRowObject: false).ToArray();
var flowerPredictions = mlContext.Data.CreateEnumerable<IrisPrediction>(predictions, reuseRowObject: false).ToArray();
foreach (var i in new int[] { 0, 10, 20 })
{
    Console.WriteLine($"   Flower: {flowers[i].Label}, prediction: {flowerPredictions[i].ClusterID}");
}

Predicting 3 flowers from the test set....
   Flower: Iris-setosa, prediction: 2
   Flower: Iris-versicolor, prediction: 3
   Flower: Iris-virginica, prediction: 1


This code calls the **CreateEnumerable** method to convert the test partition into an array of **IrisData** instances, and the model predictions into an array of **IrisPrediction** instances. 

Then the code picks three flowers for testing. For each flower it writes the label and the cluster ID (= a number between 1 and 3) to the console. 

The first flower is an Iris-Setosa and is assigned the label '2' by the model. The second flower is an Iris-Versicolor and gets the label '2', and the third flower is an Iris-Virginica and gets the label '1'. The model can tell all three flowers apart.

This is a great result because the Iris dataset is notoriously difficult to cluster.

 ## Further improvements
 
 How do you think this model can be improved even more?