# Clustering the Iris dataset into species using the k-means clustering technique.
The $k$ to use is basically determined from our input - we want to have exactly one cluster centre for each species of Iris - so in this case we have $k = 3$(species: Setosa, Versicolor, Virginica). The idea here is for the AI to be able to group a particular flower into the right species like a human expert by looking at different attributes like petal length, width & sepal length, width etc.

In [8]:
from sklearn.cluster import KMeans
from csv import reader
from collections import Counter

In [12]:
# similar to linReg.ipynb load functions. Taken from svm.ipynb
def load_data(f: str, inputs: list[str], output: str):
    inputs = [input.lower() for input in inputs]; output = output.lower()
    with open(f) as f:
        rows = reader(f)
        titles = next(rows) # no need to split, we already get a list
        indices = [i for i, title in enumerate(titles) if title.lower() in inputs]
        for i, title in enumerate(titles):
            if title.lower() == output: 
                outindex = i; break
        return [{
            "inputs": [float(row[i]) for i in indices], 
            "output": row[outindex]
        } for row in rows]

In [21]:
params = ["SepalLengthCm", "SepalWidthCm","PetalLengthCm", "PetalWidthCm"]
output = "species"
dataset = load_data("iris.csv", params, output)
print("Dataset loaded.")
X = [example["inputs"] for example in dataset]
y = [example["output"] for example in dataset]

model = KMeans(n_clusters=3, max_iter=300, tol=1e-4, algorithm="lloyd")
model.fit(X) # Note that the model does not require y to group examples. It can classify them without the labels being given! - unsupervised learning.
print("Model trained.")

prediction = model.predict(X)
print("Prediction =", prediction) # A list of 0s, 1s and 2s. Due to the random nature of the inital cluster locations, all 0s become 1s etc is possible.

results = sorted(Counter(zip(y, prediction)).items(), key=lambda item: item[1], reverse=True)

print("Grouping results:", results)
correct = 0
for i in range(3): correct += results[i][1]
print(f"\nModel correctly grouped {correct} examples from {len(X)} testcases.")

# From the results we see that the AI perfectly classified Setosa flowers, and almost perfectly classified Versicolor flowers(2 Versicolors were incorrectly classified as Virginica). However, it classified only 36 of the Virginica flowers correctly and classified the rest as Versicolor. Hence we conclude that in the input space, Setosa flowers are very far(different features) from the other two varieties, which are somewhat close to each other.

Dataset loaded.
Model trained.
Prediction = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 0 0 0 0 2 0 0 0 0
 0 0 2 2 0 0 0 0 2 0 2 0 2 0 0 2 2 0 0 0 0 0 2 0 0 0 0 2 0 0 0 2 0 0 0 2 0
 0 2]
Grouping results: [(('Iris-setosa', 1), 50), (('Iris-versicolor', 2), 48), (('Iris-virginica', 0), 36), (('Iris-virginica', 2), 14), (('Iris-versicolor', 0), 2)]

Model correctly grouped 134 examples from 150 testcases.
