## Finding underlying structures in flowers

(The 'hello world'-example of clustering algorithms)

In this example we explore a dataset of flowers for underlying structures. We examine the data, try to guess the number of different classes and then apply a k-means clustering algorithm (http://scikit-learn.org/stable/modules/clustering.html#k-means)

### First explore the data set

In [None]:
# Importing necessary libraries
from sklearn import datasets
import matplotlib.pyplot as plt
plt.style.use('seaborn')
import pandas as pd
from sklearn.cluster import KMeans

In [None]:
# Loading a flowers dataset
iris_data = datasets.load_iris()

In [None]:
# Inspect the data numerically
iris_df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
iris_df.head()

In [None]:
# Inspect the data visually with two relevant features
plt.scatter(iris_df.iloc[:, 0], iris_df.iloc[:, 2], cmap='viridis')
plt.xlabel('sepal length')
plt.ylabel('petal length')
plt.show()

### TODO: Estimate the number of classes with the elbow method

TODO: Execute the following code cell. Now we try to find the right number of clusters in the dataset by calculating the sum of the squared distances of the points within the clusters. Smaller distances means more compact clusters. After how many clusters the curve (elbow) is flatening out (meaning the distance between the points does not decrease any more)?

In [None]:
# Calculate the distances for 1 to 15 clusters
squared_distances = []
number_of_clusters = range(1,15)
for k in number_of_clusters:
    km = KMeans(n_clusters=k)
    km = km.fit(iris_df.iloc[:, [0,2]])
    squared_distances.append(km.inertia_)
    
# Plotting these distances for comparisation
plt.plot(number_of_clusters, squared_distances, 'bx-')
plt.xlabel('Number of clusters')
plt.ylabel('sum of squared distances')
plt.title('Elbow Method For Optimal Number of Clusters')
plt.show()

### TODO: Predict the number of classes with k-means

TODO: Initialize a KMeans-Model, set the number of clusters (n_clusters), train the model and make the predictions.

In [None]:
# Create the model and initialize it with the supposed number of clusters
# TODO: Initialize a KMeans-Model and set the parameter n_clusters to the number of clusters

# TODO: Fit the model and predict the classes (pred)
# predicted_classes = model.fit_predict(iris_df)

### Compare predictions with the real clusters in the dataset

Now compare the clustering of the flowers with the target variable in the dataset. How did you do?

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6), sharey=True)
ax1.scatter(iris_df.iloc[:, 0], iris_df.iloc[:, 2], c=predicted_classes,cmap='viridis')
ax2.scatter(iris_df.iloc[:, 0], iris_df.iloc[:, 2], c=iris_data.target,cmap='viridis')
ax1.set_ylabel('petal length')
ax1.set_xlabel('sepal length')
ax2.set_xlabel('sepal length')
plt.show()

Congratulations! You have implemented your first clusting algorithm. Now move on to the next exercise.