## 0. Introduction 

The aim of this lab is to get familiar with **clustering** using **K-means**.

For this lab, we will be using the [iris dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#iris-dataset).

In [None]:
from sklearn import model_selection
from sklearn import preprocessing
from sklearn import cluster
from sklearn import datasets
from sklearn import metrics
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import colors as clr
import seaborn as sn
from scipy.spatial.distance import cdist
from scipy.cluster.hierarchy import dendrogram
from IPython import display

import typing
%matplotlib inline

In [None]:
iris = datasets.load_iris()
print(iris.DESCR)

For simplicity we will use `petal length` and `petal width` in the first part of the lab.

In [None]:
X = iris.data[:, 2:]
Y = iris.target
scaler = preprocessing.StandardScaler()
X = scaler.fit_transform(X)

print(X.shape, Y.shape)

By looking at the data and their corresponding ground truths, we can already identify some clusters.

In [None]:
marker_list = ['+', '.', 'x']
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111)
ax.set_aspect('equal')

for l in [0, 1, 2]:
  ax.scatter(
      X[Y == l, 0], 
      X[Y == l, 1],
      marker=marker_list[l], 
      s=70, 
      color='black',
      label='{:d} ({:s})'.format(l, iris.target_names[l])
      )

ax.legend(fontsize=12)
ax.set_xlabel(iris.feature_names[2], fontsize=14)
ax.set_ylabel(iris.feature_names[3], fontsize=14)
ax.grid(alpha=0.3)
ax.set_xlim(X[:, 0].min() - 0.5, X[:, 0].max() + 0.5)
ax.set_ylim(X[:, 1].min() - 0.5, X[:, 1].max() + 0.5)
plt.show()

# 1. K-means algorithm

This is an iterative algorithm to partition $N$ observations into $K$ clusters.

The two main parts of the algorithm are:

1. E-step: Assign each datum to the closest centroid
2. M-step: Find the new center of each cluster

In [None]:
def distance(x: np.ndarray, mu: np.ndarray) -> np.ndarray:
  """We will use the euclidean distance"""
  x2 = np.sum(x**2, axis=1)
  mu2 = np.sum(mu**2, axis=1)
  xmu = np.matmul(x, mu.T)
  x2 = x2.reshape(-1, 1)
  dist = x2 - 2*xmu + mu2
  return np.sqrt(dist)

def estep(centroids: np.ndarray, x_data: np.ndarray) -> np.ndarray:
  distances = distance(x_data, centroids)
  allocation = np.argmin(distances, axis=1)
  return allocation

def mstep(x_data: np.ndarray, allocation: np.ndarray) -> np.ndarray:
  _ndx = np.argsort(allocation)
  _id, _pos, g_count = np.unique(allocation[_ndx], return_index=True, return_counts=True)
  sum = np.add.reduceat(x_data[_ndx], _pos, axis=0)
  mean = sum/g_count[:, None]
  return mean

In [None]:
def visualize_clusters(X, Y, centroids, allocation):
  display.clear_output(wait=True)
  colors = ['#CBAACB', '#F6EAC2', '#97C1A9']
  fig = plt.figure(figsize=(7, 7))
  ax = fig.add_subplot(111)
  ax.set_aspect('equal')
  for l in [0, 1, 2]:
      idx = Y == l
      c = allocation[idx]
      cmap = clr.ListedColormap(colors, N=3)
      ranges = np.linspace(0, 2, len(colors)+1)
      norm = clr.BoundaryNorm(ranges, cmap.N)
      # print(X.shape, Y.shape, c.shape)
      ax.scatter(
          X[idx, 0], 
          X[idx, 1],
          marker=marker_list[l], 
          s=70, 
          c=c,
          label='{:d} ({:s})'.format(l, iris.target_names[l]),
          cmap=cmap,
          norm=norm
          )
  #plot centroids
  plt.scatter(
      centroids[:, 0], 
      centroids[:, 1], 
      marker='^', 
      c=colors, 
      s=70,
      )
  #plot lines
  for idx, datum in enumerate(X):
    calloc = allocation[idx]
    cen = centroids[calloc]
    x_dat = [datum[0], cen[0]]
    y_dat = [datum[1], cen[1]]
    plt.plot(x_dat, y_dat, alpha=0.2, c=colors[calloc])

  ax.legend(fontsize=12)
  ax.set_xlabel(iris.feature_names[2], fontsize=14)
  ax.set_ylabel(iris.feature_names[3], fontsize=14)
  ax.grid(alpha=0.3)
  ax.set_xlim(X[:, 0].min() - 0.5, X[:, 0].max() + 0.5)
  ax.set_ylim(X[:, 1].min() - 0.5, X[:, 1].max() + 0.5)
  plt.show()

Puting everything together:

* Initialize $K$ random centroids
* Repeat E-M steps for $I$ number of iterations

In [None]:
iterations = 10
K = 3
dimensions = X.shape[1]

# Initialize K random centroids
centroids = np.random.rand(K, dimensions) * 4 - 2 #random centroids between [-2,2]
# for I number of iterations
for i in range(iterations):
  allocation = estep(centroids, X)
  centroids = mstep(X, allocation)
  visualize_clusters(X, Y, centroids, allocation)

# 2. Selecting $K$ (Elbow curve)
In the iris example, we have some prior knowledge about the different species which informs our decision on K. This is not always the case, in which case we need a way of selecting $K$.

For this part of the lab, we will use `sklearn` built in methods.

In [None]:
K = range(1, 10)
cost = list()
for k in K:
  model = cluster.KMeans(n_clusters=k)
  model.fit(X)
  cost.append(
      sum(np.min(cdist(X, model.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0]
  )

Plot the cost against number of clusters. What is the optimal number of $K$?

In [None]:
### your code here

Now using all of the attributes, repeat the elbow curve method to select the optimal $K$. Are they the same?

In [None]:
X = iris.data
### your code here

# 3. Hierarchical Clustering
In this section we will look into a Hierarchical clustering algorithm.
We will be using the [AgglomerativeClustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html?highlight=agglomerative) class from sklearn.

First, spend some time to read the API and understand the model parameters.
Then, fit a model to $X$


In [None]:
### your code here

Using the method below, plot the dendrogram. What do you observe? 

In [None]:
def plot_dendrogram(model, **kwargs):
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)
    dendrogram(linkage_matrix, **kwargs)


### your code here

# 4. Segmentation Example

For the remainder of the lab, we will use the `ifood_df.csv` dataset, which consists of 2206 customers of XYZ company with data on:

*    Customer profiles
*    Product preferences
*    Campaign successes/failures
*    Channel performance

First, upload the csv file to drive and load as a DataFrame.

*Note the files uploaded are ephemeral and will need to be reuploaded*

In [None]:
from google.colab import files
files.upload()
### load in a data frame


Some transformations are required:

*    yearbirth -> customer_age: in years `date.today().year - df['Year_Birth']`
*    dtCustomer -> customer_days: customer's erollment in days `(pd.to_datetime("now") - pd.to_datetime(df['Dt_Customer'])) // np.timedelta64(1,'D')`
* Total Amount -> `df['MntTotal'] = df.loc[:,['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts']].sum(axis=1)`
* Regular products amount -> `df['MntRegularProds'] = df.loc[:,'MntTotal'] - df.loc[:,'MntGoldProds']`



We will only use the following columns:
`['Income', 'Age', 'Recency', 'MntWines',
       'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntRegularProds','MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'MntTotal'
       ]`

In [None]:
### your code here

We will train a Kmeans clustering algorithm.

Using the elbow method, find the optimal number of $K$ and plot the elbow curve.

Train a model for the optimal number of $K$.

In [None]:
### elbow curve


### train model for optimal K
k = 0

Add a column to the DataFrame named `cluster` and store the allocated cluster.

In [None]:
### your code here

Using matplotlib and [boxplot](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.boxplot.html), visualize the clusters along each dimension.

In [None]:
### your code here

Now looking at the clusters along each dimension:
* Are all dimensions equally important in separating the customers into clusters?
* Write a small paragraph describing each customer profile.


Answer: