# Lesson 3: Clustering 


### Review

**Question** What is clustering?

**Answer** Clustering is a machine learning task in which data is categorized into groups where items in each group are similar.

**Question** Do we have correct labels available for clustering?

**Answer** No, we do not have labels available for clustering data and our goal is to just group datapoints based on their similarity without having access to correct group numbers for each datapoint. 

Let's review the example we saw last time. The code below reads our data points from a file and then creates a scatterplot.

In [None]:
%matplotlib inline
# import necessary packages
import matplotlib.pyplot as plt 
import pandas
import numpy as np

# Read the data, and put it into a variable. 
data = pandas.read_csv("../assets/clustering-synthetic.csv") 

# show the first five data points
# in the data, "r" stands for red, and "b" stands for blue
print(data.head()) 

In [None]:
# Create a scatter plot
plt.scatter(x=data["x"], y=data["y"])
# Function to draw circle around the points.
def encircle2(x,y, l, ax=None, **kw):
    if not ax: ax=plt.gca()
    p = np.c_[x,y]
    mean = np.mean(p, axis=0)
    d = p-mean
    r = np.max(np.sqrt(d[:,0]**2+d[:,1]**2 ))
    circ = plt.Circle(mean, radius=1.25*r,**kw)
    label = ax.annotate(l, xy= mean, fontsize= 15)
    xmean, ymean = mean
    plt.scatter(xmean, ymean, c='red')
    ax.add_patch(circ)

# Logic to define cluster for synthetic data
idx1 = np.where(data["x"] > 5)
idx2 = np.where(data["x"] < 5)
x = np.array(data["x"])
y = np.array(data["y"])

# plot the clusters
plt.scatter(x, y)
encircle2(x[idx1], y[idx1], "A", ec="k", fc="gold", alpha=0.2)
encircle2(x[idx2], y[idx2], "B", ec="k", fc="blue", alpha=0.2)
plt.gca().relim()
plt.gca().autoscale_view()
plt.show()

### K-Means Clustering

K-means clustering is commonly used method to find the clusters within data. The method accepts an argument **k** which specifies how many clusters we want to find in data. 

The method returns the centroid (center) of the K clusters. That's why the name of the method is K-Means. 

K-means find the minimum distance of each datapoint from each of the centroids. Each point is assigned to the cluster whose centroid is closest to that point.


In [None]:
# plot the clusters
plt.scatter(x, y)
encircle2(x[idx1], y[idx1], "A", ec="k", fc="gold", alpha=0.2)
encircle2(x[idx2], y[idx2], "B", ec="k", fc="blue", alpha=0.2)
plt.scatter(4.9, 11, marker = 'x', c = "black")
plt.gca().relim()
plt.gca().autoscale_view()
plt.show()

### Activity 1

Let's suppose there are two clusters and their centroids are point A and B as shown above. 

**Question** - Which centroid is closest to the new point marked as 'x' below. Which cluster this new point belongs to?

**Answer** - Centroid B is closest to the new point. Point will belong to blue cluster.

### Clustering example : Handwritten Digits
In below figures, we can see the clusters of handwritten digits. Figure 1 (a) shows the single cluster of handwritten digit 8 and Figure 1 (b) shows the clusters of handwritten digits 0 - 9. 

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from matplotlib import rcParams
from PIL import Image

%matplotlib inline

# figure size in inches optional
rcParams['figure.figsize'] = 20,20

# read images
img_A = Image.open('../assets/mnist_8.png')
img_B = Image.open('../assets/mnist_all.png') 
img_B = img_B.resize(img_A.size)
# display images
fig, ax = plt.subplots(1,2)


ax[0].imshow(img_A);
ax[0].axis('off')
ax[0].set_title("Figure 1(a): Single cluster of handwritten digit 8", fontsize = 20)
ax[1].imshow(img_B);
ax[1].axis('off')
a = ax[1].set_title("Figure 1(b): Clusters of handwritten digits 0 - 9", fontsize = 20)


### Activity 2

**Question**: In Figure 1 (a), Why are there images corresponding to digit '3' are part of cluster '8'?

**Answer:** As shape of digits 3 and 8 are very similar, some of the 3 shaped digits are include in cluster containing digits '8'

**Question:** In Figure 1 (b), Why distance between clusters of 4 and 7 is less than disance between cluster of 7 and 0?

**Answer:** As shapes of digit 7 is more similar to digit 4, rather than digit 0, clusters of 4 and 7 are closer than clusers of 7 and 0.


# Clustering for World Happiness Dataset

In [None]:
# Load World Happiness Data
import pandas as pd
df = pd.read_csv("../assets/happinessDataset/2015.csv")

# View first five rows of the dataset
df.head()


# Data Normalization For Clustering

### Importance of data normalization
For clustering, it is recommended to scale the dataset to values between 0 and 1, so that all the columns have values in same range. For example, if a medical dataset contains height of people ranging between 100-200 cm and blood haemoglobin levels ranging between 4-10 g/dL. 

If we want to cluster people in groups based on height and haemoglobin levels using original data (without scaling values between 0 and 1), then height will have more say in distance calculation for clustering than haemglobin. This is because scale of height is higher. In that case, people having similar heights will be grouped in the same cluster although their haemoglobin levels are quite different, which will be ignored due to small values. 

In [None]:
# Normalize the dataset values between 0 and 1

# import module which normalize the data between 0 and 1 by subtracting mean, 
# and dividing by standard deviation
from sklearn.preprocessing import StandardScaler

# Print columns and their data types

df.dtypes

### Activity 3
**Question:** Which columns in world happiness data have non-numerical values (like strings)?

**Answer:** Country and Region

**Question:** Do we need to normalize columns with non-numerical values?

**Answer:** No, we don't need to normalize columns with values as strings (e.g. Country).


In [None]:
# See the range of values of different columns for world happiness data
df.describe()

In [None]:

# Select relevant numerical columns and normalize them using standard scaler
data = df[["Happiness Score", 'Economy (GDP per Capita)', 'Family',
       'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)',
       'Generosity', 'Dystopia Residual']]
ss = StandardScaler()

# Fit standard scaler to the data
clustering_data = ss.fit_transform(data)

# View normalized dataset 
clustering_data = pd.DataFrame(clustering_data, columns= data.columns)
clustering_data.head(5)



### Undo Normalization

You can uncomment below code for last activity in this lesson if you want to see the effect of using unnormalizing dataset for clustering.

In [None]:
# Remove normalization by uncommenting below code 
# clustering_data = data

In [None]:
# K-Means Clustering on World Happiness Data 
from sklearn.cluster import KMeans

def doKmeans(X, nclust=2):
    model = KMeans(nclust)
    model.fit(X)
    clust_labels = model.predict(X)
    cent = model.cluster_centers_
    return (clust_labels, cent)

# Cluster the data into two groups
clust_labels, cent = doKmeans(clustering_data, nclust = 2)
kmeans = pd.DataFrame(clust_labels)
clustering_data.insert((clustering_data.shape[1]),'kmeans',kmeans)

In [None]:
# Plot the clusters obtained using k means using scatter plot 
# between happiness score and economy GDP
fig = plt.figure(figsize= (10,8))
ax = fig.add_subplot(111)
scatter = ax.scatter(clustering_data['Happiness Score'],clustering_data['Economy (GDP per Capita)'],
                     c=kmeans[0],s=50)
ax.set_title('K-Means Clustering')
ax.set_xlabel('Happiness Score')
ax.set_ylabel('Economy (GDP per Capita)')
cbar = plt.colorbar(scatter)
cbar.set_label("Cluster Group")

### Activity 4
**Question:** Which cluster represents higher happiness score?  (Yellow or Purple)

**Answers:** Purple

**Question:** Which cluster represents lower GDP per Capita? 1 or 2 ((Yellow or Purple)

**Answer:** Yellow



In [None]:
# Visualize clusters on geographical map

import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
wh1 = clustering_data
if 'Country' not in wh1.columns:
    wh1.insert(0,'Country', df.iloc[:,0])
data = [dict(type='choropleth',
             locations = wh1['Country'],
             locationmode = 'country names',
             z = wh1['kmeans'],
             text = wh1['Country'],
             colorbar = {'title':'Cluster Group'})]
layout = dict(title='Clustering of Countries based on K-Means',
              geo=dict(showframe = False,
                       projection = {'type':'mercator'}))
map1 = go.Figure(data = data, layout=layout)
iplot(map1)

### Activity 5: Change number of clusters
Change number of clusters to three and generate clustering graphs again. 


### Activity 6: Data Normalization and Clustering 
Comment the cell which does data normalization and see the effect of removing data normalization on clustering results