# Lesson 3: Clustering 

Clustering is a machine learning task in which data is categorized into groups where items in each group are similar. 

In below figures, we can see the clusters of handwritten digits. Figure 1 (a) shows the single cluster of handwritten digit 8 and Figure 1 (b) shows the clusters of handwritten digits 0 - 9. 

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from matplotlib import rcParams
from PIL import Image

%matplotlib inline

# figure size in inches optional
rcParams['figure.figsize'] = 20,20

# read images
img_A = Image.open('../assets/mnist_8.png')
img_B = Image.open('../assets/mnist_all.png') 
img_B = img_B.resize(img_A.size)
# display images
fig, ax = plt.subplots(1,2)


ax[0].imshow(img_A);
ax[0].axis('off')
ax[0].set_title("Figure 1(a): Single cluster of handwritten digit 8", fontsize = 20)
ax[1].imshow(img_B);
ax[1].axis('off')
a = ax[1].set_title("Figure 1(b): Clusters of handwritten digits 0 - 9", fontsize = 20)


# Activity 1

Question 1 - In Figure 1 (a), Why are there images corresponding to digit '3' are part of cluster '8'?


Question 2 - In Figure 1 (b), Why distance between clusters of 4 and 7 is less than disance between cluster of 7 and 0?





# World Happiness Dataset


World happiness dataset consists of dataset of 155 countries which are ranked by their happiness levels. The scores are based on answers to the main life evaluation question asked in a poll. The columns following the happiness score estimate the extent to which each of six factors – economic production, social support, life expectancy, freedom, absence of corruption, and generosity – contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors. These scores explain why some countries rank higher than others. 


In this lesson, we will do following tasks
- Load World Happiness Data. 
- Observe the available columns and their data types.
- Visualize the dataset to understand how different countries rank w.r.t. each other.
- Learn about K-Means clustering algorithm which is used to cluster data into similar groups. 
- Use K-Means Clustering to form the clusters of countries with high, medium and low happiness scores. 



In [None]:
# Load World Happiness Data
import pandas as pd
df = pd.read_csv("../assets/happinessDataset/2015.csv")


In [None]:
# View first five rows of the dataset
df.head()

In [None]:
# Observe the statistics of Columns
df.describe()

In [None]:
# Observe the data types of columns
df.dtypes

In [None]:
# Visualize the distribution of the happiness scores
plt.figure(figsize=(10,6))
a=10
plt.hist(df["Happiness Score"],a,label='2015',alpha=0.3,color='red')
plt.ylabel('No. of countries',size=13)
plt.legend(loc='upper right')
a = plt.title("Distribution of Happiness scores",size=16)


# Questions related to distribution of the happiness score 

Question 1 - How many countries have happiness score of 7?
- Less than 5
- Less than 15
- 10 - 15
- Greater than 20

In [None]:
reg = pd.DataFrame(df.groupby(['Region'])["Happiness Score"].mean())
plt.figure(figsize=(10,7))
plt.title('Happiness Scores across different regions')
sns.barplot(x='Happiness Score',y=reg.index,data=reg,palette='mako')

# Questions related to happiness score across different regions 

Question 1 - Which region has highest score?
- Western Europe
- Southern Asia
- North America
- Middle East and Northern Africa

In [None]:
# Correlation between happiness score and economy, family, health. 
import seaborn as sns
df_copy = df.copy()
df_copy.drop(['Country', 'Region', 'Happiness Rank', 'Standard Error', 'Generosity', 'Dystopia Residual', 'Trust (Government Corruption)'], axis=1,inplace=True)
c2 = df_copy.corr(method = "pearson")
plt.figure(figsize=(10,6))
sns.heatmap(c2,annot=True)

# Questions related to distribution of the happiness score 

Question 1 - Which factor is a good predictor for happiness score?
- Economy(GDP per Capita)
- Family
- Health (Life Expectancy)
- Freedom

In [None]:
df.columns

In [None]:
# K-Means Clustering
from sklearn.preprocessing import StandardScaler
clustering_data = df[["Happiness Score", 'Economy (GDP per Capita)', 'Family',
       'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)',
       'Generosity', 'Dystopia Residual']]
ss = StandardScaler()
ss.fit_transform(clustering_data)



In [None]:
# K-Means Clustering on World Happiness Data 
from sklearn.cluster import KMeans
def doKmeans(X, nclust=2):
    model = KMeans(nclust)
    model.fit(X)
    clust_labels = model.predict(X)
    cent = model.cluster_centers_
    return (clust_labels, cent)

clust_labels, cent = doKmeans(clustering_data, 3)
kmeans = pd.DataFrame(clust_labels)
clustering_data.insert((clustering_data.shape[1]),'kmeans',kmeans)

In [None]:
#Plot the clusters obtained using k means
fig = plt.figure(figsize= (10,8))
ax = fig.add_subplot(111)
scatter = ax.scatter(clustering_data['Happiness Score'],clustering_data['Economy (GDP per Capita)'],
                     c=kmeans[0],s=50)
ax.set_title('K-Means Clustering')
ax.set_xlabel('Happiness Score')
ax.set_ylabel('Economy (GDP per Capita)')
cbar = plt.colorbar(scatter)
cbar.set_label("Cluster Group")

# Questions related to Clustering

Question 1: Which cluster represents high happiness score?  1, 2 or 3

# Activity 
Change number of clusters to three and generate clustering graphs again. 



In [None]:
wh1 = clustering_data
if 'Country' not in wh1.columns:
    wh1.insert(0,'Country',wh.iloc[:,0])
data = [dict(type='choropleth',
             locations = wh1['Country'],
             locationmode = 'country names',
             z = wh1['kmeans'],
             text = wh1['Country'],
             colorbar = {'title':'Cluster Group'})]
layout = dict(title='Clustering of Countries based on K-Means',
              geo=dict(showframe = False,
                       projection = {'type':'mercator'}))
map1 = go.Figure(data = data, layout=layout)
iplot(map1)