In [None]:
import pandas as pd
import numpy as np
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import graphviz as gp

# Part 1: K-means on the happiness index

We use the 2019 happiness index dataset available here: https://www.kaggle.com/unsdsn/world-happiness 
We have removed the columns giving us the ranking and the score of each country and just kept the bare-bones indicators. The goal is to cluster the countries into countries with similar attributes.

In [None]:
happiness=pd.read_csv("countries_indicators.csv")
happiness

In [None]:
from sklearn.cluster import KMeans

### A. Introduction to K-means

We start with a basic version of K-means to just get used to the set-up in Python. We pick the number of clusters to be equal to 3 and we do no pre-processing. We drop the country or region column as well as the country code for this purpose as K-means works on numerical data.


In [None]:
happiness_quant=happiness.drop(columns=["Country or region","Code"])

1. Run the code below to run K-means

In [None]:
kmeans = KMeans(n_clusters=3).fit(happiness_quant)

2. Run the two snippets of code below. What do you think they are giving us?

In [None]:
kmeans.labels_

In [None]:
kmeans.cluster_centers_

3. Run the code below to plot the three clusters on a map. Make sure to go look there!

In [None]:
labels=kmeans.labels_
df=pd.DataFrame(labels,columns=['Cluster'])
df["Code"]=happiness.Code

from pygal.maps.world import World
from IPython.display import SVG, display

wm = World()
wm.force_uri_protocol = 'http'

cluster0=pd.Series.to_numpy(df[df["Cluster"]==0]["Code"])
cluster1=pd.Series.to_numpy(df[df["Cluster"]==1]["Code"])
cluster2=pd.Series.to_numpy(df[df["Cluster"]==2]["Code"])

wm.add('Cluster 0', cluster0)
wm.add('Cluster 1',cluster1)
wm.add('Cluster 2',cluster2)
display(SVG(wm.render()))

### B. Scaling the data

We now preprocess the data by scaling it.

1. Run the code below. In terms of absolute values, which features dominate?

In [None]:
happiness.hist(figsize=[5,8])
plt.show()

2. Rerun the clustering with scaling code below. Which countries have changed clusters? Can you come up with an explanation of why?

In [None]:
from sklearn import preprocessing

happiness_quant=preprocessing.scale(happiness_quant)
kmeans = KMeans(n_clusters=3).fit(happiness_quant)

#plotting
labels=kmeans.labels_
df=pd.DataFrame(labels,columns=['Cluster'])
df["Code"]=happiness.Code

wm = World()
wm.force_uri_protocol = 'http'

cluster0=pd.Series.to_numpy(df[df["Cluster"]==0]["Code"])
cluster1=pd.Series.to_numpy(df[df["Cluster"]==1]["Code"])
cluster2=pd.Series.to_numpy(df[df["Cluster"]==2]["Code"])

wm.add('Cluster 0', cluster0)
wm.add('Cluster 1',cluster1)
wm.add('Cluster 2',cluster2)
display(SVG(wm.render()))

In [None]:
happiness.loc[happiness['Country or region'] == "Saudi Arabia"]

In [None]:
happiness.loc[happiness['Country or region'] == "United Kingdom"]

### C. Impact of random initialization

1. Run the code below twice, saving the map each time under a different name (the output maps should be found within the same folder as the source file). Are the two maps obtained the same?

In [None]:
happiness_quant=preprocessing.scale(happiness_quant)
kmeans = KMeans(n_clusters=4,n_init=1).fit(happiness_quant)

#plotting
labels=kmeans.labels_
df=pd.DataFrame(labels,columns=['Cluster'])
df["Code"]=happiness.Code

wm = World()
wm.force_uri_protocol = 'http'

cluster0=pd.Series.to_numpy(df[df["Cluster"]==0]["Code"])
cluster1=pd.Series.to_numpy(df[df["Cluster"]==1]["Code"])
cluster2=pd.Series.to_numpy(df[df["Cluster"]==2]["Code"])
cluster3=pd.Series.to_numpy(df[df["Cluster"]==3]["Code"])

wm.add('Cluster 0', cluster0)
wm.add('Cluster 1',cluster1)
wm.add('Cluster 2',cluster2)
wm.add('Cluster 3',cluster3)
wm.render_to_file('map2.svg')

2. The code above is exactly the same in both cases. What leads to the differences observed?

### D. Choosing the right K

1. Run the code below. What are we getting?

In [None]:
inertia_K=[]
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(happiness_quant)
    inertia_K.append(kmeanModel.inertia_)

2. Plot inertia_K as a function of K using plt.plot. What value would you choose for K?

In [None]:
plt.plot(K,inertia_K)
plt.show()

# Part 2: Hierarchical clustering on the happiness index

Exceptionally, we use scipy rather than scikit-learn as scikit-learn does not have an easy module for drawing dendograms.


In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster

### A. Without scaling

We start off by importing the dataset once again.

In [None]:
happiness=pd.read_csv("countries_indicators.csv")
happiness_quant=happiness.drop(columns=["Country or region","Code"])

1. Draw a dendrogram in the with "average" linkage. How many clusters do you think we should have?

In [None]:
Z = linkage(happiness_quant,method='average')

dendrogram(Z)
plt.show()

2. Draw another dendogram, this time with "complete" linkages. This time, how many clusters do you think we should have?

In [None]:
Z = linkage(happiness_quant,method='complete')

dendrogram(Z)
plt.show()

3. Use the code below to find the cluster assignment for complete linkage that gives the most balanced graph.

In [None]:
labels=fcluster(Z, 3, criterion='maxclust') #returns clustering with 3 clusters, can also specify max distance you are okay with
labels

4. Use this code below to obtain the map of the world with this clustering. What do you think of the quality of the clusters?

In [None]:
df=pd.DataFrame(labels,columns=['Cluster'])
df["Code"]=happiness.Code

from pygal.maps.world import World
from IPython.display import SVG, display
wm = World()
wm.force_uri_protocol = 'http'

cluster1=pd.Series.to_numpy(df[df["Cluster"]==1]["Code"])
cluster2=pd.Series.to_numpy(df[df["Cluster"]==2]["Code"])
cluster3=pd.Series.to_numpy(df[df["Cluster"]==3]["Code"])

wm.add('Cluster 1',cluster1)
wm.add('Cluster 2',cluster2)
wm.add('Cluster 3', cluster3)

display(SVG(wm.render()))

### B. Using scaling

We now scale the dataset and repeat the steps above.

In [None]:
happiness=pd.read_csv("countries_indicators.csv")
happiness_quant=happiness.drop(columns=["Country or region","Code"])

from sklearn import preprocessing
happiness_quant=preprocessing.scale(happiness_quant)

1. Draw a dendrogram in the case where the linkage is average and when it is complete. How many clusters do you think we should have for both cases? Which type of linkage would you prefer to use?

In [None]:
Z = linkage(happiness_quant,method='average')

dendrogram(Z)
plt.show()

In [None]:
Z = linkage(happiness_quant,method='complete')

dendrogram(Z)
plt.show()

2. Use the code below to find the cluster assignment for complete linkage that gives the most balanced graph.

In [None]:
labels=fcluster(Z, 3, criterion='maxclust')
labels

3. Use this code below to obtain the map of the world with this clustering. What do you think of the quality of the clusters in contrast with K-means and hierarchical clustering without scaling?

In [None]:
df=pd.DataFrame(labels,columns=['Cluster'])
df["Code"]=happiness.Code

from pygal.maps.world import World
wm = World()
wm.force_uri_protocol = 'http'

cluster1=pd.Series.to_numpy(df[df["Cluster"]==1]["Code"])
cluster2=pd.Series.to_numpy(df[df["Cluster"]==2]["Code"])
cluster3=pd.Series.to_numpy(df[df["Cluster"]==3]["Code"])

wm.add('Cluster 1',cluster1)
wm.add('Cluster 2',cluster2)
wm.add('Cluster 3', cluster3)

display(SVG(wm.render()))

# Part 3: the Daily Kos dataset (optional)

For this exercise, we are considering a dataset called the `dailykos` dataset. It contains data on 3,430 news articles or blogs that have been posted on *Daily Kos*, an American political blog that publishes news and opinion articles written from a progressive point of view. These articles were posted in 2004, leading up to the United States Presidential Election. The leading candidates were incumbent President George W. Bush (republican) and John Kerry (democratic). Foreign policy was a dominant topic of the election, specifically, the 2003 invasion of Iraq. Our goal is to cluster the articles that appear in the dataset.

Note: Each observation is a news article (with 3,430 total) and each feature is a word that has appeared in at least 50 of these articles (with 1,545 words in total). The values are then then number of times that the given word has appeared in the article.

In [None]:
dailykos=pd.read_csv("dailykos.csv")
dailykos

## A. Hierarchical clustering

1. Do we need to scale the dataset here? Why/why not?

2. We use `scipy` to obtain the dendrogram for this section. We are going to use `method="ward"` here as it gives rise to the best results. Generate the dendrogram for this dataset (note: due to the number of words and articles, this may take a while). In light of the application and of the dendrogram, how many clusters would you pick?

3. Generate the labels of each datapoint. Which one is the largest cluster? Which one is the smallest?

4. Add a new column to dailykos which contains the labels. Then, filter the dataset based on these labels: for example, restrict yourselves to those rows which correspond to `labels==1`. For those rows, take a look at the 5 words that appear most often on average. Are there some clusters that stand out in terms of topic? How many observations are there in each one? The code for the first label is provided:

## B. K-means clustering

We now give K-means clustering a try. We keep the same number (7) of clusters.

In [None]:
dailykos=pd.read_csv("dailykos.csv")

1. Using the code introduced before, run a K-means algorithm on dailykos.

In [None]:
kmeans = KMeans(n_clusters=7).fit(dailykos)

2. Add a "label" column to dailykos again. (note: kmeans creates clusters 0-6, unlike hierarchical clustering, which gives us clusters 1-7)

3. Using a similar method to above, obtain the top 5 words for each cluster. For those clusters that you had assigned a topic too, can you find them again in the k-means clusters? Are you maybe even able to map other clusters obtained via k-means to the hierarchical clusters above?