# K-means

## Data

Download the [World Value Survey](http://www.worldvaluessurvey.org/WVSDocumentationWV6.jsp) data and check out the corresponding questionnaire and codebook files to understand the dataset contents.

## Overarching research question

What kind of groups can we identify among survey respondents?
* Choose some variables in the data that might be relevant
* Run clustering
* Interpret results

## Tools

K-means clustering can be performed using many tools. We apply [SciKit learn](https://scikit-learn.org/0.16/modules/clustering.html#clustering).

In [None]:
import csv
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

In [None]:
# Create new data frame for analysis

# Check the questionnaire and codebook and modify these as you like.
selected_keys = ['V4', 'V5', 'V6', 'V7', 'V8', 'V9']

# Read data
data = []
for line in csv.DictReader( open('data/wvs.csv'), delimiter=';' ):
    dd = []
    for key in selected_keys:
        dd.append( line[key] )
    data.append( dd )

# Create data frame
df = pd.DataFrame(data, columns=selected_keys)
df

In [None]:
kmeans = KMeans( 10, n_init=10, random_state=100 ) # Set random state for reproducible results
kmeans.fit_predict( df )

## Check number of responders per cluster
clusters, counts = np.unique( kmeans.labels_, return_counts=True )
pd.DataFrame( counts, columns=['Number of respondents per cluster'] ).T

Now we have created a model with **ten clusters**.

How do we know if it is any good?

What would be different if we created a **five cluster** model instead?

Let's examine the mean values of each variable per cluster.

In [None]:
pd.DataFrame( kmeans.cluster_centers_, columns=selected_keys )

## Tasks

* Run the above code and explain to yourself what it does.
* Response values -1, -2 and -3 relate to missing data (people answering I don't know etc). Clean these values away the dataset and rerun the analysis.
* Modify the variables used for clustering and the number of clusters and examine how the results change.

## Evaluating the results

One way to evaluate the quality of clustering is to use the ["Elbow method"](https://en.wikipedia.org/wiki/Elbow_method_(clustering)), which provides a visual approach to selecting the number of clusters. Other tools exists as well, such as the [Silhouette method](https://en.wikipedia.org/wiki/Silhouette_(clustering)). Elbow is a simple approach to model selection in k-means, but it does not always provide clear answers.

The Elbow-method measures the distance between data points and their cluster centroids (using sum of squared errors, sse). The metric's values can range from 0 (all items in the clusters are at the same point as their centroid) to positive infinity (nodes are all over the place). When the number or clusters (k) increases, the SSE score decreases. The goal in using the Elbow is to balance between increasing model complexity and understability and interpretability of the results.

In [None]:
sse = {}

# Run k-means for a range of number of clusters
for k in range(1, 10):
    kmeans = KMeans( k, n_init=10, random_state=100 )
    kmeans.fit_predict( df )
    sse[k] = kmeans.inertia_

In [None]:
import matplotlib.pyplot as plt
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of clusters K")
plt.ylabel("SSE")
plt.show()

## Things to think and try out

* Try to run k-means using different ranges of K and use the Elbow method to select a model. Note that running a large range of models can take a long time.
* Inspect the results and try to interpret what the variable means mean.
* What similarities can you find between k-means and factor analysis?
* How does k-means differ from factor analysis? 