# K-means

## Data

Download the [World Value Survey](http://www.worldvaluessurvey.org/WVSDocumentationWV6.jsp) data and check out the corresponding questionnaire and codebook files to understand the dataset contents.

## Overarching research question

What kind of groups can we identify among survey respondents?
* Choose some variables in the data that might be relevant
* Run clustering
* Interpret results

## Tools

[K-means](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/kmeans) is in-built to R.

In [None]:
# Create new data frame for analysis

# Check the questionnaire and codebook and modify these as you like.
selected_keys <- c('V4', 'V5', 'V6', 'V7', 'V8', 'V9')

full_data <- read.csv('data/wvs.csv', sep=';')

data <- full_data[,selected_keys ]

print( nrow( data ) )
head(data)

In [None]:
set.seed(100) # Set random seed for reproducible results

kmeans_results <- kmeans( data, centers = 10 )

## Check number of responders per cluster
table( kmeans_results$cluster )

Now we have created a **ten cluster** approach.
How do we know if it is any good?

What would be different if we create a **five cluster** model instead?

Let's examine the mean values of the variables per each identified cluster and plot the results.

In [None]:
kmeans_results$centers

## Task

* Run the above code and explain to yourself what it does.
* Response values -1, -2 and -3 relate to missing data (people answering I don't know etc). Clean these values away the dataset and rerun the analysis.
* Modify the variables used for clustering and the number of clusters and examine how the results change.

## Evaluating the results

One way to evaluate the quality of clustering is to use the ["Elbow method"](https://en.wikipedia.org/wiki/Elbow_method_(clustering)), which provides a visual approach to selecting the number of clusters. Other tools exists as well, such as the [Silhouette method](https://en.wikipedia.org/wiki/Silhouette_(clustering)). Elbow is a simple approach to model selection in k-means, but it does not always provide clear answers.

The Elbow-method measures the distance between data points and their cluster centroids (using sum of squared errors, sse). The metric's values can range from 0 (all items in the clusters are at the same point as their centroid) to positive infinity (nodes are all over the place). When the number or clusters (k) increases, the SSE score decreases. The goal in using the Elbow is to balance between increasing model complexity and understability and interpretability of the results.

In [None]:
set.seed(100)

sse <- c()

for(k in 2:10) {
    result <- kmeans( data, centers = k )
    sse <- c( sse, result$tot.withinss ) ## this is slow in R, but when doing a list of ten items it is OK.
}

In [None]:
plot( 2:10, sse, type="b")

## Things to think and try out

* Try to run k-means using different ranges of K and use the Elbow method to select a model. Note that running a large range of models can take a long time.
* Inspect the results and try to interpret what the variable means mean.
* What similarities can you find between k-means and factor analysis?
* How does k-means differ from factor analysis? 