# K-means

## Data

Use the [World Value Survey](http://www.worldvaluessurvey.org/WVSDocumentationWV6.jsp) datafiles and corresponding questionaire and codebook files to understand what is in the data.

## Overarching research question

What kind of responder groups can emerge from survey responders and do they correspond to nationalities?
* Choose some relevant measurements
* Run analysis
* Interprent

## Method

There are many tools used for this, we apply [SciKit learn](https://scikit-learn.org/0.16/modules/clustering.html#clustering).


In [3]:
import csv
import sklearn.cluster
import sklearn
import collections

In [None]:
## create new data matrix for k-means analysis

selected_keys = ['V4', 'V5', 'V6', 'V7', 'V8', 'V9']

data = []

for line in csv.DictReader( open('data/wvs.csv') ):
    dd = []
    for key in selected_keys:
        dd.append( line[key] )
    data.append( dd )

print( len( data ) )

In [None]:
clustering_machine = sklearn.cluster.KMeans( 10 )
clustering_results = clustering_machine.fit_predict( data )

## number of responders per cluster
print( collections.Counter( clustering_results ) )

Now we have created a **ten cluster** approach.
How do we know if it is any good?

What would be different if we create a **five cluster** model instead?

Let's examine the mean values per each of the identified cluster.

In [None]:
## clustering_results with row ID numbers
all_clusters = set( clustering_results  )
clustering_results_with_rows = set( enumerate( clustering_results ) )

for cluster in all_clusters:
    
    ## select entries in this cluster
    this_cluster_rows = filter( lambda cr: cr[1] == cluster, clustering_results_with_rows )
    this_cluster_values = []
    
    for entry in this_cluster_rows:
        row = entry[0]
        this_cluster_values.append( data[ row ] )
        
    print( "Cluster", cluster )
    
    ## compute means per cluster
    for i, name in enumerate( selected_keys ):
        
        dd = set( map( lambda x: int(x[i]), this_cluster_values ) )   
        print( name , sum( dd ) / len( dd ) )
        
## TODO: this is super-manual way of doing this. Pandas could do all of this for you automatically.

## Task

* Run the above code and explain to yourself what is done.
* Response values -1, -2 and -3 relate to missing data (people answering I don't know etc.). Clean these values away from the dataset and redo your analysis.
* Choose suitable variables using the codebook and your understanding and intuition.
* Modify the number of clusters and examine how results change.

## Looking inside K-means

Often we prefer to use some data-driven approaches to identify the best number of clusters. One way to achieve this is to use the [elbow_ method](https://en.wikipedia.org/wiki/Elbow_method_(clustering)), where we visually inspect the best number of topics. Other tools exists as well, such as the [Silhouette method](https://en.wikipedia.org/wiki/Silhouette_(clustering)). Elbow is simple, but not always that clear and other methods are preferred. However, it is easy to understand.

The Elbow-method measures the distance clusters' items have to the centroid (sum of squared errors, sse). It can range from 0 (all items in the clusters are at the same point as its centroid) to positive infinity (nodes are all over the place). When numer or clusters (k) is increased, it decreases SSE; but this is a balancing act: how do you balance between more clusters and additional complexity and most explainability?

In [None]:
sse = {}

for k in range(1, 10):
    kmeans = sklearn.cluster.KMeans(n_clusters=k).fit(data)
    sse[k] = kmeans.inertia_

In [None]:
import matplotlib.pyplot as plt
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()

## Tasks

* Draw three different k-means clusterings with centroids and related values and organize them by their SSE.
* Use the elbow method to optimize your model.
* What similarities can you find between k-means and factor analysis?
* How does k-means differ from factor analysis? 