# $k$-means clustering

$k$-means clustering aims to partition the instances space into $k$ clusters, in which each observation
belongs to the cluster with the nearest mean.

## Implementation

### Load datasets

Here, train and test datasets will be used together. Also, since this is an unsupervised technique, 
the target column will be removed:

In [None]:
import pandas as pd
# dataset paths
paths = ['../datasets/covertype_norm_train.csv',
         '../datasets/covertype_norm_test.csv']
# load train and test datasets together
dataset = pd.concat([pd.read_csv(f) for f in paths])
# keep targets for future comparative purposes
targets = dataset['cover_type']
# remove target column
dataset.drop('cover_type', inplace=True, axis=1)
# check shape
print("[INFO] Dataset shape: ", dataset.shape)
# check head
dataset.head()

### Methodology

#### Varying $k$

$k$ is the number of clusters to be constructed during the execution. It is important to vary it in order to 
seek for better partitions. In this work, $k \in \{2,3, \ldots, 13\}$. Notice that the number of classes
is in the middle of this interval.

#### Number of executions

For each value of $k$, $5$ executions will be performed.

#### Davies-Bouldin (DB) index

Every clustering will then be evaluated in the view of the Davies-Bouldin (DB) index:

$$
R_{ij} = \frac{s_i + s_j}{d_{ij}}
$$

$$
D_i = \max_{j\neq i}R_{ij}
$$

$$
DB = \frac 1 k \sum_i D_i
$$

### Performing $k$-means

Here, the `sklearn` library will be used for performing the executions described in the methodology:

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import davies_bouldin_score
import itertools
import time
import numpy as np
# k values
#ks = list(range(2,14))
ks = list(range(2,5))
# number of executions per k
ns = list(range(2))
# store execution results
results_execs = []
# store models
models = {}
# iterate over ns
for n in ns:
    # seed
    ms = round(((time.time()*1000.0) - 0)/(2**32-1 - 0))
    # for each k
    for k in ks:
        print("[INFO] k =", k, ", n =", n)
        # perform k-means
        kmeans = KMeans(n_clusters=k, 
                        random_state=ms).fit(dataset)
        # store model
        models[(k, n)] = kmeans
        # retrieve labels
        labels = kmeans.labels_
        # compute DB index
        db = davies_bouldin_score(dataset, labels)
        # info
        print("seed", ms, ", db = ", db)
        # store results
        results_execs.append([k, n, ms, db])

### Computing the average DB per $k$

In [None]:
# make dataframe from results
results_execs = pd.DataFrame(results_execs, 
                             columns=['k', 'iteration', 
                                      'random_seed','db_index'])
# take average
results_execs_avg = results_execs.groupby('k', axis=0).mean()['db_index']
# get the k with min db average
best_k = results_execs_avg.idxmin()
# take the partition id with best_k clusters and min db
best_partition_idx = results_execs[results_execs['k']==best_k]['db_index'].idxmin()
# take the partition
best_partition = results_execs.iloc[best_partition_idx]
print("[INFO] Best partition:")
best_partition

### Saving the best  partition

In [None]:
from sklearn.externals import joblib
# recover the model
model = models[(int(best_partition['k']), 
                int(best_partition['iteration']))]
# dump it
joblib.dump(model, '../models/best_kmeans.save') 

### Saving the results

In [None]:
results_execs.to_csv('../results/kmeans_exec_results.csv',
                     index=False)
results_execs_avg.to_csv('../results/kmeans_avg_db.csv')