In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

from scripts import *

# Gap statistic

## What is?

The gap statistic was developed by Stanford researchers Tibshirani, Walther and Hastie in their 2001 paper. The idea behind their approach was to find a way to compare cluster compactness with a null reference distribution of the data, i.e. a distribution with no obvious clustering. Their estimate for the optimal number of clusters is the value for which cluster compactness on the original data falls the farthest below this reference curve. This information is contained in the following formula for the gap statistic: \

$$Gap_n(k)=E^*_n\{log(W_k)\}-log(W_k)$$

where


$W_k$ = compactness of our clustering based on the *Within-Cluster-Sum of Squared Errors* (WSS).

WSS is calculated as:
$$D_k=2n_k\sum_{x_i\in C_k}||x_i - \mu_k||^2$$

The WSS is calculated by the inertia_ attribute of sklearn.cluster.KMeans functions as follows:
- The square of the distance of each point from the centre of the cluster (Squared Erros)
- The WSS score is the sum of these Squared Errors for all the points


source:
https://towardsdatascience.com/cheat-sheet-to-implementing-7-methods-for-selecting-optimal-number-of-clusters-in-python-898241e1d6ad

## Calculating Gap statistic

In [2]:
df = pd.read_csv("../data/country-data.csv")
df_pca_clusters = pd.read_csv("../data/country-data-pca-w-clusters.csv")

In [7]:
from scripts import *

# calculating intra cluster variance for each cluster
cluster_list, IVS = intra_cluster_variance(df_pca_clusters)
IVS

[8.56563649479922,
 1.4499552968213443,
 14.224225630782398,
 0.8977424168599871,
 3.326600138959439]