# Sample DiviK run

## Sample data

In [1]:
from sklearn.datasets import make_blobs

In [2]:
X, _ = make_blobs(
    n_samples=1_000,
    n_features=2,
    centers=7,
    random_state=42,
)

## DiviK instance building

`DiviK` requires `kmeans` instance, which should implement `AutoKMeans` interface.

`AutoKMeans` interface gathers all methods which automatically tune the number of clusters in the `scikit-learn`-compatible K-Means algorithm implementations.

`divik` library provides two implementations of `AutoKMeans` interface: `DunnSearch` and `GAPSearch`, both in the `divik.cluster` package.

In [3]:
from divik.cluster import (
    DiviK,
    DunnSearch,
    GAPSearch,
    KMeans
)
from sklearn.cluster import KMeans as sklKMeans

### Minimal `DiviK` example

In [4]:
minimal_divik = DiviK(
    kmeans=DunnSearch(  # we want to use Dunn's method for finding the optimal number of clusters
        kmeans=KMeans(
            n_clusters=2,  # it is required, like in scikit-learn, but you can provide any number here,
                           # DunnSearch will override it anyway
        ),
        max_clusters=5,  # for the sake of the example I'll keep it low
    ),
    minimal_size=100,  # for the sake of the example, I won't split clusters with less than 100 elements
    filter_type='none',  # we have 2 features in sample data, feature selection would be pointless
)

In [5]:
minimal_divik.fit(X)

DiviK(filter_type='none',
      kmeans=DunnSearch(kmeans=KMeans(n_clusters=2), max_clusters=5),
      minimal_size=100)

In [6]:
minimal_divik.n_clusters_

22

In the above case, the only stop criterion for the algorithm, is reaching the subgroup size below `minimal_size`, which is a naive approach.

### `DiviK` example with data heterogeneity check

In [7]:
divik = DiviK(
    kmeans=DunnSearch(  # we want to use Dunn's method for finding the optimal number of clusters
        kmeans=KMeans(
            n_clusters=2,  # it is required, like in scikit-learn, but you can provide any number here,
                           # DunnSearch will override it anyway
        ),
        max_clusters=5,  # for the sake of the example I'll keep it low
    ),
    fast_kmeans=GAPSearch(  # this one is for assessment, if we should split a subregion
        kmeans=KMeans(
            n_clusters=2,  # as above
        ),
        max_clusters=2,  # For the sake of heterogeneity check, it should be 2 for GAP index.
                         # GAP index always looks for "first feasible", so if one cluster
                         # does not yield the right solution, we split.
    ),
    minimal_size=100,  # for the sake of the example, I won't split clusters with less than 100 elements
    filter_type='none',  # we have 2 features in sample data, feature selection would be pointless
)

In [8]:
divik.fit(X)

DiviK(fast_kmeans=GAPSearch(kmeans=KMeans(n_clusters=2), max_clusters=2),
      filter_type='none',
      kmeans=DunnSearch(kmeans=KMeans(n_clusters=2), max_clusters=5),
      minimal_size=100)

In [9]:
divik.n_clusters_

7

With heterogeneity check it is more likely that the discovered structure corresponds to the actual data, because we don't create so much artificial clusters.

### `scikit-learn` K-Means implementation

You can use `scikit-learn` implementation of K-Means algorithm if you want to.

In [10]:
skl_kmeans = sklKMeans(
    n_clusters=2,  # whatever actually, as explained above
    random_state=42,
)

skl_divik = DiviK(
    kmeans=DunnSearch(  # we want to use Dunn's method for finding the optimal number of clusters
        kmeans=skl_kmeans,
        max_clusters=5,  # for the sake of the example I'll keep it low
    ),
    fast_kmeans=GAPSearch(  # this one is for assessment, if we should split a subregion
        kmeans=skl_kmeans,
        max_clusters=2,  # For the sake of heterogeneity check, it should be 2 for GAP index.
                         # GAP index always looks for "first feasible", so if one cluster
                         # does not yield the right solution, we split.
    ),
    minimal_size=100,  # for the sake of the example, I won't split clusters with less than 100 elements
    filter_type='none',  # we have 2 features in sample data, feature selection would be pointless
)

In [11]:
skl_divik.fit(X)

DiviK(fast_kmeans=GAPSearch(kmeans=KMeans(n_clusters=2, random_state=42),
                            max_clusters=2),
      filter_type='none',
      kmeans=DunnSearch(kmeans=KMeans(n_clusters=2, random_state=42),
                        max_clusters=5),
      minimal_size=100)

In [12]:
skl_divik.n_clusters_

7