# Scikit-Learn Clustering Experimentation

#### Date Created: 11/8/2020

#### The purpose of this notebook is to develop experience with the datasets and clustering algorithms available in Scikit-Learn, relevant plotting with matplotlib, and synthetic data generation. The ultimate goal is to recreate some of the experiments conducted in: 

Nathalie Barbosa Roa, Louise Travé-Massuyès, Victor Hugo Grisales. DyClee: Dynamic clustering for tracking evolving environments. Pattern Recognition, Elsevier, 2019, 94, pp.162-186. 10.1016/j.patcog.2019.05.024 . hal-02135580

**PLEASE NOTE that the below is informed by Scikit-Learn documentation/examples (primarily the third link below), and the following book:**

Aurelien Geron. *Hands-on Machine Learning with Scikit-Learn, Keras & Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems* (Sebastopol, CA: O'Reilly Media, Inc., 2019).

### Useful links:
- https://scikit-learn.org/stable/datasets/index.html#sample-generators
- https://scikit-learn.org/stable/modules/clustering.html
- **CLUSTER COMPARISON NOTEBOOK** https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html

### Imports

In [None]:
from itertools import cycle, islice

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import MiniBatchKMeans, AgglomerativeClustering, AffinityPropagation, DBSCAN, Birch
from sklearn.datasets import make_moons, make_circles, make_blobs 

## Data:

### Circles, Moons, Blobs and Random

In [None]:
num_samples = 1500 # As used in DyClee paper (page 18)

# Below dataset parameters are used from the CLUSTER COMPARISON NOTEBOOK, as it appears
# the DyClee authors used the same parameters
np.random.seed(0)

# sklearn.datasets.make_circles(n_samples=100, *, shuffle=True, noise=None,
# random_state=None, factor=0.8)
X_circles, Y_true_circles = make_circles(num_samples, factor=.5, noise=.05)

# sklearn.datasets.make_moons(n_samples=100, *, shuffle=True, noise=None,
# random_state=None)
X_moons, Y_true_moons = make_moons(num_samples, noise=.05)

# sklearn.datasets.make_blobs(n_samples=100, n_features=2, *, centers=None,
# cluster_std=1.0, center_box=(-10.0, 10.0), shuffle=True, random_state=None,
# return_centers=False)
X_blobs, Y_true_blobs = make_blobs(num_samples, random_state=8)

# Random data
X_random, Y_true_random = np.random.rand(num_samples, 2), np.zeros((num_samples, 1), dtype=np.uint8)

In [None]:
# BELOW COLOR CODE IS USED FROM CLUSTER COMPARISON NOTEBOOK
colors = np.array(list(islice(cycle(['#377eb8', '#ff7f00', '#4daf4a',
                                    '#f781bf', '#a65628', '#984ea3',
                                    '#999999', '#e41a1c', '#dede00']),
                                    int(max(Y_true_blobs) + 1))))
# add black color for outliers (if any)
colors = np.append(colors, ["#000000"])

# Plot the synthetic data
fig, axs = plt.subplots(2,2, figsize=(10,10))
fig.suptitle("Circles, Moons, Blobs and Random")
axs[0,0].scatter(X_circles[:,0], X_circles[:,1], s=10, color=colors[Y_true_circles])
axs[0,1].scatter(X_moons[:,0], X_moons[:,1], s=10, color=colors[Y_true_moons])
axs[1,0].scatter(X_blobs[:,0], X_blobs[:,1], s=10, color=colors[Y_true_blobs])
axs[1,1].scatter(X_random[:,0], X_random[:,1], s=10, color=colors[0])

## Algorithms:

## Real-Time Plotting:

## Time-Series Data: