### Compute the Adaptive Dimension Reduction Expectation Maximization (ADR-EM) algorithm

Ding C, Xiaofeng He, Hongyuan Zha, Simon HD (2002). “Adaptive Dimension Reduction for
Clustering High Dimensional Data.” In *Proceedings 2002 IEEE International Conference on Data
Mining*, 147–154.

In [1]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.manifold import TSNE
from sklearn.metrics.cluster import fowlkes_mallows_score
from sklearn.preprocessing import LabelEncoder

# Scale and visualize the embedding vectors
def plot_embedding(X, title=None):
    x_min, x_max = np.min(X, 0), np.max(X, 0)
    X = (X - x_min) / (x_max - x_min)

    plt.figure(figsize=(8,6))
    ax = plt.subplot(111)
    for i in range(X.shape[0]):
        plt.text(X[i, 0], X[i, 1], str(y[i]),
                 color=plt.cm.Set1(y[i] / 10.),
                 fontdict={'weight': 'bold', 'size': 9})
    plt.xticks([]), plt.yticks([])
    if title is not None:
        plt.title(title)

In [2]:
raw_npca = np.load('data/sketches_raw_nopca.npy')
fc6_npca = np.load('data/sketches_fc6_nopca.npy')
raw_pca  = np.load('data/sketches_raw_pca.npy')
fc6_pca  = np.load('data/sketches_fc6_pca.npy')
metadata = pd.read_csv('data/sketches_metadata.csv')
metadata['category_factored'] = LabelEncoder().fit_transform(metadata.category)

feature_set = [raw_npca, fc6_npca, raw_pca, fc6_pca]
labels_set  = [metadata.sort_values(col).category_factored.values 
               for col in metadata.drop(columns=['category', 'category_factored']).columns]

In [3]:
K = 32 # we want 32 clusters
r = K - 1 # reduced-dimension subspace

In [4]:
labels = labels_set[-1]
features = feature_set[-1]

In [5]:
# scale data to mean 0 variance 1 per dimension
features = features - features.mean(axis=0) + (np.random.rand() / 1000000)
features /= features.std(axis=0)
features = PCA(n_components=r).fit_transform(features)

In [6]:
for i in range(100):
    pred_labels = GaussianMixture(n_components=K, random_state=0).fit_predict(features)
    features = LinearDiscriminantAnalysis().fit_transform(features, pred_labels)
    print(f'On iteration {i} | score: {fowlkes_mallows_score(labels, pred_labels)}')
#     features_TSNE = TSNE(n_components=2,random_state=0).fit_transform(features)
#     y=pred_labels
#     plot_embedding(features_TSNE, f"t-SNE embedding on iteration {i+1}")
#     plt.show()


On iteration 0 | score: 0.03436982437844208
On iteration 1 | score: 0.034309462657735185
On iteration 2 | score: 0.03344988328969336
On iteration 3 | score: 0.03358678062246045
On iteration 4 | score: 0.033463462134777536
On iteration 5 | score: 0.03390613475726522
On iteration 6 | score: 0.03317221294299476
On iteration 7 | score: 0.033369081873519016
On iteration 8 | score: 0.03401252153887691
On iteration 9 | score: 0.03317245066302933
On iteration 10 | score: 0.03443579427974891
On iteration 11 | score: 0.0344635299357771
On iteration 12 | score: 0.03471251726948786
On iteration 13 | score: 0.03346215719753251
On iteration 14 | score: 0.03370646932954293
On iteration 15 | score: 0.0329305227724656
On iteration 16 | score: 0.033829239338318544
On iteration 17 | score: 0.03349232133322897
On iteration 18 | score: 0.03340476068905491
On iteration 19 | score: 0.03373636762994072
On iteration 20 | score: 0.0329854410485013
On iteration 21 | score: 0.03409499754332015
On iteration 22 | s