Let us consider the scenario of training a k-means clustering model on the scikit-learn load_digits dataset.

We will follow the example given by scikit-learn, and use the load_digits dataset to train and test a k-means model.

In [1]:
import numpy as np
from time import time
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from sklearn.preprocessing import scale

X_digits, y_digits = load_digits(return_X_y=True)
data = scale(X_digits)

n_samples, n_features = data.shape
n_digits = len(np.unique(y_digits))
labels = y_digits

sample_size = 1000

print("n_digits: %d, \t n_samples %d, \t n_features %d"
      % (n_digits, n_samples, n_features))

print('init\t\ttime\tinertia\thomo\tcompl\tv-meas\tARI\tAMI\tsilhouette')

def bench_k_means(estimator, name, data):
    t0 = time()
    estimator.fit(data)
    print('%-9s\t%.2fs\t%i\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f'
          % (name, (time() - t0), estimator.inertia_,
             metrics.homogeneity_score(labels, estimator.labels_),
             metrics.completeness_score(labels, estimator.labels_),
             metrics.v_measure_score(labels, estimator.labels_),
             metrics.adjusted_rand_score(labels, estimator.labels_),
             metrics.adjusted_mutual_info_score(labels,  estimator.labels_),
             metrics.silhouette_score(data, estimator.labels_,
                                      metric='euclidean',
                                      sample_size=sample_size)))

n_digits: 10, 	 n_samples 1797, 	 n_features 64
init		time	inertia	homo	compl	v-meas	ARI	AMI	silhouette


## Non-private Baseline
We now use scikit-learn's native KMeans function to establish a non-private baseline for our experiments. We will use the k-means++ and random initialization respectively.

In [2]:
bench_k_means(KMeans(init='k-means++', n_clusters=n_digits, n_init=100),
              name="k-means++", data=data)

bench_k_means(KMeans(init='random', n_clusters=n_digits, n_init=100),
              name="random", data=data)

k-means++	3.76s	69408	0.603	0.651	0.626	0.467	0.622	0.144
random   	2.24s	69408	0.599	0.648	0.623	0.463	0.619	0.145


## Differentially Private K-means Clustering

In [3]:
!pip install diffprivlib 
from diffprivlib.models import KMeans

bench_k_means(KMeans(epsilon=1.0, bounds=None, n_clusters=n_digits, init='k-means++', n_init=100), name="dp_k-means", data=data)

Collecting diffprivlib
[?25l  Downloading https://files.pythonhosted.org/packages/fe/b8/852409057d6acc060f06cac8d0a45b73dfa54ee4fbd1577c9a7d755e9fb6/diffprivlib-0.3.0.tar.gz (70kB)
[K     |████▋                           | 10kB 15.9MB/s eta 0:00:01[K     |█████████▎                      | 20kB 20.9MB/s eta 0:00:01[K     |██████████████                  | 30kB 15.9MB/s eta 0:00:01[K     |██████████████████▋             | 40kB 10.8MB/s eta 0:00:01[K     |███████████████████████▎        | 51kB 4.5MB/s eta 0:00:01[K     |████████████████████████████    | 61kB 4.9MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 3.5MB/s 
Building wheels for collected packages: diffprivlib
  Building wheel for diffprivlib (setup.py) ... [?25l[?25hdone
  Created wheel for diffprivlib: filename=diffprivlib-0.3.0-cp36-none-any.whl size=138999 sha256=3b8e8c5d85d73ccc67b32a5a02cee627b08a74bb340371bb070cf4797c2f6714
  Stored in directory: /root/.cache/pip/wheels/64/68/62/617183f73d3fe

