# Differentially Private Histograms

## Plotting the distribution of ages in `Adult`

In [None]:
import numpy as np
import pandas as pd
from diffprivlib import tools as dp
import matplotlib.pyplot as plt
import datetime
import random

from dq0sdk.data.utils.dp_methods import _dp_stats

We first read in the list of ages in the Adult UCI dataset (the first column).

In [None]:
usecols=None#[0,2,4,12]
ages_adult_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                         usecols=usecols, header=None)
# add more data types
ts = pd.Timestamp(datetime.datetime.now())
ages_adult_df[15]=pd.Timedelta('1 days')*int(np.random.sample(1)[0]*10)
ages_adult_df[16]=ages_adult_df[15]+ts
ages_adult_df[17]=[bool(random.getrandbits(1)) for _ in range(ages_adult_df.shape[0])]
ages_adult_df[0]=ages_adult_df[0].astype(float)
ages_adult_df[1]=ages_adult_df[1].astype('category')

print(ages_adult_df.dtypes)

In [None]:
content = ages_adult_df
mean, std, hist = _dp_stats(content, epsilon=1, randomize_range=False)


### All the warnings above result from not setting the range

### All the warnings below are from the histograms for none numeric data, where rnge is set to None

In [None]:
epsilon = 0.005

dp_mean, dp_std, dp_hist = _dp_stats(content, epsilon)

nrows = len(dp_hist)//2+len(dp_hist)%2
fig, axes = plt.subplots(nrows, 2, figsize=(16,nrows*6))
for ii, (_dp_hist, _hist) in enumerate(zip(dp_hist, hist)):
    axes[ii//2,ii%2].bar(_hist[1][:-1], _hist[0], width=(_hist[1][1]-_hist[1][0]) * 0.9)
    axes[ii//2,ii%2].bar(_dp_hist[1][:-1], _dp_hist[0], width=(_dp_hist[1][1]-_dp_hist[1][0]) * 0.9, alpha=0.5)
    axes[ii//2,ii%2].set_title(str(content.dtypes.values[ii]))
axes[0,0].legend(['no dp','dp'])

## The original content from diffpriv

In [None]:
ages_adult = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                        usecols=0, delimiter=", ")

Using Numpy's native `histogram` function, we can find the distribution of ages, as determined by ten equally-spaced bins calculated by `histogram`.

In [None]:
hist, bins = np.histogram(ages_adult)
hist = hist / hist.sum()

Using `matplotlib.pyplot`, we can plot a barchart of the histogram distribution.

In [None]:
plt.bar(bins[:-1], hist, width=(bins[1]-bins[0]) * 0.9)
plt.show()

## Differentially private histograms

Using `diffprivlib`, we can calculate a differentially private version of the histogram. For this example, we use the default settings:
- `epsilon` is 1.0
- `range` is not specified, so is calculated by the function on-the-fly. This throws a warning, as it leaks privacy about the data (from `dp_bins`, we know that there are people in the dataset aged 17 and 90).

In [None]:
dp_hist, dp_bins = dp.histogram(ages_adult)
dp_hist = dp_hist / dp_hist.sum()

plt.bar(dp_bins[:-1], dp_hist, width=(dp_bins[1] - dp_bins[0]) * 0.9)
plt.show()

**Privacy Leak:** In this setting, we know for sure that at least one person in the dataset is aged 17, and another is aged 90.

In [None]:
dp_bins[0], dp_bins[-1]

**Mirroring the behaviour of `np.histogram`:** We can see that the bins returned by `diffprivlib.tools.histogram` are identical to those given by `numpy.histogram`.

In [None]:
np.all(dp_bins == bins)

**Error:** We can see very little difference in the values of the histgram. In fact, we see an aggregate absolute error across all bins of the order of 0.01%. This is expected, due to the large size of the dataset (`n=48842`). 

In [None]:
print("Total histogram error: %f" % np.abs(hist - dp_hist).sum())

**Effect of `epsilon`:** If we decrease `epsilon` (i.e. **increase** the privacy guarantee), the error will increase.

In [None]:
dp_hist, dp_bins = dp.histogram(ages_adult, epsilon=0.001)
dp_hist = dp_hist / dp_hist.sum()

print("Total histogram error: %f" % np.abs(hist - dp_hist).sum())
plt.bar(dp_bins[:-1], dp_hist, width=(dp_bins[1] - dp_bins[0]) * 0.9)
plt.show()

In [None]:
dp.mean(ages_adult, epsilon=0.001)
dp.std(ages_adult, epsilon=0.001)


## Deciding on the `range` parameter

We know from the [dataset description](https://archive.ics.uci.edu/ml/datasets/adult) that everyone in the dataset is at least 17 years of age. We don't know off-hand what the upper bound is, so for this example we'll set the upper bound to `100`. As of 2019, less than 0.005% of the world's population is [aged over 100](https://en.wikipedia.org/wiki/Centenarian), so this is an appropriate simplification. Values in the dataset above 100 will be excluded from calculations.

An `epsilon` of 0.1 still preserves the broad structure of the histogram.

In [None]:
epsilon=0.005
range=(17, 100)

print('data size: %d' % len(ages_adult))

hist, bins = np.histogram(ages_adult)
hist = hist / hist.sum()

dp_hist2, dp_bins2 = dp.histogram(ages_adult, epsilon=epsilon, range=range)
dp_hist2 = dp_hist2 / dp_hist2.sum()

dp_mean2 = dp.mean(ages_adult, epsilon=epsilon, range=range[1])
dp_std2 = dp.std(ages_adult, epsilon=epsilon, range=range[1])

print("mean: {:8.6f} +- {:8.6f}".format(dp_mean2, np.abs(np.mean(ages_adult) - dp_mean2)))
print("std: {:8.6f} +- {:8.6f}".format(dp_std2, np.abs(np.std(ages_adult) - dp_std2)))

print("Total histogram error: %f" % np.abs(hist - dp_hist2).sum())
plt.bar(dp_bins2[:-1], dp_hist2, width=(dp_bins2[1] - dp_bins2[0]) * 0.9)
plt.show()

## Error for smaller datasets

Let's repeate the first experiments above with a smaller dataset, this time the [Cleveland heart disease dataset](https://archive.ics.uci.edu/ml/datasets/heart+Disease) from the UCI Repository. This dataset has 303 samples, a small fractin of the Adult dataset processed previously.

In [None]:
ages_heart = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data",
                        usecols=0, delimiter=",")

We first find the histogram distribution using `numpy.histogram`.

In [None]:
heart_hist, heart_bins = np.histogram(ages_heart)
heart_hist = heart_hist / heart_hist.sum()

And then find the histogram distribution using `diffprivlib.histogram`, using the defaults as before (with the accompanying warning).

In [None]:
dp_heart_hist, dp_heart_bins = dp.histogram(ages_heart)
dp_heart_hist = dp_heart_hist / dp_heart_hist.sum()

And double-check that the bins are the same.

In [None]:
np.all(heart_bins == dp_heart_bins)

We then see that the error this time is 3%, a 100-fold increase in error.

In [None]:
print("Total histogram error: %f" % np.abs(heart_hist - dp_heart_hist).sum())

## Mirroring Numpy's behaviour

We can evaluate `diffprivlib.models.histogram` without any privacy by setting `epsilon = float("inf")`. This should give the exact same result as running `numpy.histogram`.

In [None]:
heart_hist, _ = np.histogram(ages_heart)
dp_heart_hist, _ = dp.histogram(ages_heart, epsilon=float("inf"))

np.all(heart_hist == dp_heart_hist)