# Basic PUMS Analysis with WhiteNoise

This notebook will be a brief tutorial on doing data analysis within the WhiteNoise system.

We will start out by setting up our environment -- loading the necessary libraries and establishing the very basic things we need to know before loading our data (the file path and variable names).

In [1]:
# load libraries
import os
import whitenoise
import whitenoise.components as op

# establish data information
data_path = os.path.join('..', 'data', 'PUMS_california_demographics_1000', 'data.csv')
var_names = ["age", "sex", "educ", "race", "income", "married"]


Let's say that we have access to the PUMS codebook, and thus know some basic information about the possible values for the variables in the data. Many differentially private algorithms will require us to give information of this type (a range for a continuous variable and set of feasible values for a categorical variable). It is not necessary to set these up front, but we will do so for the sake of clarity.

We also need to give an estimate of the sample size of the data in question. In general, this could be based on true knowledge of the data, an educated guess, or we could make a differentially private estimate. We know, by construction of the data set, that this is a 1,000 person sample.

In [2]:
# set sample size
n = 1_000

# set ranges/feasible values
age_range = (0., 100.)
sex_vals = [0, 1]
educ_vals = [i for i in range(1, 17)]
race_vals = [i for i in range(1, 7)]
income_range = (0., 500_000.)
married_vals = [0, 1]

Now we can proceed to performing a basic analysis. Let's start by considering a differentially private mean of `age`. We will start with a few failed attempts in order to build an intuition for the requisite steps.

In [3]:
# attempt 1 - fails because of nullity
with whitenoise.Analysis() as analysis:
    # load data
    data = whitenoise.Dataset(path = data_path, column_names = var_names)
    
    ''' get mean age '''
    # establish data 
    age_dt = op.cast(data['age'], 'FLOAT')
    
    # calculate differentially private mean of age
    age_mean = op.dp_mean(data = age_dt, privacy_usage={'epsilon': .65})

analysis.release()

RuntimeError: 
  Error: node specification LaplaceMechanism(LaplaceMechanism { privacy_usage: [PrivacyUsage { distance: Some(DistancePure(DistancePure { epsilon: 0.65 })) }] }):
Caused by: node specification Mean(Mean):
Caused by: data may contain nullity when non-nullity is required


Notice that `dp_mean` requires the data to have the property `nullity = False`.
We can get around this by using `impute`. We will impute from a `Gaussian(mean = 45, sd = 10)` distribution, truncated such that no values fall outside of our age range we already established.

In [None]:
# attempt 2 - fails because of undefined min/max
with whitenoise.Analysis() as analysis:
    # load data
    data = whitenoise.Dataset(path = data_path, column_names = var_names)
    
    ''' get mean age '''
    # establish data 
    age_dt = op.cast(data['age'], 'FLOAT')
    
    # clamp data to range and impute missing values
    age_dt = op.impute(data = age_dt, distribution = 'Gaussian',
                                      min = age_range[0], max = age_range[1],
                                      shift = 45., scale = 10.)
    
    # calculate differentially private mean of age
    age_mean = op.dp_mean(data = age_dt, privacy_usage={'epsilon': .65})
     
analysis.release()

Now we see that `dp_mean` needs to know the `min` value (in fact, it also needs to know `max`). We provide that with `clamp`. We paramaterize `clamp` with the min and max values of age we established at the beginning.

In [None]:
# attempt 3 - fails because of undefined n
with whitenoise.Analysis() as analysis:
    # load data
    data = whitenoise.Dataset(path = data_path, column_names = var_names)
    
    ''' get mean age '''
    # establish data 
    age_dt = op.cast(data['age'], 'FLOAT')
    
    # clamp data to range and impute missing values
    age_dt = op.clamp(data = age_dt, min = age_range[0], max = age_range[1])
    age_dt = op.impute(data = age_dt, distribution = 'Gaussian',
                                      min = age_range[0], max = age_range[1],
                                      shift = 45., scale = 10.)
    
    # calculate differentially private mean of age
    age_mean = op.dp_mean(data = age_dt, privacy_usage={'epsilon': .65})

    
analysis.release()

WhiteNoise requires `n` to be specified before a release can be considered valid.
We know the true `n` in this case, but this will not always be true. We call `resize` to ensure that the data are consistent with the `n` we provide.

In [4]:
# attempt 4 - succeeds!
with whitenoise.Analysis() as analysis:
    # load data
    data = whitenoise.Dataset(path = data_path, column_names = var_names)
    
    ''' get mean age '''
    # establish data 
    age_dt = op.cast(data['age'], 'FLOAT')
    
    # clamp data to range and impute missing values
    age_dt = op.clamp(data = age_dt, min = age_range[0], max = age_range[1])
    age_dt = op.impute(data = age_dt, distribution = 'Gaussian',
                                      min = age_range[0], max = age_range[1],
                                      shift = 45., scale = 10.)
    
    # ensure data are consistent with proposed n
    age_dt = op.resize(data = age_dt, n = n, distribution = 'Gaussian',
                       min = age_range[0], max = age_range[1],
                       shift = 45., scale = 10.)
    
    # calculate differentially private mean of age
    age_mean = op.dp_mean(data = age_dt, privacy_usage={'epsilon': .65})
        
    ''' get variance of age '''
    # calculate differentially private variance of income
    age_var = op.dp_variance(data = age_dt, privacy_usage={'epsilon': .35})
    
analysis.release()

# print differentially private estimates of mean and variance of age
print(age_mean.value)
print(age_var.value)

44.7431753302894
242.451769137952


Notice that we asked for an extra `dp_variance` at the end without having to use `clamp`, `impute`, or `resize`. Because these functions are updating the properties of `age_dt` as they are called, `dp_variance` has everything it needs from `age_dt` when we call it.

Now that we have a sense for building up a statistic step-by-step, we can run through a much quicker version. We simply provide `data_min, data_max, data_n` and the `clamp, impute, resize` steps are all performed implicitly. You'll notice that we don't even provide a `distribution` argument, even though it is needed for `impute`. For some arguments, we have (what we believe to be) reasonable defaults that are used if not provided explicitly. 

In [7]:
with whitenoise.Analysis() as analysis:
    # load data
    data = whitenoise.Dataset(path = data_path, column_names = var_names)

    # get mean of age
    age_mean = op.dp_mean(data = op.cast(data['age'], type="FLOAT"),
                          privacy_usage = {'epsilon': .65},
                          data_min = 0.,
                          data_max = 100.,
                          data_n = 1000
                         )
    # get variance of age
    age_var = op.dp_variance(data = op.cast(data['age'], type="FLOAT"),
                             privacy_usage = {'epsilon': .35},
                             data_min = 0.,
                             data_max = 100.,
                             data_n = 1000
                            )
analysis.release()

print("DP mean of age: {0}".format(age_mean.value))
print("DP variance of age: {0}".format(age_var.value))
print("Privacy usage: {0}".format(analysis.privacy_usage))

DP mean of age: 44.87714677948631
DP variance of age: 300.67225224189036
Privacy usage: distance_pure {
  epsilon: 1.0
}



We see that the two DP releases within our analysis compose in a simple way, the individual epsilons we set add together for a total privacy usage of 1.   