# Basic PUMS Analysis with OpenDP

This notebook will be a brief tutorial on doing data analysis within the OpenDP system.

We will start out by setting up our environment -- loading the necessary libraries and establishing the very basic
things we need to know before loading our data (the file path and variable names).

In [1]:
# load libraries
from opendp.trans import *
from opendp.meas import *
from opendp.core import *
from opendp.mod import enable_features

enable_features("floating-point")

# establish data information
data_path = os.path.join('.', 'data', 'PUMS_california_demographics_1000', 'data.csv')
var_names = ["age", "sex", "educ", "race", "income", "married"]

# TODO: Remove column headers
with open(data_path) as input_data:
    data = input_data.read()

### Properties

*TODO* OpenDP architecture description here

Let's examine how we can read and process data within a computation chain. We already have data as a string, read in
from a file.
Here are the first 5 lines of the data:

In [2]:
print('\n'.join(data.split('\n')[:6]))

59,1,9,1,0,1
31,0,1,3,17000,0
36,1,11,1,0,1
54,1,11,1,9100,1
39,0,5,3,37000,0
34,0,9,1,0,1


Now let's read this data into a computation chain and select the age column. Notice we will need to specify the columns
as integer indices. make_select_column will then give us the nth column as a particular type. In this case, giving
us the first column cast as a string:

In [3]:
preprocessor = (
    # Convert data into Vec<Vec<String>>
    make_split_dataframe(separator=",", col_names=var_names) >>
    # Selects a column of df, Vec<str>
    make_select_column(key="age", T=str)
)
res = preprocessor(data)
print(type(res))
print(res)

<class 'list'>
['59', '31', '36', '54', '39', '34', '93', '69', '40', '27', '59', '31', '73', '89', '39', '51', '32', '52', '24', '48', '51', '43', '29', '44', '87', '27', '58', '32', '74', '28', '70', '35', '36', '63', '21', '29', '44', '35', '43', '59', '53', '42', '32', '50', '18', '40', '42', '52', '56', '45', '39', '28', '46', '45', '32', '22', '52', '21', '60', '77', '38', '40', '34', '48', '69', '46', '40', '26', '37', '30', '70', '42', '24', '31', '20', '33', '47', '19', '33', '66', '23', '51', '23', '47', '48', '43', '31', '47', '74', '93', '55', '29', '33', '50', '28', '29', '38', '42', '50', '77', '37', '40', '30', '19', '50', '82', '22', '63', '48', '46', '45', '37', '76', '25', '40', '34', '56', '43', '42', '22', '45', '32', '23', '19', '52', '44', '43', '34', '32', '38', '40', '82', '31', '42', '47', '66', '30', '42', '50', '26', '53', '22', '28', '25', '30', '86', '33', '62', '23', '36', '84', '29', '29', '18', '44', '61', '72', '59', '33', '51', '35', '47', '50', '30', 

Age doesn't make sense as a string for our purposes, so let's cast it to an integer:

In [4]:
# Create a chained computation
preprocessor = (
    # Convert data into Vec<Vec<String>>
    make_split_dataframe(separator=",", col_names=var_names) >>
    # Selects a column of df, Vec<str>
    make_select_column(key="age", T=str)
    # Cast the column as Vec<Int>
)
chain = preprocessor >> make_cast(TI=str, TO=int) >> make_impute_constant(0)
print(chain(data)[:10])

[59, 31, 36, 54, 39, 34, 93, 69, 40, 27]


Now that we have an age column as integers and we've imputed any missing values, let's clamp the values to a defined
range so that we can quantify our sensitivity for future computations:

In [5]:
# Create a chained computation
preprocessor = (
    # Convert data into Vec<Vec<String>>
    make_split_dataframe(separator=",", col_names=var_names) >>
    # Selects a column of df, Vec<str>
    make_select_column(key="age", T=str) >>
    # Cast the column as Vec<Int>
    make_cast(TI=str, TO=int) >>
    # Impute missing values to 0
    make_impute_constant(0) >>
    make_clamp(20, 50)
)
res = preprocessor(data)
print(res[:10])

[50, 31, 36, 50, 39, 34, 50, 50, 40, 27]


Time to compute our first aggregate statistic. Suppose we want to know the sum of the ages in our dataset.
We can add one more step to our previous computation chain: make_bounded_sum. This will take the result of make_clamp,
and calculate the sum over the domain [20, 50].

In [6]:
preprocessor = (
    # Convert data into Vec<Vec<String>>
    make_split_dataframe(separator=",", col_names=var_names) >>
    # Selects a column of df, Vec<str>
    make_select_column(key="age", T=str) >>
    # Cast the column as Vec<Int>
    make_cast(TI=str, TO=int) >>
    # Impute missing values to 0
    make_impute_constant(0) >>
    # Clamp age values
    make_clamp(20, 50) >>
    make_bounded_sum(lower=20, upper=50)
)

res = preprocessor(data)
print(res)


39650


We may be more interested in the mean age of the data. Then we can take the result of clamp and call
make_bounded_mean over the same domain.

Our bounded mean age is:

In [7]:
chain = (
    # Convert data into Vec<Vec<String>>
    # TODO: how to remove column headers?
    make_split_dataframe(separator=",", col_names=var_names) >>
    # Selects a column of df, Vec<str>
    make_select_column(key="age", T=str) >>
    # Cast the column as Vec<Int>
    make_cast(TI=str, TO=float) >>
    # Impute missing values to 0
    make_impute_constant(0.) >>
    # Clamp age values
    make_clamp(20., 50.) >>
    make_resize_bounded(lower=20., upper=50., length=100, constant=20.) >>
    make_bounded_mean(lower=20., upper=50., n=100, T=float) >>
    make_base_laplace(scale=1.0)
)

res = chain(data)
print(res)

39.74152375796441


Privatizing sums
(TODO: Explain geometric mechanism and how it privatizes the sum)

Let's use the geometric mechanism to create a DP release of the sum of all values
in the age column

In [8]:
preprocessor = (
    # Convert data into Vec<Vec<String>>
    make_split_dataframe(separator=",", col_names=var_names) >>
    # Selects a column of df, Vec<str>
    make_select_column(key="age", T=str) >>
    # Cast the column as Vec<Int>
    make_cast(TI=str, TO=int) >>
    # Impute missing values to 0
    make_impute_constant(0) >>
    # Clamp age values
    make_clamp(20, 50) >>
    make_bounded_sum(lower=20, upper=50) >>
    make_base_geometric(scale=1.0)
)

res = preprocessor(data)
print("DP Sum: ", res)

DP Sum:  39650


Privatizing counts:

Sometimes you will want to make a DP release of the total number of elements in a data set. For example: How many rows
does our data set have? Below, we will use `make_count` to calculate this number, and `make_base_geometric` to
privatize the value.

In [9]:
preprocessor = (
    # Convert data into Vec<Vec<String>>
    make_split_dataframe(separator=",", col_names=var_names) >>
    # Selects a column of df, Vec<str>
    make_select_column(key="income", T=str) >>
    # Cast the column as Vec<Int>
    make_cast(TI=str, TO=int) >>
    # Impute missing values to 0
    make_impute_constant(0) >>
    make_count(TIA=int) >>
    make_base_geometric(scale=1.0)
)


res = preprocessor(data)
print("Income column sum: ", res)

Income column sum:  1000
