# Privacy, Confidentiality and Ethics

This notebook contains the exercises to the Privacy, Confidentiality and Ethics session. It provides a walkthrough of the disclosure review process and what you have to keep in mind when you want to export output. In addition, it provides examples of how research results can be biased (degradation) when working with modified (topcoded) data instead of using the original underlying information.

### The notebook is outlined as follows: 
- [Data Preparation for Exercises](#Data-Preparation-for-Exercises)
- [Exercise 1: Disclosure Review](#Exercise-1:-Disclosure-Review)
- [How to Export Files](#How-to_Export-Files)
- [Exercise 2: Degradation](#Exercise-2:-Degradation)

In [None]:
# Load packages
%pylab inline
from __future__ import print_function
import os
import pandas as pd
import numpy as np
import scipy
import sklearn
import psycopg2
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')
import sqlalchemy
import statsmodels.api as sm
import statsmodels.formula.api as smf

## Data Preparation for Exercises
This section creates variables and combines different datasets which are needed for the exercises later. From the IDES data we create a measure of total earnings and number of jobs people have in the first quarter of 2015. In a second step we censor these measures. From the IDOC data we will load individual characteristics.

In [None]:
mypath = (os.getcwd())
print(mypath)

In [None]:
# Database connection
db_name = "appliedda"
hostname = "10.10.2.10"
conn = psycopg2.connect(database=db_name, host = hostname) 

### IDOC data person file
We need this data to obtain demographics for (ex-)offenders. These characteristics are standardized in the person table. We want to know when (ex-)offenders are born, their race and sex. In addition, we get the hashed SSN to identify their earnings' records in the IDES data.

In [None]:
# Get data from person table (sex and race)
query2 = 'SELECT ssn_hash, birth_year, race, sex FROM class1.person;'

In [None]:
# Save query in dataframe
df_ind = pd.read_sql( query2, con = conn )

In [None]:
# Check dataframe
df_ind.head()

### IDES earnings file
Form the IDES earnings file we get all earnings for the first quarter of 2015. This allows us to look at the entire working population of Illinois in the respective quarter, as well as all (ex-)offenders who have jobs in the first quarter of 2015 by merging this information to IDOC data. Thus, we need the hashed SSN and the wage variable.

In [None]:
# Select all spell in 1st quarter of 2015 & variables needed
query = 'SELECT ssn, wage FROM ides.il_wage WHERE year = 2015 AND quarter = 1;'

In [None]:
# Save query in dataframe
df_wage = pd.read_sql( query, con = conn )

In [None]:
# Let's check the data frame
df_wage.sort_values(by='ssn')
df_wage.head()

In [None]:
# Close database connection
conn.close()

#### Calculate Number of Jobs 
We calculate the number of jobs each reported person has in the first quarter of 2015. Every job is reported separately so we just have to count the entries per ssn. Then we generate a variable number of jobs censored which we need for the degradation exercise. We censor the jobcount at three or more jobs.

In [None]:
# Define function for retrieving counts by group
def get_count(group):
    return{'count': group.count()}
df_jobs=df_wage['wage'].groupby(df_wage['ssn']).apply(get_count).unstack()

In [None]:
# Reset index and rename
df_jobs = df_jobs.reset_index()
df_jobs.rename(columns={'count':'nofjobs'}, inplace=True)

In [None]:
# Duplicate job variable info so we can censor one of them
df_jobs['nofjobs_cens'] = df_jobs['nofjobs']

In [None]:
# Topcode number of jobs at 3
df_jobs.loc[df_jobs['nofjobs_cens'] > 2, 'nofjobs_cens']=3

In [None]:
# Check dataframe 
df_jobs.sort_values(by='ssn')
df_jobs.head(15)

In [None]:
# How much jobs do people have in the reported quarter when using original measure?
df_jobs['nofjobs'].value_counts()

In [None]:
# How much jobs do people have in the reported quarter when using the topcoded measure?
df_jobs['nofjobs_cens'].value_counts()

#### Calculate Wages
Now we calculate the total earings for the first quarter in 2015 for every person because we want to find out more about earning inequality in Illinois. In addition, we generate a censored wage variable. Wages are censored at total earnings of ### USD per person per quarter. As we will be taking the log of earnings in the regressionns later we will replace 0 earnings with 0.0001 so we can create the log

In [None]:
# Define function to retrieve sum of attribute by group
def get_sum(group):
    return{'sum': group.sum()}
df_wage_tot=df_wage['wage'].groupby(df_wage['ssn']).apply(get_sum).unstack()

In [None]:
# Reset index and rename
df_wage_tot = df_wage_tot.reset_index()
df_wage_tot.rename(columns={'sum':'wage'}, inplace=True)

In [None]:
# Duplicate wage info so we can censor one of the variables
df_wage_tot['wage_cens'] = df_wage_tot['wage']
# You cannot take the log of 0
df_wage_tot.loc[df_wage_tot['wage'] == 0, 'wage']=0.0001

In [None]:
# Topcode earnings at #### USD
df_wage_tot.loc[df_wage_tot['wage_cens'] > (###), 'wage_cens']=###
# You cannot take the log of 0
df_wage_tot.loc[df_wage_tot['wage_cens'] == 0, 'wage_cens']=0.0001

In [None]:
# Let's look at the dataframe
df_wage_tot.head()

#### Construct final data for regressions
This dataset contains all the information (earnings, jobcount, and demographics) for the population of ex-offenders who are employed in the first quarter of 2015.

In [None]:
# Merge the wage andnumber of jobs info
df_ides = pd.merge(left=df_wage_tot,right=df_jobs, how='inner', 
                   left_on=['ssn'], right_on=['ssn'])
# Merge ides data to idc person data
df_reg = pd.merge(left=df_ind,right=df_ides, how='inner', left_on=['ssn_hash'],
                  right_on=['ssn'])

In [None]:
# Check dataframe
df_reg.head()

## Exercise 1: Disclosure Review
This exercise provides you with information on how to prepare research output for disclosure control. It outlines different kind of output forms and what information is needed for disclosure review. In general you can export any kind of format. The most popular formats are tables, graphs, regression output and aggregated data. The only thing you can't export is content in a jupyter notebook. Durning the export process you have to provide the code for every output. Every result you would like to export needs to be saved in either .csv, .txt or graph format. It is not possible to do export reviews in a jupyter notebook.

General Rules:
- The disclosure review is based on the underlying observations. Every statistic you want to export should be based on at least 10 individual data points.
- Document your code so the reviewer can follow your data work. Assessing re-identification risks highly depends on the context. Thus it is important that you provide content to your anlysis for the reviewer.
- Save the requested output with the corresponding code in the git repo for your project's data exports. Make sure the code is executable. The code should exactly produce the output you requested.
- In case you are exporting powerpoint slides that show project results you have to provide the code to produce the output in the slides. 
- Export results only when there are final and you need them for your report/work/presentation.

### Tables
For tables of any kind you need to provide the underlying counts of the statistics presented in the table. Make sure you provide all counts. If you calculate ratios, for example re-incarceration rates, you need to provide the number of individuals who are incarcerated and then ones who are not. 

Let's assume we are intersted in demographic characteristics of ex-offenders and want to know how race and sex are distributed over birth year. We can get this information from the person table (df_ind, the dataframe which was created earlier).

In [None]:
# Now let's tabulate sex and race characteristics across birth_year
print(pd.crosstab([df_ind.birth_year.fillna('missing'), df_ind.race.fillna('missing')], df_ind.sex.fillna('missing'), margins=True))

#### Problematic Output
We can see that we have a lot of small numbers here. This table won't be released. In this case, disclosure review would mean to delete all cells with counts of less than 10. In addition, secondary suppression has to take place. The disclosure reviewer has to delete as many cells as needed to make it impossible to recalculate the suppressed values. 

#### How to do it better
Instead of asking for export of a tables like this, you should prepare your tables in advance that all cell sizes are at least represented by 10 observations. In our example we can do this by grouping birth years for instance. 

In [None]:
# Aggregate birth year variable
cohort = []
for row in df_ind['birth_year']:
    if row < 1941:
        cohort.append('up to 1939')
    elif row > 1939 and row < 1960:
        cohort.append('1940 to 1959')        
    elif row > 1959 and row < 1980:
        cohort.append('1960 to 1989')      
    elif row > 1979 and row < 2000:
        cohort.append('1970 to 1999')
    else:
        cohort.append('')
df_ind['cohort']=cohort
df_ind.head()

In [None]:
# Now let's tabulate sex and race characteristics across cohort
print(pd.crosstab([df_ind.cohort.fillna('missing'), df_ind.race.fillna('missing')], df_ind.sex.fillna('missing'), margins=True))

There are still some small cell sizes, so this table needs even a higher aggregation. It would make sense to aggregate the race variable. For example, you can combine the races with small cell sizes into a group "other" or remove them from the table completely.

### CSV files: Tables and Aggregated Data
You can save any dataframe as a csv file and export this csv file. The only thing you have to keep in mind is that besides the statistic X you are interested in you have to include a variable count of X so we can see on how many observations the statistic is based on. For example if you agregate by industry, we need to know how many observations are in each industry (after the aggregation each industry will be only one data point). This applies to all exported tables, aggregations, etc. For now let's save the table above in a .csv file to export the result.

In [None]:
# Only include the races with enough observations
output = df_ind[(df_ind.race == 'BLK') | (df_ind.race == 'HSP') | (df_ind.race == 'WHI')]
# Produce crosstab
export = pd.crosstab([output.cohort, output.race], output.sex, margins=True)
print(export)
# write to csv because notebook output cannot be exported
export.to_csv('demographics.csv')

### Graphs
It is important that every point which is plotted in a graph is based on at least 10 observations. Thus scatterplots for example cannot be released. In case you are interested in a histogram you have to change the bin size to make sure that every bin contains at least 10 people. In addition to the graph you have to provide the ADRF with the underlying table in a .csv or .txt file. This file should have the same name as the graph so ADRF can directly see which files go together. 

In [None]:
# Produce plot for job count
histjobs=df_jobs['ssn'].groupby(df_jobs['nofjobs']).apply(get_count).unstack()
myplot = histjobs.plot(kind='bar', legend=None)
myplot.set_ylabel("Total number of Workers")
myplot.set_xlabel("Number of Jobs held")

From this plot we can see that there are only few people who had more than 3 jobs. Thus we have to make sure that we only plot the cells which contain more than 10 workers.

In [None]:
# Let's look at the dataframe
histjobs

In [None]:
# Look at the dataframe with 10 plus observations
histjobs_cens=histjobs[histjobs['count']>9]
histjobs_cens

In [None]:
# Gernerate plot for export
myplot2 = histjobs_cens.plot(kind='bar', legend=None)
myplot2.set_ylabel("Total number of Workers")
myplot2.set_xlabel("Number of Jobs held")

In [None]:
# save plot to file
fig = myplot2.get_figure()
fig.savefig('nofjobs_hist.png')
# dont forget to save counts as csv with the same name as graph
histjobs_cens.to_csv('nofjobs.csv')

### Regressions
You need to provide the ADRF with the number of observations which are included in the regression. When using the statsmodel package you can print the necessary information by using the summary command. Regression output should be written in a .txt file. If you are including dummies in the regression you need to provide the number of observations for each dummy included in the regression.

In [None]:
# Run regression on wages (not censored)
model = smf.ols('log(wage) ~ C(sex) + C(race)', data= df_reg)
results = model.fit()
res = results.summary()
print(res)

In [None]:
# We need to find out the number of observations for each dummy
counts = zip(model.exog.sum(0), model.exog_names)
print(counts)

In [None]:
# Write results in txt file
output = open('OLS_results.txt', "w")
output.write("%s" % res + '\n' "%s"  % counts)
output.close()

## How to Export Files
Every project has a Git Repository which is only used for export request. The name of the repository is your project_name/export. You need to save every file that is includedd in an export request in this repository. After setting up the export git repo in your workspace please create two folders: input and output. 
- Save the code and any additional documentation for all your results you wnat to export in the input folder
- Save the files that you wnat to export in the output folder. If you wnat to export a jupyter notebook please make sure that there is no data in the notebook before you put it in the output folder.

When you are ready for export commit your changes to the repo and go to the gitlab interface, find your export repository and submit a merge request.
- Press new merge request
- Select the branch: export-yyyy-mm-dd
- Select master as the branch
- Fill out the necessary information:
    - Title: export-yyyy-mm-dd-#-files
    - Description: Provide us with a description of what is in the export request, what the purpose of the export is, what kind of files you are requesting and what kinf of analyis you performed. You should use this field to provide information in a way that a person who does not know what your project is about will be able to understand your output. The more detailed your description is the easier is it for the reviwer to understand what you did.

When you merged the branch please notify us on the slack channel export-request that you want to request a disclosure review. Your slack message should only contain the project name and date of request. We then will look at the files and will be in touch with you about the export. 

ADRF reviews for confidentiality. Please keep in mind that some data providers might review your output too. Thus, the export process might take some time. To minimize our and the data providers work please only commit and export final results you need for your presentation and limit export of intermediate results.

### Command Line: set up Git Repo and Commit/Push Files
You need to clone the git repo and set up a branch. In a second step you need to create a subfolder input and a subfolder output in the export folder. Now you can add the files you want to export to the folders. Then you can commit and push your changes. 

![easylink](./terminal1.jpg)
![easylink](./terminal2.jpg)
![easylink](./terminal3.jpg)

### Gitlab: Sumit Merge Request
Afer you committed and pushed you files to the export repo you can open gitlab and follow the procedure outlined below:

Find your export repository for your project project-project_name/export.
![easylink](./gitlab1.jpg)

Click on merge request.
![easylink](./gitlab2.jpg)

Generate a new merge request. The source branch should be your project export repo, which you label export-date. The target branch is master.
![easylink](./gitlab3.jpg)

Press continue and you will be getting to following page. Her you need to provide us with information on your export request. Please fill out the fields before submitting your merge request. The assignee to enter is Daniela.
![easylink](./gitlab4.jpg)

After submitting the request you will see the page below (without the accept button - this one is only available in the administrator interface). You have succesfully submitted output for disclosure review. 
![easylink](./gitlab5.jpg)

Please keep in mind that you will not recieve notifications through Git. After you submitted your merge request you need to send a message to the slack channel #export-request and let us know that output is waiting for us. Once the disclosure review is done ADRF will get in touch with you and send you the innformation needed to retrieve your output. 

In case your output doesn't pass the disclosure review we will let you know why and ask you to make changes to your results. You can commit these changes as amendment into the same branch. 

## Exercise 2: Degradation
One way of protecting confidentiality in data is data transformation in such a way that information on the observed unit is reduced. Oftentimes this is done by adding noise, grouping of individual attributes, or topcoding. It is a common procedure done by data providers to generate public use files which can be freely accessed on the internet. These methods definitelyminimize the risk of re-identification, however, this data manipulation can also lead to wrong inference as the result do not necessarily represent the true underlying population. 

The goal of this excercise is to test how data manipulation can affect the result of a research project. We can demonstrate this by using the created topcoded variables on earnings and number of jobs. We execute two different kinds of analyses including the original and censored variable seperately and find out how much the results differ.

The exercise is as follows: 
- How does manipulation of data affect regression output?
- How does it affect the creation of an inequality measure?

### Regression Output: Original Value vs. Topcoded Value
We will look at how regression out put changes when we use our two different neasures we constructed for the number of jobs and wages. We are interested in the relationship between number of jobs/total earnings and personal characteristics. This is using the combined dataset so we are looking at ex-offenders only. As they probably don't show earnings at the upper threshold we expect less differences when we regress on earnings compared to number of jobs. 

In [None]:
# Wages:
# Run regression using the original measure
results = smf.ols('log(wage) ~ C(sex) + C(race)', data= df_reg).fit()
print(results.summary())
print('Predicted values:')
# Run regression using the topcoded measure
results = smf.ols('log(wage_cens) ~ C(sex) + C(race)', data= df_reg).fit()
print(results.summary())

In [None]:
# Jobs:
# Run regression using the original measure
results = smf.ols('nofjobs ~ C(sex) + C(race)', data= df_reg).fit()
print(results.summary())
# Run regression using the topcoded measure
results = smf.ols('nofjobs_cens ~ C(sex) + C(race)', data= df_reg).fit()
print(results.summary())

### Gini Coefficient: Original value vs. topcoded value
The Gini coefficient is a measure of statisitcal dispersion to document the residents income/wealth/earnings distribution and is often used as a measure of inequality. A Gini of 0 denotes perfect equality, a Gini of 1 perfect inequality. We can calculate the Gini coefficient for the Illinois population and the population of ex-offerender. For the same reason as above we don't expect the Gini to differe much between censored and original variable within ex-offenders.

In [None]:
# Define function to calculate Gini Coefficient
def gini(x):
    n = len(x)
    try:
        x_sum = x.sum()
    except AttributeError:
        x = npasarray(x)
        x_sum = x.sum()
    n_x_sum = n * x_sum
    r_x = (2. * np.arange(1, len(x)+1) * x[np.argsort(x)]).sum()
    return (r_x - n_x_sum - x_sum) / n_x_sum

In [None]:
# Calculate Gini based on original value
print('Gini Coefficient original earnings (all): ', gini(df_wage_tot['wage']))

# Calculate Gini based on original value for ex-offenders
print('Gini Coefficient original earnings (ex-offenders): ', gini(df_reg['wage']))

In [None]:
# Calculate Gini based on censored value
print('Gini Coefficient censored earnings (all): ', gini(df_wage_tot['wage_cens']))

# Calculate Gini based on original value for ex-offenders
print('Gini Coefficient censored earnings (ex-offenders): ', gini(df_reg['wage_cens']))

We can see that the inequality measure differs quite a bit when using the original reporting and a censored version of earnings for the working population of Illionis in the first quarter of 2015. In this case we would underreport earnings inequality which might lead to wrong policy advise. 