# 04. Case Study - Privacy in Practice

In this notebook, we'll explore the possibilties for data privacy on a new dataset. In this notebook, you'll be asked to work in small groups, so make friends with someone seated near you. 

Your challenge is that you are working with a health care provider, who would like to do the "machine learning" on this dataset to figure out if there are preventative measures that can be taken so fewer patients are seen in the hospital for related care or so that their visits are shorter. The goal is that more potentially affected patients are given access to primary care physicians and regular medication or visits that can keep them out of the hostpital for long stays. This study is focused on blood-sugar related illnesses, but not only diabetes.

Using this dataset, we'll walk through a few possible scenarios and apply what we have learned today about data privacy to this new use case.

## Part One: Determining What's Useful and What's Sensitive

- Data completeness
- Potential sensitive columns
- Potential useful features
- What should we use (or not use)? Why?

In [None]:
%matplotlib inline
import pandas as pd

df = pd.read_csv('../data/health_data.csv')

In [None]:
df.head()

In [None]:
df.admitted_ts = df.admitted_ts.map(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S') + timedelta(days=-8*12*30) if datetime.strptime(x, '%Y-%m-%d %H:%M:%S').year > 2018 else x)

In [None]:
df.to_csv('../data/health_data.csv', index=False)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
# if you'd like, continue to look at column distributions, values or feel free to plot a few which are of interest to you.

## Discussion:

- What columns should we use? 
- Which ones should we remove?
- Are there columns which we should protect but not remove? 

For each, we need some justification or thought!

In [None]:
cols_to_drop = ['id', 'patient_name', 'ssn']

df = df.drop(columns=cols_to_drop)

In [None]:
df.columns

## Part Two: Determining the Approach for Protecting the Columns

- Scenario One: You are an employee of the company which produced this data. You have full access and can use full variables for building your model; however, you want to respect privacy and ensure your model is valuable without leaking private information.

- Scenario Two: You are the database manager at the health care provider asked to prepare the data to send to a machine learning consultant who will help give you a more detailed analysis. The consultant has signed all the necessary NDAs, but you have instructions to keep the private or potentially sensitive data to a minimum.

- Scenario Three: You suggested releasing the dataset to Kaggle. It will be uploaded so hundreds of Kagglers can participate. For the sake of avoiding a long legal argument, let's say all patients were part of a study in which they signed a waiver that their records could be released publicly. That said, you still want to avoid a PR nightmare and protect the data as much as possible. 

Based on the scenario for your team above, what do you do?

In [None]:
# any investigation code to see what approach you might use

In [None]:
df.has_diabetes.value_counts()

In [None]:
df.private_insurance.value_counts()

In [None]:
df.no_primary_dr.value_counts()

In [None]:
df.corr()['no_primary_dr']

In [None]:
df.marital_status.value_counts()

In [None]:
df.age.hist(bins=70)

### Discussion

- What methods will be most effective in the scenario you have? 
- Have you considered potential data leakage within the *non-sensitive* columns?
- Is there other sensitive or secret data we should address given the scenario?

### Step Three: Implement Data Protection for the Dataset

Now it's time to code! Feel free to utilize code from the previous notebooks to implement protection of at least two of the columns you chose as sensitive. Are there ways to make these applications more Pandas-friendly or easy to use? 

In [None]:
# implement protection for the columns you are keeping -- you may use code from previous notebooks in this workshop



In [None]:
# scenario three: possible implementation (to hide as solution for hints)


## pseudonymize hospital, age and admitted timestamp
import json
import requests


SHARED_KEY = '42a2d3fc1cc449e2a27ddd457e056012'

item_list = list(df.T.to_dict().values())

actions = [
    {
        "name": "pseudonymize-hospital",
        "transform-value" : {
            "key": "hospital",
            "pseudonymize" : {
                "method": "merengue",
                "key": "89f7dklnvkldhiwokdljklsnm,qip72", 
            }
        }
    },
    {
        "name": "pseudonymize-age",
        "transform-value": {
            "key": "age",
            "pseudonymize": {
                "method": "structured",
                "key": "320fidsjkl8wy8uiofme#908",
                "type": "integer",
                "format": "raw",
                "typeParams": {
                    "min": 16,
                    "max": 100
                }
            }
        }
    },
    {
        "name": "pseudonymize-admitted-ts",
        "transform-value": {
            "key": "admitted_ts",
            "pseudonymize": {
                "method": "structured",
                "key": "320fidsjkl8wy8uiofme#908",
                "type": "date",
                "preservePrefix": True,
                "format": "%(2000-2019)Y-%m-%d %H:%M:%S"
            }
        }
    }
]

pseudonymized_data = requests.post(
    'https://api.kiprotect.com/v1/transform', 
    data = json.dumps(
        {"actions": actions, "items": item_list}, 
        allow_nan=False),
    headers = {'Authorization': 'Bearer {}'.format(
        SHARED_KEY)}
)


protected_df = pd.DataFrame(pseudonymized_data.json()['items'])

In [None]:
protected_df.head()

In [None]:
#TODO: add differentially private for cols private_insurance or no_primary_dr and has_diabetes
#Possible next step, explore k-anon?

### Discussion:

- What was difficult to decide and implement?
- How might this relate to real problems in machine learning with sensitive data? 
- Does this apply to your work? How? What can you take away?

In [None]:
protected_df.to_csv('../data/health_data_protected.csv', index=False)