In [1]:
# To run videos from Youtube etc
from IPython.display import HTML, Image

Acknowledgement to [Safe Data Access Professionals Working Group ](https://securedatagroup.org)  
Please review [SDC](https://doi.org/10.6084/m9.figshare.9958520) handbook  

<div>
<img src="figs/sdc-handbook-v1.0-pp1.png" align="right" width="500"/>
</div>

# Working with data at UCLH

## Five Safes

We follow the ['Five Safes'](http://www2.uwe.ac.uk/faculties/BBS/Documents/1601.pdf) approach to managing data and information security. This means that we don't rely on just the 'safety' of the data but also take into account the following:

### Safe People

- all individuals have substantive contracts or educational relationships with higher education or NHS institutions
- those working need to have evidence of experience of working with such data (e.g. previous training, previous work with ONS, data safe havens etc.) or they need a supervisor who can has similar experience
- those working need to undergo training in information governance and issues with statistical disclosure control (SDC)

### Safe Projects

- projects must 'serve the public good' 
- projects must meet relevant HRA and UCLH research and ethics approvals
- service delivery work mandated as per usual trust processes

### Safe Settings

- working at UCLH in the NHS on approved infrastructure
- UCLH local and remote desktops
- UCLH Data Science Desktop
- Generic Application Development Environment (GADE)

### Safe Outputs

- outputs (e.g. reports, figures and tables) must be non-disclosing 
- outputs should remain on NHS systems initially
- a copy of all outputs that are released externally (documents) should be stored in one central location so that there is visibility for all

### Safe Data

- direct identifiers (hospital numbers, NHS numbers, names etc) should be masked unless there is an explicit justification for their use
- data releases are proportionate (e.g. limited by calendar periods, by patient cohort etc.)
- further work to obscure or mask the data is not necessary given the other safe guards (as per the recommendation by the UK data service)

In [7]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/Mln9T52mwj0?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>

## Five safes at UCLH

Practically although this means that we are judging your data safety on _more_ than just the qualities of the data, we are able to work with data that would otherwise be considered unsafe. The plot below demonstrates this by comparing the effort we would have to expend on safety if we wanted to release data on the internet. This means that we lose all the other 4 safes.

![](../figs/five-safes-uclh-radar.png)

# Safe data without the other 4 safes

- Anonymisation `<-`
- Pseudonymisation
- De-identification 

## Anonymisation (is _really_ hard)

Methods include Generalised Adversarial Networks, differentially-private Bayesian generative models, and Statistical Disclosure Control

### Statistical Disclosure Control

Set thresholds for

- *k-anonymity*: counts the number of individuals identified by the intersection of key variables
- *l-diversity*: counts how varied other sensitive fields are within a k-anonymous group

Then define

- direct identifiers
- key variables (indirect identifiers)
- sensitive fields
- non-identifying variables

![](figs/cchic-anon-1.png)

![](figs/cchic-anon-2.png)

# Safe outputs

Some terminology

- Rules-based 
> Users are given a set of fixed rules about what can and cannot be released, if the statistical output presented by the user meets the criteria it is released.
- Principles-based
> An assessment of risk takes place, and a decision is made as to whether the statistical output presented ‘safe’ to release or not? (in accordance with the Five Safes ‘Safe Output’ element).

- Primary disclosure
> Inferring the identity, and/or information about, a data subject from a single source of data.
Spontaneous recognition.
- Secondary disclosure (‘attribute’)
> Deriving the identity, and/or information of, a data subjecting by combining two or more sources of information together.

## Frequency tables

Rules-based
- Minimum cell count 
- All counts should be unweighted

Principles-based
- Threshold is a ‘rule-of-thumb’
- The units and data being presented should be considered

Frequencies can be presented in many different ways including tables, histograms, pie charts, bar charts. The guidance for frequency tables will also apply for these.


![](figs/five-safes-freq-table-1.png)

![](figs/five-safes-freq-table-2.png)

## Graphs and Figures

Example issues include

- histograms: often low counts in the tails of the distribution, the maximum and minimum values may also be shown.
- scatter plots: by definition are plots of individuals (also residual plots); consider grouping

![](figs/five-safes-freq-scatter-1.png)

![](figs/five-safes-freq-scatter-2.png)

## Four eyes principle

# Practical tips

![](figs/safe-ide.png)

## `.gitignore` is your friend

```yaml
# Jupyter Notebook
.ipynb_checkpoints

# secret
secret*
.env

# scratches
scratch.ipynb
Untitled.ipynb
labbooks/*
tmp/*
```

## Do not 'hard code' usernames and passwords

Use a `.env` file (or `secrets` or similar)
Exclude from git via `.gitignore`

```sh
# edit this file and replace with actual usernames and passwords
# then save WITH the dot prefix e.g. env --> .env

# DO NOT SAVE as env (without the dot prefix) 
# else you will publish your secrets to github

# environment variables
EMAP_DB_USER=YOUR_USERNAME_HERE
EMAP_DB_PASSWORD=YOUR_PASSWORD_HERE
```

### Load your environment variables in Python

In [None]:
from dotenv import load_dotenv
load_dotenv('.env')

A template JupyterNotebook for working with data at UCLH. The following features of this notebook, and associated files are documented here to minimise the risk of data leaks or other incidents.

- Usernames and passwords are stored in a .env file that is excluded from version control. The example `env` file at `./config/env` should be edited and saved as `./config/.env`. A utility function `load_env_vars()` is provided that will confirm this file exists and load the configuration into the working environment.
- .gitattributes are set to strip JupyterNotebook cells when pushing to GitHub

In [None]:
import os
from dotenv import load_dotenv
from pathlib import Path

def load_env_vars(
    ENV_FILE_ID = 'rainy.fever.song',
    dotenv_path = './config/.env'
                 ):
    """
    Load environment variables or raise error if the file is not found
    """
    dotenv_path = Path(dotenv_path)
    load_dotenv(dotenv_path=dotenv_path)

    if os.getenv('ENV_FILE_ID') != ENV_FILE_ID:
        raise FileNotFoundError("""
        IMPORTANT
        An environment file holding the ENV_FILE_ID variable equal to 'rainy.fever.song'
        should have been found at the ./config/.env path.

        Is the script being run from the repository root (emap-helper/)?
        Did you convert the example 'env' file to the '.env' file?

        Please check the above and try again 
        """)
    else:
        return True

In [None]:
from sqlalchemy import create_engine

def make_emap_engine(db):
    # Load environment variables
    load_env_vars()
    
    if db == 'uds':
        # expects to run in HYLODE so these are part of this env
        host = os.getenv('EMAP_DB_HOST')
        name = os.getenv('EMAP_DB_NAME')
        port = os.getenv('EMAP_DB_PORT')
        user = os.getenv('EMAP_DB_USER')
        passwd = os.getenv('EMAP_DB_PASSWORD')
    elif db == 'ids':
        host = os.getenv('IDS_DB_HOST')
        name = os.getenv('IDS_DB_NAME')
        port = os.getenv('IDS_DB_PORT')
        user = os.getenv('IDS_DB_USER')
        passwd = os.getenv('IDS_DB_PASSWORD')
    else:
        raise ValueError("db is not recognised; should be one of 'uds' or 'ids'")
        
    # Construct the PostgreSQL connection
    emapdb_engine = create_engine(f'postgresql://{user}:{passwd}@{host}:{port}/{name}')
    return emapdb_engine


## Do not leak from Jupyter Notebooks

## `.gitattributes` to strip outputs from cells in a notebook
```sh
*.ipynb filter=strip-notebook-output
```

## For R ...

Use `.Renviron` as you would `.env` but it is automatically 'read' by R when it starts

```R
# IMPORTANT
# DO NOT ADD THE .Renviron VERSION OF THIS FILE TO VERSION CONTROL
# RENAME from dotrenviron to .Renviron in place and then update the
# environment variables with actual values

# IDS access
IDS_PWD=foo
IDS_HOST=bar
IDS_USER=me

# UDS access
UDS_PWD=foo
UDS_HOST=bar
UDS_USER=me

# Internet access
http_proxy=http://my-hospital.nhs.uk:1234/
HTTP_PROXY=http://my-hospital.nhs.uk:1234/
https_proxy=http://my-hospital.nhs.uk:1234/
HTTPS_PROXY=http://my-hospital.nhs.uk:1234/


# https://rstudio.github.io/renv/articles/docker.html
RENV_PATHS_CACHE=/home/rstudio/renv
```

# The End

A template JupyterNotebook for working with data at UCLH. The following features of this notebook, and associated files are documented here to minimise the risk of data leaks or other incidents.

- Usernames and passwords are stored in a .env file that is excluded from version control. The example `env` file at `./config/env` should be edited and saved as `./config/.env`. A utility function `load_env_vars()` is provided that will confirm this file exists and load the configuration into the working environment.
- .gitattributes are set to strip JupyterNotebook cells when pushing to GitHub