# Data access policy
____________________________________

In this tutorial, we are going to present the various options available to a data owner to set up a policy for their data.

Privacy policies define the kind of operations that can be run on a dataframe. 

These are useful to ensure that data scientists are unable to fetch individial rows or do any operation that leaks information from the dataset. The policy must be set based on the sensitivity of the dataset.

We also cover sanitization of columns in this tutorial. Sanitization allows the data owner to specify columns which should not be returned as a result of any query. 

If the result of a query contains a sanitized column, the entire column will be replaced with 'nulls' before bering be returned to the data scientist.

## Pre-requisites

____________________________________

### Technical requirements

To start this tutorial, ensure the following are already installed in your system:
- Python3.7 or greater (get the latest version of Python at https://www.python.org/downloads/ or with your operating system’s package manager)
- [Python Pip](https://pypi.org/project/pip/) (PyPi), the package manager
- [Docker](https://www.docker.com/) 

*Here's the [Docker official tutorial](https://docker-curriculum.com/) to set it up on your computer.*

In order to run this notebook, you will also need to install Polars, BastionLab and our easy-install bastionlab_server pip package. You can download all of these by running the following code block.

In [None]:
! pip install polars
! pip install bastionlab
! pip install bationlab_server

Then we run the server:

In [29]:
import bastionlab_server

srv = bastionlab_server.start()

BastionLab server (version 0.3.5) already installed
Libtorch (version 1.12.1) already installed
TLS certificates already generated
Bastionlab server is now running on port 50056


[2022-12-16T15:55:15Z INFO  bastionlab] Authentication is disabled.
[2022-12-16T15:55:15Z INFO  bastionlab] Telemetry is enabled.
[2022-12-16T15:55:15Z INFO  bastionlab] BastionLab server listening on 0.0.0.0:50056.
[2022-12-16T15:55:15Z INFO  bastionlab] Server ready to take requests


## Privacy policy options
_______________________________

A Privacy policy is a set of rules describing the kind of operations that can be run on your data. 

They are defined at the remote object level (BastionLab's main object: the RemoteDataFrame), and outputs of computations (resulting dataframes) inherit their policies from the inputs. When there is more than one input, the new policy is a combination of all the input policies using the AND combinator.

A policy has two sections:

- `safe_zone`: it contains the rules specifying whether the result of a query is safe to return to a data-scientist.

- `unsafe_handling`: this parameter specifies the type of action that must be taken if a query breaks the rules of the safe zone. 

Now, let's import all the elements we'll demonstrate in this tutorial.

In [30]:
from bastionlab import Identity, Connection
from bastionlab.polars.policy import (
    Policy,
    AtLeastNof,
    Aggregation,
    UserId,
    Log,
    Review,
    Reject,
)

### `safe-zone`

The safe zone contains the rules specifying whether the result of a query is safe to return to a user. Ignore the unsafe_handling part for the moment, we'll explain that next.

#### `Aggregation()`
The `Aggregation()` rule ensures that the returned dataframe aggregates, at minimum, the specified number of rows from the orginial dataset.

In the following example if the result of a query does not aggregate at least 10 rows, it will violate the safe zone.

In [31]:
policy = Policy(safe_zone=Aggregation(min_agg_size=10), unsafe_handling=Log())

#### `UserId()`
The `UserId()` rule lets a data owner grant access to a dataframe, to one particular user. The `user_id` is the hash of the public key of the user.

The following code block demonstrates how a user can obtain their user_id. 

 The user should share their user id with the data owner who may add it in the safe zone as follows. 

The data_scientist.pubkey returns the public-key from the Identity.
The hash returns the hash of the public-key, and the hex() converts the hash to a hexadecimal string.

 > *Note - We explain what `Identities` are and how they work in our [Authentication tutorial](https://github.com/mithril-security/bastionlab/blob/master/docs/docs/tutorials/authentication.ipynb).* 

In [32]:
data_scientist = Identity.create("./data_scientist.pem")

user_id = data_scientist.pubkey.hash.hex()

policy = Policy(safe_zone=UserId(user_id), unsafe_handling=Log())

> ***Important** - `UserId()` will only work if authentication is enabled on the server.*

#### `AtLeastNof()`

`AtLeastNof()` is a collection of rules which ensures that the result of a query must pass at least `n` rules of the total number of rules.

You can use this to specify, for example, different rules for different users. 

A possible scenario would be:

- User 1 (data_scientist) is trusted by the data owner. They can run any query they want on the dataset and retrieve the results.
- Other users are untrusted by the data owner. They must aggregate a minimum of 20 rows in the resulting dataframe.

When a query is run on the dataframe with this policy. The AtLeastNof rule will check that atleast 'n' of the rules listed in 'of' are matched.

Here the 'n' is 1. The 'of' parameter is a list containing 2 rules.

If the query matches at least 1 of the 2 rules mentioned, the query will execute and return successfully.

i.e, if the data_scientist runs a query, they will match the UserId rule set in the policy and successfully execute their query.

If another user runs a query, the result of the query must aggregate a minimum of 20 rows to be considered safe and be returned.

In [33]:
data_scientist = Identity.create("./data_scientist.pem")

data_scientist_id = data_scientist.pubkey.hash.hex()

policy = Policy(
    safe_zone=AtLeastNof(n=1, of=[UserId(data_scientist_id), Aggregation(min_agg_size=20)]),
    unsafe_handling=Log(),
)

If you changed `n` to 2 in the code above, the Policy would enforce that both rules match. In this case, access is only allowed for the `data_scientist` and no other user. The queries run by the `data_scientist` would also need to aggregate a minimum of 20 rows.

### `unsafe_handling`

The `unsafe_handling` parameter is where the data owner specifies the action that must be taken when a query violates the safe zone.

#### `Log()`
The `Log()` action logs every query that violates the safe zone. It is the *default* action.

For example, if the following policy (which requires a minimum of 10 aggregated rows) is violated because an operation only aggregates 5 rows, the server will log that query.

> **Important - This action is unsafe!** It is only suitable for development and testing. The server will return the dataframe that violates the safe zone to the user.

In [7]:
policy = Policy(safe_zone=Aggregation(10), unsafe_handling=Log())

#### `Review()`
The `Review()` action will require the data owner's approval to return any dataframes that violate the safe zone. The data owner can, then, review the operation and either accept or reject the query.

If approved, the dataframe is returned to the user.

If rejected, the user will be notified that the data owner has rejected their query.


In [8]:
policy = Policy(safe_zone=Aggregation(10), unsafe_handling=Review())

#### `Reject()`

The `Reject()` action will automatically reject any query that violates the safe zone.

In [9]:
policy = Policy(safe_zone=Aggregation(10), unsafe_handling=Reject())

## Set-up a privacy policy
___________________________________

Now that we know how all the rules work, let's play with an example. We'll use the [Titanic dataset](https://www.kaggle.com/competitions/titanic/data), which contains information relating to the passengers aboard the Titanic. We can download it by running the code block below. 

In [10]:
!wget 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'

--2022-12-16 15:25:42--  https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60302 (59K) [text/plain]
Saving to: ‘titanic.csv’


2022-12-16 15:25:42 (14.2 MB/s) - ‘titanic.csv’ saved [60302/60302]



We'll set up a minimum of 10 aggregated row for any query and reject the ones that don't follow this rule. 

In [34]:
import polars as pl

connection = Connection("localhost", 50056)

df = pl.read_csv("titanic.csv")

policy = Policy(safe_zone=Aggregation(10), unsafe_handling=Reject())

rdf = connection.client.polars.send_df(df, policy=policy)

{"safe_zone":{"Aggregation":10},"unsafe_handling":"Reject"}


[2022-12-16T15:55:34Z INFO  bastionlab_polars] Succesfully sent dataframe c802e0d5-32c7-4da4-8d4a-b8d21868e07f to server


To test it, let's run a safe query that aggregates at least 10 rows:

In [35]:
per_class_rates = (
    rdf.select([pl.col("Pclass"), pl.col("Survived")])
    .groupby(pl.col("Pclass"))
    .agg(pl.col("Survived").mean())
    .sort("Survived", reverse=True)
    .collect()
    .fetch()
)

print(per_class_rates)

shape: (3, 2)
┌────────┬──────────┐
│ Pclass ┆ Survived │
│ ---    ┆ ---      │
│ i64    ┆ f64      │
╞════════╪══════════╡
│ 1      ┆ 0.62963  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2      ┆ 0.472826 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3      ┆ 0.242363 │
└────────┴──────────┘


[2022-12-16T15:55:42Z INFO  bastionlab_polars] Succesfully ran query on 5d4b2d0c-c948-49f2-948b-bab7d6e82518


Let's now try an unsafe query that doesn't aggregate the minimum number of rows.

In [36]:
unsafe_df = (
    rdf.select([pl.col("Age"), pl.col("Survived")])
    .groupby(pl.col("Age"))
    .agg(pl.col("Survived").mean())
    .sort("Survived", reverse=True)
    .collect()
    .fetch()
)

print(unsafe_df)

Safe zone violation: a DataFrame has been non-privately fetched.
Reason: Only 1 subrules matched but at least 2 are required.
Failed sub rules are:
Rule #1: Cannot fetch a DataFrame that does not aggregate at least 10 rows of the initial dataframe uploaded by the data owner.
[31mThe query has been rejected by the data owner.[37m
None


[2022-12-16T15:55:46Z INFO  bastionlab_polars] Succesfully ran query on b8efd1d9-6a18-400f-a86d-c7273028b255


### Sanitization of columns

We're now handling aggregated queries, but some columns are never safe to expose - for example, a column of names. We want to make sure those are removed from the dataframe when the dataframe is fetched. 

We can do this by using the `sanitized_columns` parameter in the `send_df()` call:

In [37]:
policy = Policy(safe_zone=Aggregation(10), unsafe_handling=Log())

rdf = connection.client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])

rdf.head(5).collect().fetch()

{"safe_zone":{"Aggregation":10},"unsafe_handling":"Log"}
Safe zone violation: a DataFrame has been non-privately fetched.
Reason: Only 1 subrules matched but at least 2 are required.
Failed sub rules are:
Rule #1: Cannot fetch a DataFrame that does not aggregate at least 10 rows of the initial dataframe uploaded by the data owner.
Reason: Only 1 subrules matched but at least 2 are required.
Failed sub rules are:
Rule #1: Cannot fetch a DataFrame that does not aggregate at least 10 rows of the initial dataframe uploaded by the data owner.

This incident will be reported to the data owner.[37m


[2022-12-16T15:55:48Z INFO  bastionlab_polars] Succesfully sent dataframe 65082778-f3cf-4b9a-a862-d7b0537f7d85 to server
[2022-12-16T15:55:48Z INFO  bastionlab_polars] Succesfully ran query on 92629d27-7b63-4a24-9a08-e077fa982012


PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,,"""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,,"""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,,"""female""",26.0,0,0,"""STON/O2. 31012...",7.925,,"""S"""
4,1,1,,"""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
5,0,3,,"""male""",35.0,0,0,"""373450""",8.05,,"""S"""


When we print the first five rows of the dataset, the `'Name'` column has been replaced by `null` values!

We have successfully set up a privacy policy. Now let's terminate the connection and stop the server:

In [28]:
connection.close()
bastionlab_server.stop(srv)

Stopping BastionLab's server...


True