# Data Access Policy Tutorial

In this tutorial, we are going to present the various options available to a data owner to set up a policy for their data.

Policies are defined at the remote object level (e.g. a remote DataFrame) and outputs of computations inherit their policies from the inputs. When there are more than one input, the new policy is a combination of all the input policies using the AND combinator.

## Installing BastionLab Client and Starting the server

This steps are very similar to the [Quick Tour](https://github.com/mithril-security/bastionlab/blob/master/docs/docs/quick-tour/quick-tour.ipynb).

First we download and install the BastionLab client and test server using pip:

In [None]:
!pip install bastionlab bastionlab_server

Then, we start the test server:

In [29]:
import bastionlab_server

srv = bastionlab_server.start()

BastionLab server (version 0.3.5) already installed
Libtorch (version 1.12.1) already installed
TLS certificates already generated
Bastionlab server is now running on port 50056


[2022-12-16T15:55:15Z INFO  bastionlab] Authentication is disabled.
[2022-12-16T15:55:15Z INFO  bastionlab] Telemetry is enabled.
[2022-12-16T15:55:15Z INFO  bastionlab] BastionLab server listening on 0.0.0.0:50056.
[2022-12-16T15:55:15Z INFO  bastionlab] Server ready to take requests


## Why do you need to use a policy?

A policy is a set of rules describing the kind of operations that can be run on your data.

A policy has two sections:

1) Safe zone: The safe zone contains the rules specifying whether the result of a query is safe to return to a data-scientist.

2) Unsafe handling: The unsafe handling parameter specifies the type of action that must be taken if a query breaks the rules of the safe zone. 

To avoid importing between the tutorial, let's import everything we use in this tutorial.

In [30]:
from bastionlab import Identity, Connection
from bastionlab.polars.policy import (
    Policy,
    AtLeastNof,
    Aggregation,
    UserId,
    Log,
    Review,
    Reject,
)

### Safe Zone

Currently we support three rules, ignore the unsafe_handling part for the moment.

##### Aggregation:
The aggregation rule ensures that the returned dataframe aggregates, at minimum, the specified number of rows from the orginial dataset.

In the following example if the result of a query does not aggregate at least 10 rows, it will violate the safe zone.

In [31]:
policy = Policy(safe_zone=Aggregation(min_agg_size=10), unsafe_handling=Log())

##### UserId:
The user id rule lets a dataowner grant access to a dataframe to only a particular user.
The user id is the hash of the public key of the user.

You may be familiar with how identities work from the [authentication tutorial](https://github.com/mithril-security/bastionlab/blob/master/docs/docs/tutorials/authentication.ipynb). The following code demonstrates how a user can obtain their user id.

The user should share their user id with the data owner who may add it in the safe zone as follows.

In [32]:
data_scientist = Identity.create("./data_scientist.pem")

user_id = data_scientist.pubkey.hash.hex()

policy = Policy(safe_zone=UserId(user_id), unsafe_handling=Log())

##### Please be aware that UserIds will only work if authentication is enabled on the server.

##### AtLeastNof:

This rule is a collection of rules. As the rule's name suggests, the result of a query must pass at least 'N' rules of the total number of rules.

You can use this to specify, for example, different rules for different users. 

A possible scenario is:

User 1 is trusted by the data owner and can run any query they want on the dataset and retrieve the results.

User 2 is unknown to the the data owner and must aggregate a minimum of 20 rows in the resulting dataframe.

Such a policy would look like,

In [33]:
data_scientist = Identity.create("./data_scientist.pem")

user_id = data_scientist.pubkey.hash.hex()

policy = Policy(
    safe_zone=AtLeastNof(n=1, of=[UserId(user_id), Aggregation(min_agg_size=20)]),
    unsafe_handling=Log(),
)

If you changed 'n' to 2 above, the Policy would enforce access to only the 'data_scientist' and no other user and the queries the data_scientist runs must aggregate a minimum of 20 rows.

### Unsafe Handling

Unsafe handling is where the data owner specifies the action that must be taken when a query violates the safe zone.

There are currently 3 unsafe handling actions:

##### Log():
The Log action logs every query that violates the safe zone. This is the default action.

For example, if the policy in the cell below is violated because an operation only aggregates 5 rows, the server will log that query.

##### This action is unsafe! It is only suitable for development and testing. The server will return the dataframe that violates the safe zone to the user.

In [7]:
policy = Policy(safe_zone=Aggregation(10), unsafe_handling=Log())

##### Review():
The Review action will require the data owner's approval to return any dataframes that violate the safe zone.

The data owner can then review the operation and either accept or reject the query.

If approved, the dataframe is returned to the user.

If rejected, the user will be notified that the data owner has rejected their query.


In [8]:
policy = Policy(safe_zone=Aggregation(10), unsafe_handling=Review())

##### Reject():
The Reject action will automatically reject any query that violates the safe zone.

In [9]:
policy = Policy(safe_zone=Aggregation(10), unsafe_handling=Reject())

### Setting a policy for a dataframe

Lets download a dataset to begin.

In [10]:
!wget 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'

--2022-12-16 15:25:42--  https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60302 (59K) [text/plain]
Saving to: ‘titanic.csv’


2022-12-16 15:25:42 (14.2 MB/s) - ‘titanic.csv’ saved [60302/60302]



The data owner may attach a policy to a dataframe as follows:

In [34]:
import polars as pl

connection = Connection("localhost", 50056)

df = pl.read_csv("titanic.csv")

policy = Policy(safe_zone=Aggregation(10), unsafe_handling=Reject())

rdf = connection.client.polars.send_df(df, policy=policy)

{"safe_zone":{"Aggregation":10},"unsafe_handling":"Reject"}


[2022-12-16T15:55:34Z INFO  bastionlab_polars] Succesfully sent dataframe c802e0d5-32c7-4da4-8d4a-b8d21868e07f to server


Lets run a safe query that aggregates at least 10 rows

In [35]:
per_class_rates = (
    rdf.select([pl.col("Pclass"), pl.col("Survived")])
    .groupby(pl.col("Pclass"))
    .agg(pl.col("Survived").mean())
    .sort("Survived", reverse=True)
    .collect()
    .fetch()
)

print(per_class_rates)

shape: (3, 2)
┌────────┬──────────┐
│ Pclass ┆ Survived │
│ ---    ┆ ---      │
│ i64    ┆ f64      │
╞════════╪══════════╡
│ 1      ┆ 0.62963  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2      ┆ 0.472826 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3      ┆ 0.242363 │
└────────┴──────────┘


[2022-12-16T15:55:42Z INFO  bastionlab_polars] Succesfully ran query on 5d4b2d0c-c948-49f2-948b-bab7d6e82518


Lets now try an unsafe query that doesn't aggregate the minimum number of rows.

In [36]:
unsafe_df = (
    rdf.select([pl.col("Age"), pl.col("Survived")])
    .groupby(pl.col("Age"))
    .agg(pl.col("Survived").mean())
    .sort("Survived", reverse=True)
    .collect()
    .fetch()
)

print(unsafe_df)

Safe zone violation: a DataFrame has been non-privately fetched.
Reason: Only 1 subrules matched but at least 2 are required.
Failed sub rules are:
Rule #1: Cannot fetch a DataFrame that does not aggregate at least 10 rows of the initial dataframe uploaded by the data owner.
[31mThe query has been rejected by the data owner.[37m
None


[2022-12-16T15:55:46Z INFO  bastionlab_polars] Succesfully ran query on b8efd1d9-6a18-400f-a86d-c7273028b255


### Sanitization of columns

Some columns may never be safe to expose, a column of names for example.

These can be sanitized using the sanitized_columns parameter in the send_df call.

Sanitized columns are removed from the dataframe when the dataframe is fetched.

In [37]:
policy = Policy(safe_zone=Aggregation(10), unsafe_handling=Log())

rdf = connection.client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])

rdf.head(5).collect().fetch()

{"safe_zone":{"Aggregation":10},"unsafe_handling":"Log"}
Safe zone violation: a DataFrame has been non-privately fetched.
Reason: Only 1 subrules matched but at least 2 are required.
Failed sub rules are:
Rule #1: Cannot fetch a DataFrame that does not aggregate at least 10 rows of the initial dataframe uploaded by the data owner.
Reason: Only 1 subrules matched but at least 2 are required.
Failed sub rules are:
Rule #1: Cannot fetch a DataFrame that does not aggregate at least 10 rows of the initial dataframe uploaded by the data owner.

This incident will be reported to the data owner.[37m


[2022-12-16T15:55:48Z INFO  bastionlab_polars] Succesfully sent dataframe 65082778-f3cf-4b9a-a862-d7b0537f7d85 to server
[2022-12-16T15:55:48Z INFO  bastionlab_polars] Succesfully ran query on 92629d27-7b63-4a24-9a08-e077fa982012


PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,,"""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,,"""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,,"""female""",26.0,0,0,"""STON/O2. 31012...",7.925,,"""S"""
4,1,1,,"""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
5,0,3,,"""male""",35.0,0,0,"""373450""",8.05,,"""S"""



As you can see the 'Name' column has been replaced by nulls.

Now let's terminate the connection and stop the server.

In [28]:
connection.close()
bastionlab_server.stop(srv)

Stopping BastionLab's server...


True