<div id="colab_button">
  <h1>Defining the privacy policy</h1>
  <a target="_blank" href="https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.7/docs/docs/tutorials/defining_policy_privacy.ipynb"> 
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

_____________________________________________________________________

When collaborating with data scientists, data owners often have to manually sanitize the extracts of the datasets they share. This is unsafe due to the large risk of human-error, and costs a lot in time as well as ressources.

Implementing a data access policy is the solution we found to automate this process, while making it safer and less of a headache. Our privacy policy defines the kind of operations that can be run on a RemoteDataFrame (the main object you'll manipulate with BastionLab). It will ensure that data scientists are unable to fetch individial rows or do any operation that leak informations. The policy must be set based on the sensitivity of the dataset.

In this tutorial, we'll show **how it works**, which **options you can customize** to your needs, and **how to implement it** on your dataset. 

## Pre-requisites

____________________________________

### Installation
In order to run this notebook, we need to:
- Have [Python3.7](https://www.python.org/downloads/) (or greater) and [Python Pip](https://pypi.org/project/pip/) installed
- Install [BastionLab](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/)

We'll do so by running the code block below. 

>If you are running this notebook on your machine instead of [Google Colab](https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.6/docs/docs/tutorials/data_cleaning.ipynb), you can see our [Installation page](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/) to find the installation method that best suits your needs.

In [None]:
# pip packages
!pip install bastionlab
!pip install bastionlab_server

### Launch the server

In [None]:
# launch bastionlab_server test package
import bastionlab_server

srv = bastionlab_server.start()

>*Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our [Installation Tutorial](../getting-started/installation.md) for more details.*

## Privacy policy options
_______________________________

A privacy policy is a set of rules describing the kind of operations that can be run on your data. 

Technically, they are defined at the RemoteDataFrame level (*BastionLab's main object*), which means that every `RemoteDataFrame` produced (output) will inherit their policy from the input. When there is more than one input, the new policy is a combination of all the input policies using the `AND` combinator. In this section, we'll cover the different inputs you can define.

A policy has various sections:

- `safe_zone`: it contains the rules specifying whether the result of a query is safe to return to a data-scientist.

- `unsafe_handling`: this parameter specifies the type of action that must be taken if a query breaks the rules of the safe zone. 

- `savable`: this parameter accepts a boolean value. If `true`, the `RemoteDataFrame` itself and all its derived RemoteDataFrames can be saved on the server.

Now, let's import all the options they can have and that we'll demonstrate in this tutorial:

In [1]:
from bastionlab import Identity, Connection
from bastionlab.polars.policy import (
    Policy,
    AtLeastNof,
    Aggregation,
    UserId,
    Log,
    Review,
    Reject,
)

### `safe-zone`

The safe zone contains the rules specifying whether the result of a query is safe to return to a user. 

#### `Aggregation()`
The `Aggregation()` rule ensures that the returned dataframe aggregates, at minimum, the specified number of rows from the orginial dataset.

In the following example if the result of a query does not aggregate at least 10 rows, it will violate the safe zone.

In [2]:
policy = Policy(
    safe_zone=Aggregation(min_agg_size=10), unsafe_handling=Log(), savable=False
)

#### `UserId()`

The `UserId()` rule lets a data owner grant access to a dataframe, to one particular user. The `user_id` is the hash of the public key of the user.

> *Note - We explain what `Identities` are and how they work in our [Authentication tutorial](https://github.com/mithril-security/bastionlab/blob/master/docs/docs/tutorials/authentication.ipynb).* 

The worflow is as follows: on one side, the data scientist (or user) shoud make their Identity and obtain their `user_id`. Then they should share it with other side, the data owner, so their can add it in the safe zone. 

Here's how:

In [3]:
# Data scientist side
data_scientist = Identity.create("./data_scientist")
user_id = (
    data_scientist.pubkey.hash.hex()
)  # returns the public-key from the Identity, converted in the right format

# Data owner side
policy = Policy(safe_zone=UserId(user_id), unsafe_handling=Log(), savable=False)

> ***Important** - `UserId()` will only work if authentication is enabled on the server.*

#### `AtLeastNof()`

`AtLeastNof()` is a collection of rules which ensures that the result of a query must pass at least `n` rules of the total number of rules.

You can use this to specify, for example, different rules for different users. 

A possible scenario would be:

- Our main user, a data scientist, is trusted by the data owner. They can run any query they want on the dataset and retrieve the results.
- Other users are untrusted by the data owner. They must aggregate a minimum of 20 rows in the resulting dataframe.

When a query is run on the dataframe with this policy, the `AtLeastNof()` rule will check that atleast `'n'` of the rules listed in `'of'` are matched. Another way of understanding it is that ***either*** the user connecting is the `trusted_data_scientist`, ***or*** they have to aggregate a minimum of 20 rows.

In [4]:
data_scientist = Identity.create("./data_scientist")

trusted_data_scientist_id = data_scientist.pubkey.hash.hex()

policy = Policy(
    safe_zone=AtLeastNof(
        n=1, of=[UserId(trusted_data_scientist_id), Aggregation(min_agg_size=20)]
    ),
    unsafe_handling=Log(),
    savable=False,
)

If you changed `n` to 2 in the code above, the policy would enforce that both rules match: access would only be allowed for the `trusted_data_scientist` ***and*** their queries would also need to aggregate a minimum of 20 rows.

### `unsafe_handling`

The `unsafe_handling` parameter is where the data owner specifies the action that must be taken when a query violates the safe zone.

#### `Log()`

> **Important - This action is unsafe!** It is only suitable for development and testing. The server will return the dataframe that violates the safe zone to the user.

The `Log()` action logs every query that violates the safe zone. It is the ***default*** action.

For example, if the following policy (which requires a minimum of 10 aggregated rows) is violated because an operation only aggregates 5 rows, the server will log that query.

In [5]:
policy = Policy(safe_zone=Aggregation(10), unsafe_handling=Log(), savable=False)

#### `Review()`
The `Review()` action will require the data owner's approval to return any dataframes that violate the safe zone. Then the data owner can review the operation and either accept or reject the query.

If approved, the dataframe is returned to the user. If rejected, the user will be notified that the data owner has rejected their query.


In [6]:
policy = Policy(safe_zone=Aggregation(10), unsafe_handling=Review(), savable=False)

#### `Reject()`

The `Reject()` action will automatically reject any query that violates the safe zone.

In [7]:
policy = Policy(safe_zone=Aggregation(10), unsafe_handling=Reject(), savable=False)

### `savable`

The `savable` parameter is where the data owner specifies whether the current (*this*) RemoteDataFrame can be saved and allowed to remain on the server even after a server restart. 

If set to `true`, any user can save this RemoteDataFrame and any RemoteDataFrames derived from it.

If set to `false`, neither this RemoteDataFrame nor any RemoteDataFrames resulting from it can be saved.

In [8]:
# this dataframe can be saved
policy = Policy(safe_zone=Aggregation(10), unsafe_handling=Reject(), savable=True)

## Set-up a privacy policy
___________________________________

Now that we know how all the rules work, let's play with an example and see how to implement it when uploading our dataset. We'll use the [Titanic dataset](https://www.kaggle.com/competitions/titanic/data), which contains information relating to the passengers aboard the Titanic. We can download it by running the code block below: 

In [None]:
!wget 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'

We'll set up a minimum of 10 aggregated row for any query and reject the ones that don't follow this rule: 

In [9]:
import polars as pl

# we open the connection to BastionLab server
connection = Connection("localhost")

# we create a dataframe with the dataset
df = pl.read_csv("titanic.csv")

policy = Policy(
    safe_zone=Aggregation(10), unsafe_handling=Reject(), savable=False
)  # we define the policy

# we upload our dataset AND the policy rules which returns a RemoteDataFrame instance
rdf = connection.client.polars.send_df(df, policy=policy)

To test that it works, let's run a safe query that aggregates at least 10 rows:

In [10]:
per_class_rates = (
    rdf.select([pl.col("Pclass"), pl.col("Survived")])
    .groupby(pl.col("Pclass"))
    .agg(pl.col("Survived").mean())
    .sort("Survived", reverse=True)
    .collect()
    .fetch()
)

print(per_class_rates)

shape: (3, 2)
┌────────┬──────────┐
│ Pclass ┆ Survived │
│ ---    ┆ ---      │
│ i64    ┆ f64      │
╞════════╪══════════╡
│ 1      ┆ 0.62963  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2      ┆ 0.472826 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3      ┆ 0.242363 │
└────────┴──────────┘


Let's now try an unsafe query that doesn't aggregate the minimum number of rows:

In [11]:
unsafe_df = (
    rdf.select([pl.col("Age"), pl.col("Survived")])
    .groupby(pl.col("Age"))
    .agg(pl.col("Survived").mean())
    .sort("Survived", reverse=True)
    .collect()
    .fetch()
)

print(unsafe_df)

[31mThe query has been rejected by the data owner.[37m
None


### Sanitization of columns

We're now handling many options automatically, but what about columns which are ***never*** safe to expose? For example, a column of names... We want to make sure those are removed from the dataframe when the dataframe is fetched. 

We can do this by using the `sanitized_columns` parameter in the `send_df()` call:

In [12]:
policy = Policy(safe_zone=Aggregation(10), unsafe_handling=Log(), savable=False)

# We add a step in the send_df() call:
rdf = connection.client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])

rdf.head(5).collect().fetch()

Reason: Cannot fetch a result DataFrame that does not aggregate at least 10 rows of DataFrame a3f0a488-7da6-4874-9ba0-933aad1f41a9.

This incident will be reported to the data owner.[37m


PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,,"""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,,"""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,,"""female""",26.0,0,0,"""STON/O2. 31012...",7.925,,"""S"""
4,1,1,,"""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
5,0,3,,"""male""",35.0,0,0,"""373450""",8.05,,"""S"""


When we print the first five rows of the dataset, the `'Name'` column has been replaced by `null` values!

We have successfully set up a privacy policy. Now let's terminate the connection and stop the server:

In [13]:
connection.close()
bastionlab_server.stop(srv)