# Quick Tour of BastionLab

## Why BastionLab?
Data owners often need or wish that remote data scientists would access their datasets - like a hospital might want to valorize their data to external parties, startups, labs, or receive help from external experts, for instance. 
The problem is that the most popular solution is to give access to a Jupyter Python notebook installed on the data owner infrastructure. 


![](https://github.com/mithril-security/bastionlab/blob/master/docs/assets/current_solution.png?raw=true)

This is dangerous, because it exposes the dataset to serious data leakages. Jupyter was not made for this task and exfiltrating data can easily be done.
That is why we have built BastionLab, a data science framework to perform remote and secure Exploratory Data Analysis. Data scientists can remotely run queries on data frames without seeing the original data or intermediary results - according to the strict privacy policies defined by the data owner.


![](https://github.com/mithril-security/bastionlab/blob/master/docs/assets/proposed_solution.png?raw=true)

BastionLab features include:
- Showing only aggregated results to maintain privacy with a minimal sample size to ensure the anonymization of each individual
- When case rows have to be displayed, only a minimal amount of information is shown, and all data shared is recorded and tracked. 

Differential Privacy will be integrated transparently in the future.

Technically, the framework uses polars (a Rust equivalent of pandas) lazy API to construct the queries locally. Once built, the queries are sent to the remote server of BastionLab and executed, if they pass the privacy policy rules defined by the data owner. BastionLab supports most data wrangling operations, like selects, groupbys, joins…

## Tutorial’s Introduction

In the following notebook tutorial, we will show you how to install BastionLab and use a few basic functionalities. We’ll use a mock example in which the data owner puts a Titanic passengers dataset at the disposal of the data scientist. 

“Titanic - Machine Learning from Disaster” dataset can be found on Kaggle and downloaded with a free user account https://www.kaggle.com/competitions/titanic/data.


This notebook is divided into three parts:
- Installation of BastionLab Client and Server
- The Data Owner's Side
- The Data Scientist's Side


By the end, the data scientist will be able to do Exploratory Data Analysis remotely, under the constraints defined by the data owner.


### Technical Requirements

To start this tutorial, ensure the following are already installed in your system:
- Python3.7 or greater (get the latest version of Python at https://www.python.org/downloads/ or with your operating system’s package manager)
- [Python Pip](https://pypi.org/project/pip/) (PyPi), the package manager
- [Docker](https://www.docker.com/) 

*Here's the [Docker official tutorial](https://docker-curriculum.com/) to set it up on your computer.*


## Installing BastionLab Client

In [None]:
!pip install bastionlab

# Installing BastionLab Server

### Using the official docker image

In [None]:
!docker pull mithrilsecuritysas/bastionlab:latest

*To install Bastion Client or Server locally, refer to our more detailed [Installation Tutorial](docs/).*

## Setting up the keys

BastionLab only accept request from authenticated users. Authentication is done with asymmetric cryptography: the data owners provides a list of authorized public keys to the server upon start up and all users must provide their corresponding private key to the client when they connect to the server. The client then transparently creates a session for the user.

BastionLab provides a utility module to manage the keys. We will use it to create the public and private keys for a single user.

Firts create two directories to store the keys:

In [None]:
!mkdir pubkeys
!mkdir privkeys

Then create the public and private keys using the BastionLab library. This may be done in the interpreter:

In [10]:
from bastionlab import SigningKey

# We create the Data Owner's private and public keys.
data_owner_signing_key = SigningKey.from_pem_or_generate("privkeys/data_owner.key.pem")
data_owner_public_key = data_owner_signing_key.pubkey.save_pem("pubkeys/data_owner.pem")

"""
    In order to authentify the data scientist(s), the data onwer would have to start 
    an instance of the BastionLab server with all the **allow** public keys. In the case of 
    this tutorial, the data owner's public key and the public key of the data scientist(s)

    And for the purpose of this tutorial, the data scientist public and private keys are created right below.
"""

data_scientist_signing_key = SigningKey.from_pem_or_generate("privkeys/data_scientist.key.pem")
data_scientist_public_key = data_scientist_signing_key.pubkey.save_pem("pubkeys/data_scientist.pem")

Now, we start the instance of the BastionLab server by running the following docker command while binding the `pubkeys` directory created above to the server. 

## Running BastionLab Server

### Using the official Docker image

In [None]:
!docker run -p 50056:50056 --mount type=bind,source=$(pwd)/pubkeys,target=/app/bin/keys -d mithrilsecuritysas/bastionlab:latest

# Data Owner's Side
In this part of the notebook, the data owner can setup the BastionLab server in his infrastructure and share a dataset with a privacy policy in place.
It is divided into two steps: 
- Uploading the dataset to BastionLab 
- Choose a privacy policy regarding data exposure to the data scientist


### Upload the data frame to the BastionLab Client
Once downloaded, you can use polars to load it as a DataFrame:


In [None]:
import polars as pl

df = pl.read_csv("titanic_train.csv")

To upload the DataFrame to BastionLab, first open a connection to the server by providing its hostname. 


In [None]:
from bastionlab import Connection

connection = Connection('localhost', 50056, signing_key=data_owner_signing_key)

Using BastionLab client, you may now upload your data to the server in a secure and private fashion.

In [None]:
connection.client.send_df(
    df,
    policy={"aggregation": "accept", "rows_default_behavior": "approval"},
    k=20,
    protected_columns=["Sex"],
)

### Privacy Policy Options
BastionLab offers many options to finetune your Privacy Policy.<br>

You can choose three types of actions for each type of query: `accept`, `reject`, and `approval`.<br>

In the previous example, the data owner accepts aggregation queries automatically but requires approval for all other types of queries.<br>

Change the parameter `k` to set the minimum number of rows per group in aggregation. If a query does not have at least the given minimum number of rows per group, it is *not* considered an aggregation, and the request is denied. This prevents queries that would specifically isolate an individual, which would in turn ease their identification (a process known as deanonymization in data privacy).<br>

You can also protect a list of sensitive columns with protected_columns which prevents users from displaying them except in aggregation contexts.


### Reference Code

In [None]:
import polars as pl
from bastionlab import Connection

df = pl.read_csv("titanic_train.csv")

with Connection("localhost", 50056, signing_key=data_owner_signing_key) as client:
    client.send_df(
        df,
        policy={"aggregation": "accept", "default": "approval"},
        k=20,
        protected_columns=["Sex"],
    )

## Data Scientist’s Side
In this part, we’ll show how the data scientist can access the Data Owner’s dataset, run queries, fetch the results, and display them.

This tutorial is divided into five steps:
- Access the data owner’s dataset
- Run queries
- Fetching the results
- Data visualization functions

### Access the Data Owner’s Dataset 

You'll encounter two core objects to access the dataset’s DataFrame in BastionLab: the RemoteLazyFrame and the FetchableLazyFrame.

First, you’ll need a RemoteLazyFrame, which is a reference to the DataFrame uploaded by the data owner, along with some metadata such as the names and types of the columns. 

This reference allows you to remotely run queries on the DataFrame without the need to download it and without the ability to see the initial data or intermediary results. 

As we do not know the unique identifier of the DataFrame uploaded by the data owner, we start by asking the server to list all available DataFrames.



In [None]:
import bastionlab

# Replace the api_key value with your own API key
api_key = "DATA_SCIENTIST_API_KEY"
connection = Connection("localhost", 50056, signing_key=data_scientist_signing_key)

client = connection.client

all_rdfs = client.list_dfs()

rdf = all_rdfs[0]

The server returns a list of FetchableLazyFrames, a specific kind of RemoteLazyFrames, that we can inspect. In our case, we can just take the first one as the data owner has only uploaded one DataFrame so far.

### Running Queries

Now that you have a RemoteLazyFrame corresponding to the data owner’s DataFrame, it is time to run some queries on it.

To define these queries, you can directly use all the methods provided by polars’ lazy API. Here, the adjective lazy means that no computation will be run unless explicitly needed. This allows the data scientist to build queries with a Pythonic approach from the RemoteLazyFrame, and when an operation needs to be executed on the data, the query is serialized and sent to the server. 

This is done with the **collect() method** to trigger the execution of all the recorded operations on the server.


In [None]:
rdf1 = rdf.head(5)
rdf2 = rdf1.collect()

In this example, the first line returns a new RemoteLazyFrame that records the head operation - nothing happens on the server. In the second line, however, the call to collect()  sends a query to the server instructing it to do a head operation, and will run it right away.


What is key to understanding, is that **every call to collect() will create a new DataFame on the server side that contains the result**.

**On the client side, collect() returns a new FetchableLazyFrame that references the result on the server.**

### Fetching Results

At some point in your process, you will need to download the results to use them locally or to display them. This can be achieved with the fetch method.

The fetch method is defined on the FetchableLazyFrame class which extends the RemoteLazyFrame class. 

Recall that we’ve already seen the two ways of getting FetchableLazyFrames: by listing available DataFrames on the server, and by calling collect() on any RemoteLazyFrame. 
In practice, this means that fetch() may only be called on references to DataFrames already available on the server or after a call to collect(). As no computation has run before you call collect(), it wouldn’t actually make sense to fetch() the result because it does not yet exist! 


In [None]:
rdf.head(5).collect().fetch()

In addition, fetch() downloads the result DataFrame after performing some checks on how it was obtained and what the data owner authorizes you to do in their policy. These checks allow BastionLab to uphold a decent level of privacy without too big an impact on your workflow. If you need more guarantees, we plan to support differential privacy in the future as an optional feature.

In our case, the data owner has set up a policy which allows downloading aggregated DataFrames right away but requires approval for all others. This means that, for example, we cannot directly print out rows from the original DataFrame because it wouldn’t count as an aggregation. Instead, the server requires the data owner’s approval first. This explains the message printed in the terminal when we try to fetch() the result of the head operation. 

Here, the data owner has accepted to disclose the data, which will allow us to download and to display it. 


Let’s now try a more involved query: we compute the survival rates of the passengers on the Titanic based on their ticket class.

In [None]:
per_class_rates = (
    rdf.select([pl.col("Pclass"), pl.col("Survived")])
    .groupby(pl.col("Pclass"))
    .agg(pl.col("Survived").mean())
    .sort("Survived", reverse=True)
    .collect()
    .fetch()
)
per_class_rates

Once again, we must use:
- collect() to run the computation on the server
- fetch() to retrieve the result locally 

In this case, no message appears because the query involves an aggregation step.

### Data Visualization Functions

The data scientist can also use plotting functions to visualize data while still upholding data privacy. One example feature is that the data is aggregated by default and although the size of bins is modifiable when calling the functions, BastionLab will check that this value enables sufficient data anonymity.

#### Barplot

You can generate a barplot visualization of the data showing the number of survivors per age category.


In [None]:
rdf.barplot(col_x="Age", col_y="Survived", bins=10, palette="bright")

![](https://github.com/mithril-security/bastionlab/blob/master/docs/assets/barplot.png?raw=true)

You can generate a scatterplot visualization of this relationship.


In [None]:
rdf.scatterplot(col_x="Age", col_y="Survived", bins=2, color="orange")

![](https://github.com/mithril-security/bastionlab/blob/master/docs/assets/scatterplot.png?raw=true)

#### Curveplot

You can generate a curveplot to create a regression best-fit curve visualization of this relationship.

In [None]:
rdf.curveplot(col_x="Age", col_y="Survived", bins=10)

![](https://github.com/mithril-security/bastionlab/blob/master/docs/assets/curveplot.png?raw=true)


For more information on these functions, check out our data visualization tutorial [here](../tutorials/visualization.ipynb).