# BastionLab Analytics Demo

BastionLab allows data scientists to run queries on data frames remotely without seeing the original data or intermediary results.
The user uses polars' lazy API augmented with BasionLab's objects to define a *Composite Plan* which is an abstract object representing all the instructions to be run on the server in a lazy fashion. BastionLab's *Composite Plan* supports most polars operations from selects to groupbys to joins.

We intentionnaly forbid map operations with user defined functions to avoid remote code execution. As a replacement, we provide the `apply_udf` method on our `RemoteLazyFrame` type that allows the user to apply a custom TorchScript compatible function on a subset of series. Internally, the provided plain python function is compiled into TorchScript and applied on the server to the matching series by casting them as libtorch tensors.

## Data Loading

We load the csv dataset as a plain polars dataframe, we open a connection to the server and we upload the data frame.
The server sends back some metadata (a reference and the schema of the data frame) that are wrapped in a `FetchableLazyFrame`.
`FetchableLazyFrame`s are a subtype of `RemoteLazyFrames` that can be used not only to define the computation graph by using polars' lazy API but also to retrieve the data (we'll add a mechanism to enforce security and privacy rules in the future).

In order to test the implementation, we use `SigningKey` and `PublicKey` to create three keys:
- Data owner
- Data scientist 1
- Data scientist 2

NB: The difference between Data scientist 1 & 2 is that the public key of the Data scientist 1 is shared with the data owner who will start up the server. BastionLab server will need the keys of both the data owner and the data scientist in order for it to start.

In [None]:
from bastionlab import SigningKey, PublicKey

data_owner_key = SigningKey.from_pem_or_generate("./data_owner.key.pem")
data_scientist_key = SigningKey.from_pem_or_generate("./data_scientist.key.pem")

data_scientist_pubkey = data_scientist_key.pubkey.save_pem("./data_scientist.pem")
data_owner_pubkey = data_owner_key.pubkey.save_pem("./data_owner.pem")

fake_data_scientist = SigningKey.from_pem_or_generate("./fake_scientist.key.pem")
fake_scientist_pubkey = fake_data_scientist.pubkey.save_pem("./fake_scientist.pem")


In [None]:
import polars as pl
from bastionlab import Connection

df = pl.read_csv("titanic_train.csv").limit(50)

connection = Connection("localhost", 50056, license_key=data_owner_key)
client = connection.client

rdf = client.send_df(df)

original_rdf = rdf


The `RemoteLazyFrame`s (and their subtypes) can be used to query the metadata without performing any actual request to the server.

In [None]:
rdf.columns


In [None]:
connection.close()


In order to create separate view points, we close the connection initially created by the Data Owner.
We then use the Data Scientist 1's signing key to establish a connection between the BastionLab server.

The user may also invoke polars' lazy API to construct query plans.

In the following cell, we compute the survival rates of the passengers of the Titanic with respect to their ticket class.

Note that issuing such aggregated queries without seing the original dataset or intermediary results already provides some sense of privacy. We plan to add a budgeting mechanism such as Differential Privacy to allow the data owner to control how their data can be used.

After closing the Data Owner's connection, we lose access to the Remote DataFrame `rdf`. But we can use `list_dfs` to fetch it back.

The cell below illustrates how to list all Remote Dataframes available on BastionLab and also selects the first as our `rdf`.

In [None]:
connection = Connection("localhost", 50056, license_key=data_scientist_key)
client = connection.client

rdfs = client.list_dfs()

rdf = rdfs[0]


In [None]:
per_class_rates = (
    rdf.select([pl.col("Pclass"), pl.col("Survived")])
    .groupby(pl.col("Pclass"))
    .agg(pl.col("Survived").mean())
    .sort("Survived", reverse=True)
    .collect()
    .fetch()
)

per_class_rates


To prove that the authentication works successfully, we first close the connection created with Data scientist 1's signing key and then create another with Data Scientist 2's signing key.

Refer to [Data Loading](#data-loading) to see the difference between both data scientists.

In [None]:
# Closing Data Scientist 1's connection.
connection.close()


Again, we fetch our lists of dataframes and select the first one as our `rdf` since we loose access to it.

In [None]:
# Setting up connection for Data Scientist 2
connection = Connection("localhost", 50056, license_key=fake_data_scientist)
client = connection.client

rdfs = client.list_dfs()

rdf = rdfs[0]


In [None]:
per_class_rates = (
    rdf.select([pl.col("Pclass"), pl.col("Survived")])
    .groupby(pl.col("Pclass"))
    .agg(pl.col("Survived").mean())
    .sort("Survived", reverse=True)
    .collect()
    .fetch()
)

per_class_rates

connection.close()


In order to continue with the rest of the illustration, we need to setup once more the connection but with the Data Scientist 1 (the right data scientist), and also setup `rdf` while we set up the connection.

Setting up `rdf` won't be as we have done before but this time round we need to use the identifier of the Remote DataFrame to set up `rdf` by calling `get_df`.

In [9]:
connection = Connection("localhost", 50056, license_key=data_scientist_key)
client = connection.client

rdf = client.get_df(original_rdf.identifier)


GRPCException: Remote resource not found: code=StatusCode.NOT_FOUND message=Could not find dataframe: identifier=50fc551e-ee48-40b1-81f6-91477ff2265d

We can do the same with respect to the sex of the passengers.

In [None]:
per_sex_rates = (
    rdf.select([pl.col("Sex"), pl.col("Survived")])
    .groupby(pl.col("Sex"))
    .agg(pl.col("Survived").mean())
    .sort("Survived", reverse=True)
    .collect()
    .fetch()
)

per_sex_rates


Aggregated queries may be used locally to make plots

In [None]:
import seaborn as sns
import torch
from bastionlab import RemoteLazyFrame


def bar_plot(rdf: RemoteLazyFrame, col_x: str, col_y: str) -> RemoteLazyFrame:
    def f(x):
        bins = 10 * torch.ones_like(x)
        return x // bins * bins

    data = (
        rdf.select([pl.col(col_x), pl.col(col_y)])
        .apply_udf([col_x], f)
        .groupby(pl.col(col_x))
        .agg(pl.col(col_y).count())
        .sort(col_x)
        .collect()
        .fetch()
    )
    sns.barplot(x=col_x, y=col_y, data=data.to_pandas())


rdf = rdf.filter(pl.col("Age") != None)
bp = bar_plot(rdf, "Age", "Survived")


We finally close the conection to the server.

In [None]:
connection.close()
