<div id="colab_button">
  <h1>Saving dataframes</h1>
  <a target="_blank" href="https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.7/docs/docs/tutorials/saving_dataframes.ipynb"> 
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

________________________

In this tutorial, we will save a dataframe and then restart the server to test whether the dataframe remains available on the server.

## Pre-requisites
___________________________________________

### Installation and dataset

In order to run this notebook, we need to:
- Have [Python3.7](https://www.python.org/downloads/) (or greater) and [Python Pip](https://pypi.org/project/pip/) installed
- Install [BastionLab](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/)
- Download [the dataset](https://www.kaggle.com/competitions/titanic) we will be using in this tutorial.

We'll do so by running the code block below. 

>If you are running this notebook on your machine instead of [Google Colab](https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.6/docs/docs/tutorials/data_cleaning.ipynb), you can see our [Installation page](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/) to find the installation method that best suits your needs.

In [28]:
# pip packages
!pip install bastionlab
!pip install bastionlab_server

# download the Titanic dataset
!wget 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'

Our dataset is based on the Titanic dataset, one of the most popular datasets used for understanding machine learning which contains information relating to the passengers aboard the Titanic.

### Launch and connect to the server

In [None]:
# launch bastionlab_server test package
import bastionlab_server

srv = bastionlab_server.start()

>*Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our [Installation Tutorial](../getting-started/installation.md) for more details.*

It's important to note that in a typical workflow, the data owner would send a set of keys to the server, so that authorization can be required for all users at the point of connection. **BastionLab offers the authorization feature**, but as it's not the focus of this visualization tutorial, we will not use it. You can refer to the [authentication tutorial](https://bastionlab.readthedocs.io/en/latest/docs/tutorials/authentication/) if you want to set it up.

In [1]:
# connecting to the server
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client

### Upload the dataframe to the server

We'll quickly upload the dataset to the server with an open safety policy, since setting up BastionLab is not the focus of this tutorial. It will allows us to demonstrate features without having to approve any data access requests. You can check out how to define a safe privacy policy [here](https://bastionlab.readthedocs.io/en/latest/docs/tutorials/defining_policy_privacy/).

In [2]:
import polars as pl
from bastionlab.polars.policy import Policy, TrueRule, Log

df = pl.read_csv("modified_titanic_data.csv")

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=True)
rdf = client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])

rdf

FetchableLazyFrame(identifier=43eabca3-e2e9-4600-b0f2-fb09e3422548)

<div class="admonition warning">
    <p class="admonition-title">Important!</p>
    <p class=""><b>This policy is not suitable for production.</b> Please note that we <i>only</i> use it for demonstration purposes, to avoid having to approve any data access requests in the tutorial. <br></p>
</div>
<br>

We'll check that we're properly connected and that we have the authorizations by running a simple query:

In [3]:
per_class_rates = (
    rdf.select([pl.col("Pclass"), pl.col("Survived")])
    .groupby(pl.col("Pclass"))
    .agg(pl.col("Survived").mean())
    .sort("Survived", reverse=True)
    .collect()
)

## Saving the data frame
_______________________________________

Now we save the resulting dataframe from our previous operation.

In [4]:
per_class_rates.save()
saved_identifier = per_class_rates.identifier
print(saved_identifier)

4c6edbf1-acd7-49ae-82fe-95eab559536e


Let us also fetch the rdf so we can compare it to the reloaded dataframe later.

In [5]:
per_class_rates.fetch()

Pclass,Survived
i64,f64
1,0.633028
2,0.475676
3,0.24187


We will now restart the server and check which dataframes persist in the server.

## Testing persistence of data frames

<b>Terminate the running bastionlab server, then restart it. </b>

>If you are not running this Notebook in Colab or if you do not use the pip packaged server, you can kill the server by issuing Ctrl+C in your terminal and then launch it again using the same command you used to start it.

In [None]:
bastionlab_server.stop(srv)
srv = bastionlab_server.start()

Reconnect to the server and list the available dataframes.

In [6]:
connection = Connection("localhost")
client = connection.client

client.polars.list_dfs()

[FetchableLazyFrame(identifier=4c6edbf1-acd7-49ae-82fe-95eab559536e)]

As you can see, the saved dataframe persists on the server.

You can print the dataframe to be sure it's the same one you saved.

In [7]:
retrieved_rdf = client.polars.get_df(saved_identifier)
retrieved_rdf.fetch()

Pclass,Survived
i64,f64
1,0.633028
2,0.475676
3,0.24187


Finally, close the connection.

In [8]:
connection.close()
bastionlab_server.stop(srv)