# Saving dataframes
__________________________________

<a target="_blank" href="https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.6/docs/docs/tutorials/saving_dataframes.ipynb">
  <p align="center"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In this tutorial, we will save a dataframe and then restart the server to test whether the dataframe remains available on the server.

## Pre-requisites
___________________________________________

### Technical requirements

To start this tutorial, ensure the following are already installed in your system:
- Python3.7 or greater (get the latest version of Python at https://www.python.org/downloads/ or with your operating system’s package manager)
- [Python Pip](https://pypi.org/project/pip/) (PyPi), the package manager

### Pip packages

In order to run this notebook, you will also need to install bastionlab and bastionlab_server packages by running the following code block.

In [28]:
!pip install polars
!pip install bastionlab

>*Note that the bastionlab_server package we install here was created for testing purposes. You can alternatively install the BastionLab server using our Docker image or from source. Check out our [Installation Tutorial](../getting-started/installation.md) for more details.*

### Dataset

Next we will download the dataset that we will be using in this tutorial. This dataset is based on the [Titanic dataset](https://www.kaggle.com/competitions/titanic/data), one of the most popular datasets used for understanding machine learning which contains information relating to the passengers aboard the Titanic. However, it has been modified by data analyst June Tao Ching to contain some values that need cleaning up before we can start running queries.

In [1]:
!wget 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'

--2022-12-16 17:47:13--  https://raw.githubusercontent.com/chingjunetao/medium-article/master/simple-guide-to-data-cleaning/modified_titanic_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 80253 (78K) [text/plain]
Saving to: ‘modified_titanic_data.csv’


2022-12-16 17:47:14 (535 KB/s) - ‘modified_titanic_data.csv’ saved [80253/80253]



>*You can also get it from source with a free user account at https://www.kaggle.com/competitions/titanic/data)*

### Launching the server

The next step is to launch the server. The server exposes port `50056` for gRPC communication with clients and uses a default configuration (no authentication, default settings). For the purpose of this tutorial, these settings are sufficient and we won't change them. To launch the server, we use `bastionlab_server`'s `start()` method.

In [None]:
import bastionlab_server

srv = bastionlab_server.start()

### Uploading our dataset to the server

The final step is to connect to our sever to send our dataset.

First, we read in the dataset using Polar's `read_csv()` function, which returns a Polar's DataFrame instance containing the dataset.

Then, we connect to the server by creating an instance of `Connection()` and supplying the constructor with the host of our docker instance.

>*In a typical workflow, the data owner would send a set of keys to the server, so that authorization can be required for all users at the point of connection. **BastionLab offers the authorization feature**, but as it's not the focus of this visualization tutorial, we will not use it. You can refer to the [authentication tutorial](https://bastionlab.readthedocs.io/en/latest/docs/tutorials/authentication/) if you want to set it up.*

In [1]:
from bastionlab import Connection
import polars as pl

df = pl.read_csv("modified_titanic_data.csv")

connection = Connection("localhost")
client = connection.client

Finally, we send the Polar's DataFrame instance to the server using Bastionlab's `polars.send_df()` method which will return a `RemoteLazyFrame` instance, a reference to the DataFrame uploaded which we will be working with throughout this tutorial.

For the sake of this tutorial, we specify an unsafe policy which disables all checks. We set the `unsafe_zone` parameter to `TrueRule()` to allow all requests. In this case, the `unsafe_handling` parameter can be anything (as there are no unsafe requests), we set it to `Log()` in the following example.

>***Important note - This unsafe policy is not suited for production.***

In [2]:
from bastionlab.polars.policy import Policy, TrueRule, Log

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log())
rdf = client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])

rdf

FetchableLazyFrame(identifier=36c04b11-90dc-42a9-9d2c-cf69a6643dfc)

Lets do a simple operation on the data frame to verify all has been properly set-up:

In [3]:
per_class_rates = (
    rdf.select([pl.col("Pclass"), pl.col("Survived")])
    .groupby(pl.col("Pclass"))
    .agg(pl.col("Survived").mean())
    .sort("Survived", reverse=True)
    .collect()
)

## Saving the data frame
_______________________________________

Now we save the resulting dataframe from our previous operation.

In [4]:
per_class_rates.save()
saved_identifier = per_class_rates.identifier
print(saved_identifier)

cd36cc93-fbc8-4d6f-b006-d704d11ad082


Let us also fetch the rdf so we can compare it to the reloaded dataframe later.

In [5]:
per_class_rates.fetch()

Pclass,Survived
i64,f64
1,0.633028
2,0.475676
3,0.24187


We will now restart the server and check which dataframes persist in the server.

## Testing persistence of data frames

<b>Terminate the running bastionlab server, then restart it. </b>

In [None]:
bastionlab_server.stop(srv)
srv = bastionlab_server.start()

Reconnect to the server and list the available dataframes.

In [6]:
connection = Connection("localhost")
client = connection.client

client.polars.list_dfs()

[FetchableLazyFrame(identifier=cd36cc93-fbc8-4d6f-b006-d704d11ad082)]

As you can see, the saved dataframe persists on the server.

You can print the dataframe to be sure it's the same one you saved.

In [7]:
retrieved_rdf = client.polars.get_df(saved_identifier)
retrieved_rdf.fetch()

Pclass,Survived
i64,f64
1,0.633028
2,0.475676
3,0.24187


Finally, close the connection.

In [8]:
connection.close()
bastionlab_server.stop(srv)