# Saving dataframes
__________________________________

In this tutorial, we will save a dataframe and then restart the server to test whether the dataframe remains available on the server.

## Pre-requisites
___________________________________________

### Technical requirements

>If you've done the [Quick tour](https://bastionlab.readthedocs.io/en/latest/docs/quick-tour/quick-tour/), you can skip ahead to the Histplot section.

Else, we assume you have:
- Python3.7 or greater *(get the latest version of Python at [https://www.python.org/downloads/](https://www.python.org/downloads/) or with your operating system’s package manager)*
- [Python Pip](https://pypi.org/project/pip/) (PyPi), the package manager

### Pip packages and dataset

In order to run this notebook, you will also need to install Polars, BastionLab pip package. You can download all of these by running the following code block.

In [28]:
!pip install polars
!pip install bastionlab

Next we will download the dataset that we will be using in this tutorial.

In [1]:
!wget 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'

--2022-12-16 17:47:13--  https://raw.githubusercontent.com/chingjunetao/medium-article/master/simple-guide-to-data-cleaning/modified_titanic_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 80253 (78K) [text/plain]
Saving to: ‘modified_titanic_data.csv’


2022-12-16 17:47:14 (535 KB/s) - ‘modified_titanic_data.csv’ saved [80253/80253]



### Upload the dataset to the server

The final step is to connect to our sever to send our dataset.

First, we read in the dataset using Polar's `read_csv()` function, which returns a Polar's DataFrame instance containing the dataset. Then, we connect to the server using Bastionlab's `Connection()` class constructor.

In [2]:
from bastionlab import Connection
import polars as pl

df = pl.read_csv("modified_titanic_data.csv")

connection = Connection("localhost")
client = connection.client

Finally, we send the Polar's DataFrame instance to the server using Bastionlab's `polars.send_df()` method which will return a `RemoteLazyFrame` instance, a reference to the DataFrame uploaded which we will be working with throughout this tutorial.

For the sake of this tutorial, we specify an unsafe policy which disables all checks. We set the `unsafe_zone` parameter to `TrueRule()` to allow all requests. In this case, the `unsafe_handling` parameter can be anything (as there are no unsafe requests), we set it to `Log()` in the following example.

Note that this is purely done so that we can focus on demonstrating data cleaning in BastionLab without having to worry about approving any data access requests. But this policy **is not suited for production**.

In [3]:
from bastionlab.polars.policy import Policy, TrueRule, Log

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log())
rdf = client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])

rdf

FetchableLazyFrame(identifier=68fb7492-4c8b-49ca-b9a9-649b3ced9eb3)

Lets do a simple operation on the data frame to verify all has been properly set-up:

In [4]:
per_class_rates = (
    rdf.select([pl.col("Pclass"), pl.col("Survived")])
    .groupby(pl.col("Pclass"))
    .agg(pl.col("Survived").mean())
    .sort("Survived", reverse=True)
    .collect()
)

## Saving the data frame
_______________________________________

Now we save the resulting dataframe from our previous operation.

In [5]:
per_class_rates.save()
saved_identifier = per_class_rates.identifier
print(saved_identifier)

c2ee85c9-2a14-451e-81df-53a83d0d06d5


Let us also fetch the rdf so we can compare it to the reloaded dataframe later.

In [6]:
per_class_rates.fetch()

Pclass,Survived
i64,f64
1,0.633028
2,0.475676
3,0.24187


We will now restart the server and check which dataframes persist in the server.

## Testing persistence of data frames

<b>Terminate the running bastionlab server (ctrl+C), then restart it. </b>

Reconnect to the server and list the available dataframes.

In [8]:
connection = Connection("localhost")
client = connection.client

client.polars.list_dfs()

[FetchableLazyFrame(identifier=c2ee85c9-2a14-451e-81df-53a83d0d06d5)]

As you can see, the saved dataframe persists on the server.

You can print the dataframe to be sure it's the same one you saved.

In [9]:
retrieved_rdf = client.polars.get_df(saved_identifier)
retrieved_rdf.fetch()

Pclass,Survived
i64,f64
1,0.633028
2,0.475676
3,0.24187


Finally, close the connection.

In [10]:
connection.close()