<div id="colab_button">
  <h1>Data Conversion within BastionLab</h1>
  <a target="_blank" href="https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.7/docs/docs/tutorials/SQL_queries.ipynb"> 
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>
__________________________________________________

In order for BastionLab to be a one-stop shop for data exploration, deep learning training and machine learning model fitting, it's important that we are able to convert remote data to respective representations to exploit BastionLab full capabilities.

This tutorial introduces simply how you can convert a `RemoteDataFrame`s to `RemoteTensor`s and use them for your deep learning model training.

## Pre-requisites
___________________________________________

### Installation and dataset

In order to run this notebook, we need to:
- Have [Python3.7](https://www.python.org/downloads/) (or greater) and [Python Pip](https://pypi.org/project/pip/) installed
- Install [BastionLab](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/)
- Download [the dataset](https://www.kaggle.com/competitions/titanic) we will be using in this tutorial.
- PyTorch [1.13.1](https://pypi.org/project/torch/) installed

We'll do so by running the code block below. 

>If you are running this notebook on your machine instead of [Google Colab](https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.6/docs/docs/tutorials/data_cleaning.ipynb), you can see our [Installation page](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/) to find the installation method that best suits your needs.

In [1]:
# pip packages
!pip install bastionlab
!pip install bastionlab_server

# download the dataset
!wget 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'

Our dataset is based on the [Titanic dataset](https://www.kaggle.com/c/titanic), one of the most popular ressource used for understanding machine learning, which contains information relating to the passengers aboard the Titanic. 

### Launch and connect to the server

In [2]:
# launch bastionlab_server test package
import bastionlab_server

srv = bastionlab_server.start()

>*Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our [Installation Tutorial](../getting-started/installation.md) for more details.*

In [1]:
# connect to the server
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client

  from .autonotebook import tqdm as notebook_tqdm


### Upload the dataframe to the server

Before we upload the dataset to the server, we'll create a custom privacy policy which allows all operations to be done on the dataframe. *You can check out how to define a privacy policy [here](https://bastionlab.readthedocs.io/en/latest/docs/tutorials/defining_policy_privacy/).* 

We also limit the size of the dataset sent to the server because in this tutorial, we are only performing data conversion and not really data exploration.

In [2]:
import polars as pl
from bastionlab.polars.policy import Policy, TrueRule, Log

df = pl.read_csv("titanic.csv")
policy = Policy(safe_zone=TrueRule, unsafe_handling=Log(), savable=False)
rdf = client.polars.send_df(df.limit(100), policy=policy)

rdf

FetchableLazyFrame(identifier=600f8b6d-2a7c-4363-97cb-cce0055cf65d)

## Convert RemoteDataFrames to RemoteArray
----

`RemoteArray`s are BastionLab's internal intermediate representations which are akin to numpy arrays but are essentially pointers to a `DataFrame` on the server which when `to_tensor` is called converts the `DataFrame` to `Tensor` on the server.

In [3]:
rdf.to_array()

TypeError: Utf8 column cannot be converted into RemoteArray

We notice that we had an error when we tried to convert the `RemoteDataFrame` into a `RemoteArray`. The error message is _"TypeError: Utf8 column cannot be converted into RemoteArray"_.

This means we need to make sure our dataframe has only numerical fields (ints,floats) before we convert into `RemoteArray`.

In [4]:
rdf = rdf.select(pl.col([pl.Float64, pl.Float32, pl.Int64, pl.Int32]))

rdf.to_array()

TypeError: DataTypes for all columns should be the same

Again, our `to_array` method gives out an error, and this time it says that _"DataTypes for all columns should be the same"_.

This means we need to cast all our columns first before converting into an array.

And, we choose to cast our columns to `Float64` to capture all numerical values.

In [7]:
rdf = rdf.select(pl.all().cast(pl.Float64))

Then finally, we can successfully convert our `RemoteDataFrame` into `RemoteArray`.

In [8]:
rdf.to_array()

RemoteArray(identifier=091d9529-2369-45b2-bc2b-151ab9eb5eb5

We can see right above that our `RemoteDataFrame` has been converted into a `RemoteArray`.

## Convert RemoteArrays to RemoteTensors
----

In order for us to train on our amazing deepl learning and machine learning models on `RemoteDataFrame`s, they would have to converted into `RemoteTensor`s.

The section right above demonstates how to convert a `RemoteDataFrame` into `RemoteArray`.

In this section, we further that illustration by converting a `RemoteArray` into a `RemoteTensor`.

In [9]:
# Converts `RemoteArray` into `RemoteTensor`
remote_tensor = rdf.to_array().to_tensor()

Once the `RemoteTensor` has been created, we can go ahead and print the available properties, which are `dtype` and `shape`.

In [10]:
print(remote_tensor)

RemoteTensor(identifier=67094d5d-2be3-4ef6-bec6-64e83827d0cb, dtype=torch.float64, shape=torch.Size([100, 7]))


> `RemoteTensor` limits the access to the tensor stored on the server by only providing you with a single API to update the `dtype` of the corresponding `Tensor` stored on the server.

## Updating the `dtype` of our `RemoteTensor`

We import torch and call the `to` method provided on the `RemoteTensor` to update the `dtype` of the `RemoteTensor`.

In [11]:
import torch

remote_tensor.to(torch.int64)

RemoteTensor(identifier=67094d5d-2be3-4ef6-bec6-64e83827d0cb, dtype=torch.int64, shape=torch.Size([100, 7]))

In [None]:
print(remote_tensor)

In the print message above, we can observe that the `dtype` for the RemoteTensor has been updated to `int64`.

Let's now close the connection and shutdown the server.

In [None]:
connection.close()
bastionlab_server.stop(srv)