<div id="colab_button">
  <h1>Data conversion</h1>
  <a target="https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.7/docs/docs/tutorials/data_conversion.ipynb" href="LINK COLAB"> 
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

__________________________________________________

In order for data scientists to use BastionLab from data exploration to deep learning training and machine learning model fitting, it's important that they are able to convert remote data to their respective representations.

This tutorial introduces how you can convert a `RemoteDataFrame` to `RemoteTensor`, using a `RemoteArray` intermediary step, and use them for your deep learning model training. 

## Pre-requisites
___________________________________________

### Installation and dataset

In order to run this notebook, we need to:
- Have [Python3.7](https://www.python.org/downloads/) (or greater) and [Python Pip](https://pypi.org/project/pip/) installed
- Install [BastionLab](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/)
- Install PyTorch [1.13.1](https://pypi.org/project/torch/)
- Download [the dataset](https://www.kaggle.com/competitions/titanic) we will be using in this tutorial.

We'll do so by running the code block below. 

>If you are running this notebook on your machine instead of [Google Colab](https://colab.research.google.com/github/mithril-security/bastionlab/blob/v0.3.6/docs/docs/tutorials/data_conversion.ipynb), you can see our [Installation page](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/) to find the installation method that best suits your needs.

In [1]:
# pip packages
!pip install bastionlab
!pip install bastionlab_server
!pip install torch

# download the dataset
!wget 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'

Our dataset is based on the [Titanic dataset](https://www.kaggle.com/c/titanic), one of the most popular ressource used for understanding machine learning, which contains information relating to the passengers aboard the Titanic. 

### Launch and connect to the server

In [2]:
# launch bastionlab_server test package
import bastionlab_server

srv = bastionlab_server.start()

>*Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our [Installation Tutorial](../getting-started/installation.md) for more details.*

In [3]:
# connect to the server
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client

  from .autonotebook import tqdm as notebook_tqdm


### Upload the dataframe to the server

We'll quickly upload the dataset to the server with an open safety policy, since setting up BastionLab is not the focus of this tutorial. It will allows us to demonstrate features without having to approve any data access requests. *You can check out how to define a privacy policy [here](https://bastionlab.readthedocs.io/en/latest/docs/tutorials/defining_policy_privacy/).* 

We'll also limit the size of the dataset sent to the server, with Polar's `df.limit()` method, to run this tutorial faster and use less ressources - since we are only performing data conversion and not full on data exploration. 

In [4]:
import polars as pl
from bastionlab.polars.policy import Policy, TrueRule, Log

df = pl.read_csv("titanic.csv")
policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=False)
rdf = client.polars.send_df(df.limit(100), policy=policy)

rdf

FetchableLazyFrame(identifier=b14b0823-c94b-4019-a56e-72dbb8a95466)

## Convert `RemoteDataFrame` to `RemoteArray`
----

To convert BastionLab's main data exploration object, the `RemoteDataFrame`, to it's AI training's main object `RemoteTensor`, we'll need to go through an intermediary step: the `RemoteArray`. 

Since [NumPy](https://numpy.org/) library's `array`s are commonly used in machine learning training, we decided to make our user interface and experience similar. What we'll show in this tutorial will be as straightforward as fitting a [Scikit-learn](https://scikit-learn.org/stable/) `LinearRegression` model on a NumPy `array`.
```python

    lr = LinearRegression()

    lr.fit(array)

```

Except, in BastionLab, `array` will be `RemoteArray`, which are pointers to a `RemoteDataFrame`. When `to_tensor()` will be called, they'll convert the `RemoteDataFrame` to a `RemoteTensor`.

In [None]:
# Converting a RemoteDataFrame to a RemoteArray
rdf.to_array()

: 

Oh but wait. It didn't work! We got an error message: _`TypeError: Utf8 column cannot be converted into RemoteArray`_.

This means we need to make sure our `RemoteDataFrame` only has numerical fields (_ints, floats_) before we convert it into a `RemoteArray`. This makes sense because tensors only accept numerical values, and arrays are here to prepare that next conversion step.

In [5]:
# We use Polar's pl.col() method to convert all values to numerical ones
rdf = rdf.select(pl.col([pl.Float64, pl.Float32, pl.Int64, pl.Int32]))

Let's try to convert our `RemoteDataFrame` once more to a `RemoteArray`.

In [4]:
# Converting RemoteDataFrame to RemoteArray
rdf.to_array()

TypeError: DataTypes for all columns should be the same

Again, our `to_array()` method gives out an error! _`TypeError: DataTypes for all columns should be the same`_.

This means we need to cast all our columns first before converting them into an array. Here, we'll choose `Float64` to capture all numerical values.

> *It is very important that we cast all our columns into a single datatype to make our `RemoteArray` compatible with other libraries and machine learning applications - as arrays are supposed to be a collection of objects of the same type.*

In [6]:
# Converting all values of the RemoteDataFrame to Float64
rdf = rdf.select(pl.all().cast(pl.Float64))

We'll try again to convert `RemoteDataFrame` into `RemoteArray`.

In [7]:
# Converting RemoteDataFrame to RemoteArray
rdf.to_array()

RemoteArray(identifier=66657395-e506-4934-907f-025396635d93

It's a success! 

## Convert `RemoteArray` to `RemoteTensor`
____________________________________________________


Now that we converted our `RemoteDataFrame` to a `Remote Array`, we'll convert the `RemoteArray` to a `RemoteTensor` to be able to train our model. This shouldn't run into problems, since the `RemoteArray` step would have already taken care of eventual conversion issues.


In [8]:
# Converts `RemoteArray` into `RemoteTensor`
# (using the middle step of converting to RemoteArray)
remote_tensor = rdf.to_array().to_tensor()

Once the `RemoteTensor` has been created, we can go ahead and print its available properties, which are `dtype` and `shape`.

In [None]:
print(remote_tensor)

We chose to only show you those two properties (the type of the tensor and its shape) to protect the privacy of the data - but still give you the vital information you need to train your model. 

>*You can refer to our [Covid 19 deep learning](https://github.com/mithril-security/bastionlab/blob/master/docs/docs/how-to-guides/covid_19_deep_learning_cleaning.ipynb) how-to-guide to see how we use `RemoteTensor`s in training a PyTorch Linear Regression model.*

### Updating the `dtype` of `RemoteTensor`

This is the only method you can use on `RemoteTensor`, because we need to limit access to guarantee the privacy of the data stored. 

We'll use `to()`, just like with a regular torch tensor, to change the `dtype` of the tensor.

In [10]:
import torch

# Using the to() method to update the dtype of the RemoteTensor
remote_tensor.to(torch.int64)

RemoteTensor(identifier=fa80806f-38dc-4923-9035-8d9e94b94526, dtype=torch.int64, shape=torch.Size([100, 7]))

In [11]:
print(remote_tensor)

RemoteTensor(identifier=fa80806f-38dc-4923-9035-8d9e94b94526, dtype=torch.float64, shape=torch.Size([100, 7]))


The `dtype` for the RemoteTensor has been updated to `int64`!

You now know how to converte RemoteDataframe to RemoteTensor. All that's left to do now is to close your connection to the server and stop the server:

In [None]:
connection.close()
bastionlab_server.stop(srv)