# Combining Datasets Tutorial

In this tutorial, we are going to explore how we can combine remote datasets securely in BastionLab. Combining datasets allows us to train models or run queries on multiple datasets from different parties, which can lead to much more powerful results. 

Let's take an example: say 100 different hospitals around the world want to take part in a project to train a machine learning model to determine if a patient has Covid-19 based on a chest X-ray scan. The model will almost certainly be much more accurate and relevant to a more varied range of patients if this model is trained on a combined dataset from the 100 hospitals than from any one of these hospital.

The vital advantage of doing this with a RemoteLazyFrame is that a data scientist can combine all of these datasets without having direct access to any one of them, enabling a level of collaboration which may previously have been deemed too risky in terms of data privacy.

So let's take a look at the steps required to combine datasets.

## Pre-requisites
### Technical Requirements

To follow along with this tutorial, ensure the following are already installed in your system:
- Python3.7 or greater (get the latest version of Python at https://www.python.org/downloads/ or with your operating system’s package manager)
- [Python Pip](https://pypi.org/project/pip/) (PyPi), the package manager

### Pip packages

In order to run this notebook, you will also need to install Polars, Bastionlab and the bastionlab_server packages by running the following code block.

In [1]:
!pip install bastionlab, polars, bastionlab_server

>*Note that the bastionlab_server package we install here was created for testing purposes. You can alternatively install the BastionLab server using our Docker image or from source. Check out our [Installation Tutorial](../tutorials/installation.md) for more details.*

## Launching the server

The next step is to launch the server. The server exposes port 50056 for gRPC communication with clients and uses a default configuration (no authentication, default settings). For the purpose of this tutorial, these settings are sufficient and we won't change them. To launch the server, we simply use bastionlab_server's start method.

In [None]:
import bastionlab_server

srv = bastionlab_server.start()

## Connecting to the sever and uploading our data frames

We will connect to the server via the Connection() constructor.

In [None]:
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client

Next, we will create three short Polars dataframes which we will then use to demonstrate how to combine datasets. Our three dataframes have an "Element" column containing elements and a "Melting Point (K)" column with their corresponding melting points.

In [4]:
import polars as pl

df1 = pl.DataFrame(
    {
        "Element": ["Copper", "Silver", "Gold"],
        "Melting Point (K)": [1357.77, 1234.93, 1337.33],
    }
)

df2 = pl.DataFrame(
    {"Element": ["Platinum", "Palladium"], "Melting Point (K)": [2041.4, 1828.05]}
)

df3 = pl.DataFrame({"Element": ["Titanium"], "Melting Point (K)": [1945.0]})

We now need to send these Polar dataframes to the server to get back our RemoteLazyFrame instances. In a real-life scenario, it may well be the data owner that sends over the dataframes and a data scientist who connects with authentication and retrieves these RemoteLazyFrame instances.

We normally would use the default or a safe customized policy to ensure that anyone who works with our datasets from now on will not be able to run queries that compromise the privacy our data. However, in this tutorial we are going to specify a policy which will allow us full access to the data, simply so we can easily print out the datasets in full without needing to request access for illustrative purposes. To do this, we set the `unsafe_zone` parameter to `TrueRule()` to allow all requests. In this case, the `unsafe_handling` parameter can be anything (as there are no unsafe requests), we set it to `Log()` in the following example.

>*Important note - This unsafe policy is not suited for production.*

We now send our three dataframes to the server, using our custom policy and get back three RemoteLazyFrame instances.

In [5]:
from bastionlab.polars.policy import Policy, TrueRule, Log

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log())

rdf1 = client.polars.send_df(df1, policy=policy)
rdf2 = client.polars.send_df(df2, policy=policy)
rdf3 = client.polars.send_df(df3, policy=policy)

## Appending datasets

We can now move onto exploring how to append RemoteLazyFrames using vstack. Vstack can be used to append any RemoteLazyFrame to another where the column names and types match. 

>*You can learn about dataset preparation, including changing column types, names and adding and removing columns, in our [Data cleaning tutorial](https://bastionlab.readthedocs.io/en/latest/docs/tutorials/data_cleaning/)*

We call the vstack method on the first RemoteLazyFrame and then give the RemoteLazyFrame we want to append to it as an argument. Vstack returns the resulting combined RemoteLazyFrame.

Here, for example, rdf2, containing Platinum and Palladium, is appended to rdf1, containing Copper, Silver and Gold. We set rdf1 to equal combined RemoteLazyFrame returned by vstack, and so when we .collect().fetch() rdf1, we see the resulting combined dataset.

In [6]:
rdf1 = rdf1.vstack(rdf2)
rdf1.collect().fetch()

Element,Melting Point (K)
str,f64
"""Copper""",1357.77
"""Silver""",1234.93
"""Gold""",1337.33
"""Platinum""",2041.4
"""Palladium""",1828.05


Next we will add our third RemoteLazyFrame containing Titanium to rdf1, two times!

As you can see, we now have all the previous elements, plus two lots of Titanium at the end.

In [7]:
rdf1 = rdf1.vstack(rdf3)
rdf1 = rdf1.vstack(rdf3)
rdf1.collect().fetch()

Element,Melting Point (K)
str,f64
"""Copper""",1357.77
"""Silver""",1234.93
"""Gold""",1337.33
"""Platinum""",2041.4
"""Palladium""",1828.05
"""Titanium""",1945.0
"""Titanium""",1945.0


And that's it for this tutorial. Finally, we will now simply close the connection.

In [8]:
connection.close()