# Combining datasets
______________________________________________________

***#More context =) I think you basically should start with this idea and maybe explain a bit more (This can be particularly useful if you want to query or train models on a combination of datasets from different parties. ) and then go to the technical intro. 5/7 lines***

In this tutorial, we are going to see how to use BastionLab's `vstack()` function to join RemoteLazyFrames vertically.

***#Is "vertically" a technical one that everyone will know, or can we avoid using the term and just use simple words for the same result? Also I think in the intro part we should stay with non technical details as much as possible and explain what we'll teach in 'easy' words.*** 

This can be particularly useful if you want to query or train models on a combination of datasets from different parties. 

***#this info should come later I think, when you explain the particulars of the vstack() function*** 

The only requirement is that datasets must have the same column names/types. 

>You can learn about dataset preparation, including changing column types, names and adding and removing columns, in our [Data cleaning tutorial](https://bastionlab.readthedocs.io/en/latest/docs/tutorials/data_cleaning/).

## Pre-requisites
___________________________________

### Technical Requirements

To start this tutorial, ensure the following are already installed in your system:
- Python3.7 or greater (get the latest version of Python at https://www.python.org/downloads/ or with your operating system’s package manager)
- [Python Pip](https://pypi.org/project/pip/) (PyPi), the package manager
- [Docker](https://www.docker.com/) 

*Here's the [Docker official tutorial](https://docker-curriculum.com/) to set it up on your computer.*

### Pip packages and dataset

In order to run this notebook, you will also need to install Polars, Bastionlab by running the following code block.

In [1]:
!pip install bastionlab



Firstly, we will run the server via BastionLab's official Docker image.

In [None]:
!docker run -it -p 50056:50056 -d mithrilsecuritysas/bastionlab:latest

Then, we will connect to the server using the `Connection()` method.

In [None]:
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client

***#This is way too dry haha Take the person by the hand: why are they here, why are you explaining this? It doesn't need to be long - but you're skipping a bit too ahead. You need to explain that you're making a short dataset for the purpose of this tutorial and that it will contain columns with metals and their corresponding melting point. ^^ Also we do not use the word 'simple' =)***

Now we will create three Polars dataframes with "Element" and "Melting Point (K)" columns.

In [4]:
import polars as pl

df1 = pl.DataFrame(
    {
        "Element": ["Copper", "Silver", "Gold"],
        "Melting Point (K)": [1357.77, 1234.93, 1337.33],
    }
)

df2 = pl.DataFrame(
    {"Element": ["Platinum", "Palladium"], "Melting Point (K)": [2041.4, 1828.05]}
)

df3 = pl.DataFrame({"Element": ["Titanium"], "Melting Point (K)": [1945.0]})

### Upload the dataset to the server

Now, let's send the Polar's DataFrame instance to the server. We'll use Bastionlab's `polars.send_df()` method which will return a `RemoteLazyFrame` instance, a reference to the DataFrame uploaded. We will be working with it throughout this tutorial.

For the sake of this tutorial, we specify an **unsafe policy which disables all checks**. We set the `unsafe_zone` parameter to `TrueRule()` to allow all requests. In this case, the `unsafe_handling` parameter can be anything (as there are no unsafe requests), so we set it to `Log()` in the following example.

>**Important note** - This unsafe policy is used so that we can focus on demonstrating data cleaning in BastionLab, without having to worry about approving any data access requests. However this policy is **not** suited for production.


In [5]:
from bastionlab.polars.policy import Policy, TrueRule, Log

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log())

rdf1 = client.polars.send_df(df1, policy=policy)
rdf2 = client.polars.send_df(df2, policy=policy)
rdf3 = client.polars.send_df(df3, policy=policy)

## Append datasets ***#or something like that?***
____________________________________________

***#There's not enough explaining here as well. Why are we here and what will we be doing - not in technical terms. Then introduce that you'll use vstack() to do that and explain what is vstack - what does it do, how does it work, what arguments does it take? Imagine it's the first time you're mentionning it because I think it will be removed from the top of the document ^^***

***#It will help to have an overall narration if you add Titles because you'll know what you want to say and where you're going with all this.***

***#Also, I cannot highlight it enough. No 'simply', no 'simple' haha. If it's really simple, people will realise that as they do it ^^***

So now we will test out `vstack()` by adding the second dataframe, with the Platinum and Palladium elements, to our first one, that has Copper, Silver and Gold. 

***#Here you're just saying what people can already read in the code. You won't have this problem if you explain instead the argument vstack() takes first***
We change rdf1 to equal the RemoteLazyFrame returned by our vstack function.

Now, when we `.collect().fetch()` our RemoteLazyFrame, the second dataframe has been appended at the bottom of the first one!

In [6]:
rdf1 = rdf1.vstack(rdf2)
rdf1.collect().fetch()

Element,Melting Point (K)
str,f64
"""Copper""",1357.77
"""Silver""",1234.93
"""Gold""",1337.33
"""Platinum""",2041.4
"""Palladium""",1828.05


You can do this as many time as you want. For example, let's add our third RemoteLazyFrame containing Titanium to our first RemoteDataFrame, twice.

As you can see, `rdf1` now has all the previous elements, plus two lots of Titanium at the end.

In [7]:
rdf1 = rdf1.vstack(rdf3)
rdf1 = rdf1.vstack(rdf3)
rdf1.collect().fetch()

Element,Melting Point (K)
str,f64
"""Copper""",1357.77
"""Silver""",1234.93
"""Gold""",1337.33
"""Platinum""",2041.4
"""Palladium""",1828.05
"""Titanium""",1945.0
"""Titanium""",1945.0


And that's it for this tutorial, you now know **how to combine multiple datasets using BastionLab**. We can now close the connection.

In [8]:
connection.close()