# Combining Datasets Tutorial

In this tutorial, we are going to see how to use BastionLab's vstack function to join RemoteLazyFrames vertically. This can be particularly useful if you want to query or train models on a combination of datasets from different parties. The only requirement is that datasets must have the same column names/types. You can learn about dataset preparation including changing column types, names and adding and removing columns in out data cleaning tutorial.

## Pre-requisites
### Technical Requirements

To start this tutorial, ensure the following are already installed in your system:
- Python3.7 or greater (get the latest version of Python at https://www.python.org/downloads/ or with your operating system’s package manager)
- [Python Pip](https://pypi.org/project/pip/) (PyPi), the package manager
- [Docker](https://www.docker.com/) 

*Here's the [Docker official tutorial](https://docker-curriculum.com/) to set it up on your computer.*

### Pip packages and dataset

In order to run this notebook, you will also need to install Polars, Bastionlab by running the following code block.

In [1]:
!pip install bastionlab

Firstly, we will run the server via BastionLab's official docker image.

In [None]:
import bastionlab_server

srv = bastionlab_server.start()

### Uploading our data frames to the server

Secondly, we will connect to the server using the Connection() method.

In [None]:
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client

Now we will create three simple Polars dataframes with "Element" and "Melting Point (K)" columns.

In [4]:
import polars as pl

df1 = pl.DataFrame(
    {
        "Element": ["Copper", "Silver", "Gold"],
        "Melting Point (K)": [1357.77, 1234.93, 1337.33],
    }
)

df2 = pl.DataFrame(
    {"Element": ["Platinum", "Palladium"], "Melting Point (K)": [2041.4, 1828.05]}
)

df3 = pl.DataFrame({"Element": ["Titanium"], "Melting Point (K)": [1945.0]})

Now we need to send these Polar dataframes to the server and get back our RemoteLazyFrame instances which we will be working with in the rest of the tutorial.

For the sake of this tutorial, we specify an unsafe policy which disables all checks. We set the `unsafe_zone` parameter to `TrueRule()` to allow all requests. In this case, the `unsafe_handling` parameter can be anything (as there are no unsafe requests), we set it to `Log()` in the following example.

Note that this is purely done so that we can focus on demonstrating the vstack feature in BastionLab without having to worry about approving any data access requests. This policy is not suitable for production.

In [5]:
from bastionlab.polars.policy import Policy, TrueRule, Log

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log())

rdf1 = client.polars.send_df(df1, policy=policy)
rdf2 = client.polars.send_df(df2, policy=policy)
rdf3 = client.polars.send_df(df3, policy=policy)

So now we will test out vstack by adding the second dataframe, with the Platinum and Palladium elements, to our first one, with Copper, Silver and Gold. We simply change rdf1 to equal the RemoteLazyFrame returned by our vstack function.

As you can see when we .collect().fetch() our RemoteLazyFrame, the second dataframe has now been appended to the bottom of the first one.

In [6]:
rdf1 = rdf1.vstack(rdf2)
rdf1.collect().fetch()

Element,Melting Point (K)
str,f64
"""Copper""",1357.77
"""Silver""",1234.93
"""Gold""",1337.33
"""Platinum""",2041.4
"""Palladium""",1828.05


Next we will add our third RemoteLazyFrame containing Titanium to rdf1, two times!

As you can see, we now have all the previous elements, plus two lots of Titanium at the end.

In [7]:
rdf1 = rdf1.vstack(rdf3)
rdf1 = rdf1.vstack(rdf3)
rdf1.collect().fetch()

Element,Melting Point (K)
str,f64
"""Copper""",1357.77
"""Silver""",1234.93
"""Gold""",1337.33
"""Platinum""",2041.4
"""Palladium""",1828.05
"""Titanium""",1945.0
"""Titanium""",1945.0


And that's it for this tutorial. Finally, we will now simply close the connection.

In [8]:
print("rdf1:")
print(rdf1.collect().fetch())

new_df = pl.DataFrame(
    {
        "Element": ["Magnesium", "Silver", "Gold", "Platinum"],
        "Symbol": ["Mg", "Ag", "Au", "Pt"],
        "Number": [12, 47, 79, 78],
    }
)
new_rdf = client.polars.send_df(new_df, policy=policy)
print("new_rdf")
new_rdf.collect().fetch()

rdf1:
shape: (8, 2)
┌───────────┬───────────────────┐
│ Element   ┆ Melting Point (K) │
│ ---       ┆ ---               │
│ str       ┆ f64               │
╞═══════════╪═══════════════════╡
│ Copper    ┆ 1357.77           │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Silver    ┆ 1234.93           │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Silver    ┆ 1234.93           │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Gold      ┆ 1337.33           │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Platinum  ┆ 2041.4            │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Palladium ┆ 1828.05           │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Titanium  ┆ 1945.0            │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Titanium  ┆ 1945.0            │
└───────────┴───────────────────┘
new_rdf


Element,Symbol,Number
str,str,i64
"""Magnesium""","""Mg""",12
"""Silver""","""Ag""",47
"""Gold""","""Au""",79
"""Platinum""","""Pt""",78


For our first example, we'll join the two RemoteLazyFrames by their column in common, `"Element"`, using the `inner` method. This will leave us with a combined table, but will drop any elements which were not found in both tables.

In [9]:
join = rdf1.join(new_rdf, on="Element", how="inner")
join.collect().fetch()

Element,Melting Point (K),Symbol,Number
str,f64,str,i64
"""Silver""",1234.93,"""Ag""",47
"""Silver""",1234.93,"""Ag""",47
"""Gold""",1337.33,"""Au""",79
"""Platinum""",2041.4,"""Pt""",78


For our second example, we will us the `anti` join which will give us only the elements from the right-hand table which **do not match** with any elements in the the `other` table.

In [10]:
join = rdf1.join(new_rdf, on="Element", how="anti")
join.collect().fetch()

Element,Melting Point (K)
str,f64
"""Copper""",1357.77
"""Palladium""",1828.05
"""Titanium""",1945.0
"""Titanium""",1945.0


### `join_asof()`

`join_asof()` works similarly to a `left-join`, except that we match **on nearest key rather than equal keys**. For this to work, both RemoteLazyFrames must be sorted by the `join_asof` key. 

Like with `join`, `join_asof` also makes use of Polars own `join_asof` function for LazyFrames and allows `join_asof` to be performed on RemoteLazyFrames on the same server. 

The arguments accepted are the same as those accepted by the [Polars LazyFrame `join_asof` method](https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/api/polars.LazyFrame.join_asof.html#polars.LazyFrame.join_asof).

- `other (mandatory argument)`: The other RemoteLazyFrame you want to join your current RemoteLazyFrame with.
- `left_on`: The name(s) of the left join column(s). *(Note: you must specify either the `left_on` and `right_on` columns or one `on` column.)*
- `right_on`: The name(s) of of the right join column(s).
- `on`: Name(s) of join columns for both RemoteLazyFrames.
- `by_left`: Join on these columns before doing asof join.
- `by_right`: Join on these columns before doing asof join.
- `by`: Join on these columns before doing asof join.
- `strategy`: Join strategy: `'backward'` or `'forward'`. *See the next section, 'strategy' for more details.*
- `suffix`: Suffix to append to columns with a duplicate name.
- `tolerance`: Numeric tolerance. By setting this the join will only be done if the near keys are within this distance.
- `allow_parallel`: Boolean value for allowing the physical plan to evaluate the computation of both RemoteLazyFrames up to the join in parallel.
- `force_parallel`: Boolean value for forcing parallel the physical plan to evaluate the computation of both RemoteLazyFrames up to the join in parallel.

### strategy

- If you select `backward`, search selects the last row in the right DataFrame whose `on` key is **less than or equal** to the left’s key.
- If you select `forward`, search selects the first row in the right DataFrame whose `on` key is **greater than or equal** to the left’s key.

### Examples

This was a lot of theory, now let's dive into an example. 

First, we will create two RemoteLazyFrames. They both have a `distance` column that they can join on. The first dataframe has a names columns, with runners names, and a distance column, wiht how far they have ran over a week. The second dataframe has set "levels" associated with having ran a certain distance over a week: e.g. running 50+km is classed as 'Pro' level.

In [31]:
from datetime import datetime

d1 = pl.DataFrame(
    {
        "distance": [7, 16, 24, 49],
        "name": ["Laura", "Charles", "Kwabena", "Shannon"],
    }
)
rd1 = client.polars.send_df(d1, policy=policy)
d2 = pl.DataFrame(
    {
        "distance": [1, 10, 25, 50],
        "level unlocked": ["Amateur", "Intermediate", "Excellent", "Pro"],
    }
)
rd2 = client.polars.send_df(d2, policy=policy)
print(rd1.collect().fetch())
rd2.collect().fetch()

shape: (4, 2)
┌──────────┬─────────┐
│ distance ┆ name    │
│ ---      ┆ ---     │
│ i64      ┆ str     │
╞══════════╪═════════╡
│ 7        ┆ Laura   │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 16       ┆ Charles │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 24       ┆ Kwabena │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 49       ┆ Shannon │
└──────────┴─────────┘


distance,Verdict
i64,str
1,"""Poor effort"""
10,"""Good"""
25,"""Excellent"""
50,"""Amazing"""


We can use the `merge_asof` function to merge the two columns, giving each runer an associated 'level unlocked' value based on how much they ran.

In [33]:
joined = rd1.join_asof(rd2, on="distance")
joined.collect().fetch()

distance,name,Verdict
i64,str,str
7,"""Laura""","""Poor effort"""
16,"""Charles""","""Good"""
24,"""Kwabena""","""Good"""
49,"""Shannon""","""Excellent"""


For further examples of join_asof, check out the examples in [Polars User Guide](https://pola-rs.github.io/polars-book/user-guide/howcani/combining_data/joining.html?highlight=join_asof#asof-join)!

With this, you now know multiple ways of **how to combine datasets using BastionLab**. We can close the connection and stop the server.

In [None]:
connection.close()