# Combining datasets
______________________________________________________

Combining datasets allows us to train models or run queries on multiple datasets from different parties, which can lead to much more powerful results. 

Let's take an example: say 100 different hospitals around the world want to take part in a project to train a machine learning model to determine if a patient has Covid-19 based on a chest X-ray scan. The model will almost certainly be much more accurate and relevant to a more varied range of patients if this model is trained on a combined dataset from the 100 hospitals than from any one of these hospital.

In this tutorial, we are going to explore how we can combine remote datasets securely in BastionLab. The vital advantage of doing this with our remote privacy features is that a data scientist can combine all of these datasets **without having direct access** to any one of them, enabling a level of collaboration which may previously have been deemed too risky in terms of data privacy.

So let's take a look at the steps required to combine datasets.

## Pre-requisites
___________________________________

### Technical requirements

To follow along with this tutorial, ensure the following are already installed in your system:
- Python3.7 or greater (get the latest version of Python at https://www.python.org/downloads/ or with your operating system’s package manager)
- [Python Pip](https://pypi.org/project/pip/) (PyPi), the package manager

### Pip packages

In order to run this notebook, you will also need to install Polars, Bastionlab and the bastionlab_server packages by running the following code block.

In [None]:
!pip install bastionlab polars bastionlab_server

>*Note that the bastionlab_server package we install here was created for testing purposes. You can alternatively install the BastionLab server using our Docker image or from source. Check out our [Installation Tutorial](../getting-started/installation.md) for more details.*

### Launching the server

The next step is to launch the server. The server exposes port `50056` for gRPC communication with clients and uses a default configuration (no authentication, default settings). For the purpose of this tutorial, these settings are sufficient and we won't change them. To launch the server, we simply use bastionlab_server's start method.

In [30]:
import bastionlab_server

srv = bastionlab_server.start()

### Uploading our data frames to the server

We will connect to the server via the `Connection()` constructor.

In [31]:
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client

Next, we will create three short Polars dataframes which we will then use to demonstrate how to combine datasets. Our three dataframes have an `"Element"` column containing elements and a `"Melting Point (K)"` column with their corresponding melting points.

In [32]:
import polars as pl

df1 = pl.DataFrame(
    {
        "Element": ["Copper", "Silver", "Silver", "Gold"],
        "Melting Point (K)": [1357.77, 1234.93, 1234.93, 1337.33],
    }
)

df2 = pl.DataFrame(
    {"Element": ["Platinum", "Palladium"], "Melting Point (K)": [2041.4, 1828.05]}
)

df3 = pl.DataFrame({"Element": ["Titanium"], "Melting Point (K)": [1945.0]})


We now need to send these Polar dataframes to the server to get back our `RemoteLazyFrame` instances. In a real-life scenario, it may well be the data owner that sends over the dataframes and a data scientist who connects with authentication and retrieves these RemoteLazyFrame instances.

We normally would use the default or a safe customized policy to ensure that anyone who works with our datasets from now on will not be able to run queries that compromise the privacy our data. However, in this tutorial we are going to specify a policy which will allow us full access to the data. We dot this so we'll easily be able to print out the datasets in full  for illustrative purposes, without needing to request access. To do this, we set the `unsafe_zone` parameter to `TrueRule()` to allow all requests. In this case, the `unsafe_handling` parameter can be anything (as there are no unsafe requests), we set it to `Log()` in the following example.

>***Important note - This unsafe policy is not suited for production.***

We now send our three dataframes to the server, using our custom policy, and get back three RemoteLazyFrame instances.

In [33]:
from bastionlab.polars.policy import Policy, TrueRule, Log

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log())

rdf1 = client.polars.send_df(df1, policy=policy)
rdf2 = client.polars.send_df(df2, policy=policy)
rdf3 = client.polars.send_df(df3, policy=policy)

## Appending datasets
__________________________________________________________

We can now move onto exploring how to append RemoteLazyFrames using `vstack`. `vstack()` can be used to **append any RemoteLazyFrame to another where the column names and types match**. 

>*You can learn about dataset preparation, including changing column types, names and adding and removing columns, in our [Data cleaning tutorial](https://bastionlab.readthedocs.io/en/latest/docs/tutorials/data_cleaning/)*

We call the `vstack()` method on the first RemoteLazyFrame and then give the RemoteLazyFrame we want to append to it as an argument. `vstack` returns the resulting combined RemoteLazyFrame.

Here, for example, `rdf2`, containing Platinum and Palladium, is appended to `rdf1`, containing Copper, Silver and Gold. We set `rdf1` to equal combined RemoteLazyFrame returned by `vstack`, and so when we `.collect().fetch()` `rdf1`, we see the resulting combined dataset.

In [34]:
rdf1 = rdf1.vstack(rdf2)
rdf1.collect().fetch()

Element,Melting Point (K)
str,f64
"""Copper""",1357.77
"""Silver""",1234.93
"""Silver""",1234.93
"""Gold""",1337.33
"""Platinum""",2041.4
"""Palladium""",1828.05


You can do this as many times as you want. For example, let's add our third RemoteLazyFrame containing Titanium to our first RemoteDataFrame, twice.

As you can see, `rdf1` now has all the previous elements, plus two lots of Titanium at the end.

In [35]:
rdf1 = rdf1.vstack(rdf3)
rdf1 = rdf1.vstack(rdf3)
rdf1.collect().fetch()

Element,Melting Point (K)
str,f64
"""Copper""",1357.77
"""Silver""",1234.93
"""Silver""",1234.93
"""Gold""",1337.33
"""Platinum""",2041.4
"""Palladium""",1828.05
"""Titanium""",1945.0
"""Titanium""",1945.0


## Joining datasets
______________________________________

We also implement `join` and `join_asof` to enable users to combine rows from two or more tables, based on a related column between the tables.

### `join()`

Let's start by examining the `join` function further. The `join` function makes use of Polars own `join` function for LazyFrames and allows joins on RemoteLazyFrames on the same server. 

The arguments accepted by `join` are the same as those accepted by [Polars LazyFrame](https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/api/polars.LazyFrame.join.html#polars.LazyFrame.join) `join` method:

- `other (mandatory argument)`: The other RemoteLazyFrame you want to join your current RemoteLazyFrame with.
- `left_on`: The name or list of names of the left join column(s). *(Note: you must specify either the `left_on` and `right_on` columns or one `on` column.)*
- `right_on`: The name or list of names of of the right join column(s).
- `on`: The name or list of names of join columns for both RemoteLazyFrames.
- `how`: The `how` argument is where you specify your join strategy. *See the 'strategies' section below for more info on the available options.*
- `allow_parallel`: Boolean value for allowing the physical plan to evaluate the computation of both RemoteLazyFrames up to the join in parallel.
- `force_parallel`: Boolean value for forcing parallel the physical plan to evaluate the computation of both RemoteLazyFrames up to the join in parallel.


#### strategies
The following strategies are supported as values of the `how` keyword. They all will let you join your tables following a specific set of instructions. 

- `inner (default)`: Returns records that have matching values in both tables.
- `left`: Returns all records from the left table, and the matched records from the right table.
- `outer`: Returns all records when there is a match in either left or right table.
- `semi`: Similar to the `inner` strategy but returns results with only the columns from the left-hand RemoteLazyFrame.
- `anti`:  Like the `semi` join, but the opposite! It shows only the left-hand values that **do not match** with the right-hand RemoteLazyFrame values (based on the `on` column).
- `cross`: Returns a paired combination of each row of the first table with each row of the second table.
- `suffix`: Suffix to append to columns with a duplicate name.

> To learn more about the more common `inner`, `left` and `outer` join methods, check out this [article](https://www.w3schools.com/sql/sql_join.asp) or this [visual helper tool](https://joins.spathon.com/).
>
> To learn more about `cross` joins, check out this [article](https://www.sqlshack.com/sql-cross-join-with-examples/).
>
> To learn more about the `semi`, `anti` and join strategies, check out the example in [Polars User Guide](https://pola-rs.github.io/polars-book/user-guide/howcani/combining_data/joining.html?highlight=join#joins).

#### Examples

We now know about join's arguments and join strategies, so let's take a look at a couple of examples!

Let's start by creating a new RemoteLazyFrame which will contain an `element` column to join with our previous `rdf1` RemoteLazyFrame on, plus a `Symbol` and `Number` column. It will include some of the elements already in our `rdf1` RemoteLazyFrame, plus some new elements.

In [36]:
print("rdf1:")
print(rdf1.collect().fetch())

new_df = pl.DataFrame(
    {
        "Element": ["Magnesium", "Silver", "Gold", "Platinum"],
        "Symbol": ["Mg", "Ag", "Au", "Pt"],
        "Number": [12, 47, 79, 78],
    }
)
new_rdf = client.polars.send_df(new_df, policy=policy)
print("new_rdf")
new_rdf.collect().fetch()

rdf1:
shape: (8, 2)
┌───────────┬───────────────────┐
│ Element   ┆ Melting Point (K) │
│ ---       ┆ ---               │
│ str       ┆ f64               │
╞═══════════╪═══════════════════╡
│ Copper    ┆ 1357.77           │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Silver    ┆ 1234.93           │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Silver    ┆ 1234.93           │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Gold      ┆ 1337.33           │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Platinum  ┆ 2041.4            │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Palladium ┆ 1828.05           │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Titanium  ┆ 1945.0            │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Titanium  ┆ 1945.0            │
└───────────┴───────────────────┘
new_rdf


Element,Symbol,Number
str,str,i64
"""Magnesium""","""Mg""",12
"""Silver""","""Ag""",47
"""Gold""","""Au""",79
"""Platinum""","""Pt""",78


For our first example, we'll join the two RemoteLazyFrames by their column in common, `"Element"`, using the `inner` method. This will leave us with a combined table, but will drop any elements which were not found in both tables.

In [37]:
join = rdf1.join(new_rdf, on="Element", how="inner")
join.collect().fetch()

Element,Melting Point (K),Symbol,Number
str,f64,str,i64
"""Silver""",1234.93,"""Ag""",47
"""Silver""",1234.93,"""Ag""",47
"""Gold""",1337.33,"""Au""",79
"""Platinum""",2041.4,"""Pt""",78


For our second example, we will us the `anti` join which will give us only the elements from the right-hand table which **do not match** with any elements in the the `other` table.

In [38]:
join = rdf1.join(new_rdf, on="Element", how="semi")
join.collect().fetch()

Element,Melting Point (K)
str,f64
"""Silver""",1234.93
"""Silver""",1234.93
"""Gold""",1337.33
"""Platinum""",2041.4


### `join_asof()`

`join_asof()` works similarly to a `left-join`, except that we match **on nearest key rather than equal keys**. For this to work, both RemoteLazyFrames must be sorted by the `join_asof` key. 

Like with `join`, `join_asof` also makes use of Polars own `join_asof` function for LazyFrames and allows `join_asof` to be performed on RemoteLazyFrames on the same server. 

The arguments accepted are the same as those accepted by the [Polars LazyFrame](https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/api/polars.LazyFrame.join_asof.html#polars.LazyFrame.join_asof) `join_asof` method.

- `other (mandatory argument)`: The other RemoteLazyFrame you want to join your current RemoteLazyFrame with.
- `left_on`: The name(s) of the left join column(s). *(Note: you must specify either the `left_on` and `right_on` columns or one `on` column.)*
- `right_on`: The name(s) of of the right join column(s).
- `on`: Name(s) of join columns for both RemoteLazyFrames.
- `by_left`: Join on these columns before doing asof join.
- `by_right`: Join on these columns before doing asof join.
- `by`: Join on these columns before doing asof join.
- `strategy`: Join strategy: `'backward'` or `'forward'`. *See the next section, 'strategy' for more details.*
- `suffix`: Suffix to append to columns with a duplicate name.
- `tolerance`: Numeric tolerance. By setting this the join will only be done if the near keys are within this distance.
- `allow_parallel`: Boolean value for allowing the physical plan to evaluate the computation of both RemoteLazyFrames up to the join in parallel.
- `force_parallel`: Boolean value for forcing parallel the physical plan to evaluate the computation of both RemoteLazyFrames up to the join in parallel.

### strategy

- If you select `backward`, search selects the last row in the right DataFrame whose `on` key is **less than or equal** to the left’s key.
- If you select `forward`, search selects the first row in the right DataFrame whose `on` key is **greater than or equal** to the left’s key.

### Examples

This was a lot of theory, now let's dive into an example. 

First, we will create two RemoteLazyFrames. They both have a `time` column that they can join on, but the second dataframe has only `'recorded times to the nearest hour'` and has not filled in the precise `'minutes'` column like the first one. So we can use the `join_asof` function to merge the two columns, based on the closest `'datetime'` match.

In [42]:
from datetime import datetime

# Creating and uploading the first data frame
d1 = pl.DataFrame(
    {
        "time": [
            datetime.strptime("09:10", "%H:%M").time(),
            datetime.strptime("10:15", "%H:%M").time(),
            datetime.strptime("11:05", "%H:%M").time(),
            datetime.strptime("15:34", "%H:%M").time(),
        ],
        "Visitor name": ["Charles", "Kwabena", "Laura", "Shannon"],
    }
)
rd1 = client.polars.send_df(d1, policy=policy)

# Creating and uploading the second data frame
d2 = pl.DataFrame(
    {
        "time": [
            datetime.strptime("09:00", "%H:%M").time(),
            datetime.strptime("10:00", "%H:%M").time(),
            datetime.strptime("11:00", "%H:%M").time(),
        ],
        "Verdict": ["OK", "Good", "Average"],
    }
)
rd2 = client.polars.send_df(d2, policy=policy)

# Printing and collecting the data frames
print(rd1.collect().fetch())
rd2.collect().fetch()

shape: (4, 2)
┌──────────┬──────────────┐
│ time     ┆ Visitor name │
│ ---      ┆ ---          │
│ time     ┆ str          │
╞══════════╪══════════════╡
│ 09:10:00 ┆ Charles      │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10:15:00 ┆ Kwabena      │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 11:05:00 ┆ Laura        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 15:34:00 ┆ Shannon      │
└──────────┴──────────────┘


time,Verdict
time,str
09:00:00,"""OK"""
10:00:00,"""Good"""
11:00:00,"""Average"""


In [51]:
# We join_asof() both dataframes, based on the closest `'datetime'` match
joined2 = d1.join_asof(
    d2,
    on="time",
    strategy="backward",
    tolerance=datetime.strptime("01:00", "%H:%M").time(),
)
joined2

time,Visitor name,Verdict
time,str,str
09:10:00,"""Charles""","""OK"""
10:15:00,"""Kwabena""","""Good"""
11:05:00,"""Laura""","""Average"""
15:34:00,"""Shannon""",


Since we set the tolerance to match on results on the right table within 1 hour of the left table, we get the combined results for Charles, Kwabena and Laura... but Shannon's Verdict column is null!

With this, you now know **how to combine datasets using BastionLab**. We can close the connection.

In [None]:
connection.close()