# Vertival Federated Learning

The easiest way to get started is to run the example on a hosted Python environment using Google Colab. To open the example on Google Colab click on the "Open in Colab" button below. If you chose Google Colab, you will need to set some variabales at the beginning of the notebook. The easiest way to do this is by clicking on the copy button below, which will copy all variables and insert them at the same position in the notebook on Google Colab.


In [None]:
PYTHON_VERSION = "{PYTHON_VERSION}"  # noqa: F821
ARTIFACT_USER = "{ARTIFACT_USER} "  # noqa: F821
ARTIFACT_KEY = "{ARTIFACT_KEY}"  # noqa: F821
PYPI_REGISTRY = "{PYPI_REGISTRY}"  # noqa: F821
ORGANIZATION_ID = "{ORGANIZATION_ID}"  # noqa: F821
SERVER_ADDRESS = "{SERVER_ADDRESS}"  # noqa: F821
TOKEN_URL = "{TOKEN_URL}"  # noqa: F821

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/katulu-io/examples/blob/{PLATFORM_VERSION}/examples/workbook_vfl.ipynb)

This example demonstrates how to use Vertical Federated Learning. The demo covers:
* A short introduction to the concept of Vertical Federated Learning
* A demonstration on how to install and onboard an agent
* How to access the data and prepare it for training
* The training of a model using federated learning
* How to evaluate the model


## Introduction to Vertical Federated Learning

Vertical Federated Learning is a machine learning technique that allows multiple parties to train a model in a setting where each party has access to different features of the data. This allows the different parties to train a model without sharing their data with each other. For example if we want to train a model along the value chain, we will usually need data from different companies. In this case, each company trains a part of the model using their own data and the model is then virtually combined to create a global model. This is done by cutting a global model into two parts, one for each company, and only the output from one model is communicated to the other.

One example for such a scenario is where we would like to the predict the quality of a product based on the raw materials used in the production. In this case, the raw material supplier has data on the quality of the raw materials and the manufacturer has data on the quality of the final product. Each dataset for itself contains limited value to improve the process. However, if the producer can use the data from the supplier without ever seeing it, she can produce higher quality with lower costs and both can participate in the value created.

## Installing the agent

for this demo we will install the agents locally and use an already prepared dataset. 

We will run the agents directly in Python. Please make sure that you have at least Python > 10.14. installed. First, we will create a virtual environment called `platform_demo` and and activate it. If you don't want to use a virtual environment, you can skip this step.
```bash
python -m venv platform_demo
source platform_demo/bin/activate
```

Next, we will install the required packages. 
```bash
pip install katulu-agent=={PYTHON_VERSION} -U --extra-index-url https://{ARTIFACT_USER}:{ARTIFACT_KEY}@{PYPI_REGISTRY}
```

More details on agent installation and different options can be found in the [agent documentation](/docs/agent/installation).

### Download the datasets 

After installing the agent we will need to download the datasets. This can be done with `gcloud CLI`. 
If you don't have `gcloud CLI` installed, you can install it by following the instructions [here](https://cloud.google.com/sdk/docs/install). 

Next, we will download the dataset and store it in a folder called `data`.

```bash
mkdir data
echo {ARTIFACT_KEY} \
| base64 --decode | gcloud auth activate-service-account --key-file=-
gcloud storage cp gs://demo-agent-data-files/* data/
```

### Configure and start the agents

Now we will create two agents on the Platform. To do this, we need to open the [agents page](/{ORGANIZATION_ID}/agents) and click on the `Create Agent` button. This will open a dialog where we need to set a name for the agent and can define some labels, to better identify the agent, e.g., location, hardware, etc. We will call the agents `agent_1` and `agent_2`. After inserting a name and maybe some labels, we will click `Create Agent`. This will open up a new page, where we can download the first part of the configuration file by clicking the `Download (agent_1.yml)` (respective `agent_2.yml`) button on the `Configuration File(...)` section.

The configuration file should be place into the current directory and we will need to append the data specific configuration next. 

For the first agent, we will append the following information to the configuration file:
```yaml
datasets:
  - name: chemical_passive
      type:
        parquet:
          file: .//data/ungrouped_passive.parquet
      collaboration:
        - secret:
            password:
              value: "12345"
      privacy_level: 0
```

and for the second agent:
```yaml
datasets:
  - name: chemical_active
      type:
        parquet:
          file: .//data/active.parquet
      collaboration:
        - secret:
            password:
              value: "12345"
      privacy_level: 0
```

This will tell the agent to use the data from the `ungrouped_passive.parquet` and `active.parquet` files. The `colaboration` section is used to define a key that the agents use to encrypt identifiers. The server needs to run a private set intersection protocol to find the common identifiers between the agents. To keep the identifiers private they are first ecrypted through password based encryption (PBKDF2 HMAC + PRF blake2b) on the agents, before they are shared with the server. To ensure that the encrypted ids are the same, the agents need to use the same pre-shared password.

The `privacy_level` is set to `0` which means that the we will not use any additional privacy preserving techniques on the data. In a real world scenario, you would want to use a higher privacy level, e.g., `1` or `2` to ensure that the data cannot be reconstructed from the model.

The full configuration file should look like this:
```yaml
id: {AGENT_ID}
server_url: https://{SERVER_ADDRESS}
credentials:
    token_url: {TOKEN_URL}
    client_id: {CLIENT_ID}
    client_secret: {CLIENT_SECRET}
datasets:
  - name: chemical_{active/passive}
      type:
        parquet:
          file: .//data/{active/ungrouped_passive}.parquet
      collaboration:
        - secret:
            password:
              value: "12345"
      privacy_level: 0
```


More information on the configuration file can be found in the [agent documentation](/docs/agent/configuration).

Now we are all set to start the agents. We can do this by opening two terminals and run the following command in the terminals:
```bash
source platform_demo/bin/activate
katulu-agent agent_1.yml
```
and
```bash
source platform_demo/bin/activate
katulu-agent agent_2.yml
``` 

The agents are sucessfully started when you see three lines in the terminal starting with:
```
Starting agent
Retrieving server version
Schemas registered with server 
```

If you are not seeing these lines, please check the configuration file and the [troubleshooting guide](/docs/agent/troubleshooting).

Now we can move to the next step and explore the data and prepare it for training.

## Exploring the SDK

To interact with the Platform we will use the Katulu SDK. The SDK is a Python package that provides an easy way to interact with the Platform. The SDK can be installed by running the following command:

In [None]:
!pip install katulu-sdk=={PYTHON_VERSION} -U --extra-index-url https://download.pytorch.org/whl/cpu --extra-index-url https://{ARTIFACT_USER}:{ARTIFACT_KEY}@{PYPI_REGISTRY}

You can write the SDK code in a Python script and call it with `python script.py` or you can use a Jupyter notebook. For this demo we will use a Jupyter notebook. You can use a hosted Jupyter service like, Google Colab or Kaggle Notebook. We will use a local installation in this demo. To start a Jupyter notebook, run the following command:
```bash
pip install jupyter
jupyter notebook
```

This installs and starts a Jupyter notebook server. You can now open a browser and navigate to `http://localhost:8888` to open the Jupyter notebook interface. There you can create a new notebook and start writing code.

### Notebook setup

First we will import the necessary libraries and set up the environment.

In [2]:
# ruff: noqa: PLE1142
import torch

from katulu.sdk import (
    Adam,
    BinaryAccuracy,
    BinaryCrossEntropyWLogitsLoss,
    JobSpecConfig,
    VerticalJobSpec,
    build_vfl_job_spec,
    connect,
    model_from_torch,
)
from katulu.sdk.pipeline import (
    Alias,
    Avg,
    Cast,
    CastType,
    Features,
    GroupBy,
    Max,
    MinMaxScaler,
    PSIJoinOn,
    Select,
    Source,
    Targets,
)

### Project setup

Now we will need to create a new project on the Platform. To do this, we need to open the [projects page](/{ORGANIZATION_ID}/projects) and click on the `Create Project` button. This will open a dialog where we need to set a name for the project. We will call the project `chemical_quality_prediction`. After inserting the name, we will click `Create Project`. This will open up a new page, and it will show the `PROJECT_ID` in the top of the page, next the project name. We will need this `PROJECT_ID` to connect to the project in the SDK. We can insert it into the code below.

In [1]:
PROJECT_ID = "{PROJECT_ID}"  # noqa: F821
CLIENT_ID = "{CLIENT_ID}"  # noqa: F821
CLIENT_SECRET = "{CLIENT_SECRET}"  # noqa: F821

### Client setup

To get the `CLIENT_ID` and `CLIENT_SECRET`, we will create new Access Credential associated to our profile. Open the [Access Credentials page](/profile/access-credentials) and click on the `Create Access Credential` button. This will open a dialog where we need to set a name for the credential. We will call the credential `chemical_quality_credentials`. After inserting the name, we will click `Create Access Credential`. This will open up a new page, and it will show all the information we need to fill in the missing pieces in the above code.

These are all the required information to connect to the Platform. Now we can start exploring the data and prepare it for training.

We will define the dataset names that we want to retrieve from the Platform. The name should be the same as the name in the configuration file. In this case, we will use `chemical_active` and `chemical_passive` and create an empty dictionary to store the data.

In [4]:
DATASET_NAMES = ["chemical_active", "chemical_passive"]
datasets = {}

Now we can connect to the Platform and retrieve the data and print their schemas.

In [5]:
datasets = {}
# Connect to the Platform to get the dataset
async with connect(
    project_id=PROJECT_ID,
    organization_id=ORGANIZATION_ID,
    server_address=SERVER_ADDRESS,
    client_id=CLIENT_ID,
    client_secret=CLIENT_SECRET,
    token_url=TOKEN_URL,
) as session:
    for dataset_name in DATASET_NAMES:
        ds = (await session.find_sources(name=dataset_name))[0]
        datasets[dataset_name] = ds
        print(f"Found dataset: {ds.id}")
        print(ds.schema)

[2m2024-07-15 08:47:03[0m [[32m[1mdebug    [0m] [1mIssuing new access token      [0m [36msource[0m=[35mkatulu.core.auth.jwt[0m
Found dataset: d84d1772da1373ad9d919a18762ef37d08f014d89502d2d343aeb1edf35b8abd
╒═════════╤════════╕
│ Field   │ Type   │
╞═════════╪════════╡
│ quality │ Int64  │
├─────────┼────────┤
│ id      │ Int64  │
╘═════════╧════════╛
Found dataset: 69d9977aee4b2f18687aca9f0bc6f2c3b19ffc91523bf480c26cd797c7d793ef
╒══════════════════════╤═════════╕
│ Field                │ Type    │
╞══════════════════════╪═════════╡
│ fixed acidity        │ Float64 │
├──────────────────────┼─────────┤
│ volatile acidity     │ Float64 │
├──────────────────────┼─────────┤
│ citric acid          │ Float64 │
├──────────────────────┼─────────┤
│ residual sugar       │ Float64 │
├──────────────────────┼─────────┤
│ chlorides            │ Float64 │
├──────────────────────┼─────────┤
│ free sulfur dioxide  │ Float64 │
├──────────────────────┼─────────┤
│ total sulfur dioxide │ Float

The Schema contains the available columns and their types. We can use this information to prepare the data for training.

To do this we define which columns we want to use for training and which column we want to predict. In this case, we will use all columns except the `quality` column to train the model and we will use the `quality` column to predict the quality of the chemical solution. The features will be put into a list called `fields` and the target column will be stored in a list called `target`. Additionally, we will need to define a `psi_column` which is used to match the data between the agents. In this case, we will use the `id` column.

In [6]:
fields = [
    "fixed acidity",
    "volatile acidity",
    "citric acid",
    "residual sugar",
    "chlorides",
    "free sulfur dioxide",
    "total sulfur dioxide",
    "density",
    "pH",
    "sulphates",
    "alcohol",
]

target = ["quality"]
psi_column = ["id"]

The pipeline defines the data transformations that needs to be applied to the data. In this case, we know fromt the production that they take multiple mesurements per batch. Therefore, we will first group the data by the batch id and aggregate the values by the `MAX` function. This will give us a single row per batch with the maximum value of each feature.

In the vertical federated learning setting, we need to define two pipelines. One for the passive agent and one for the active agent. The active agent will have the `quality` column and the `id` column and provides the targets for the training, while the passive agent will have all the other columns and provides the features.

The active pipeline will define the `PSIJoinOn` column and the `Targets`. The `Targets` is the `quality` value casted to a `Float32`. 

The passive pipeline will define the `PSIJoinOn` column and the `Features`. The `Features` are all columns except the `quality` column. The `Features` are casted to `Float32` and normalized. We also knwo from production that they take multiple measurements per batch. Therefore, we will first group the data by the batch id and aggregate the values by the `MAX` function. This will give us a single row per batch with the maximum value of each feature.

In [7]:
# Define the pipeline
# Define an active and a passive pipeline
# fmt: off
active_pipeline = (
    Source(datasets["chemical_active"])
    + Targets(Select(target) | Cast(target, CastType.Float32))
    + PSIJoinOn(Select(psi_column))
)

passive_pipeline = (
    Source(datasets["chemical_passive"])
    | GroupBy(["id"], [Alias(Max(f), f) for f in fields])
    + PSIJoinOn(Select(psi_column))
    + Features(Select(fields) + Cast(fields, CastType.Float32) + MinMaxScaler(fields))
)
# fmt: on

Next, we will define two simple models that we will train on the data. 

The passive model will accept the features as its inputs and is a simple feed forward neural network with two hidden layers and no output. The output from the passive model is the input to the active model, and the active model will predict the quality of the chemical solution.

In [8]:
# Define the models
PASSIVE_MODEL = torch.nn.Sequential(
    torch.nn.BatchNorm1d(len(fields)),
    torch.nn.Linear(len(fields), 20),
    torch.nn.Linear(20, 20),
    torch.nn.ReLU(),
)

ACTIVE_MODEL = torch.nn.Sequential(
    torch.nn.Linear(20, 20),
    torch.nn.ReLU(),
    torch.nn.LayerNorm(20),
    torch.nn.Linear(20, 20),
    torch.nn.ReLU(),
    torch.nn.Linear(20, 1),
)

Finnaly, we put all the pieces together and define a training `Job`. It contains the `pipeline`, the two models, and a configuration for the training.

We will train the model for 5 epochs with a batch size of 256. For the optimizer, we will use the `Adam` optimizer with a learning rate of `0.001`. We will use the `BinarCrossEntropyLoss` as the loss function, as we want the predict if the quality of the chemical solution meets some binary quality criteria.

As a metric to track during the training we use the `Accuracy` metric. This metric will tell us how many of the predictions are correct.

In [9]:
# Create the job spec
job_spec = build_vfl_job_spec(
    name="chemical_vfl",
    passive=VerticalJobSpec(
        pipeline=passive_pipeline,
        model=model_from_torch(PASSIVE_MODEL),
    ),
    active=VerticalJobSpec(
        pipeline=active_pipeline,
        model=model_from_torch(ACTIVE_MODEL),
    ),
    config=JobSpecConfig(
        num_rounds=5,
        batch_size=256,
        optimizer=Adam(learning_rate=1e-3),
        loss_function=BinaryCrossEntropyWLogitsLoss(),
        metrics=[BinaryAccuracy()],
    ),
)

Finally, we are all set and can start the training job. We will print the training progress and also have the chance to track the training metrics on the platform by following the link in the output.

In [10]:
# Start the training
async with connect(
    project_id=PROJECT_ID,
    organization_id=ORGANIZATION_ID,
    server_address=SERVER_ADDRESS,
    client_id=CLIENT_ID,
    client_secret=CLIENT_SECRET,
    token_url=TOKEN_URL,
) as session:
    job_id = await session.fit(job_spec)  # noqa: PLE1142
    print(f"Job ID: {job_id}")
    print(f"/{ORGANIZATION_ID}/{PROJECT_ID}/jobs")
    round = 1
    async for metrics in session.metrics(job_id):
        print("Round:", round, {m.name: metrics.items[f"val_{m.name}"] for m in job_spec.config.metrics})
        round += 1

[2m2024-07-15 08:47:16[0m [[32m[1mdebug    [0m] [1mIssuing new access token      [0m [36msource[0m=[35mkatulu.core.auth.jwt[0m
Job ID: a012baf0-49eb-4d67-b540-b5a29afd6c58
https://https://platform.katulu.io/ca2bf2b8-3cd1-427b-8b39-c1b09c72c0ed/bcd538b2-422f-400e-a9fd-18aa54f92ec5/jobs
Round: 1 {'accuracy': 0.6272125593401641}
Round: 2 {'accuracy': 0.5276281356591342}
Round: 3 {'accuracy': 0.5831922422523992}
Round: 4 {'accuracy': 0.6041249804484288}
Round: 5 {'accuracy': 0.621517623206625}


We have seen from the training that the model wasn't able to learn the data. We possibily made a mistake in the data preperation. Therefore, we will go back and check the pipeline. We will change the aggregation function from `MAX` to `MEAN`.

In [11]:
# Use MEAN aggregation
# fmt: off
active_pipeline = (
    Source(datasets["chemical_active"])
    + Targets(Select(target) | Cast(target, CastType.Float32))
    + PSIJoinOn(Select(["id"]))
)

passive_pipeline = (
    Source(datasets["chemical_passive"])
    | GroupBy(["id"], [Alias(Avg(f), f) for f in fields])
    + PSIJoinOn(Select(psi_column))
    + Features(Select(fields) + Cast(fields, CastType.Float32) + MinMaxScaler(fields))
)
# fmt: on

We create a new Job with the updated pipeline and start the training. This time the model should be able to learn the data and we should see the accuracy increasing over time.

In [12]:
# Create the job spec
job_spec = build_vfl_job_spec(
    name="chemical_vfl",
    passive=VerticalJobSpec(
        pipeline=passive_pipeline,
        model=model_from_torch(PASSIVE_MODEL),
    ),
    active=VerticalJobSpec(
        pipeline=active_pipeline,
        model=model_from_torch(ACTIVE_MODEL),
    ),
    config=JobSpecConfig(
        num_rounds=5,
        batch_size=256,
        optimizer=Adam(learning_rate=1e-3),
        loss_function=BinaryCrossEntropyWLogitsLoss(),
        metrics=[BinaryAccuracy(threshold=0.0)],
    ),
)

In [13]:
# Start the training
async with connect(
    project_id=PROJECT_ID,
    organization_id=ORGANIZATION_ID,
    server_address=SERVER_ADDRESS,
    client_id=CLIENT_ID,
    client_secret=CLIENT_SECRET,
    token_url=TOKEN_URL,
) as session:
    job_id = await session.fit(job_spec)  # noqa: PLE1142
    print(f"Job ID: {job_id}")
    print(f"/{ORGANIZATION_ID}/{PROJECT_ID}/jobs")
    round = 1
    async for metrics in session.metrics(job_id):
        print("Round:", round, {m.name: metrics.items[f"val_{m.name}"] for m in job_spec.config.metrics})
        round += 1

[2m2024-07-15 08:50:32[0m [[32m[1mdebug    [0m] [1mIssuing new access token      [0m [36msource[0m=[35mkatulu.core.auth.jwt[0m
Job ID: 04a2dcd4-f555-4404-97de-2fd608afead6
https://https://platform.katulu.io/ca2bf2b8-3cd1-427b-8b39-c1b09c72c0ed/bcd538b2-422f-400e-a9fd-18aa54f92ec5/jobs
Round: 1 {'accuracy': 0.6647683544050313}
Round: 2 {'accuracy': 0.6941665382096817}
Round: 3 {'accuracy': 0.7163306139919857}
Round: 4 {'accuracy': 0.7249499768023309}
Round: 5 {'accuracy': 0.7249499768023309}


And we see that the Accuracy increased by more then 10% and the model is able to predict the quality of the chemical solution with an accuracy of around 75%.

Within the demo we have seen how to install and onboard agents, how to access the data and prepare it for training, how to train a model using federated learning and how to evaluate the model. Now it is time to explore the platform and try it out yourself.