# integrate.ai API Sample Notebook for VFL-GLM

This is an example notebook that demonstrates using an AWS task runner to run a PRL session to determine overlap, followed by a VFL-GLM session (logistic), and an example using the Tweedie Regression.  

For details about required setup and configuration for task runners, see [Using integrate.ai](https://documentation.integrateai.net/#using-integrate-ai).

## Setup
### Set environment variables (or replace inline) with your IAI credentials
Generate and manage this token in the UI, in the Tokens page. 

In [None]:
from integrate_ai_sdk.api import connect
import os
import json
import pandas as pd

IAI_TOKEN = ""
client = connect(token=IAI_TOKEN)

## Prerequisites

### Download the sample data

You can download sample data from the integrate.ai sample bucket:

For PRL and VFL: [https://s3.ca-central-1.amazonaws.com/public.s3.integrate.ai/integrate_ai_examples/vfl.zip](https://s3.ca-central-1.amazonaws.com/public.s3.integrate.ai/integrate_ai_examples/vfl.zip)

### Create a task runner in your workspace

For instructions for how to create an AWS task runner, [see the documentation](https://documentation.integrateai.net/#create-an-aws-task-runner). 

### Upload the sample data to the S3 bucket created for your task runner

**Important: By default the task runner expects your data to be in the bucket that was created when the task runner was provisioned.**

This bucket name takes the form of: `s3://{aws_traskrunner_profile}-{aws_taskrunner_name}.integrate.ai`

For example: `myworkspace-mytaskrunner.integrate.ai`


In [None]:
aws_taskrunner_profile = "staging" # This is your workspace name
aws_taskrunner_name = "shay911" # Task runner name - must match what was supplied in UI to create task runner

base_aws_bucket = f'{aws_taskrunner_profile}-{aws_taskrunner_name}.integrate.ai'

base_aws_bucket #Prints the base_aws_bucket name for reference

### Register the sample datasets

For instructions for how to register datasets, [see the documentation](https://documentation.integrateai.net/#register-a-dataset-aws). 

**In your workspace, register the sample datasets with your task runner with the following names. Replace {base_aws_bucket} with the bucket name for your environment.**

active_train = s3://{base_aws_bucket}/vfl/active_train.parquet

passive_train = s3://{base_aws_bucket}/vfl/passive_train.parquet

active_test = s3://{base_aws_bucket}/vfl/active_test.parquet

passive_test = s3://{base_aws_bucket}/vfl/passive_test.parquet


**Note:** If you use other datasets or change the names, you **must** update the dataset names in the code example below to run a session succesfully. 

### Set up the taskbuilder 


In [None]:
from integrate_ai_sdk.taskgroup.taskbuilder.integrate_ai import IntegrateAiTaskBuilder
from integrate_ai_sdk.taskgroup.base import SessionTaskGroup

iai_tb_aws = IntegrateAiTaskBuilder(client=client,
   task_runner_id=aws_taskrunner_name)


## Create a PRL Session for linking two or more datasets

To create a PRL session, specify a `dataset_config` dictionary indicating the client names and columns to use as identifiers to link the datasets to each other. The number of expected clients will be inferred as the number of items in dataset_config (i.e., two). These client names are referenced for the compute on the PRL session and for any sessions that use the PRL session downstream.

For this session, two clients are going to be providing data. Client 1 and client 2 are naming their clients client_1 and client_2 respectively. Their datasets will be linked by the "id" column in any provided datasets.

Detailed information about PRL is available in the [documentation](https://documentation.integrateai.net/#private-record-linkage-prl-sessions).

In [None]:
# Specify PRL dataset configuration 

prl_data_config = {
    "clients": {
        "active_client": {"id_columns": ["id"]},
        "passive_client": {"id_columns": ["id"]},
    }
}

In [None]:
# Create and start PRL session

prl_session = client.create_prl_session(
    name="Testing notebook - VFL GLM",
    description="I am testing PRL for VFL GLM",
    data_config=prl_data_config,
).start()

prl_session.id #Prints the session ID for reference

In [None]:
# Create a task group with one task for each of the clients joining the session

prl_task_group = (SessionTaskGroup(prl_session)\
    .add_task(iai_tb_aws.prl(train_path=active_train_path, test_path=active_test_path, client_name="active_client"))\
    .add_task(iai_tb_aws.prl(train_path=passive_train_path, test_path=passive_test_path, client_name="passive_client"))
)

prl_task_group_context = prl_task_group.start()

In [None]:
#Check the status of the task group

for i in prl_task_group_context.contexts.values():
    print(json.dumps(i.status(), indent=4))

prl_task_group_context.monitor_task_logs()

In [None]:
# Wait for the tasks to complete (success = True)

prl_task_group_context.wait(60*5, 2)

### PRL Session Complete!
Now you can view the overlap statistics for the datasets.

In [None]:
# View PRL session metrics

metrics = prl_session.metrics().as_dict()
metrics

## Create a VFL GLM Training Session using the PRL session

To create a VFL train session, specify the `prl_session_id` indicating the session you just ran to link the datasets together. 

For more information about vertical federated learning with a Generalized Linear model (GLM) strategy, see [VFL GLM Model Trianing]()

In [None]:
model_config = {
    "strategy": {"name": "VflGlm", "params": {}},
    "model": {
        "passive_client": {"params": {"input_size": 7, "output_activation": "sigmoid"}},
        "active_client": {"params": {"input_size": 8, "output_activation": "sigmoid"}},
    },
    "ml_task": {
        "type": "logistic",
        "params": {},
    },
    "optimizer": {"name": "SGD", "params": {"learning_rate": 0.2, "momentum": 0.0}},
    "seed": 23,  # for reproducibility
}

data_config = {
        "passive_client": {
            "label_client": False,
            "predictors": ["x1", "x3", "x5", "x7", "x9", "x11", "x13"],
            "target": None,
        },
        "active_client": {
            "label_client": True,
            "predictors": ["x0", "x2", "x4", "x6", "x8", "x10", "x12", "x14"],
            "target": "y",
        },
    }

The `vfl_mode` must be set to `train`.

In [None]:
# Create and start a VFL training session

vfl_train_session = client.create_vfl_session(
    name="Testing notebook - VFL GLM Train",
    description="I am testing VFL GLM training session creation through a notebook",
    prl_session_id=prl_session.id,
    vfl_mode='train',
    min_num_clients=2,
    num_rounds=2,
    package_name="iai_glm",
    data_config=data_config,
    model_config=model_config
).start()


vfl_train_session.id   #Prints the session ID for reference

In [None]:
# Specify the storage path for the training output.

storage_path = f"{aws_storage_path}/vfl/{vfl_train_session.id}"

# Create and start a task group with one task for each of the clients joining the session
# This example uses registered dataset names. 

vfl_task_group_context = (SessionTaskGroup(vfl_train_session)\
    .add_task(iai_tb_aws.vfl_train(train_path=active_train_path, 
                                    test_path=active_test_path, 
                                    batch_size=1024,
                                    client_name="active_client", 
                                    storage_path=aws_storage_path))\
    .add_task(iai_tb_aws.vfl_train(train_path=passive_train_path, 
                                    test_path=passive_test_path, 
                                    batch_size=1024, 
                                    client_name="passive_client", 
                                    storage_path=aws_storage_path))\
    .start())


In [None]:
# Check the status of the tasks

for i in vfl_task_group_context.contexts.values():
    print(json.dumps(i.status(), indent=4))

vfl_task_group_context.monitor_task_logs()

In [None]:
# Wait for the tasks to complete (success = True)

vfl_task_group_context.wait(60*8, 2)

### VFL Session Complete!
Now you can view and plot the VFL training metrics and start making predictions.

In [None]:
metrics = vfl_train_session.metrics().as_dict()
metrics

In [None]:
fig = vfl_train_session.metrics().plot()

## Make a Prediction on the trained VFL model

To create a VFL predict session, specify the `prl_session_id` indicating the session you ran to link the datasets together. You also need the `training_id` of the VFL train session that was run using the same `prl_session_id`. 

The `vfl_mode` must be set to `predict`.

In [None]:
# Create and start a VFL predict session

vfl_predict_session = client.create_vfl_session(
    name="Testing notebook - VFL-GLM Predict",
    description="I am testing VFL-GLM prediction session creation through a notebook",
    prl_session_id=prl_session.id,
    training_session_id=vfl_train_session.id,
    vfl_mode="predict",
    data_config=data_config,
).start()

vfl_predict_session.id  # Prints the session ID for reference

In [None]:
# Set the storage path for the output of the prediction session.

vfl_predict_task_group_context = (SessionTaskGroup(vfl_predict_session)\

# Create and start a task group with one task for each of the clients joining the session

.add_task(iai_tb_aws.vfl_predict(
        client_name="active_client", 
        dataset_path=active_test_path, 
        raw_output=True,
        batch_size=1024, 
        storage_path=vfl_predict_active_storage_path))\
.add_task(iai_tb_aws.vfl_predict(
        client_name="passive_client",
        dataset_path=passive_test_path,
        batch_size=1024,
        raw_output=True,
        storage_path=vfl_predict_passive_storage_path))\
.start())


In [None]:
# Check the status of the tasks

for i in vfl_predict_task_group_context.contexts.values():
    print(json.dumps(i.status(), indent=4))

vfl_predict_task_group_context.monitor_task_logs()

In [None]:
# Wait for the tasks to complete (success = True)

vfl_predict_task_group_context.wait(60*8, 2)

### VFL Predict Session Complete!

Now you can view the VFL predictions and evaluate the performance.

In [None]:
# Retrieve the metrics

metrics = vfl_predict_session.metrics().as_dict()
metrics

In [None]:
presigned_result_urls = vfl_predict_session.prediction_result()

print(vfl_predict_active_storage_path)
df_pred = pd.read_csv(presigned_result_urls.get(vfl_predict_active_storage_path))

df_pred.head()

## Create a VFL GLM training session using the Tweedie Regression


The recommended `output_activation` is `None` for `power <= 0` and `exp` for `power > 0`. This is the same as for the sklearn [TweedieRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.TweedieRegressor.html).

In [None]:
model_config = {
    "strategy": {"name": "VflGlm", "params": {}},
    "model": {
        "passive_client": {"params": {"input_size": 7, "output_activation": None}},
        "active_client": {"params": {"input_size": 8, "output_activation": None}},
    },
    "ml_task": {
        "type": "tweedie",
        "params": {"power": 0},
    },
    "optimizer": {"name": "SGD", "params": {"learning_rate": 0.01, "momentum": 0.0}},
    "seed": 23,  # for reproducibility
}

data_config = {
        "passive_client": {
            "label_client": False,
            "predictors": ["x1", "x3", "x5", "x7", "x9", "x11", "x13"],
            "target": None,
        },
        "active_client": {
            "label_client": True,
            "predictors": ["x0", "x2", "x4", "x6", "x8", "x10", "x12", "x14"],
            "target": "y",
        },
    }

In [None]:
# Create and start a VFL training session

tweedie_train_session = client.create_vfl_session(
    name="Testing notebook - VFL GLM Train with Tweedie",
    description="I am testing VFL GLM training session creation through a notebook",
    prl_session_id=prl_session.id,
    vfl_mode='train',
    min_num_clients=2,
    num_rounds=2,
    package_name="iai_glm",
    data_config=data_config,
    model_config=model_config
).start()

tweedie_train_session.id    #Prints the session ID for reference

In [None]:
# Specify the storage path for the training output.

storage_path = f"{aws_storage_path}/vfl/{tweedie_train_session.id}"

# Create and start a task group with one task for each of the clients joining the session
# This example uses registered dataset names. 

tweedie_task_group_context = (SessionTaskGroup(tweedie_train_session)\
    .add_task(iai_tb_aws.vfl_train(train_path=active_train_path, 
                                    test_path=active_test_path, 
                                    batch_size=1024,
                                    client_name="active_client", 
                                    storage_path=aws_storage_path))\
    .add_task(iai_tb_aws.vfl_train(train_path=passive_train_path, 
                                    test_path=passive_test_path, 
                                    batch_size=1024, 
                                    client_name="passive_client", 
                                    storage_path=aws_storage_path))\
    .start())


In [None]:
# Check the status of the tasks

for i in tweedie_task_group_context.contexts.values():
    print(json.dumps(i.status(), indent=4))

tweedie_task_group_context.monitor_task_logs()

In [None]:
# Wait for the tasks to complete (success = True)

tweedie_task_group_context.wait(60*5, 2)

### VFL Session Complete!
Now you can view and plot the VFL training metrics.

In [None]:
metrics = tweedie_train_session.metrics().as_dict()
metrics

In [None]:
fig = tweedie_train_session.metrics().plot()