# integrate.ai API Sample Notebook to run HFL tasks with an AWS task runner.


This is an example notebook that demonstrates creating taskbuilders and running tasks using an AWS task runner. 
For details about required setup and configuration for task runners, see [Using integrate.ai](https://documentation.integrateai.net/#using-integrate-ai).

## Setup
### Set environment variables (or replace inline) with your IAI credentials
Generate and manage this token in the UI, in the Tokens page. 

In [None]:
from integrate_ai_sdk.api import connect
import os
import json
import pandas as pd

IAI_TOKEN = ""
client = connect(token=IAI_TOKEN)

The documentation section [Understanding Models](https://documentation.integrateai.net/#understanding-models) provides additional context into the parameters that are used during training session creation.<br />
For this session we are going to be using two training clients and two rounds. 

You can find the model config and data schema details in the documentation.

In [None]:
# Specify the model and data configurations

model_config = {
    "experiment_name": "test_synthetic_tabular",
    "experiment_description": "test_synthetic_tabular",
    "strategy": {"name": "FedAvg", "params": {}},
    "model": {"params": {"input_size": 15, "hidden_layer_sizes": [6, 6, 6], "output_size": 2}},
    "balance_train_datasets": False,
    "ml_task": {
        "type": "classification",
        "params": {
            "loss_weights": None,
        },
    },
    "optimizer": {"name": "SGD", "params": {"learning_rate": 0.2, "momentum": 0.0}},
    "differential_privacy_params": {"epsilon": 4, "max_grad_norm": 7},
    "save_best_model": {
        "metric": "loss",  # to disable this and save model from the last round, set to None
        "mode": "min",
    },
    "seed": 23,  # for reproducibility
}

data_schema = {
    "predictors": ["x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10", "x11", "x12", "x13", "x14"],
    "target": "y",
}

In [None]:
# Create and start the training session

hfl_session = client.create_fl_session(
    name="Testing notebook - HFL",
    description="I am testing session creation with a task runner through a notebook",
    min_num_clients=2,
    num_rounds=2,
    package_name="iai_ffnet",
    model_config=model_config,
    data_config=data_schema
).start()

hfl_session.id # Prints the training session ID for reference

In [None]:
# Create a task group with one task for each of the clients joining the session
# This example uses registered dataset names

task_group = (
    SessionTaskGroup(hfl_session)
    .add_task(iai_tb_aws.hfl(train_dataset_name="train_one", test_dataset_name="test_parquet", use_gpu=False))\
    .add_task(iai_tb_aws.hfl(train_dataset_name="train_two", test_dataset_name="test_parquet", use_gpu=False))
)

task_group_context = task_group.start()

In [None]:
# Monitor the submitted tasks

for i in task_group_context.contexts.values():
    print(json.dumps(i.status(), indent=4))

task_group_context.monitor_task_logs()

In [None]:
# Wait for the tasks to complete (success = True)

task_group_context.wait(60*8, 2)

### HFL Session Complete!
Now you can view the training metrics and start making predictions

In [None]:
# Retrieve the session metrics

hfl_session.metrics().as_dict()

In [None]:
# Plot the session metrics

fig = hfl_session.metrics().plot()

In [None]:
hfl_gbm = hfl_session.id()
 training_session.model().as_sklearn()

# Task 2: Create a linear inference session (HFL)

The built-in model package `iai_linear_inference` trains a bundle of linear models for the target of interest against a specified list of predictors. It obtains the coefficients and variance estimates, and also calculates the p-values from the corresponding hypothesis tests. Linear inference is particularly useful for genome-wide association studies (GWAS), to identify genomic variants that are statistically associated with a risk for a disease or a particular trait.

This is a horizontal federated learning (HFL) model package.

For more information, see the [documentation](https://documentation.integrateai.net/#linear-inference-sessions).

### Sample model config and data config
To be compatible with the `iai_linear_inference` package, we use the strategy `LogitRegInference` in the `model_config`, if the target of interest is binary, and use `LinearRegInference` if it is continuous.

The `data_config` dictionary should include the following 3 fields (note that the columns in all the fields can be specified as either names/strings or indices/integers):
- `target`: the target column of interest;
- `shared_predictors`: predictor columns that should be included in all linear models (e.g., the confounding factors like age, gender in GWAS);
- `chunked_predictors`: predictor columns that should be included in the linear model one at a time (e.g., the gene expressions in GWAS)

With the example data config below, the session will train 4 logistic regression models with `y` as the target, and `x1, x2` plus any one of `x0, x3, x10, x11` as predictors.

In [None]:
# Specify the model and data configurations

model_config_logit = {
    "strategy": {"name": "LogitRegInference", "params": {}},
    "seed": 23,  # for reproducibility
}

data_config_logit = {
    "target": "y",
    "shared_predictors": ["x1", "x2"],
    "chunked_predictors": ["x0", "x3", "x10", "x11"]
}

In [None]:
# Create and start a linear inference session 

training_session_logit = client.create_fl_session(
    name="Testing linear inference session",
    description="I am testing linear inference session creation using a task runner through a notebook",
    min_num_clients=2,
    num_rounds=5,
    package_name="iai_linear_inference",
    model_config=model_config_logit,
    data_config=data_config_logit
).start()

training_session_logit.id

In [None]:
# Create a task group
# This example uses registered dataset names

task_group_context = (
    SessionTaskGroup(training_session_logit)
    .add_task(iai_tb_aws.hfl(train_dataset_name="train_one", test_dataset_name="test_parquet"))\
    .add_task(iai_tb_aws.hfl(train_dataset_name="train_two", test_dataset_name="test_parquet"))
    .start()
)


In [None]:
# Check the status of the tasks

for i in task_group_context.contexts.values():
    print(json.dumps(i.status(), indent=4))
task_group_context.monitor_task_logs()

In [None]:
# Wait for the tasks to complete (success = True)

task_group_context.wait(60*5, 2)

## Inference Session Complete!
Now we can view the training metrics and model details such as the model coefficients and p-values. Note that since there are a bundle of models being trained, the metrics below are the average values of all the models.

In [None]:
training_session_logit.metrics().as_dict()
training_session_logit.metrics().plot()

### Trained models are accessible from the completed session

The `LinearInferenceModel` object can be retrieved using the model's `as_pytorch` method. And the relevant information such as p-values can be accessed directly from the model object.


In [None]:
model_logit = training_session_logit.model().as_pytorch()
pv = model_logit.p_values()
pv

The `.summary` method fetches the coefficient, standard error and p-value of the model corresponding to the specified predictor.

In [None]:
summary_x0 = model_logit.summary("x0")
summary_x0

It is also possible to make predictions with the resulting bundle of models, when the data is loaded by the `ChunkedTabularDataset` from the `iai_linear_inference` package. For an example of this, see the `integrateai_linear_inference.ipynb` see the [documentation](https://documentation.integrateai.net/#making-predictions-from-a-linear-inference-session).