# integrate.ai API Sample Notebook

## Set environment variables (or replace inline) with your IAI credentials
### Generate and manage this token in the UI, in the Tokens page

In [None]:
import os

IAI_TOKEN = os.environ.get("IAI_TOKEN")

## Authenticate to the integrate.ai api client

In [None]:
from integrate_ai_sdk.api import connect

client = connect(token=IAI_TOKEN)

## Create an EDA Session for exploring the datasets

To create an EDA session, we specify a `dataset_config` dictionary indicating the columns to explore for each dataset. Here the empty list `[]` means to include all columns. The number of expected datasets will be inferred as the number of items in dataset_config (i.e., two). Alternatively, we can manually set it with the optional argument `num_datasets` in `client.create_eda_session()`

For information more information on how to configure an EDA session from scratch, reference the documentation [here](https://integrate-ai.gitbook.io/integrate.ai-user-documentation/tutorials/exploratory-data-analysis-eda).

In [None]:
dataset_config = {"dataset_one": [], "dataset_two": []}

In [None]:
eda_session = client.create_eda_session(
    name="Testing notebook - EDA",
    description="I am testing EDA session creation through a notebook",
    data_config=dataset_config,
).start()

eda_session.id

## Start an EDA Session using IAI client
Follow the documentation on directions for how to install the [integrate_ai](https://pypi.org/project/integrate-ai/) package and the [sample data](https://integrate-ai.gitbook.io/integrate.ai-user-documentation/tutorials/end-user-tutorials/model-training-with-a-sample-local-dataset#prerequisites).<br/>
Unzip the sample data to your `~/Downloads` directory, otherwise update the `data_path` below to point to the sample data.

In [None]:
import subprocess

data_path = "~/Downloads/synthetic"

dataset_1 = subprocess.Popen(
    f"iai client eda --token {IAI_TOKEN} --session {eda_session.id} --dataset-path {data_path}/train_silo0.parquet --dataset-name dataset_one --remove-after-complete",
    shell=True,
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
)

dataset_2 = subprocess.Popen(
    f"iai client eda --token {IAI_TOKEN} --session {eda_session.id} --dataset-path {data_path}/train_silo1.parquet --dataset-name dataset_two --remove-after-complete",
    shell=True,
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
)

## Poll for session status

You can log whatever you would like about the session during this time. For now we are just checking for session completion.If you want to access the logs later you can use `iai client log` command.

In [None]:
import time

current_status = None
while dataset_1.poll() is None or dataset_2.poll() is None:
    output1 = dataset_1.stdout.readline().decode("utf-8").strip()
    output2 = dataset_2.stdout.readline().decode("utf-8").strip()
    if output1:
        print("silo1: ", output1)
    if output2:
        print("silo2: ", output2)

    # poll for status
    if current_status != eda_session.status:
        print("Session status: ", eda_session.status)
        current_status = eda_session.status
    time.sleep(1)

output1, error1 = dataset_1.communicate()
output2, error2 = dataset_2.communicate()

print(
    "dataset_1 finished with return code: %d\noutput: %s\n  %s"
    % (dataset_1.returncode, output1.decode("utf-8"), error1.decode("utf-8"))
)
print(
    "dataset_2 finished with return code: %d\noutput: %s\n  %s"
    % (dataset_2.returncode, output2.decode("utf-8"), error2.decode("utf-8"))
)

## EDA Session Complete!
Now you can analyze the datasets.

The results object is a dataset collection, which is comprised of multiple datasets that can be retrieved by name. 

Each dataset is comprised of columns, which can be retrieved by column name. 

The same base analysis functions can be performed at the collection, dataset, or column level.

In [None]:
results = eda_session.results()
results

The .describe() method can be used to retrieve a standard set of descriptive statistics.

In this example, columns `x10` to `x14` are categorical and no statistics outside of `count` will be computed for these columns.

If a statistical function is invalid for a column (ex: mean requires a continuous column and `x10` is categorical) or the column from one dataset is not present in the other then the result will show as `NaN`.

In [None]:
results.describe()

In [None]:
results["dataset_one"].describe()

For categorical columns, other statistics like `unique_count`, `mode`, and `uniques` can be used for further exploration.

In [None]:
results["dataset_one"][["x10", "x11"]].uniques()

Functions like `.mean()`, `.median()`, `.std()` can also be called individually. 

In [None]:
results["dataset_one"].mean()

In [None]:
results["dataset_one"]["x1"].mean()

Histogram plots can be created using the `.plot_hist()` function.

In [None]:
saved_dataset_one_hist_plots = results["dataset_two"].plot_hist()

In [None]:
single_hist = results["dataset_two"]["x1"].plot_hist()

## Sample model config and data schema
You can find the model config and data schema in the [integrate.ai end user tutorial](https://integrate-ai.gitbook.io/integrate.ai-user-documentation/tutorials/end-user-tutorials/model-training-with-a-sample-local-dataset)

In [None]:
model_config = {
    "experiment_name": "test_synthetic_tabular",
    "experiment_description": "test_synthetic_tabular",
    "strategy": {"name": "FedAvg", "params": {}},
    "model": {"params": {"input_size": 15, "hidden_layer_sizes": [6, 6, 6], "output_size": 2}},
    "balance_train_datasets": False,
    "ml_task": {
        "type": "classification",
        "params": {
            "loss_weights": None,
        },
    },
    "optimizer": {"name": "SGD", "params": {"learning_rate": 0.2, "momentum": 0.0}},
    "differential_privacy_params": {"epsilon": 4, "max_grad_norm": 7},
    "save_best_model": {
        "metric": "loss",  # to disable this and save model from the last round, set to None
        "mode": "min",
    },
    "seed": 23,  # for reproducibility
}

data_schema = {
    "predictors": ["x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10", "x11", "x12", "x13", "x14"],
    "target": "y",
}

## Create a Training Session

The documentation for [creating a session](https://integrate-ai.gitbook.io/integrate.ai-user-documentation/tutorials/end-user-tutorials/model-training-with-a-sample-local-dataset#create-and-start-the-session) gives a bit more context into the parameters that are used during training session creation.<br />
For this session we are going to be using two training clients and two rounds. 

In [None]:
training_session = client.create_fl_session(
    name="Testing notebook",
    description="I am testing session creation through a notebook",
    min_num_clients=2,
    num_rounds=2,
    package_name="iai_ffnet",
    model_config=model_config,
    data_config=data_schema,
).start()

training_session.id

## Start a training session using iai client
Make sure that the sample data you downloaded to [Start an EDA Session](#Start-an-EDA-Session-using-IAI-client) is saved to your `~/Downloads` directory, otherwise update the `data_path` below to point to the sample data.

In [None]:
import subprocess

data_path = "~/Downloads/synthetic"

client_1 = subprocess.Popen(
    f"iai client train --token {IAI_TOKEN} --session {training_session.id} --train-path {data_path}/train_silo0.parquet --test-path {data_path}/test.parquet --batch-size 1024 --client-name client-1 --remove-after-complete",
    shell=True,
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
)

client_2 = subprocess.Popen(
    f"iai client train --token {IAI_TOKEN} --session {training_session.id} --train-path {data_path}/train_silo1.parquet --test-path {data_path}/test.parquet --batch-size 1024 --client-name client-2 --remove-after-complete",
    shell=True,
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
)

## Poll for session status

You can log whatever you would like about the session during this time. For now we are logging the current round and the session status. If you want to access the logs later you can use `iai client log` command.

In [None]:
import time

current_round = None
current_status = None
while client_1.poll() is None or client_2.poll() is None:
    output1 = client_1.stdout.readline().decode("utf-8").strip()
    output2 = client_2.stdout.readline().decode("utf-8").strip()
    if output1:
        print("silo1: ", output1)
    if output2:
        print("silo2: ", output2)

    # poll for status and round
    if current_status != training_session.status:
        print("Session status: ", training_session.status)
        current_status = training_session.status
    if current_round != training_session.round and training_session.round > 0:
        print("Session round: ", training_session.round)
        current_round = training_session.round
    time.sleep(1)

output1, error1 = client_1.communicate()
output2, error2 = client_2.communicate()

print(
    "client_1 finished with return code: %d\noutput: %s\n  %s"
    % (client_1.returncode, output1.decode("utf-8"), error1.decode("utf-8"))
)
print(
    "client_2 finished with return code: %d\noutput: %s\n  %s"
    % (client_2.returncode, output2.decode("utf-8"), error2.decode("utf-8"))
)

## Session Complete!
Now you can view the training metrics and start making predictions

In [None]:
training_session.metrics().as_dict()

In [None]:
fig = training_session.metrics().plot()

## Trained model parameters are accessible from the completed session

Model parameters can be retrieved using the model's state_dict method. These parameters can then be saved with torch.save().

In [None]:
import torch

model = training_session.model().as_pytorch()

save_state_dict_folder = "./saved_models"
# PyTorch conventional file type
file_name = f"{training_session.id}.pt"
os.makedirs(save_state_dict_folder, exist_ok=True)
saved_state_dict_path = os.path.join(save_state_dict_folder, file_name)

with open(saved_state_dict_path, "w") as f:
    torch.save(model.state_dict(), saved_state_dict_path)

## Load the saved model

To load a model saved previously, a model object needs to be initialized first. This can be done by directly importing one of the IAI-supported packages (e.g., FFNet) or using the model class defined in a custom package. 

In [None]:
from integrate_ai_sdk.packages.FFNet.nn_model import FFNet

model = FFNet(input_size=15, output_size=2, hidden_layer_sizes=[6, 6, 6])

# use torch.load to unpickle the state_dict
target_state_dict = torch.load(saved_state_dict_path)

model.load_state_dict(target_state_dict)

## Load test data

In [None]:
import pandas as pd

test_data = pd.read_parquet(f"{data_path}/test.parquet")
test_data.head()

## Convert test data to tensors

In [None]:
Y = torch.tensor(test_data["y"].values)

In [None]:
X = torch.tensor(
    test_data[["x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10", "x11", "x12", "x13", "x14"]].values
)

## Run model predictions

In [None]:
model(X)

In [None]:
labels = model(X).max(dim=1)[1]
labels