# Federated Learning Workshop 

In this workshop you will learn about federated learning its concepts and how it works through the help of PySyft which is a Federted Learning python based libary. We will walk through the main steps of the data science workflow, and we will learn how Federated Learning enables data science on non-public data, without obtaining nor seeing a copy of the data itself.

## Scenario

We will be using a Breast Cancer Study scenario involving 2 parties. Rachel who is the Data-Scietist who is performing Machine Learning Research using cancer data. To do so, Rachel would like to use the (non-public) ‚ÄúBreast Cancer Biormaker‚Äù dataset that has been made available on the Cancer Research Centre Datasite.

Owen, who is the Data-Owner. The data cannot be made public due to legal reasons. Nonetheless Owen is very keen on allowing researchers to feature the ‚ÄúBreast Cancer Biomarker‚Äù dataset in their projects. So Owen sets up a PySyft Datasite hosting the dataset. As Data Owner, Owen will be responsible to

* upload the data
* manage credentials and user profiles
* review any project proposal submitted by external data scientists.

### Workflow

Step 1. Owen sets up the new Cancer Research Centre Datasite by (a) uploading the non-public ‚ÄúBreast Cancer Biomarker‚Äù dataset, and (b) configuring login credentials for Rachel to access the Datasite.

Step 2. Rachel connects to Cancer Research Centre; prepares their machine learning code to work with the ‚ÄúBreast Cancer Biomarker‚Äù dataset; and submits their research study to the Datasite.

Step 3. Owen, as the data owner of the Datasite, receives the request, and reviews Rachel‚Äôs code for approval.

Step 4. Once approved, Rachel is ale to remotely execute their code on the Datasite, and get the results of their machine learning study using the ‚ÄúBreast Cancer Dataset‚Äù.

## CODE

### Part 1: Datasets and Assets

First install the packages below. Then restart the session. To do this head to RUNTIME----> RESTART SESSION. Once done continue with then code. 

In [None]:
!pip install syft
!pip install ucimlrepo

Let‚Äôs first import `syft` as `sy` (we will use this coding convention throughout the tutorial, ed.):

In [None]:
import syft as sy

The `syft.orchestra.launch` functions runs a special local Datasite server, that is only intended for development purposes. Each server is identified by its unique `name`, which is used by PySyft to restore its internal state in case of rebooting. We will use the `reset=True` option to make sure that the server instance will be initialised for the first time. Once the server is up and running, we then login into the Datasite:

In [None]:
data_site = sy.orchestra.launch(name="cancer-research-centre", reset=True)
client = data_site.login(email="info@openmined.org", password="changethis")

Now we are going to download the dataset we will be using for this workshop. Use the following code block to do so.

In [None]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17)

# data (as pandas dataframes)
X = breast_cancer_wisconsin_diagnostic.data.features
y = breast_cancer_wisconsin_diagnostic.data.targets

# metadata
metadata = breast_cancer_wisconsin_diagnostic.metadata
# variable information
variables = breast_cancer_wisconsin_diagnostic.variables

This dataset contains `596` samples, organised in `30` clinical features (i.e. `X`). Each sample corresponds to a single categorical target, identifying the outcome of the tumour: `B` as in Bening; and `M` as in Malign:

In [None]:
X.head(n=5)  # n specifies how many rows we want in the preview
X.shape
y.sample(n=5, random_state=10)

Now we have the real dataset. Owen as the data owner needs to create a mock version of the dataset. This is used by the scietists as they can download and view it to see if its the right fit for their data. 

In [None]:
import numpy as np

# fix seed for reproducibility
SEED = 12345
np.random.seed(SEED)

X_mock = X.apply(lambda s: s + np.mean(s) + np.random.uniform(size=len(s)))
y_mock = y.sample(frac=1, random_state=SEED).reset_index(drop=True)

The clinical features `X_mock` are obtained from the original `X` by adding the *arithmetic mean* of each corresponding column, plus some random noise from a normal distribution. The categorical targets `y_mock` are created by simply shuffling their original values. In this way, the data types as well as the *class distribution* remains unchanged, whilst any possible pattern with the samples is dropped.

Now that we have both real and mock data, we are ready to create the corresponding assets in PySyft, each identified by their unique `name` within the Datasite.

In [None]:
features_asset = sy.Asset(
    name="Breast Cancer Data: Features",
    data = X,      # real data
    mock = X_mock  # mock data
)

targets_asset = sy.Asset(
    name="Breast Cancer Data: Targets",
    data = y,      # real data
    mock = y_mock  # mock data
)

Please notice how each asset holds a reference to `data` and `mock`, which are also two properties of a syft.Asset object that we can inspect:

In [None]:
features_asset.data.head(n=3)

In [None]:
features_asset.mock.head(n=3)

Ok, so we have got two assets: `features_asset`, and `targets_asset`, and now we‚Äôre ready to upload them to the Datasite server right ?! Well, not quite! There‚Äôs a problem:
* For this reason, PySyft expects each asset to be stored as part of a `syft.Dataset` object. Each dataset in PySyft is identified by its unique name, and contains additional metadata (e.g. `description`, `citation`, `contributors`) that further describe the core data it includes in its assets.

Let‚Äôs now collect our metadata, and then use it to create our `Dataset` object:


In [None]:
# Metadata
description = f'{metadata["abstract"]}\n{metadata["additional_info"]["summary"]}'

paper = metadata["intro_paper"]
citation = f'{paper["authors"]} - {paper["title"]}, {paper["year"]}'

summary = "The Breast Cancer Wisconsin dataset can be used to predict whether the cancer is benign or malignant."

# Dataset creation
breast_cancer_dataset = sy.Dataset(
    name="Breast Cancer Biomarker",
    description=description,
    summary=summary,
    citation=citation,
    url=metadata["dataset_doi"],
)

Finally, we can add the two assets to the dataset:

In [None]:
breast_cancer_dataset.add_asset(features_asset)

breast_cancer_dataset.add_asset(targets_asset)

Let‚Äôs finally have a look at the newly created breast_cancer_dataset object, using the default rich representation offered by PySyft:

In [None]:
breast_cancer_dataset

To upload a new dataset to the Datasite, we can call the upload_dataset function from the available client:

In [None]:
client.upload_dataset(dataset=breast_cancer_dataset)

Well done! üëè

The dataset has finally reached the Datasite üéâ.

To verify that, we could explore all the datasets accessible through our client object:

In [None]:
client.datasets

Once we are done with the upload of the dataset, we can shutdown the running server using the land function

In [None]:
data_site.land()

Congrats on completing Part 1 üéâ

### Part 2: Clients and Datasite Access

Once our new `cancer-research-centre` Datasite has been setup with the newly created datasets (and assets), the next step for Owen will be to configure the access credentials and policies, in order to enable Rachel to operate on the Datasite as Data Scientist.

At the end of part 1, after uploading the ‚ÄúBreast Cancer Dataset‚Äù to the Datasite, we called the `data_site.land()` function to shutdown the server. To reconnect we can now call the `syft.orchestra.launch` function again, using the same value for the `name` parameter, namely `name="cancer-research-centre"`.

However, this time, we are going to explicitly pass `reset=False` (i.e. default for the parameter) so that we make sure that persistency is restored. In other words, when we reconnect to the Datasite, we will expect to find the ‚ÄúBreast Cancer Dataset‚Äù already uploaded on the Datasite.

In [None]:
data_site = sy.orchestra.launch(name="cancer-research-centre", reset=False)

# logging in as root client with default credentials
client = data_site.login(email="info@openmined.org", password="changethis")

Let‚Äôs quickly double-check that the `Breast Cancer Dataset` is present, and accessible through the available `datasets`:

In [None]:
client.datasets

Now at the begining Owen logged in using the credentials provided by default in PySyft. As part of Owen‚Äôs operations to setup the `cancer-research-centre` Datasite, it is now the time for them to set their own credentials, and to update their profile information.

To update email, and password, we can use the functions `client.account.set_email([new_email])` and `client.account.set_password([new_password])`, respectively.

To update profile information, we can use `client.account.update([name, institution, website, role])`.

In [None]:
OWEN_EMAIL = "owen@cancer-research.science"
OWEN_PASSWD = "cancer_research_syft_admin"

client.account.set_email(OWEN_EMAIL)

# we can bypass the confirmation by using the confirm=False parameter
client.account.set_password(OWEN_PASSWD, confirm=False)

Let‚Äôs now change Owen‚Äôs profile information:

In [None]:
client.account.update(name="Owen, the Data Owner",
                 institution="Cancer Research Centre")

Let‚Äôs now immediately test our new credentials by instantiating a new (root) client, and accessing registered users info:

In [None]:
client = data_site.login(email=OWEN_EMAIL, password=OWEN_PASSWD)
client.users

As expected, the new credentials worked, and all the information in Owen‚Äôs profile have been updated accordingly! From now on, the next time Owen‚Äôs will connect again to the Datasite, they will use these new set of credentials.

The last problem Owen needs to solve is to allow Rachel to connect to the Datasite! In other words, Owen needs to add a new user on the Datasite, to be registered with the role of Data Scientist!

We can use the `client.users.create()` function, which expects the following parameters:

* `name` (type: str): mandatory
* `email` (type: str): mandatory
* `password` (type: str): mandatory
* `password_verify` (type_str): mandatory
* `institution` (type: str): optional
* `website` (type: str): optional

Let‚Äôs use this function to create a new account for Rachel:

In [None]:
rachel_account_info = client.users.create(
    email="rachel@datascience.inst",
    name="Dr. Rachel Science",
    password="syftrocks",
    password_verify="syftrocks",
    institution="Data Science Institute",
    website="https://datascience_institute.research.data"
)

print(f"New User: {rachel_account_info.name} ({rachel_account_info.email}) registered as {rachel_account_info.role}")

The function returns a UserView instance, including read-only information about the newly created account.

By default, the new account for Rachel has been registered to the Datasite by default as data scientist.

To verify that the account has been successfully added to the Datasite, we can see again the list of available users:

In [None]:
client.users

Congrats on completing Part 2 üéâ

### Part 3: Propose the Research Study

For now we are finished with Owen as the data owner and this section will be focused on Rachel the data scientist.

First, let‚Äôs make sure that the local development Datasite is running. If not, syft.orchestra.launch will bootstrap the server instance once again.

In [None]:
data_site = sy.orchestra.launch(name="cancer-research-centre")



Now it is time for Rachel to login to the Datasite using their newly received credentials sent by Owen separately:

In [None]:
client = data_site.login(email="rachel@datascience.inst", password="syftrocks")

On logging into the domain, Rachel, as data scientist, can explore the datasets available in the Datasite. We can easily do so by accessing `client.datasets`

In [None]:
client.datasets

As expected, the Datasite contains one dataset, named `Breast Cancer Biomarker`, which includes 2 assets.

Once identified the dataset we are interested in, we can access them either by index or by their unique `name`:

In [None]:
bc_dataset = client.datasets["Breast Cancer Biomarker"]

We obtained bc_dataset that is a pointer to a remote dataset.

In [None]:
bc_dataset

Using a pointer to a remote dataset, we can access its internal assets either by `index` or by their unique names. In our example, we can create a pointer to the *features asset*, and the *targets asset*:

In [None]:
features, targets = bc_dataset.assets  # using Python tuple unpacking

Let‚Äôs now validate the assumptions that only `mock` data is accessible to a data scientist, and that `data` is not. We will do so by using the two `features` and `targets` variables, that are indeed pointers to their corresponding remote assets.

In [None]:
features.mock.head(n=3)  # pandas.DataFrame

In [None]:
targets.mock.head(n=3)

And what about data?

In [None]:
features.data

In [None]:
targets.data

As expected, Rachel, as a data scientist, does not have read permissions (nor any other permissions, ed.) on the non-public information stored in the remote asset.

This clear distinction between the main components of an asset has the following advantages:

1. mock data is open-access and imposes no risks to the data owner for sharing publicly non-public information;

2. it creates a staging environment for the data scientist to simulate their intended study in a realistic way;

3. reduces liability for the data scientist, who is not responsible anymore for storing safely non-public data;

4. enables the data owner to control how non-public assets can be used by data scientists for their study.

Getting access to the mock data allows to get a general understanding of what non-public data would look like. So we can use this data to start preparing our code, to run on this data.
Rachel decides to study the breast cancer data by running a simple supervised machine learning experiment using the scikit-learn library. The dataset is represented as `pandas.DataFrame`, and features are already in the format expected by machine learning models: `samples x features` matrix. This conclusion has been derived by looking at the mock data, and therefore we can assume it is similarly applicable to the true real data.

In [None]:
X, y = features.mock, targets.mock

In short, these are steps of the machine learning experiment that Rachel has in mind:

1. use the train_test_split function to generate training and testing partitions;

2. apply StandardScaler to normalise features;

3. train a LogisticRegression model;

4. calculate accuracy_score on training, and testing data.

For simplicity, let‚Äôs wrap the whole pipeline into a single Python function. In this way it will be easier to prepare our code request to send to PySyft for execution.

In [None]:
def ml_experiment_on_breast_cancer_data(features_data, labels, seed: int = 12345) -> tuple[float, float]:
    # include the necessary imports in the main body of the function
    # to prepare for what PySyft would expect for submitted code.
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score

    X, y = features_data, labels.values.ravel()
    # 1. Data Partition
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=seed, stratify=y)
    # 2. Data normalisation
    scaler = StandardScaler()
    scaler.fit(X_train, y_train)
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)
    # 3. Model training
    model = LogisticRegression().fit(X_train, y_train)
    # 4. Metrics Calculation
    acc_train = accuracy_score(y_train, model.predict(X_train))
    acc_test = accuracy_score(y_test, model.predict(X_test))

    return acc_train, acc_test

Let‚Äôs call the function on the mock data, to check that everything works:

In [None]:
ml_experiment_on_breast_cancer_data(features_data=features.mock, labels=targets.mock)

We have verified that our `ml_experiment_on_breast_cancer_data` could run with sucess locally on the mock data. Now, we would be interested in testing that function on real data. In prticular, we need to convert the transform our (local) Python function into a remote code request: a function that can process, and execute remotely on the Datasite, where real data are stored.

In [None]:
remote_user_code = sy.syft_function_single_use(features_data=features, labels=targets)(ml_experiment_on_breast_cancer_data)

Rachel must now submit her project that Owen must approve. In essence, a Project (i.e. syft.Project) is composed by one (or more) code request(s), and includes a (short) description to communicate the intent of the study to the data owner.

In [None]:
description = """
    The purpose of this study will be to run a machine learning
    experimental pipeline on breast cancer data.
    As first attempt, the pipelines includes a normalisation steps for
    features and labels using a StandardScaler and a LabelEncoder.
    The selected ML model is Logistic regression, with the intent
    to gather the accuracy scores on both training, and testing
    data partitions, randomly generated.
"""

# Create a project

research_project = client.create_project(
    name="Breast Cancer ML Project",
    description=description,
    user_email_address="rachel@datascience.inst"
)


We can access to the list of available projects through our client:

In [None]:
client.projects

We can use the `create_code_request` method to attach our new code request to our `syft.Project` instance, i.e. `research_project`

In [None]:
code_request = research_project.create_code_request(remote_user_code, client)
code_request

We can now check that the code request has reached the project by accessing `client.code`. We can see we do indeed have a code request, in PENDING status. Similarly, we can review our existing requests, by accessing `client.requests`:

In [None]:
client.code
client.requests

Let‚Äôs say Rachel is very impatient, and would try to force the execution of a not-yet-approved (not-yet-reviewed) request.

In [None]:
client.code.ml_experiment_on_breast_cancer_data(features_data=features, labels=targets)

As expected, if we try to execute a code request that has not yet been approved, a SyftError is returned!

Congrats on completing Part 3 üéâ

### Part 4: Review Code Request

As always, the first step will be to login to the Datasite. This time, we will login using Owen‚Äôs credentials as a data owner. Then, we can get access to existing projects through our `client` instance:

In [None]:
data_site = sy.orchestra.launch(name="cancer-research-centre")

client = data_site.login(email="owen@cancer-research.science", password="cancer_research_syft_admin")
client.projects

As expected, the Datasite currently includes a request from Rachel for her ‚ÄúBreast Cancer ML Project‚Äù. Looking at the description, Owen can get a general understanding of what to expect in the incoming code request.

Let‚Äôs get access to the request, to be further inspected. Existing requests can be accessed by `index`:

In [None]:
request = client.requests[0]
request

Starting fromt the `request` object, we can immediately get a reference to the code associated to it. This code corresponds to the code submitted by the data scientist, and attached to the original project.

Before proceeding to test the code execution, the data owner can review the code, and double check that the expectations set in the project description are met:

In [None]:
request.code

After having reviewed the code, the next step for Owen would be to execute the code on both the mock and the real data of the assets specified in the submitted code. After reviewing Rachel‚Äôs code, we can see that the function expects both `features` and `labels` assets, as available in the ‚ÄúBreast Cancer Biomarker‚Äù dataset.

First, let‚Äôs get the reference to the specific function along with access to the required assets:

In [None]:
syft_function = request.code
bc_dataset = client.datasets["Breast Cancer Biomarker"]
features, labels = bc_dataset.assets

At this point, the data owner can first run the `syft_function` on `features.mock` and `targets.mock`, and then repeating the same for `features.data` and `labels.data`:

In [None]:
result_mock_data = syft_function.run(features_data=features.mock, labels=labels.mock)
result_mock_data

Checked that code runs on the mock data, we can test the code on the real data, and gather the results Rachel is waiting for:

In [None]:
result_real_data = syft_function.run(features_data=features.data, labels=labels.data)
result_real_data

Now that we have reviewed, checked, and tested Rachel‚Äôs function on the selected assets, and we also gathered the result on the real non-public data, Owen can proceed to approve the code request:

In [None]:
request.approve()
client.requests

As expected, the status of Rachel‚Äôs request is now Approved.

Congrats on completing Part 4 üéâ

### Part 5: Retrieving Results

As expected, the very first thing to do is always to log in to the Datasite, making sure that the local development server is up and running.

In [None]:
data_site = sy.orchestra.launch(name="cancer-research-centre")

client = data_site.login(email="rachel@datascience.inst", password="syftrocks")

If we were to check the status of our request, we can do so by accessing `client.requests`:

In [None]:
client.requests

üéâ Whoot whoot!

Our request has been approved by the Data Owner. All we need to know now, is to execute our code, and gather the expected results.

First, we need to get a reference to the `syft.Dataset` we intend to use. In our scenaio, we will use the two assets, i.e. features and labels, as included in the ‚ÄúBreast Cancer Wisconsin (Diagnostic)‚Äù dataset.

In [None]:
bc_dataset = client.datasets["Breast Cancer Biomarker"]
features, labels = bc_dataset.assets

We can now compute the so long desired result:

In [None]:
result = client.code.ml_experiment_on_breast_cancer_data(features_data=features, labels=labels).get()
result