In [1]:
# Optional NoteXbook Theme
%load_ext notexbook
%texify

# Introduction to PySyft

In a nutshell, PySyft helps organisations securely collaborate with external (untrusted) individuals. By using PySyft, organisations can enable external auditors to use their private assets, in order to conduct a study with a specific, known purpose. 

Besides this, the workflow prevents the external auditor from using the private assets for any other purpose. 
This is called **secure external access**.

There are two types of assets that an organisation can have:
- private datasets
- private models: already-trained ML models

## The Workflow

<img src="./syft_workflow.png" />

In a nutshell:

- The organization (i.e. Data Owner, `DO`) uses PySyft to host **private data** along with a **mock** version of the dataset.
- The Data Scientist (`DS`) submits gets access to the PySyft domain, and starts thinkering with the _mock_ dataset.
    - Private data **cannot** be accessed at this stage by the DS.
- Once ready, the DS submits their proposal and code to work on private data on the PySyft node.
- The DO receives and reviews the request, comparing it to organization's data policies, and legal requirements.
- If compliant, the DO runs the code against the private data, and deposits the result.
- DS receives and download the result.

---

## Use Case: Using PySyft to study Breast Cancer

### Preamble 

Roles placeholder used throughout the notebook

**Owen**, the Data Owner 🧝‍♂️ 

**Rachel**, the Data Scientist 🧙‍♀️

### Owen 🧝‍♂️: Set up a Domain Node

In PySyft terminology, a **Domain Node** is a (network) node that contains assets (i.e. `syft.Asset`) to be consumed by external Data Scientists.

These assets will become available as part of a Dataset (i.e. `syft.DataSet`) hosted on the domain node. 

However, the dataset may contain "non-public" information, therefore (a) no data can be shared, nor released without retrictions; and (b) data cannot leave the original site.

**PySyft** allows to overcome all those issues, enabling a new paradigm of **Remote Data Science**.

First we'll create a **private** dataset and a **mock** dataset, to be both uploaded to the domain node.

- **Private** data will represent the "non-public" / not-accessible part of the dataset. This data will be never visible nor accessible by external Data Scientists.
- **Mock** data will be created explicitly to allow data scientists to familiarise with the data. e.g. format or structure, in order to prepare their analysis, to be later submitted for execution on the real (private) data!

In this notebook, we will consider the Breast Cancer Dataset (available via `sklearn.datasets.load_breast_cancer` function) as the reference data to be uploaded on the Domain node.

In [None]:
import numpy as np
import pandas as pd

from sklearn.datasets import load_breast_cancer

In [None]:
X, y = load_breast_cancer(return_X_y=True, as_frame=True)

In [None]:
X.head()

In this simple case, let's start creating a mock version of the true (private) data by adding to each value the _mean_ value of each feature (i.e. column)

In [None]:
mock_X = X.apply(lambda s: s+np.mean(s))

# let's attach to the DataFrame the corresponding ML labels as last column
X["y"] = y
mock_X["y"] = y

private_data = X
mock_data = mock_X

---

First, let's launch the domain node

In [None]:
import syft as sy

domain_node = sy.orchestra.launch(port="8083", name="pet-test-domain", reset=True)

In [None]:
client = sy.login(port="8083", email="info@openmined.org", password="changethis")
client

#### Uploading data to the client

In [None]:
# First create an instance of the Dataset
dataset = sy.Dataset(
    name="Winsconsin Breast Cancer Data",
    description="Breast cancer wisconsin (diagnostic) dataset",
    citation = "O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and \
    prognosis via linear programming. Operations Research, 43(4), pages 570-577, \
    July-August 1995.",
    url="https://goo.gl/U2Uwz2"
)

# Create the assets to be attached to the Dataset
data_asset = sy.Asset(
    name="Breast Cancer Data", 
    data=private_data,
    mock=mock_data)

dataset.add_asset(data_asset)

In [None]:
dataset

In [None]:
client.upload_dataset(dataset)

In [None]:
client.datasets

#### Create Data Scientist Account

Next we'll create our data scientist role that can log into our domain only.

In [None]:
from syft.service.user.user_roles import ServiceRole
from syft.service.user.user import UserCreate

In [None]:
ds_profile = UserCreate(
    email="rachel@datascience.inst",
    name="Rachel Science",
    role=ServiceRole.DATA_SCIENTIST,
    password="abc123",
    password_verify="abc123",
    institution="Data Science Institute",
    website="datascience.inst",
)

client.users.create(ds_profile)

---

### Rachel 🧙‍♀️: Submitting requests to the domain 

Let's switch hats and pretend we are now the Data Scientist, thus we connect to the domain using our own credentials.

In [None]:
scientist_domain = domain_node.client

In [None]:
scientist_client = scientist_domain.login(email="rachel@datascience.inst", password="abc123")
scientist_client

Now, **Rachel** can access the datasets and assets on the domain:

In [None]:
scientist_client.datasets

In [None]:
scientist_client.datasets[0]

In [None]:
assets = scientist_client.datasets[0].assets

# Success!

In [None]:
list(assets.keys())

Once we get access to the assets, we can in turn access either the `mock` and the actual `data` (_not really_, ed.)

In [None]:
asset = assets[0]

Here is the mock data, to be used to prepare our code

In [None]:
asset.mock

and here is the data

In [None]:
asset.data

As expected, `data` is **not** accessible as intended to be private. The only available information is the `mock` dataset!

#### Using Syft and Machine learning on Data

Now that Rachel is familiar with the dataset, she would like to conduct his study on the dataset available on the node. 

To do so, the first step will be to start tinkering with mock data, in order to prepare her code request to submit to the `DO` for review.

<img src="./syft_ds_workflow.png" />

Let's first gather information about the structure of the mock data

In [None]:
asset.mock.columns

In [None]:
asset.mock.head()

We do now have a general understanding of what the data would look like. 

We could start preparing our code, to run on this data.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

def run_logistic_regression_model(data):
    y = data["y"]
    X = data.drop(columns=["y"])

    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    model = LogisticRegression().fit(X_train,y_train)

    acc_train = accuracy_score(y_train, model.predict(X_train))
    acc_test = accuracy_score(y_test, model.predict(X_test))
    return acc_train, acc_test

Let's test that it works as expected on the mock data. 

In [None]:
import warnings

warnings.filterwarnings("ignore")

In [None]:
result = run_logistic_regression_model(data=asset.mock)
result

Amazing! 

Now let's create a Syft function and a **project** to be sent to Owen to get our results based on the real data.

### Creating a Syft Function

To turn a (local) function to a **Syft** function all we need to do is to use a `syft_function_single_use` decorator:

In [None]:
@sy.syft_function_single_use(data=asset)
def run_logistic_regression_model(data):
    # move imports to function body to make the function as a closure
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    
    y = data["y"]
    X = data.drop(columns=["y"])

    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    model = LogisticRegression().fit(X_train,y_train)

    acc_train = accuracy_score(y_train, model.predict(X_train))
    acc_test = accuracy_score(y_test, model.predict(X_test))
    return acc_train, acc_test

We can also inspect the properties of this syft function like so:

In [None]:
run_logistic_regression_model.kwargs

In [None]:
run_logistic_regression_model.input_policy_type

In [None]:
run_logistic_regression_model.output_policy_type

In [None]:
run_logistic_regression_model

#### Submitting a request to the domain


Now that we've created our syft function and it has all of the properties we would expect, we're ready to submit it to a project! Let's create a new project for our salary analysis:

In [None]:
scientist_client.create_project(
    name="Logistic Regression Study",
    description="Running a LogReg Model",
    user_email_address="rachel@datascience.inst"
)

In [None]:
scientist_client.projects

In [None]:
ml_project = scientist_client.get_project(name="Logistic Regression Study")
ml_project

In [None]:
ml_project.create_code_request(run_logistic_regression_model, scientist_client)

In [None]:
scientist_client.code

Let's see if the result is accesible.

In [None]:
scientist_client.code.run_logistic_regression_model(data=asset)

---


### Owen 🧝‍♂️ : Review and Execute Code Requests on the domain node

Let's switch hats and pretend we are now _Owen, the Data Owner_ so we can:
-  Connect to the domain to check for incoming requests
-  Review the code requests
-  Connect to the domain to execute approved requests on the private data
-  Submit the result back on the domain

Recall once again that the work of the data owner and data scientist would take place on different machines and separate notebooks, but we opted to present this "hat exchange" to be able to show all of our code in a single notebook.

#### Connect to the domain and check incoming requests

In [None]:
do_client = sy.login(port="8083", email="info@openmined.org", password="changethis")
do_client.requests

We see a request is pending, thus let's inspect, review and answer it.

In [None]:
request = do_client.requests[0]
request

When a request is received, the Data Owner has two option:

- answer by depositing a response computed on the private counter part of the data
- deny, by providing a written reason

Before that, the Data Owner would:

- inspect that the code is not malicious
- retrieve a callable reference to the method
- run the method against mock data for safety
- in case of approval, run the method against the private data
- in case of denial, specify the reason

In [None]:
request.code

Let's assume the code is not malicious and we would like to answer it. 

#### Get a callable method to run on the private asset

In [None]:
callable_method = request.code.unsafe_function

In [None]:
do_client.datasets.get_all()

In [None]:
do_client.datasets["Winsconsin Breast Cancer Data"].assets[0].mock

In [None]:
do_asset = do_client.datasets["Winsconsin Breast Cancer Data"].assets[0]

Owen tests the function submitted by Rachel on both `mock` and `private` data assets

In [None]:
mock_result = callable_method(data=do_asset.mock)
mock_result

In [None]:
private_data_result = callable_method(data=do_asset.data)
private_data_result

#### Deposit the real result back to the domain

Once results on the true (private) data have been collected, those can be deposited to become accessible to the original data scientists who requested it. Rachel in this case!

In [None]:
request.accept_by_depositing_result(private_data_result)

---

### Rachel 🧙‍♀️: Pulls result from the domain

Let's switch hats for one last time, now as a Data Scientist, so we can pull the final result from the domain.

In [None]:
scientist_client = scientist_domain.login(email="rachel@datascience.inst", password="abc123")

In [None]:
scientist_client.code

In [None]:
ptr = scientist_client.code.run_logistic_regression_model(data=scientist_client.datasets[0].assets[0])

In [None]:
ptr.get()

We've got our results, without needing to access nor see the true and private data asset!!! 🎉🥳🎉 