# Turbofan POC Part 1: Dataloading
CAA 20/07/2020

In this notebook, we will use PyGrid and PySyft to train a model with differential privacy and multi-party computation using a federated approach.

Dependencies for this notebook:
- miniconda3 or anaconda3 (for environment management)
- Python >= 3.7
- PySyft 0.2.7

NOTE: Before running this notebook, ensure that you have run `bash prep_turbofan.sh`.

## 1.1 Run PyGridNetwork and PyGridNode in the background

In this example, we distribute Turbofan data from a local directory to 2 workers. Following the above section, we will make sure to run the following processes at the following addresses:

1. Instance 1 (3.19.72.20)
    1. PyGridNetwork at port 5000
    1. This Jupyter Notebook at port 8000—you should be able to run this notebook on any server which is running a PyGridNetwork, or PyGridNode associated with the PyGridNetwork.
1. Instance 2 (18.221.43.195)
    1. Worker Bob: PyGridNode at port 3000
    1. Worker Alice: PyGridNode at port 3001
    


For allowing communication *between* training workers and any coordinating servers, we run PyGridNetwork on a server of choice as follows:

1. Clone [PyGridNetwork](https://github.com/OpenMined/PyGridNetwork)
1. Descend into cloned PyGridNetwork directory
1. Create and activate `conda` environment (can be shared by PyGridNetwork and PyGridNode)
1. Install dependencies: `pip install openmined.gridnetwork`
1. Run PyGridNetwork: `python -m gridnetwork --port DESIRED_PORT --start_local_db `


For allowing workers to communicate with the PyGridNetwork process, start the desired number of PyGridNodes (equal to number of desired workers) per server. The following steps should be taken per desired worker:

1. Clone [PyGridNode](https://github.com/OpenMined/PyGridNode)
1. Descend into cloned PyGridNode directory
1. Create and activate `conda` environment (can be shared by PyGridNetwork and PyGridNode)
1. Install dependencies: `pip install .`
1. Run PyGridNode: `python -m gridnode --id alice --port DESIRED_PORT --host SERVER_IPV4_ADDRESS --gateway_url HTTPS_URL_OF_PYGRIDNETWORK_SERVER`

(NOTE: PyGridNode will be deprecated with its function moved to the PySyft library). 



## 1.2 Populate nodes with data
It is possible to use a coordinating server to distribute data to the nodes, either from a remote source or from a local directory.

IMPORTANT! Before running this section, make sure to clone the [OpenMined Turbofan POC](https://github.com/matthiaslau/Turbofan-Federated-Learning-POC) repository, and follow instructions for downloading and preprocessing the dataset.

### Import dependencies

In [35]:
import syft as sy
from syft.grid.clients.dynamic_fl_client import DynamicFLClient
import torch
import pandas as pd
from numpy.random import laplace
from math import floor

from federated_trainer.helper.data_helper import _load_data, WINDOW_SIZE, _drop_unnecessary_columns, _transform_to_windowed_data, get_data_loader, _clip_rul

### Define helper functions

In [36]:
def add_rul_to_train_data(train_data):
    """ Calculate and add the RUL to all rows in the given training data.

    :param train_data: The training data
    :return: The training data with added RULs
    """
    # retrieve the max cycles per engine_node: RUL
    train_rul = pd.DataFrame(train_data.groupby('engine_no')['time_in_cycles'].max()).reset_index()

    # merge the RULs into the training data
    train_rul.columns = ['engine_no', 'max']
    train_data = train_data.merge(train_rul, on=['engine_no'], how='left')

    # add the current RUL for every cycle
    train_data['RUL'] = train_data['max'] - train_data['time_in_cycles']
    train_data.drop('max', axis=1, inplace=True)

    return train_data

def round_to_multiple(x, base):
    '''
    Round x down to multiple of base
    '''
    return base * floor(x/base)

def batch(tensor, batch_size):
    features_size = tensor.shape[1:]
    # shuffle and batch
    randi = torch.randperm(tensor.shape[0])
    # remove undersized tensor
    out = tensor[randi].split(batch_size)[:-1]
    out = torch.cat(out).view(-1, batch_size, *features_size)
    return out

def tuple_batch(tensors, batch_size):
    '''
    tensors: tuple of tensors
    '''
    return (batch(t, batch_size) for t in tensors)

### Set up configs

In [37]:
DATA_PATH = "./data"
DATA_NAME = "train_data_initial.txt"
MINIBATCH_SIZE = 4
NOISE = 0.2
DP_TYPE = 'local'

### Set up network

In [38]:
# Hook Torch
hook = sy.TorchHook(torch)

nodes = ["ws://18.221.43.195:3000/",
         "ws://18.221.43.195:3001/"]

compute_nodes = []
for node in nodes:
    compute_nodes.append(DynamicFLClient(hook, node))



### Load dataset
The code below will load prepared data from the Turbofan POC repository.

In [39]:
data = _load_data(DATA_NAME, DATA_PATH)
data_dropcol = _drop_unnecessary_columns(data)
data_rul = add_rul_to_train_data(data_dropcol)
x, y = _transform_to_windowed_data(data_rul, WINDOW_SIZE)
y = _clip_rul(y)
 # transform to torch tensor
tensor_x = torch.Tensor(x)
tensor_y = torch.Tensor(y)

1209 features with shape (80, 11)
1209 labels with shape (1209, 1)


#### Optional: Add differential privacy to data
We can add noise to the data at this point if we want to simulate the addition of noise by distributed data owners.

In [40]:
def laplacian_mechanism(input_tensor, sensitivity=0.5, epsilon=0.05):
    '''
    sensitivity and epsilon are arbitrarily 
    chosen for now
    '''
    beta = sensitivity / epsilon
    noise = torch.tensor(laplace(0, beta, 1))
    return input_tensor + noise

def add_noise(input_tensor, p_noise):
    '''
    tensor: input tensor
    p_noise: probability with which noise is added
    '''
    be_honest = (torch.rand(input_tensor.shape) < p_noise).float()
    tensor_artificial = laplacian_mechanism(input_tensor)
    # add noise
    mod_tensor = input_tensor.float() * be_honest + (1 - be_honest) * tensor_artificial
    sk_tensor = mod_tensor.float().mean()
    # de-skew result
    noisy_tensor = ((mod_tensor / p_noise) - 0.5) * p_noise / (1 - p_noise)
    return mod_tensor.type(torch.float32)

if DP_TYPE=='local':
    tensor_x = add_noise(tensor_x, NOISE)

#### Create dataloader

In [41]:
dataset_train = torch.utils.data.TensorDataset(tensor_x, tensor_y)
trainloader = torch.utils.data.DataLoader(dataset_train, 
    # split data equally among nodes with shuffle
    batch_size=dataset_train.__len__()//len(compute_nodes),
    shuffle=True,
    drop_last=True,)
    #pin_memory=True) for faster dataloading to CUDA
dataiter = iter(trainloader)

### Tag and send split datasets to each worker

In [42]:

shared_x = []
shared_y = []
for node in compute_nodes:
    # create minibatches
    worker_batch = dataiter.next()
    sensors_train_tfan, labels_train_tfan = tuple_batch(worker_batch, MINIBATCH_SIZE)
    print(sensors_train_tfan.shape, labels_train_tfan.shape)
    # Tag tensors (allows them to be retrieved later)
    if not DP_TYPE:
        tagged_sensors = sensors_train_tfan.tag("#X", "#turbofan", "#dataset").describe("The input datapoints to the Turbofan dataset.")
    elif DP_TYPE=='local':
        tagged_sensors = sensors_train_tfan.tag("#X", "#localdp", "#turbofan", "#dataset").describe("The input datapoints to the Turbofan dataset.")
    tagged_label = labels_train_tfan.tag("#Y", "#turbofan", "#dataset").describe("The input labels to the Turbofan dataset.")
    
    shared_x.append(tagged_sensors.send(node))
    shared_y.append(tagged_label.send(node))

Exception ignored in: <function ObjectPointer.__del__ at 0x7fc9f16960e0>
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/syft/lib/python3.7/site-packages/syft/generic/pointers/object_pointer.py", line 346, in __del__
    self.owner.send_msg(ForceObjectDeleteMessage(self.id_at_location), self.location)
  File "/home/ubuntu/anaconda3/envs/syft/lib/python3.7/site-packages/syft/workers/base.py", line 309, in send_msg
    bin_response = self._send_msg(bin_message, location)
  File "/home/ubuntu/anaconda3/envs/syft/lib/python3.7/site-packages/syft/workers/virtual.py", line 16, in _send_msg
    return location._recv_msg(message)
  File "/home/ubuntu/anaconda3/envs/syft/lib/python3.7/site-packages/syft/workers/websocket_client.py", line 106, in _recv_msg
    response = self._forward_to_websocket_server_worker(message)
  File "/home/ubuntu/anaconda3/envs/syft/lib/python3.7/site-packages/syft/grid/clients/dynamic_fl_client.py", line 155, in _forward_to_websocket_server_wor

In [43]:
# print("X tensor pointers: ", shared_x1, shared_x2)
# print("Y tensor pointers: ", shared_y1, shared_y2)

print("X tensor pointers: ", shared_x)
print("Y tensor pointers: ", shared_y)

X tensor pointers:  [(Wrapper)>[PointerTensor | me:56856317696 -> alice:75639148278]
	Tags: #localdp #X #turbofan #dataset 
	Shape: torch.Size([150, 4, 80, 11])
	Description: The input datapoints to the Turbofan dataset...., (Wrapper)>[PointerTensor | me:9037347334 -> bob:28395798793]
	Tags: #localdp #X #turbofan #dataset 
	Shape: torch.Size([150, 4, 80, 11])
	Description: The input datapoints to the Turbofan dataset....]
Y tensor pointers:  [(Wrapper)>[PointerTensor | me:658473828 -> alice:93578425169]
	Tags: #Y #turbofan #dataset 
	Shape: torch.Size([150, 4, 1])
	Description: The input labels to the Turbofan dataset...., (Wrapper)>[PointerTensor | me:82313452988 -> bob:21619885372]
	Tags: #Y #turbofan #dataset 
	Shape: torch.Size([150, 4, 1])
	Description: The input labels to the Turbofan dataset....]


### Disconnect nodes

To ensure that our training process (in the Part 2 notebook), if located on the same server, is not using cached or local data for training.

In [44]:
for node in compute_nodes:
    node.close()