Instructions

* the gradients are not seen "locally" (are only visible to the workers)
* all gradients are collected on one of the workers and "aggregated" (averaged?) there
* then the averaged gradients can be brought locally
* use `move` to collect all gradients on one of the workers

## Set up workers

In [1]:
import torch
import syft



In [2]:
hook = syft.TorchHook(torch)

num_workers = 10
workers = [syft.VirtualWorker(hook, id=i) for i in range(10)]

In [3]:
workers

[<VirtualWorker id:0 #tensors:0>,
 <VirtualWorker id:1 #tensors:0>,
 <VirtualWorker id:2 #tensors:0>,
 <VirtualWorker id:3 #tensors:0>,
 <VirtualWorker id:4 #tensors:0>,
 <VirtualWorker id:5 #tensors:0>,
 <VirtualWorker id:6 #tensors:0>,
 <VirtualWorker id:7 #tensors:0>,
 <VirtualWorker id:8 #tensors:0>,
 <VirtualWorker id:9 #tensors:0>]

We will have a dedicated worker that does the aggregation and will use the remaining ones for calculating gradients based on individual data items.

In [4]:
aggregator = workers[0]
differentiators = workers[1:]

## Training data

Prepare training data and distribute among the differentiators. We will try to learn a 10-dimensional linear model. To make things more interesting, only one of the workers will have the data corresponding to one of the dimensions.

In [5]:
model_dim = 10
true_coefficients = torch.tensor(range(1,model_dim+2)).float() # includes bias term
num_examples_per_differentiator = 100

X_ptrs = []
y_ptrs = []

for i in range(len(differentiators)):
    X = torch.cat((torch.rand((num_examples_per_differentiator, model_dim)),
                   torch.tensor([1.0] * num_examples_per_differentiator).view((-1, 1))), # additional dimension for bias
                  dim=1)
    # only differentiator 3 knows about the 3rd parameter
    if i != 3:
        X[:, 2] = 0
    y = (torch.matmul(X, true_coefficients)).view((num_examples_per_differentiator, 1))
    X_ptrs.append(X.send(differentiators[i]))
    y_ptrs.append(y.send(differentiators[i]))

## Model

In [19]:
def mk_model(template=None):
    if template is not None:
        return template.clone().detach().requires_grad_(True)
    else:
        return torch.tensor(()).new_zeros((model_dim+1, 1), requires_grad=True)

## Train

The instructions said to use move to get the gradients to a single worker. However, [there is a bug](https://openmined.slack.com/archives/C6DEWA4FR/p1562282512181200?thread_ts=1562282454.181100&cid=C6DEWA4FR) where the gradients don't get moved alongside the tensor. We will therefore apply the gradients on each differentiator to a copy of the model, the collect and average all copies of the model on the aggregator.

In [26]:
epochs = 1000
learning_rate = 0.001

model = mk_model()

for epoch in range(epochs):

    agg_model_ptr = mk_model().requires_grad_(False).send(aggregator)
    
    for i in range(len(differentiators)):
        worker = differentiators[i]
        model_ptr = mk_model(model).send(worker)
        pred_ptr = X_ptrs[i].mm(model_ptr)
        loss_ptr = ((pred_ptr - y_ptrs[i])**2).sum()
        loss_ptr.backward()
        model_ptr.data.sub_(model_ptr.grad * learning_rate)
        model_ptr.move(aggregator)
        agg_model_ptr += model_ptr

    model = agg_model_ptr.get() / len(differentiators)
    if epoch % 100 == 0:
        print(model.view(1, -1).data)

tensor([[3.7783, 3.7443, 0.4085, 3.7935, 3.7636, 3.7254, 3.9816, 3.7842, 3.8019,
         3.8849, 7.4163]])
tensor([[ 1.8421,  2.6393,  1.7548,  4.2687,  5.2593,  6.0991,  6.6403,  7.5424,
          8.4513,  9.2471, 11.0668]])
tensor([[ 1.1512,  2.1121,  2.3329,  4.0354,  5.0716,  6.0538,  6.8793,  7.8972,
          8.9156,  9.8638, 11.0466]])
tensor([[ 1.0252,  2.0182,  2.6399,  3.9995,  5.0158,  6.0160,  6.9548,  7.9713,
          8.9901,  9.9755, 11.0371]])
tensor([[ 1.0019,  2.0011,  2.8050,  3.9952,  5.0015,  6.0027,  6.9797,  7.9889,
          8.9999,  9.9952, 11.0280]])
tensor([[ 0.9981,  1.9982,  2.8942,  3.9958,  4.9984,  5.9992,  6.9894,  7.9942,
          9.0003,  9.9985, 11.0201]])
tensor([[ 0.9979,  1.9981,  2.9425,  3.9969,  4.9982,  5.9987,  6.9939,  7.9964,
          8.9998,  9.9990, 11.0138]])
tensor([[ 0.9984,  1.9985,  2.9687,  3.9979,  4.9986,  5.9989,  6.9964,  7.9977,
          8.9996,  9.9992, 11.0093]])
tensor([[ 0.9989,  1.9989,  2.9829,  3.9986,  4.9990,  5.99

As expected, the 3rd dimension converges more slowly than the others, due to only one worker having the data for it.