# Strategies in Federated Learning

Welcome to the next part of the federated learning tutorial. In previous parts of this tutorial, we introduced federated learning with PyTorch and Flower ([part 1](https://flower.dev/docs/tutorial/Flower-1-Intro-to-FL-PyTorch.html)).

In this notebook, we'll begin to customize the federated learning system we built in the introductory notebook (again, using [Flower](https://flower.dev/) and [PyTorch](https://pytorch.org/)).

> Join the Flower community on Slack to connect, ask questions, and get help: [Join Slack](https://flower.dev/join-slack) 🌻 We'd love to hear from you in the `#introductions` channel! If anything is unclear, head over to the `#questions` channel.

Let's move beyond FedAvg with Flower Strategies!

## Preparation

Before we begin with the actual code, let's make sure that we have everything we need.

### Installing dependencies

First, we install the necessary packages:

In [1]:
# %pip install -q flwr[simulation] torch torchvision

Now that we have all dependencies installed, we can import everything we need for this tutorial:

In [2]:
from collections import OrderedDict
from typing import Dict, List, Optional, Tuple

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, random_split
from torchvision.datasets import CIFAR10

import flwr as fl

DEVICE = torch.device("cpu")  # Try "cuda" to train on GPU
print(
    f"Training on {DEVICE} using PyTorch {torch.__version__} and Flower {fl.__version__}"
)

Training on cpu using PyTorch 1.13.1+cu117 and Flower 1.3.0


It is possible to switch to a runtime that has GPU acceleration enabled (on Google Colab: `Runtime > Change runtime type > Hardware acclerator: GPU > Save`). Note, however, that Google Colab is not always able to offer GPU acceleration. If you see an error related to GPU availability in one of the following sections, consider switching back to CPU-based execution by setting `DEVICE = torch.device("cpu")`. If the runtime has GPU acceleration enabled, you should see the output `Training on cuda`, otherwise it'll say `Training on cpu`.

### Data loading

Let's now load the CIFAR-10 training and test set, partition them into ten smaller datasets (each split into training and validation set), and wrap everything in their own `DataLoader`. We introduce a new parameter `num_clients` which allows us to call `load_datasets` with different numbers of clients.

In [3]:
NUM_CLIENTS = 10


def load_datasets(num_clients: int):
    # Download and transform CIFAR-10 (train and test)
    transform = transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
    )
    trainset = CIFAR10("./dataset", train=True, download=True, transform=transform)
    testset = CIFAR10("./dataset", train=False, download=True, transform=transform)

    # Split training set into `num_clients` partitions to simulate different local datasets
    partition_size = len(trainset) // num_clients
    lengths = [partition_size] * num_clients
    datasets = random_split(trainset, lengths, torch.Generator().manual_seed(42))

    # Split each partition into train/val and create DataLoader
    trainloaders = []
    valloaders = []
    for ds in datasets:
        len_val = len(ds) // 10  # 10 % validation set
        len_train = len(ds) - len_val
        lengths = [len_train, len_val]
        ds_train, ds_val = random_split(ds, lengths, torch.Generator().manual_seed(42))
        trainloaders.append(DataLoader(ds_train, batch_size=32, shuffle=True))
        valloaders.append(DataLoader(ds_val, batch_size=32))
    testloader = DataLoader(testset, batch_size=32)
    return trainloaders, valloaders, testloader


trainloaders, valloaders, testloader = load_datasets(NUM_CLIENTS)

Files already downloaded and verified
Files already downloaded and verified


### Model training/evaluation

Let's continue with the usual model definition (including `set_parameters` and `get_parameters`), training and test functions:

In [4]:
class Net(nn.Module):
    def __init__(self) -> None:
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


def get_parameters(net) -> List[np.ndarray]:
    return [val.cpu().numpy() for _, val in net.state_dict().items()]


def set_parameters(net, parameters: List[np.ndarray]):
    params_dict = zip(net.state_dict().keys(), parameters)
    state_dict = OrderedDict({k: torch.Tensor(v) for k, v in params_dict})
    net.load_state_dict(state_dict, strict=True)


def train(net, trainloader, epochs: int):
    """Train the network on the training set."""
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(net.parameters())
    net.train()
    for epoch in range(epochs):
        correct, total, epoch_loss = 0, 0, 0.0
        for images, labels in trainloader:
            images, labels = images.to(DEVICE), labels.to(DEVICE)
            optimizer.zero_grad()
            outputs = net(images)
            loss = criterion(net(images), labels)
            loss.backward()
            optimizer.step()
            # Metrics
            epoch_loss += loss
            total += labels.size(0)
            correct += (torch.max(outputs.data, 1)[1] == labels).sum().item()
        epoch_loss /= len(trainloader.dataset)
        epoch_acc = correct / total
        print(f"Epoch {epoch+1}: train loss {epoch_loss}, accuracy {epoch_acc}")


def test(net, testloader):
    """Evaluate the network on the entire test set."""
    criterion = torch.nn.CrossEntropyLoss()
    correct, total, loss = 0, 0, 0.0
    net.eval()
    with torch.no_grad():
        for images, labels in testloader:
            images, labels = images.to(DEVICE), labels.to(DEVICE)
            outputs = net(images)
            loss += criterion(outputs, labels).item()
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    loss /= len(testloader.dataset)
    accuracy = correct / total
    return loss, accuracy

### Flower client

To implement the Flower client, we (again) create a subclass of `flwr.client.NumPyClient` and implement the three methods `get_parameters`, `fit`, and `evaluate`. Here, we also pass the `cid` to the client and use it log additional details:

In [5]:
class FlowerClient(fl.client.NumPyClient):
    def __init__(self, cid, net, trainloader, valloader):
        self.cid = cid
        self.net = net
        self.trainloader = trainloader
        self.valloader = valloader

    def get_parameters(self, config):
        print(f"[Client {self.cid}] get_parameters")
        return get_parameters(self.net)

    def fit(self, parameters, config):
        print(f"[Client {self.cid}] fit, config: {config}")
        set_parameters(self.net, parameters)
        train(self.net, self.trainloader, epochs=1)
        return get_parameters(self.net), len(self.trainloader), {}

    def evaluate(self, parameters, config):
        print(f"[Client {self.cid}] evaluate, config: {config}")
        set_parameters(self.net, parameters)
        loss, accuracy = test(self.net, self.valloader)
        return float(loss), len(self.valloader), {"accuracy": float(accuracy)}


def client_fn(cid) -> FlowerClient:
    net = Net().to(DEVICE)
    trainloader = trainloaders[int(cid)]
    valloader = valloaders[int(cid)]
    return FlowerClient(cid, net, trainloader, valloader)

## Strategy customization

So far, everything should look familiar if you've worked through the introductory notebook. With that, we're ready to introduce a number of new features. 

### Server-side parameter **initialization**

Flower, by default, initializes the global model by asking one random client for the initial parameters. In many cases, we want more control over parameter initialization though. Flower therefore allows you to directly pass the initial parameters to the Strategy:

In [6]:
# Create an instance of the model and get the parameters
params = get_parameters(Net())

# Pass parameters to the Strategy for server-side parameter initialization
strategy = fl.server.strategy.FedAvg(
    fraction_fit=0.3,
    fraction_evaluate=0.3,
    min_fit_clients=3,
    min_evaluate_clients=3,
    min_available_clients=NUM_CLIENTS,
    initial_parameters=fl.common.ndarrays_to_parameters(params),
)

# Specify client resources if you need GPU (defaults to 1 CPU and 0 GPU)
client_resources = None
if DEVICE.type == "cuda":
    client_resources = {"num_gpus": 1}

# Start simulation
fl.simulation.start_simulation(
    client_fn=client_fn,
    num_clients=NUM_CLIENTS,
    config=fl.server.ServerConfig(num_rounds=3),  # Just three rounds
    strategy=strategy,
    client_resources=client_resources,
)

INFO flwr 2023-02-27 15:47:42,851 | app.py:145 | Starting Flower simulation, config: ServerConfig(num_rounds=3, round_timeout=None)
2023-02-27 15:47:58,512	INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
E0227 15:48:09.053167700   14900 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509289.052203500","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
INFO flwr 2023-02-27 15:48:09,445 | app.py:179 | Flower VCE: Ray initialized with resources: {'node:10.246.68.42': 1.0, 'memory': 734097408.0, 'object_store_memory': 367048704.0, 'CPU': 8.0}
INFO flwr 2023-02-27 15:48:09,456 | server.py:86 | Initializing global parameters
INFO flwr 2023-02-27 15:48:09,497 | server.py:266 | Using initial parameters provided by strategy
INFO flwr 

[2m[36m(launch_and_fit pid=15343)[0m [Client 8] fit, config: {}


[2m[33m(raylet)[0m E0227 15:48:34.826798100   15990 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509314.826763900","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}


[2m[36m(launch_and_fit pid=15343)[0m Epoch 1: train loss 0.06502389162778854, accuracy 0.22955555555555557
[2m[36m(launch_and_fit pid=15343)[0m [Client 6] fit, config: {}
[2m[36m(launch_and_fit pid=15345)[0m [Client 0] fit, config: {}
[2m[36m(launch_and_fit pid=15345)[0m Epoch 1: train loss 0.0649348720908165, accuracy 0.2371111111111111
[2m[36m(launch_and_fit pid=15345)[0m [Client 6] fit, config: {}


DEBUG flwr 2023-02-27 15:48:54,495 | server.py:229 | fit_round 1 received 3 results and 0 failures
DEBUG flwr 2023-02-27 15:48:54,519 | server.py:165 | evaluate_round 1: strategy sampled 3 clients (out of 10)


[2m[36m(launch_and_fit pid=15345)[0m Epoch 1: train loss 0.06501290202140808, accuracy 0.232


[2m[33m(raylet)[0m [2023-02-27 15:48:58,422 E 15232 15232] (raylet) node_manager.cc:3097: 6 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 6af4d8029283a981137c11393934d1bd5fe3e6cdbe5cd3d6bfb447e9, IP: 10.246.68.42) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 10.246.68.42`
[2m[33m(raylet)[0m 
[2m[33m(raylet)[0m Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.


[2m[36m(launch_and_evaluate pid=15347)[0m [Client 2] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=15347)[0m [Client 8] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=15347)[0m [Client 5] evaluate, config: {}


DEBUG flwr 2023-02-27 15:49:11,505 | server.py:179 | evaluate_round 1 received 3 results and 0 failures
DEBUG flwr 2023-02-27 15:49:11,509 | server.py:215 | fit_round 2: strategy sampled 3 clients (out of 10)


[2m[36m(launch_and_fit pid=15351)[0m [Client 5] fit, config: {}
[2m[36m(launch_and_fit pid=15351)[0m Epoch 1: train loss 0.05815477296710014, accuracy 0.316
[2m[36m(launch_and_fit pid=15351)[0m [Client 8] fit, config: {}
[2m[36m(launch_and_fit pid=15351)[0m Epoch 1: train loss 0.05778171867132187, accuracy 0.32066666666666666
[2m[36m(launch_and_fit pid=15351)[0m [Client 9] fit, config: {}


DEBUG flwr 2023-02-27 15:49:42,322 | server.py:229 | fit_round 2 received 3 results and 0 failures
DEBUG flwr 2023-02-27 15:49:42,339 | server.py:165 | evaluate_round 2: strategy sampled 3 clients (out of 10)


[2m[36m(launch_and_fit pid=15351)[0m Epoch 1: train loss 0.058538973331451416, accuracy 0.31955555555555554
[2m[36m(launch_and_evaluate pid=15346)[0m [Client 0] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=15346)[0m [Client 1] evaluate, config: {}


[2m[33m(raylet)[0m E0227 15:49:55.026685400   16404 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509395.026657100","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
DEBUG flwr 2023-02-27 15:49:55,189 | server.py:179 | evaluate_round 2 received 3 results and 0 failures
DEBUG flwr 2023-02-27 15:49:55,190 | server.py:215 | fit_round 3: strategy sampled 3 clients (out of 10)


[2m[36m(launch_and_evaluate pid=15346)[0m [Client 7] evaluate, config: {}


[2m[33m(raylet)[0m [2023-02-27 15:49:58,423 E 15232 15232] (raylet) node_manager.cc:3097: 2 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 6af4d8029283a981137c11393934d1bd5fe3e6cdbe5cd3d6bfb447e9, IP: 10.246.68.42) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 10.246.68.42`
[2m[33m(raylet)[0m 
[2m[33m(raylet)[0m Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.


[2m[36m(launch_and_fit pid=15346)[0m [Client 8] fit, config: {}


[2m[36m(raylet)[0m Spilled 2206 MiB, 28 objects, write throughput 122 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.


[2m[36m(launch_and_fit pid=15346)[0m Epoch 1: train loss 0.05325465276837349, accuracy 0.3804444444444444
[2m[36m(launch_and_fit pid=15346)[0m [Client 0] fit, config: {}
[2m[36m(launch_and_fit pid=15346)[0m Epoch 1: train loss 0.05432221665978432, accuracy 0.3668888888888889
[2m[36m(launch_and_fit pid=15346)[0m [Client 3] fit, config: {}


DEBUG flwr 2023-02-27 15:50:11,544 | server.py:229 | fit_round 3 received 3 results and 0 failures
DEBUG flwr 2023-02-27 15:50:11,564 | server.py:165 | evaluate_round 3: strategy sampled 3 clients (out of 10)


[2m[36m(launch_and_fit pid=15346)[0m Epoch 1: train loss 0.05534721538424492, accuracy 0.3526666666666667
[2m[36m(launch_and_evaluate pid=15346)[0m [Client 2] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=15346)[0m [Client 1] evaluate, config: {}


DEBUG flwr 2023-02-27 15:50:21,974 | server.py:179 | evaluate_round 3 received 3 results and 0 failures
INFO flwr 2023-02-27 15:50:21,975 | server.py:144 | FL finished in 132.40289959999973
INFO flwr 2023-02-27 15:50:21,977 | app.py:202 | app_fit: losses_distributed [(1, 0.06310507273674011), (2, 0.05618654195467632), (3, 0.05283888673782348)]
INFO flwr 2023-02-27 15:50:21,978 | app.py:203 | app_fit: metrics_distributed {}
INFO flwr 2023-02-27 15:50:21,979 | app.py:204 | app_fit: losses_centralized []
INFO flwr 2023-02-27 15:50:21,981 | app.py:205 | app_fit: metrics_centralized {}


[2m[36m(launch_and_evaluate pid=15346)[0m [Client 5] evaluate, config: {}


History (loss, distributed):
	round 1: 0.06310507273674011
	round 2: 0.05618654195467632
	round 3: 0.05283888673782348

Passing `initial_parameters` to the `FedAvg` strategy prevents Flower from asking one of the clients for the initial parameters. If we look closely, we can see that the logs do not show any calls to the `FlowerClient.get_parameters` method.

### Starting with a customized strategy

We've seen the function `start_simulation` before. It accepts a number of arguments, amongst them the `client_fn` used to create `FlowerClient` instances, the number of clients to simulate `num_clients`, the number of rounds `num_rounds`, and the strategy.

The strategy encapsulates the federated learning approach/algorithm, for example, `FedAvg` or `FedAdagrad`. Let's try to use a different strategy this time:

In [7]:
# Create FedAdam strategy
strategy = fl.server.strategy.FedAdagrad(
    fraction_fit=0.3,
    fraction_evaluate=0.3,
    min_fit_clients=3,
    min_evaluate_clients=3,
    min_available_clients=NUM_CLIENTS,
    initial_parameters=fl.common.ndarrays_to_parameters(get_parameters(Net())),
)

# Start simulation
fl.simulation.start_simulation(
    client_fn=client_fn,
    num_clients=NUM_CLIENTS,
    config=fl.server.ServerConfig(num_rounds=3),  # Just three rounds
    strategy=strategy,
    client_resources=client_resources,
)

INFO flwr 2023-02-27 15:50:22,257 | app.py:145 | Starting Flower simulation, config: ServerConfig(num_rounds=3, round_timeout=None)
2023-02-27 15:50:37,970	INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
INFO flwr 2023-02-27 15:50:44,312 | app.py:179 | Flower VCE: Ray initialized with resources: {'object_store_memory': 1141410201.0, 'memory': 2282820404.0, 'CPU': 8.0, 'node:10.246.68.42': 1.0}
INFO flwr 2023-02-27 15:50:44,315 | server.py:86 | Initializing global parameters
INFO flwr 2023-02-27 15:50:44,317 | server.py:266 | Using initial parameters provided by strategy
INFO flwr 2023-02-27 15:50:44,320 | server.py:88 | Evaluating initial parameters
INFO flwr 2023-02-27 15:50:44,324 | server.py:101 | FL starting
DEBUG flwr 2023-02-27 15:50:44,327 | server.py:215 | fit_round 1: strategy sampled 3 clients (out of 10)


[2m[36m(launch_and_fit pid=16961)[0m [Client 1] fit, config: {}
[2m[36m(launch_and_fit pid=16963)[0m [Client 9] fit, config: {}
[2m[36m(launch_and_fit pid=16962)[0m [Client 8] fit, config: {}
[2m[36m(launch_and_fit pid=16963)[0m Epoch 1: train loss 0.06593215465545654, accuracy 0.22088888888888888
[2m[36m(launch_and_fit pid=16961)[0m Epoch 1: train loss 0.06524761766195297, accuracy 0.2222222222222222


DEBUG flwr 2023-02-27 15:51:09,997 | server.py:229 | fit_round 1 received 3 results and 0 failures
DEBUG flwr 2023-02-27 15:51:10,039 | server.py:165 | evaluate_round 1: strategy sampled 3 clients (out of 10)


[2m[36m(launch_and_fit pid=16962)[0m Epoch 1: train loss 0.06476739794015884, accuracy 0.21288888888888888
[2m[36m(launch_and_evaluate pid=16961)[0m [Client 1] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=16963)[0m [Client 6] evaluate, config: {}


[2m[33m(raylet)[0m E0227 15:51:28.906780400   17492 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509488.906747200","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
[2m[33m(raylet)[0m E0227 15:51:28.942513900   17507 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509488.942477300","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
[2m[33m(raylet)[0m E0227 15:51:30.526282900   17508 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509490.526099000","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/i

[2m[36m(launch_and_evaluate pid=16945)[0m [Client 7] evaluate, config: {}


DEBUG flwr 2023-02-27 15:51:32,948 | server.py:179 | evaluate_round 1 received 3 results and 0 failures
DEBUG flwr 2023-02-27 15:51:32,953 | server.py:215 | fit_round 2: strategy sampled 3 clients (out of 10)
[2m[33m(raylet)[0m [2023-02-27 15:51:37,883 E 16823 16823] (raylet) node_manager.cc:3097: 2 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 8e76133d0765063e4f71166921b50bf0c8c576e43bf796615c9beca4, IP: 10.246.68.42) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 10.246.68.42`
[2m[33m(raylet)[0m 
[2m[33m(raylet)[0m Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_us

[2m[36m(launch_and_fit pid=16963)[0m [Client 7] fit, config: {}
[2m[36m(launch_and_fit pid=16961)[0m [Client 6] fit, config: {}
[2m[36m(launch_and_fit pid=16961)[0m Epoch 1: train loss 0.6562023162841797, accuracy 0.2782222222222222
[2m[36m(launch_and_fit pid=16963)[0m Epoch 1: train loss 0.6462485790252686, accuracy 0.2942222222222222


[2m[33m(raylet)[0m E0227 15:51:50.488486200   17606 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509510.488448400","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}


[2m[36m(launch_and_fit pid=16959)[0m [Client 0] fit, config: {}


DEBUG flwr 2023-02-27 15:52:07,655 | server.py:229 | fit_round 2 received 3 results and 0 failures
DEBUG flwr 2023-02-27 15:52:07,690 | server.py:165 | evaluate_round 2: strategy sampled 3 clients (out of 10)


[2m[36m(launch_and_fit pid=16959)[0m Epoch 1: train loss 0.6790484189987183, accuracy 0.29844444444444446
[2m[36m(launch_and_evaluate pid=16961)[0m [Client 5] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=16963)[0m [Client 0] evaluate, config: {}


DEBUG flwr 2023-02-27 15:52:13,458 | server.py:179 | evaluate_round 2 received 3 results and 0 failures
DEBUG flwr 2023-02-27 15:52:13,459 | server.py:215 | fit_round 3: strategy sampled 3 clients (out of 10)


[2m[36m(launch_and_evaluate pid=16959)[0m [Client 1] evaluate, config: {}
[2m[36m(launch_and_fit pid=16961)[0m [Client 7] fit, config: {}
[2m[36m(launch_and_fit pid=16959)[0m [Client 0] fit, config: {}
[2m[36m(launch_and_fit pid=16963)[0m [Client 1] fit, config: {}
[2m[36m(launch_and_fit pid=16961)[0m Epoch 1: train loss 0.08753269910812378, accuracy 0.17644444444444443
[2m[36m(launch_and_fit pid=16959)[0m Epoch 1: train loss 0.09109263867139816, accuracy 0.17177777777777778


DEBUG flwr 2023-02-27 15:52:24,003 | server.py:229 | fit_round 3 received 3 results and 0 failures
DEBUG flwr 2023-02-27 15:52:24,021 | server.py:165 | evaluate_round 3: strategy sampled 3 clients (out of 10)


[2m[36m(launch_and_fit pid=16963)[0m Epoch 1: train loss 0.08876537531614304, accuracy 0.184
[2m[36m(launch_and_evaluate pid=16963)[0m [Client 8] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=16961)[0m [Client 9] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=16959)[0m [Client 1] evaluate, config: {}


[2m[36m(raylet)[0m Spilled 2058 MiB, 25 objects, write throughput 118 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
DEBUG flwr 2023-02-27 15:52:30,545 | server.py:179 | evaluate_round 3 received 3 results and 0 failures
INFO flwr 2023-02-27 15:52:30,546 | server.py:144 | FL finished in 106.2191026999999
INFO flwr 2023-02-27 15:52:30,547 | app.py:202 | app_fit: losses_distributed [(1, 6.097165802001953), (2, 0.43599525769551595), (3, 0.14549784088134768)]
INFO flwr 2023-02-27 15:52:30,549 | app.py:203 | app_fit: metrics_distributed {}
INFO flwr 2023-02-27 15:52:30,550 | app.py:204 | app_fit: losses_centralized []
INFO flwr 2023-02-27 15:52:30,551 | app.py:205 | app_fit: metrics_centralized {}


History (loss, distributed):
	round 1: 6.097165802001953
	round 2: 0.43599525769551595
	round 3: 0.14549784088134768

## Server-side parameter **evaluation**

Flower can evaluate the aggregated model on the server-side or on the client-side. Client-side and server-side evaluation are similar in some ways, but different in others.

**Centralized Evaluation** (or *server-side evaluation*) is conceptually simple: it works the same way that evaluation in centralized machine learning does. If there is a server-side dataset that can be used for evaluation purposes, then that's great. We can evaluate the newly aggregated model after each round of training without having to send the model to clients. We're also fortunate in the sense that our entire evaluation dataset is available at all times.

**Federated Evaluation** (or *client-side evaluation*) is more complex, but also more powerful: it doesn't require a centralized dataset and allows us to evaluate models over a larger set of data, which often yields more realistic evaluation results. In fact, many scenarios require us to use **Federated Evaluation** if we want to get representative evaluation results at all. But this power comes at a cost: once we start to evaluate on the client side, we should be aware that our evaluation dataset can change over consecutive rounds of learning if those clients are not always available. Moreover, the dataset held by each client can also change over consecutive rounds. This can lead to evaluation results that are not stable, so even if we would not change the model, we'd see our evaluation results fluctuate over consecutive rounds.

We've seen how federated evaluation works on the client side (i.e., by implementing the `evaluate` method in `FlowerClient`). Now let's see how we can evaluate aggregated model parameters on the server-side:

In [8]:
# The `evaluate` function will be by Flower called after every round
def evaluate(
    server_round: int,
    parameters: fl.common.NDArrays,
    config: Dict[str, fl.common.Scalar],
) -> Optional[Tuple[float, Dict[str, fl.common.Scalar]]]:
    net = Net().to(DEVICE)
    valloader = valloaders[0]
    set_parameters(net, parameters)  # Update model with the latest parameters
    loss, accuracy = test(net, valloader)
    print(f"Server-side evaluation loss {loss} / accuracy {accuracy}")
    return loss, {"accuracy": accuracy}

In [9]:
strategy = fl.server.strategy.FedAvg(
    fraction_fit=0.3,
    fraction_evaluate=0.3,
    min_fit_clients=3,
    min_evaluate_clients=3,
    min_available_clients=NUM_CLIENTS,
    initial_parameters=fl.common.ndarrays_to_parameters(get_parameters(Net())),
    evaluate_fn=evaluate,  # Pass the evaluation function
)

fl.simulation.start_simulation(
    client_fn=client_fn,
    num_clients=NUM_CLIENTS,
    config=fl.server.ServerConfig(num_rounds=3),  # Just three rounds
    strategy=strategy,
    client_resources=client_resources,
)

INFO flwr 2023-02-27 15:52:30,898 | app.py:145 | Starting Flower simulation, config: ServerConfig(num_rounds=3, round_timeout=None)
2023-02-27 15:52:48,259	INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
INFO flwr 2023-02-27 15:52:54,731 | app.py:179 | Flower VCE: Ray initialized with resources: {'memory': 3021282510.0, 'node:10.246.68.42': 1.0, 'CPU': 8.0, 'object_store_memory': 1510641254.0}
INFO flwr 2023-02-27 15:52:54,742 | server.py:86 | Initializing global parameters
INFO flwr 2023-02-27 15:52:54,747 | server.py:266 | Using initial parameters provided by strategy
INFO flwr 2023-02-27 15:52:54,752 | server.py:88 | Evaluating initial parameters
INFO flwr 2023-02-27 15:52:55,361 | server.py:91 | initial parameters (loss, other metrics): 0.07379200744628907, {'accuracy': 0.096}
INFO flwr 2023-02-27 15:52:55,364 | server.py:101 | FL starting
DEBUG flwr 2023-02-27 15:52:55,375 | server.py:215 | fit_round 1: strategy sampled

Server-side evaluation loss 0.07379200744628907 / accuracy 0.096
[2m[36m(launch_and_fit pid=18133)[0m [Client 8] fit, config: {}
[2m[36m(launch_and_fit pid=18134)[0m [Client 0] fit, config: {}
[2m[36m(launch_and_fit pid=18135)[0m [Client 2] fit, config: {}


DEBUG flwr 2023-02-27 15:53:12,852 | server.py:229 | fit_round 1 received 3 results and 0 failures


[2m[36m(launch_and_fit pid=18133)[0m Epoch 1: train loss 0.06419975310564041, accuracy 0.23044444444444445
[2m[36m(launch_and_fit pid=18134)[0m Epoch 1: train loss 0.06415960192680359, accuracy 0.24155555555555555
[2m[36m(launch_and_fit pid=18135)[0m Epoch 1: train loss 0.06351277977228165, accuracy 0.23644444444444446


INFO flwr 2023-02-27 15:53:13,068 | server.py:116 | fit progress: (1, 0.0616036479473114, {'accuracy': 0.31}, 17.692429400000037)
DEBUG flwr 2023-02-27 15:53:13,069 | server.py:165 | evaluate_round 1: strategy sampled 3 clients (out of 10)


Server-side evaluation loss 0.0616036479473114 / accuracy 0.31
[2m[36m(launch_and_evaluate pid=18135)[0m [Client 1] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=18133)[0m [Client 2] evaluate, config: {}


DEBUG flwr 2023-02-27 15:53:19,531 | server.py:179 | evaluate_round 1 received 3 results and 0 failures
DEBUG flwr 2023-02-27 15:53:19,533 | server.py:215 | fit_round 2: strategy sampled 3 clients (out of 10)


[2m[36m(launch_and_evaluate pid=18134)[0m [Client 0] evaluate, config: {}
[2m[36m(launch_and_fit pid=18133)[0m [Client 1] fit, config: {}
[2m[36m(launch_and_fit pid=18134)[0m [Client 0] fit, config: {}
[2m[36m(launch_and_fit pid=18135)[0m [Client 6] fit, config: {}


DEBUG flwr 2023-02-27 15:53:28,714 | server.py:229 | fit_round 2 received 3 results and 0 failures
INFO flwr 2023-02-27 15:53:28,909 | server.py:116 | fit progress: (2, 0.053530490636825565, {'accuracy': 0.376}, 33.53404150000006)
DEBUG flwr 2023-02-27 15:53:28,910 | server.py:165 | evaluate_round 2: strategy sampled 3 clients (out of 10)


[2m[36m(launch_and_fit pid=18133)[0m Epoch 1: train loss 0.05504786968231201, accuracy 0.35644444444444445
[2m[36m(launch_and_fit pid=18134)[0m Epoch 1: train loss 0.05558769404888153, accuracy 0.3453333333333333
[2m[36m(launch_and_fit pid=18135)[0m Epoch 1: train loss 0.05689386650919914, accuracy 0.33155555555555555
Server-side evaluation loss 0.053530490636825565 / accuracy 0.376
[2m[36m(launch_and_evaluate pid=18133)[0m [Client 7] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=18134)[0m [Client 8] evaluate, config: {}


[2m[33m(raylet)[0m E0227 15:53:38.473243800   18512 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509618.473205700","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
[2m[33m(raylet)[0m E0227 15:53:38.490064700   18511 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509618.490030900","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
[2m[33m(raylet)[0m E0227 15:53:38.742873000   18507 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509618.742844600","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/i

[2m[36m(launch_and_evaluate pid=18130)[0m [Client 2] evaluate, config: {}


DEBUG flwr 2023-02-27 15:53:39,952 | server.py:179 | evaluate_round 2 received 3 results and 0 failures
DEBUG flwr 2023-02-27 15:53:39,954 | server.py:215 | fit_round 3: strategy sampled 3 clients (out of 10)
[2m[33m(raylet)[0m E0227 15:53:40.467910800   18600 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509620.467884100","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}


[2m[36m(launch_and_fit pid=18133)[0m [Client 2] fit, config: {}
[2m[36m(launch_and_fit pid=18134)[0m [Client 4] fit, config: {}
[2m[36m(launch_and_fit pid=18130)[0m [Client 1] fit, config: {}


[2m[33m(raylet)[0m [2023-02-27 15:53:48,188 E 18038 18038] (raylet) node_manager.cc:3097: 1 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: e733e0509a7835e3c6781a480005aa97080ba577eefd8322d41c1712, IP: 10.246.68.42) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 10.246.68.42`
[2m[33m(raylet)[0m 
[2m[33m(raylet)[0m Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
DEBUG flwr 2023-02-27 15:53:50,416 | server.py:229 | fit_round 3 received 3 re

[2m[36m(launch_and_fit pid=18134)[0m Epoch 1: train loss 0.05138608440756798, accuracy 0.38755555555555554
[2m[36m(launch_and_fit pid=18133)[0m Epoch 1: train loss 0.05228734388947487, accuracy 0.39844444444444443


INFO flwr 2023-02-27 15:53:50,612 | server.py:116 | fit progress: (3, 0.05148048067092895, {'accuracy': 0.428}, 55.23696760000075)
DEBUG flwr 2023-02-27 15:53:50,613 | server.py:165 | evaluate_round 3: strategy sampled 3 clients (out of 10)


[2m[36m(launch_and_fit pid=18130)[0m Epoch 1: train loss 0.05163895711302757, accuracy 0.3933333333333333
Server-side evaluation loss 0.05148048067092895 / accuracy 0.428
[2m[36m(launch_and_evaluate pid=18130)[0m [Client 8] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=18134)[0m [Client 2] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=18132)[0m [Client 1] evaluate, config: {}


DEBUG flwr 2023-02-27 15:54:01,253 | server.py:179 | evaluate_round 3 received 3 results and 0 failures
INFO flwr 2023-02-27 15:54:01,255 | server.py:144 | FL finished in 65.87967420000041
INFO flwr 2023-02-27 15:54:01,256 | app.py:202 | app_fit: losses_distributed [(1, 0.06166497111320496), (2, 0.05276294088363647), (3, 0.05091049361228942)]
INFO flwr 2023-02-27 15:54:01,257 | app.py:203 | app_fit: metrics_distributed {}
INFO flwr 2023-02-27 15:54:01,258 | app.py:204 | app_fit: losses_centralized [(0, 0.07379200744628907), (1, 0.0616036479473114), (2, 0.053530490636825565), (3, 0.05148048067092895)]
INFO flwr 2023-02-27 15:54:01,259 | app.py:205 | app_fit: metrics_centralized {'accuracy': [(0, 0.096), (1, 0.31), (2, 0.376), (3, 0.428)]}


History (loss, distributed):
	round 1: 0.06166497111320496
	round 2: 0.05276294088363647
	round 3: 0.05091049361228942
History (loss, centralized):
	round 0: 0.07379200744628907
	round 1: 0.0616036479473114
	round 2: 0.053530490636825565
	round 3: 0.05148048067092895
History (metrics, centralized):
{'accuracy': [(0, 0.096), (1, 0.31), (2, 0.376), (3, 0.428)]}

## Sending/receiving arbitrary values to/from clients

In some situations, we want to configure client-side execution (trainig, evaluation) from the server-side. One example for that is the server asking the clients to train for a certain number of local epochs. Flower provides a way to send configuration values from the server to the clients using a dictionary. Let's look at an example where the clients receive values from the server through the `config` parameter in `fit` (`config` is also available in `evaluate`). The `fit` method receives the configuration dictionary through the `config` parameter and can then read values from this dictionary. In this example, it reads `server_round` and `local_epochs` and uses those values to improve the logging and configure the number of local training epochs:

In [10]:
class FlowerClient(fl.client.NumPyClient):
    def __init__(self, cid, net, trainloader, valloader):
        self.cid = cid
        self.net = net
        self.trainloader = trainloader
        self.valloader = valloader

    def get_parameters(self, config):
        print(f"[Client {self.cid}] get_parameters")
        return get_parameters(self.net)

    def fit(self, parameters, config):
        # Read values from config
        server_round = config["server_round"]
        local_epochs = config["local_epochs"]

        # Use values provided by the config
        print(f"[Client {self.cid}, round {server_round}] fit, config: {config}")
        set_parameters(self.net, parameters)
        train(self.net, self.trainloader, epochs=local_epochs)
        return get_parameters(self.net), len(self.trainloader), {}

    def evaluate(self, parameters, config):
        print(f"[Client {self.cid}] evaluate, config: {config}")
        set_parameters(self.net, parameters)
        loss, accuracy = test(self.net, self.valloader)
        return float(loss), len(self.valloader), {"accuracy": float(accuracy)}


def client_fn(cid) -> FlowerClient:
    net = Net().to(DEVICE)
    trainloader = trainloaders[int(cid)]
    valloader = valloaders[int(cid)]
    return FlowerClient(cid, net, trainloader, valloader)

So how can we  send this config dictionary from server to clients? The built-in Flower Strategies provide way to do this, and it works similarly to the way server-side evaluation works. We provide a function to the strategy, and the strategy calls this function for every round of federated learning:

In [11]:
def fit_config(server_round: int):
    """Return training configuration dict for each round.

    Perform two rounds of training with one local epoch, increase to two local
    epochs afterwards.
    """
    config = {
        "server_round": server_round,  # The current round of federated learning
        "local_epochs": 1 if server_round < 2 else 2,  #
    }
    return config

Next, we'll just pass this function to the FedAvg strategy before starting the simulation:

In [12]:
strategy = fl.server.strategy.FedAvg(
    fraction_fit=0.3,
    fraction_evaluate=0.3,
    min_fit_clients=3,
    min_evaluate_clients=3,
    min_available_clients=NUM_CLIENTS,
    initial_parameters=fl.common.ndarrays_to_parameters(get_parameters(Net())),
    evaluate_fn=evaluate,
    on_fit_config_fn=fit_config,  # Pass the fit_config function
)

fl.simulation.start_simulation(
    client_fn=client_fn,
    num_clients=NUM_CLIENTS,
    config=fl.server.ServerConfig(num_rounds=3),  # Just three rounds
    strategy=strategy,
    client_resources=client_resources,
)

INFO flwr 2023-02-27 15:54:01,832 | app.py:145 | Starting Flower simulation, config: ServerConfig(num_rounds=3, round_timeout=None)
2023-02-27 15:54:12,711	INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
INFO flwr 2023-02-27 15:54:17,699 | app.py:179 | Flower VCE: Ray initialized with resources: {'node:10.246.68.42': 1.0, 'memory': 3294825678.0, 'object_store_memory': 1647412838.0, 'CPU': 8.0}
INFO flwr 2023-02-27 15:54:17,701 | server.py:86 | Initializing global parameters
INFO flwr 2023-02-27 15:54:17,702 | server.py:266 | Using initial parameters provided by strategy
INFO flwr 2023-02-27 15:54:17,703 | server.py:88 | Evaluating initial parameters
INFO flwr 2023-02-27 15:54:17,948 | server.py:91 | initial parameters (loss, other metrics): 0.07383857822418213, {'accuracy': 0.102}
INFO flwr 2023-02-27 15:54:17,949 | server.py:101 | FL starting
DEBUG flwr 2023-02-27 15:54:17,950 | server.py:215 | fit_round 1: strategy sampled

Server-side evaluation loss 0.07383857822418213 / accuracy 0.102
[2m[36m(launch_and_fit pid=18904)[0m [Client 2, round 1] fit, config: {'server_round': 1, 'local_epochs': 1}
[2m[36m(launch_and_fit pid=18908)[0m [Client 6, round 1] fit, config: {'server_round': 1, 'local_epochs': 1}
[2m[36m(launch_and_fit pid=18909)[0m [Client 1, round 1] fit, config: {'server_round': 1, 'local_epochs': 1}


DEBUG flwr 2023-02-27 15:54:30,967 | server.py:229 | fit_round 1 received 3 results and 0 failures


[2m[36m(launch_and_fit pid=18904)[0m Epoch 1: train loss 0.06377512961626053, accuracy 0.24488888888888888
[2m[36m(launch_and_fit pid=18908)[0m Epoch 1: train loss 0.06503230333328247, accuracy 0.22088888888888888


INFO flwr 2023-02-27 15:54:31,216 | server.py:116 | fit progress: (1, 0.06176053500175476, {'accuracy': 0.304}, 13.26582099999996)
DEBUG flwr 2023-02-27 15:54:31,217 | server.py:165 | evaluate_round 1: strategy sampled 3 clients (out of 10)


[2m[36m(launch_and_fit pid=18909)[0m Epoch 1: train loss 0.06389319151639938, accuracy 0.23955555555555555
Server-side evaluation loss 0.06176053500175476 / accuracy 0.304
[2m[36m(launch_and_evaluate pid=18908)[0m [Client 4] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=18909)[0m [Client 8] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=18904)[0m [Client 3] evaluate, config: {}


DEBUG flwr 2023-02-27 15:54:36,531 | server.py:179 | evaluate_round 1 received 3 results and 0 failures
DEBUG flwr 2023-02-27 15:54:36,533 | server.py:215 | fit_round 2: strategy sampled 3 clients (out of 10)


[2m[36m(launch_and_fit pid=18904)[0m [Client 4, round 2] fit, config: {'server_round': 2, 'local_epochs': 2}
[2m[36m(launch_and_fit pid=18908)[0m [Client 7, round 2] fit, config: {'server_round': 2, 'local_epochs': 2}
[2m[36m(launch_and_fit pid=18909)[0m [Client 0, round 2] fit, config: {'server_round': 2, 'local_epochs': 2}
[2m[36m(launch_and_fit pid=18904)[0m Epoch 1: train loss 0.05752306431531906, accuracy 0.318
[2m[36m(launch_and_fit pid=18908)[0m Epoch 1: train loss 0.05670631676912308, accuracy 0.32822222222222225
[2m[36m(launch_and_fit pid=18909)[0m Epoch 1: train loss 0.05715209245681763, accuracy 0.33466666666666667


DEBUG flwr 2023-02-27 15:54:48,500 | server.py:229 | fit_round 2 received 3 results and 0 failures


[2m[36m(launch_and_fit pid=18904)[0m Epoch 2: train loss 0.05319346860051155, accuracy 0.37377777777777776
[2m[36m(launch_and_fit pid=18908)[0m Epoch 2: train loss 0.052408479154109955, accuracy 0.38466666666666666
[2m[36m(launch_and_fit pid=18909)[0m Epoch 2: train loss 0.053283073008060455, accuracy 0.3811111111111111


INFO flwr 2023-02-27 15:54:48,725 | server.py:116 | fit progress: (2, 0.05338016629219055, {'accuracy': 0.38}, 30.775509900000543)
DEBUG flwr 2023-02-27 15:54:48,726 | server.py:165 | evaluate_round 2: strategy sampled 3 clients (out of 10)


Server-side evaluation loss 0.05338016629219055 / accuracy 0.38
[2m[36m(launch_and_evaluate pid=18904)[0m [Client 1] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=18908)[0m [Client 9] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=18909)[0m [Client 2] evaluate, config: {}


DEBUG flwr 2023-02-27 15:54:56,357 | server.py:179 | evaluate_round 2 received 3 results and 0 failures
DEBUG flwr 2023-02-27 15:54:56,362 | server.py:215 | fit_round 3: strategy sampled 3 clients (out of 10)
[2m[33m(raylet)[0m E0227 15:54:57.981341300   19281 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509697.981300100","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
[2m[33m(raylet)[0m E0227 15:54:58.288603100   19279 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509698.288569100","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
[2m[33m(raylet)[0m E0227 15:54:58.2932

[2m[36m(launch_and_fit pid=18904)[0m [Client 2, round 3] fit, config: {'server_round': 3, 'local_epochs': 2}
[2m[36m(launch_and_fit pid=18908)[0m [Client 8, round 3] fit, config: {'server_round': 3, 'local_epochs': 2}
[2m[36m(launch_and_fit pid=18909)[0m [Client 6, round 3] fit, config: {'server_round': 3, 'local_epochs': 2}
[2m[36m(launch_and_fit pid=18904)[0m Epoch 1: train loss 0.0522049255669117, accuracy 0.386
[2m[36m(launch_and_fit pid=18909)[0m Epoch 1: train loss 0.05285734310746193, accuracy 0.37977777777777777
[2m[36m(launch_and_fit pid=18908)[0m Epoch 1: train loss 0.05055960267782211, accuracy 0.412


DEBUG flwr 2023-02-27 15:55:11,958 | server.py:229 | fit_round 3 received 3 results and 0 failures


[2m[36m(launch_and_fit pid=18904)[0m Epoch 2: train loss 0.0494501069188118, accuracy 0.4146666666666667
[2m[36m(launch_and_fit pid=18909)[0m Epoch 2: train loss 0.04971221089363098, accuracy 0.4151111111111111


INFO flwr 2023-02-27 15:55:12,158 | server.py:116 | fit progress: (3, 0.04911686301231384, {'accuracy': 0.45}, 54.208160099999986)
DEBUG flwr 2023-02-27 15:55:12,159 | server.py:165 | evaluate_round 3: strategy sampled 3 clients (out of 10)


[2m[36m(launch_and_fit pid=18908)[0m Epoch 2: train loss 0.04801885783672333, accuracy 0.44066666666666665
Server-side evaluation loss 0.04911686301231384 / accuracy 0.45
[2m[36m(launch_and_evaluate pid=18904)[0m [Client 5] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=18908)[0m [Client 8] evaluate, config: {}


DEBUG flwr 2023-02-27 15:55:20,202 | server.py:179 | evaluate_round 3 received 3 results and 0 failures
INFO flwr 2023-02-27 15:55:20,203 | server.py:144 | FL finished in 62.2529013000003
INFO flwr 2023-02-27 15:55:20,204 | app.py:202 | app_fit: losses_distributed [(1, 0.06204264259338379), (2, 0.05329331096013387), (3, 0.04731542666753133)]
INFO flwr 2023-02-27 15:55:20,204 | app.py:203 | app_fit: metrics_distributed {}
INFO flwr 2023-02-27 15:55:20,205 | app.py:204 | app_fit: losses_centralized [(0, 0.07383857822418213), (1, 0.06176053500175476), (2, 0.05338016629219055), (3, 0.04911686301231384)]
INFO flwr 2023-02-27 15:55:20,207 | app.py:205 | app_fit: metrics_centralized {'accuracy': [(0, 0.102), (1, 0.304), (2, 0.38), (3, 0.45)]}


[2m[36m(launch_and_evaluate pid=18908)[0m [Client 7] evaluate, config: {}


History (loss, distributed):
	round 1: 0.06204264259338379
	round 2: 0.05329331096013387
	round 3: 0.04731542666753133
History (loss, centralized):
	round 0: 0.07383857822418213
	round 1: 0.06176053500175476
	round 2: 0.05338016629219055
	round 3: 0.04911686301231384
History (metrics, centralized):
{'accuracy': [(0, 0.102), (1, 0.304), (2, 0.38), (3, 0.45)]}

As we can see, the client logs now include the current round of federated learning (which they read from the `config` dictionary). We can also configure local training to run for one epoch during the first and second round of federated learning, and then for two epochs during the third round.

Clients can also return arbitrary values to the server. To do so, they return a dictionary from `fit` and/or `evaluate`. We have seen and used this concept throughout this notebook without mentioning it explicitly: our `FlowerClient` returns a dictionary containing a custom key/value pair as the third return value in `evaluate`.

## Scaling federated learning

As a last step in this notebook, let's see how we can use Flower to experiment with a large number of clients.

In [13]:
NUM_CLIENTS = 1000

trainloaders, valloaders, testloader = load_datasets(NUM_CLIENTS)

Files already downloaded and verified
Files already downloaded and verified


We now have 1000 partitions, each holding 45 training and 5 validation examples. Given that the number of training examples on each client is quite small, we should probably train the model a bit longer, so we configure the clients to perform 3 local training epochs. We should also adjust the fraction of clients selected for training during each round (we don't want all 1000 clients participating in every round), so we adjust `fraction_fit` to `0.05`, which means that only 5% of available clients (so 50 clients) will be selected for training each round:


In [14]:
def fit_config(server_round: int):
    config = {
        "server_round": server_round,
        "local_epochs": 3,
    }
    return config


strategy = fl.server.strategy.FedAvg(
    fraction_fit=0.025,  # Train on 25 clients (each round)
    fraction_evaluate=0.05,  # Evaluate on 50 clients (each round)
    min_fit_clients=20,
    min_evaluate_clients=40,
    min_available_clients=NUM_CLIENTS,
    initial_parameters=fl.common.ndarrays_to_parameters(get_parameters(Net())),
    on_fit_config_fn=fit_config,
)

fl.simulation.start_simulation(
    client_fn=client_fn,
    num_clients=NUM_CLIENTS,
    config=fl.server.ServerConfig(num_rounds=3),  # Just three rounds
    strategy=strategy,
    client_resources=client_resources,
)

INFO flwr 2023-02-27 15:55:24,654 | app.py:145 | Starting Flower simulation, config: ServerConfig(num_rounds=3, round_timeout=None)
2023-02-27 15:55:36,960	INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
INFO flwr 2023-02-27 15:55:41,067 | app.py:179 | Flower VCE: Ray initialized with resources: {'object_store_memory': 1663985664.0, 'CPU': 8.0, 'memory': 3327971328.0, 'node:10.246.68.42': 1.0}
INFO flwr 2023-02-27 15:55:41,077 | server.py:86 | Initializing global parameters
INFO flwr 2023-02-27 15:55:41,078 | server.py:266 | Using initial parameters provided by strategy
INFO flwr 2023-02-27 15:55:41,080 | server.py:88 | Evaluating initial parameters
INFO flwr 2023-02-27 15:55:41,081 | server.py:101 | FL starting
DEBUG flwr 2023-02-27 15:55:41,083 | server.py:215 | fit_round 1: strategy sampled 25 clients (out of 1000)


[2m[36m(launch_and_fit pid=19674)[0m [Client 823, round 1] fit, config: {'server_round': 1, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=19674)[0m Epoch 1: train loss 0.10284201055765152, accuracy 0.06666666666666667
[2m[36m(launch_and_fit pid=19674)[0m Epoch 2: train loss 0.10172911733388901, accuracy 0.1111111111111111
[2m[36m(launch_and_fit pid=19674)[0m Epoch 3: train loss 0.10133899748325348, accuracy 0.2


[2m[33m(raylet)[0m E0227 15:56:06.904156700   20025 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509766.904120400","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
[2m[33m(raylet)[0m E0227 15:56:06.934880100   20026 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509766.934844000","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
[2m[33m(raylet)[0m E0227 15:56:06.966421600   20021 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509766.966383100","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/i

[2m[36m(launch_and_fit pid=19675)[0m [Client 724, round 1] fit, config: {'server_round': 1, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=19675)[0m Epoch 1: train loss 0.10198123753070831, accuracy 0.08888888888888889
[2m[36m(launch_and_fit pid=19672)[0m [Client 498, round 1] fit, config: {'server_round': 1, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=19673)[0m [Client 962, round 1] fit, config: {'server_round': 1, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=19676)[0m [Client 47, round 1] fit, config: {'server_round': 1, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=19675)[0m Epoch 2: train loss 0.1017126739025116, accuracy 0.17777777777777778
[2m[36m(launch_and_fit pid=19675)[0m Epoch 3: train loss 0.10045036673545837, accuracy 0.2222222222222222
[2m[36m(launch_and_fit pid=19672)[0m Epoch 1: train loss 0.10291380435228348, accuracy 0.06666666666666667
[2m[36m(launch_and_fit pid=19673)[0m Epoch 1: train loss 0.1025865226984024, accuracy 0.1333333333333333

[2m[33m(raylet)[0m E0227 15:56:19.929306200   20184 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509779.929271400","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
[2m[33m(raylet)[0m E0227 15:56:19.938908800   20182 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509779.938870700","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
[2m[33m(raylet)[0m E0227 15:56:19.965374500   20183 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509779.965342500","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/i

[2m[36m(launch_and_fit pid=19675)[0m [Client 144, round 1] fit, config: {'server_round': 1, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=19675)[0m Epoch 1: train loss 0.10193154960870743, accuracy 0.13333333333333333
[2m[36m(launch_and_fit pid=19675)[0m Epoch 2: train loss 0.10134268552064896, accuracy 0.2
[2m[36m(launch_and_fit pid=19675)[0m Epoch 3: train loss 0.1008644551038742, accuracy 0.2
[2m[36m(launch_and_fit pid=19676)[0m [Client 742, round 1] fit, config: {'server_round': 1, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=19676)[0m Epoch 1: train loss 0.1021660789847374, accuracy 0.15555555555555556
[2m[36m(launch_and_fit pid=19676)[0m Epoch 2: train loss 0.10189219564199448, accuracy 0.13333333333333333
[2m[36m(launch_and_fit pid=19676)[0m Epoch 3: train loss 0.1014726310968399, accuracy 0.13333333333333333


[2m[33m(raylet)[0m [2023-02-27 15:56:37,550 E 19577 19577] (raylet) node_manager.cc:3097: 3 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 23822b6a08ac13c06b82c01db60956472938564608a9004253685752, IP: 10.246.68.42) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 10.246.68.42`
[2m[33m(raylet)[0m 
[2m[33m(raylet)[0m Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.


[2m[36m(launch_and_fit pid=19675)[0m [Client 137, round 1] fit, config: {'server_round': 1, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=19675)[0m Epoch 1: train loss 0.1023029088973999, accuracy 0.13333333333333333
[2m[36m(launch_and_fit pid=19675)[0m Epoch 2: train loss 0.10124348849058151, accuracy 0.15555555555555556
[2m[36m(launch_and_fit pid=19675)[0m Epoch 3: train loss 0.10075262933969498, accuracy 0.24444444444444444
[2m[36m(launch_and_fit pid=19676)[0m [Client 819, round 1] fit, config: {'server_round': 1, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=19676)[0m Epoch 1: train loss 0.10252483934164047, accuracy 0.13333333333333333
[2m[36m(launch_and_fit pid=19676)[0m Epoch 2: train loss 0.10155557096004486, accuracy 0.17777777777777778
[2m[36m(launch_and_fit pid=19676)[0m Epoch 3: train loss 0.10141512751579285, accuracy 0.2222222222222222
[2m[36m(launch_and_fit pid=19671)[0m [Client 449, round 1] fit, config: {'server_round': 1, 'local_epochs': 3}


[2m[36m(raylet)[0m Spilled 2654 MiB, 34 objects, write throughput 121 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
[2m[33m(raylet)[0m E0227 15:56:43.272918400   20321 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509803.272881200","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
[2m[33m(raylet)[0m E0227 15:56:43.297888400   20327 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509803.297849800","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
[2m[33m(raylet)[0m E0227 15:56:43.312528300   20329 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"c

[2m[36m(launch_and_fit pid=20327)[0m [Client 401, round 1] fit, config: {'server_round': 1, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=20327)[0m Epoch 1: train loss 0.1026434376835823, accuracy 0.08888888888888889
[2m[36m(launch_and_fit pid=20327)[0m Epoch 2: train loss 0.10209785401821136, accuracy 0.17777777777777778
[2m[36m(launch_and_fit pid=19671)[0m [Client 446, round 1] fit, config: {'server_round': 1, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=20327)[0m Epoch 3: train loss 0.10097341239452362, accuracy 0.2
[2m[36m(launch_and_fit pid=19671)[0m Epoch 1: train loss 0.1026061400771141, accuracy 0.08888888888888889
[2m[36m(launch_and_fit pid=19671)[0m Epoch 2: train loss 0.10198001563549042, accuracy 0.17777777777777778
[2m[36m(launch_and_fit pid=19671)[0m Epoch 3: train loss 0.10168256610631943, accuracy 0.17777777777777778
[2m[36m(launch_and_fit pid=19670)[0m [Client 730, round 1] fit, config: {'server_round': 1, 'local_epochs': 3}
[2m[36m(launch

[2m[33m(raylet)[0m E0227 15:56:56.812492600   20440 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509816.812473200","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
DEBUG flwr 2023-02-27 15:56:57,682 | server.py:229 | fit_round 1 received 25 results and 0 failures


[2m[36m(launch_and_fit pid=20321)[0m [Client 79, round 1] fit, config: {'server_round': 1, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=20321)[0m Epoch 1: train loss 0.10251282900571823, accuracy 0.15555555555555556
[2m[36m(launch_and_fit pid=20321)[0m Epoch 2: train loss 0.10206764936447144, accuracy 0.17777777777777778
[2m[36m(launch_and_fit pid=20321)[0m Epoch 3: train loss 0.10134074091911316, accuracy 0.17777777777777778


DEBUG flwr 2023-02-27 15:56:58,058 | server.py:165 | evaluate_round 1: strategy sampled 50 clients (out of 1000)


[2m[36m(launch_and_evaluate pid=20321)[0m [Client 363] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20321)[0m [Client 274] evaluate, config: {}


[2m[36m(raylet)[0m Spilled 4426 MiB, 64 objects, write throughput 133 MiB/s.


[2m[36m(launch_and_evaluate pid=19676)[0m [Client 94] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19669)[0m [Client 482] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20327)[0m [Client 90] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19670)[0m [Client 129] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19676)[0m [Client 468] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19669)[0m [Client 734] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20327)[0m [Client 503] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19670)[0m [Client 907] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20440)[0m [Client 408] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20526)[0m [Client 7] evaluate, config: {}


[2m[33m(raylet)[0m E0227 15:57:32.203798400   20521 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509852.203768500","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
[2m[33m(raylet)[0m E0227 15:57:32.232329600   20526 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509852.232226600","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
[2m[33m(raylet)[0m E0227 15:57:32.337754900   20527 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509852.337718700","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/i

[2m[36m(launch_and_evaluate pid=19670)[0m [Client 997] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19670)[0m [Client 272] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19676)[0m [Client 582] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20327)[0m [Client 112] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19670)[0m [Client 357] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19669)[0m [Client 823] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19669)[0m [Client 493] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19669)[0m [Client 875] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19669)[0m [Client 397] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20638)[0m [Client 948] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19669)[0m [Client 485] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19669)[0m [Client 285] evaluate, config: {}


[2m[33m(raylet)[0m E0227 15:58:02.518128500   20638 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509882.518096200","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}


[2m[36m(launch_and_evaluate pid=19669)[0m [Client 407] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20638)[0m [Client 120] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19670)[0m [Client 150] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20638)[0m [Client 511] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20638)[0m [Client 950] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19676)[0m [Client 343] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20638)[0m [Client 44] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19670)[0m [Client 22] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20327)[0m [Client 292] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19670)[0m [Client 566] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19670)[0m [Client 901] evaluate, config: {}


[2m[36m(raylet)[0m Spilled 8848 MiB, 108 objects, write throughput 163 MiB/s.


[2m[36m(launch_and_evaluate pid=19670)[0m [Client 463] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19670)[0m [Client 642] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19670)[0m [Client 774] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19670)[0m [Client 637] evaluate, config: {}


[2m[33m(raylet)[0m [2023-02-27 15:58:37,554 E 19577 19577] (raylet) node_manager.cc:3097: 4 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 23822b6a08ac13c06b82c01db60956472938564608a9004253685752, IP: 10.246.68.42) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 10.246.68.42`
[2m[33m(raylet)[0m 
[2m[33m(raylet)[0m Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.


[2m[36m(launch_and_evaluate pid=20327)[0m [Client 101] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20638)[0m [Client 495] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19670)[0m [Client 131] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19676)[0m [Client 605] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20327)[0m [Client 588] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19670)[0m [Client 456] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20638)[0m [Client 675] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19676)[0m [Client 756] evaluate, config: {}


[2m[33m(raylet)[0m E0227 15:58:52.112224100   20698 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509932.112200200","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
[2m[33m(raylet)[0m E0227 15:58:52.162060000   20703 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509932.162033000","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
[2m[33m(raylet)[0m E0227 15:58:52.188031200   20699 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509932.187987500","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/i

[2m[36m(launch_and_evaluate pid=20698)[0m [Client 882] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20703)[0m [Client 293] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20699)[0m [Client 921] evaluate, config: {}
[2m[36m(launch_and_fit pid=20703)[0m [Client 599, round 2] fit, config: {'server_round': 2, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=20703)[0m Epoch 1: train loss 0.10148078203201294, accuracy 0.13333333333333333
[2m[36m(launch_and_fit pid=20703)[0m Epoch 2: train loss 0.10104537755250931, accuracy 0.24444444444444444
[2m[36m(launch_and_fit pid=20703)[0m Epoch 3: train loss 0.09978871047496796, accuracy 0.2222222222222222
[2m[36m(launch_and_fit pid=20699)[0m [Client 3, round 2] fit, config: {'server_round': 2, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=20699)[0m Epoch 1: train loss 0.10245810449123383, accuracy 0.06666666666666667
[2m[36m(launch_and_fit pid=20699)[0m Epoch 2: train loss 0.10116440802812576, accuracy 0.177777

[2m[33m(raylet)[0m [2023-02-27 15:59:37,555 E 19577 19577] (raylet) node_manager.cc:3097: 2 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 23822b6a08ac13c06b82c01db60956472938564608a9004253685752, IP: 10.246.68.42) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 10.246.68.42`
[2m[33m(raylet)[0m 
[2m[33m(raylet)[0m Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.


[2m[36m(launch_and_fit pid=19670)[0m [Client 480, round 2] fit, config: {'server_round': 2, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=19670)[0m Epoch 1: train loss 0.10202839225530624, accuracy 0.1111111111111111
[2m[36m(launch_and_fit pid=19670)[0m Epoch 2: train loss 0.10137329250574112, accuracy 0.2
[2m[36m(launch_and_fit pid=19670)[0m Epoch 3: train loss 0.10073395073413849, accuracy 0.35555555555555557
[2m[36m(launch_and_fit pid=20694)[0m [Client 212, round 2] fit, config: {'server_round': 2, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=20698)[0m [Client 262, round 2] fit, config: {'server_round': 2, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=20694)[0m Epoch 1: train loss 0.10224051028490067, accuracy 0.13333333333333333
[2m[36m(launch_and_fit pid=20698)[0m Epoch 1: train loss 0.10184305161237717, accuracy 0.17777777777777778
[2m[36m(launch_and_fit pid=20694)[0m Epoch 2: train loss 0.10190055519342422, accuracy 0.15555555555555556
[2m[36m(launc

[2m[33m(raylet)[0m E0227 15:59:56.129624400   20886 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509996.129589000","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
[2m[33m(raylet)[0m E0227 15:59:56.147463800   20906 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677509996.147420200","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
DEBUG flwr 2023-02-27 16:00:00,553 | server.py:229 | fit_round 2 received 25 results and 0 failures


[2m[36m(launch_and_fit pid=20906)[0m [Client 978, round 2] fit, config: {'server_round': 2, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=20906)[0m Epoch 1: train loss 0.10216633230447769, accuracy 0.13333333333333333
[2m[36m(launch_and_fit pid=20906)[0m Epoch 2: train loss 0.10124514997005463, accuracy 0.1111111111111111
[2m[36m(launch_and_fit pid=20906)[0m Epoch 3: train loss 0.10028143227100372, accuracy 0.3333333333333333


DEBUG flwr 2023-02-27 16:00:00,684 | server.py:165 | evaluate_round 2: strategy sampled 50 clients (out of 1000)


[2m[36m(launch_and_evaluate pid=20638)[0m [Client 235] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20638)[0m [Client 698] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20886)[0m [Client 399] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20886)[0m [Client 930] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20694)[0m [Client 742] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20327)[0m [Client 45] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20327)[0m [Client 372] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20327)[0m [Client 476] evaluate, config: {}


[2m[33m(raylet)[0m [2023-02-27 16:00:37,558 E 19577 19577] (raylet) node_manager.cc:3097: 4 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 23822b6a08ac13c06b82c01db60956472938564608a9004253685752, IP: 10.246.68.42) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 10.246.68.42`
[2m[33m(raylet)[0m 
[2m[33m(raylet)[0m Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.


[2m[36m(launch_and_evaluate pid=20327)[0m [Client 795] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20694)[0m [Client 210] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19676)[0m [Client 552] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20698)[0m [Client 653] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19676)[0m [Client 283] evaluate, config: {}


[2m[33m(raylet)[0m E0227 16:00:52.746526300   21033 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677510052.746487700","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
[2m[33m(raylet)[0m E0227 16:00:52.756731800   21035 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677510052.756696600","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
[2m[33m(raylet)[0m E0227 16:00:52.805626000   21029 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677510052.805573300","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/i

[2m[36m(launch_and_evaluate pid=19676)[0m [Client 486] evaluate, config: {}


[2m[36m(raylet)[0m Spilled 16667 MiB, 213 objects, write throughput 171 MiB/s.


[2m[36m(launch_and_evaluate pid=19676)[0m [Client 970] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20698)[0m [Client 31] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20694)[0m [Client 620] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21033)[0m [Client 540] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21035)[0m [Client 457] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21029)[0m [Client 50] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21029)[0m [Client 186] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21029)[0m [Client 585] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21033)[0m [Client 896] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21035)[0m [Client 137] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21029)[0m [Client 760] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21033)[0m [Client 410] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21035)[0m [Client 2

[2m[33m(raylet)[0m [2023-02-27 16:01:37,559 E 19577 19577] (raylet) node_manager.cc:3097: 1 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 23822b6a08ac13c06b82c01db60956472938564608a9004253685752, IP: 10.246.68.42) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 10.246.68.42`
[2m[33m(raylet)[0m 
[2m[33m(raylet)[0m Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.


[2m[36m(launch_and_evaluate pid=20698)[0m [Client 445] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20694)[0m [Client 982] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20698)[0m [Client 6] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20698)[0m [Client 134] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19676)[0m [Client 654] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20694)[0m [Client 830] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21035)[0m [Client 304] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20698)[0m [Client 401] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21029)[0m [Client 136] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21029)[0m [Client 89] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21035)[0m [Client 419] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=19676)[0m [Client 764] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20698)[0m [Client 76

DEBUG flwr 2023-02-27 16:01:54,424 | server.py:179 | evaluate_round 2 received 50 results and 0 failures
DEBUG flwr 2023-02-27 16:01:54,428 | server.py:215 | fit_round 3: strategy sampled 25 clients (out of 1000)


[2m[36m(launch_and_evaluate pid=20327)[0m [Client 561] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20694)[0m [Client 861] evaluate, config: {}
[2m[36m(launch_and_fit pid=20694)[0m [Client 377, round 3] fit, config: {'server_round': 3, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=20694)[0m Epoch 1: train loss 0.10097963362932205, accuracy 0.15555555555555556
[2m[36m(launch_and_fit pid=20694)[0m Epoch 2: train loss 0.09817608445882797, accuracy 0.26666666666666666
[2m[36m(launch_and_fit pid=20694)[0m Epoch 3: train loss 0.09615800529718399, accuracy 0.28888888888888886
[2m[36m(launch_and_fit pid=19676)[0m [Client 384, round 3] fit, config: {'server_round': 3, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=19676)[0m Epoch 1: train loss 0.10175963491201401, accuracy 0.08888888888888889
[2m[36m(launch_and_fit pid=19676)[0m Epoch 2: train loss 0.10012194514274597, accuracy 0.15555555555555556
[2m[36m(launch_and_fit pid=19676)[0m Epoch 3: train loss 0.09

[2m[33m(raylet)[0m E0227 16:02:32.937896800   21188 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677510152.937863300","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
[2m[33m(raylet)[0m E0227 16:02:32.938174000   21199 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677510152.938141300","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
[2m[33m(raylet)[0m E0227 16:02:32.949628100   21198 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677510152.949593400","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/i

[2m[36m(launch_and_fit pid=21188)[0m [Client 909, round 3] fit, config: {'server_round': 3, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=21188)[0m Epoch 1: train loss 0.1013437882065773, accuracy 0.13333333333333333
[2m[36m(launch_and_fit pid=21199)[0m [Client 721, round 3] fit, config: {'server_round': 3, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=21199)[0m Epoch 1: train loss 0.1020599752664566, accuracy 0.06666666666666667
[2m[36m(launch_and_fit pid=21198)[0m [Client 401, round 3] fit, config: {'server_round': 3, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=21198)[0m Epoch 1: train loss 0.10167784243822098, accuracy 0.13333333333333333
[2m[36m(launch_and_fit pid=21188)[0m Epoch 2: train loss 0.09942928701639175, accuracy 0.24444444444444444
[2m[36m(launch_and_fit pid=21188)[0m Epoch 3: train loss 0.0985245630145073, accuracy 0.2222222222222222
[2m[36m(launch_and_fit pid=21199)[0m Epoch 2: train loss 0.09925474226474762, accuracy 0.24444444444444444
[

DEBUG flwr 2023-02-27 16:02:52,322 | server.py:229 | fit_round 3 received 25 results and 0 failures


[2m[36m(launch_and_fit pid=20698)[0m Epoch 2: train loss 0.10044965893030167, accuracy 0.2
[2m[36m(launch_and_fit pid=20698)[0m Epoch 3: train loss 0.09850294142961502, accuracy 0.24444444444444444
[2m[36m(launch_and_fit pid=21199)[0m Epoch 3: train loss 0.09857697039842606, accuracy 0.2222222222222222
[2m[36m(launch_and_fit pid=21198)[0m [Client 946, round 3] fit, config: {'server_round': 3, 'local_epochs': 3}
[2m[36m(launch_and_fit pid=21198)[0m Epoch 1: train loss 0.10195352882146835, accuracy 0.13333333333333333
[2m[36m(launch_and_fit pid=21198)[0m Epoch 2: train loss 0.10066024214029312, accuracy 0.24444444444444444
[2m[36m(launch_and_fit pid=21198)[0m Epoch 3: train loss 0.09983587265014648, accuracy 0.28888888888888886


DEBUG flwr 2023-02-27 16:02:52,555 | server.py:165 | evaluate_round 3: strategy sampled 50 clients (out of 1000)
[2m[33m(raylet)[0m E0227 16:02:55.997637600   21331 socket_utils_common_posix.cc:223] check for SO_REUSEPORT: {"created":"@1677510175.997601500","description":"Protocol not available","errno":92,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":202,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}


[2m[36m(launch_and_evaluate pid=21198)[0m [Client 398] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21198)[0m [Client 850] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20698)[0m [Client 200] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21198)[0m [Client 652] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21331)[0m [Client 123] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21331)[0m [Client 797] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21198)[0m [Client 154] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21198)[0m [Client 851] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21198)[0m [Client 803] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21331)[0m [Client 12] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21331)[0m [Client 299] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=20698)[0m [Client 497] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21199)[0m [Client 

DEBUG flwr 2023-02-27 16:04:34,750 | server.py:179 | evaluate_round 3 received 50 results and 0 failures
INFO flwr 2023-02-27 16:04:34,770 | server.py:144 | FL finished in 533.6758999000003
INFO flwr 2023-02-27 16:04:34,772 | app.py:202 | app_fit: losses_distributed [(1, 0.4597410078048705), (2, 0.45852434730529773), (3, 0.4515949525833129)]
INFO flwr 2023-02-27 16:04:34,774 | app.py:203 | app_fit: metrics_distributed {}
INFO flwr 2023-02-27 16:04:34,776 | app.py:204 | app_fit: losses_centralized []
INFO flwr 2023-02-27 16:04:34,777 | app.py:205 | app_fit: metrics_centralized {}


[2m[36m(launch_and_evaluate pid=21199)[0m [Client 265] evaluate, config: {}
[2m[36m(launch_and_evaluate pid=21198)[0m [Client 138] evaluate, config: {}


History (loss, distributed):
	round 1: 0.4597410078048705
	round 2: 0.45852434730529773
	round 3: 0.4515949525833129

[2m[33m(raylet)[0m [2023-02-27 16:10:37,631 E 19577 19577] (raylet) node_manager.cc:3097: 3 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 23822b6a08ac13c06b82c01db60956472938564608a9004253685752, IP: 10.246.68.42) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 10.246.68.42`
[2m[33m(raylet)[0m 
[2m[33m(raylet)[0m Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
[2m[33m(raylet)[0m [2023-02-27 16:11:37,638 E 19577 19577] (raylet) node_ma

## Recap

In this notebook, we've seen how we can gradually enhance our system by customizing the strategy, initializing parameters on the server side, choosing a different strategy, and evaluating models on the server-side. That's quite a bit of flexibility with so little code, right?

In the later sections, we've seen how we can communicate arbitrary values between server and clients to fully customize client-side execution. With that capability, we built a large-scale Federated Learning simulation using the Flower Virtual Client Engine and ran an experiment involving 1000 clients in the same workload - all in a Jupyter Notebook!

## Next steps

Before you continue, make sure to join the Flower community on Slack: [Join Slack](https://flower.dev/join-slack/)

There's a dedicated `#questions` channel if you need help, but we'd also love to hear who you are in `#introductions`!

The [Flower Federated Learning Tutorial - Part 3 [WIP]](https://flower.dev/docs/tutorial/Flower-3-Building-a-Strategy-PyTorch.html) shows how to build a fully custom `Strategy` from scratch.