In [1]:

import nest_asyncio
nest_asyncio.apply()

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import tensorflow as tf
import tensorflow_federated as tff


In [2]:
tff.__version__

'0.50.0'

## Create Binary Classification data with sklearn

In [3]:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

n = 100000
d = 50
noise_factor = 0.05
test_size = 0.1 # % of n

# Create (noisy) testing data for binary classification.
X, y = make_classification(
    n_samples=n, 
    n_features=d,
    n_informative=d,
    n_redundant=0, 
    n_classes=2,
    class_sep=-1,
    flip_y=noise_factor
)

# We will work with label values -1, +1 and not 0, +1 (convert)
y[y == 0] = -1

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size)


## Convert to Tensors

In [4]:

# Convert the data to TensorFlow tensors
X_train_tensor = tf.constant(X_train, dtype=tf.float32)
y_train_tensor = tf.constant(y_train, dtype=tf.float32)
X_test_tensor = tf.constant(X_test, dtype=tf.float32)
y_test_tensor = tf.constant(y_test, dtype=tf.float32)

## Prepare data for Tensorflow Federated

We have the training and testing Tensors holding our data. TFF expects for each client an `OrderedDict` containing `y` and `x` data. Hence, we preprocess our Tensors to follow this convention.

In [5]:

NUM_CLIENTS = 8
BATCH_SIZE = 32
SHUFFLE_BUFFER = 96

In [6]:

import collections

# Create a dictionary with the slices for each client
client_slices_train = {}
slices_test = {}

n_test = int(n - n*test_size)

for i in range(NUM_CLIENTS):
    # Compute the indices for this client's slice
    start_idx = int(i * n_test / NUM_CLIENTS)
    end_idx = int((i + 1) * n_test / NUM_CLIENTS)

    # Get the slice for this client
    X_client_train = X_train_tensor[start_idx:end_idx]
    y_client_train = y_train_tensor[start_idx:end_idx]
    
    client_data_train = collections.OrderedDict([('y', y_client_train), ('x', X_client_train)])
    
    # Combine the slices into a single dataset
    client_slices_train[f'client_{i}'] = client_data_train

slices_test = collections.OrderedDict([('y', y_test_tensor), ('x', X_test_tensor)])

For a sanity check let's see inside `client_slices_train` for the first x,y tuple of the 'first' client

In [37]:
client_slices_train['client_0']['x'][0]

<tf.Tensor: shape=(50,), dtype=float32, numpy=
array([-1.7185166 , -5.0518475 ,  0.02975657,  4.507718  , -2.9012241 ,
       -1.3904775 , -0.6722793 , -5.201269  ,  0.4005102 ,  5.250195  ,
       -5.3100247 ,  0.3369487 ,  4.662774  ,  5.5472927 ,  3.9250505 ,
        2.3702579 ,  4.892519  ,  1.3388315 ,  2.2262855 , -1.2632746 ,
       -1.6392467 ,  3.9341595 ,  0.9980433 , -0.2969859 , -1.6189998 ,
        2.2584572 , -1.639607  , -2.4005718 ,  2.3321502 , -4.670214  ,
        3.8024392 ,  5.2460275 , -1.852771  ,  0.7435165 ,  0.54761195,
       -5.3764625 , -0.12647538, -3.522802  ,  8.834553  ,  8.7223625 ,
       -3.9466345 ,  4.4238544 ,  3.8179338 , -0.90905523, -0.07614926,
       -4.1437864 ,  2.8217857 ,  0.72625434, -0.56228954, -0.7264684 ],
      dtype=float32)>

In [38]:
client_slices_train['client_0']['y'][0]

<tf.Tensor: shape=(), dtype=float32, numpy=1.0>

Now, a client with `client_id` has it's single Tensor holding instances in`client_slices_train[client_id]['x']` and labels in `client_slices_train[client_id]['y']`. Let's take a step back from TFF. Having this data scheme, we can create a client's Tensorflow dataset using `from_tensor_slices` function passing the client's id as follows

In [9]:

def create_tf_dataset_for_client(client_id):
    return tf.data.Dataset.from_tensor_slices(client_slices_train[client_id]).shuffle(SHUFFLE_BUFFER).batch(BATCH_SIZE)

def create_tf_dataset_for_test():
    return tf.data.Dataset.from_tensor_slices(slices_test).batch(BATCH_SIZE)

For TFF we need to construct Federated data for clients, i.e., `tff.simulation.datasets.ClientData`. We can use the `from_clients_and_tf_fn` function that takes as argument the `client_ids` : a list of strings corresponding to client ids, and a `serializable_dataset_fn` : a function that takes a `client_id` from the above list, and returns a `tf.data.Dataset`. It's obvious how we proceed with the code (using the above function)

In [10]:

preprocessed_train_federated_dataset = tff.simulation.datasets.ClientData.from_clients_and_tf_fn(
    client_ids=list(client_slices_train.keys()),
    serializable_dataset_fn=lambda client_id: create_tf_dataset_for_client(client_id)
)

In [11]:
preprocessed_train_federated_dataset.client_ids

['client_0',
 'client_1',
 'client_2',
 'client_3',
 'client_4',
 'client_5',
 'client_6',
 'client_7']

**Note**: Cross-device federated learning does not use client IDs or perform any tracking of clients. However in simulation experiments using centralized test data the experimenter may select specific clients to be processed per round. The concept of a client ID is only available at the preprocessing stage when preparing input data for the simulation and is not part of the TensorFlow Federated core APIs.

Now, `preprocessed_train_federated_dataset` holds logic on how each client constructs its dataset. Note that so `client_slices_train` has already been materialized and lies in this context's memory.

One way (the simplest) to feed federated data to TFF in a simulation is simply as a Python list, with each element of the list holds the data of an individual client, whether as a list or preferably as a `tf.data.Dataset`. Since we already created an interface that provides the latter we will use it. Here is a helper function that will construct a list of datasets from the set of users.

In [12]:

def create_federated_data():    
    return [
        preprocessed_train_federated_dataset.create_tf_dataset_for_client(client)
        for client in preprocessed_train_federated_dataset.client_ids
    ]

**Important Note**: Firstly, we used `sklearn` to create the binary classification data eagerly, i.e., we were forced to materialize it into memory. In simulation, in general it is more sound to push preprocessing logic into each client, i.e., each client constructs its own dataset (from the same underlying distribution) or reads from a file or something else and he, himself processes the data as needed. This is the best approach and uses the TFF distributed engine the best way. But in our case this was illogical to happen since we are forced to construct the dataset in memory anyway. For example, we could have stored each client's data inside some serialized file (`client_0.tfrecord` for the first client and so on) and push logic where each clients diserializes and processes its own data but this would be silly and slower when testing. For a small example that showcases this scenario see *TFF - Introduction - Federated Core API - Part 3(examples).ipynb*.

In [13]:
#https://stackoverflow.com/questions/60265798/tff-how-define-tff-simulation-clientdata-from-clients-and-fn-function

## TFF Types

First, let's define the type of input as a TFF named tuple. Since the size of data batches may vary, we set the batch dimension to None to indicate that the size of this dimension is unknown.

In [14]:

BATCH_SPEC = collections.OrderedDict(
    y=tf.TensorSpec(shape=[None], dtype=tf.float32),
    x=tf.TensorSpec(shape=[None, d], dtype=tf.float32)
)
BATCH_TYPE = tff.to_type(BATCH_SPEC)

In [15]:
str(BATCH_TYPE)

'<y=float32[?],x=float32[?,50]>'

Every client holds a sequence of batches so the we define the client data type as follows

In [16]:

LOCAL_DATA_TYPE = tff.SequenceType(BATCH_TYPE)

In [17]:
str(LOCAL_DATA_TYPE)

'<y=float32[?],x=float32[?,50]>*'

Let's now define the TFF type of the model which is simply a `tf.Variable` with shape (d, 1)

In [18]:

MODEL_TYPE = tff.TensorType(dtype=tf.float32, shape=(d, 1))

In [19]:
str(MODEL_TYPE)

'float32[50,1]'

Since the server holds the 'global' model we need to create the Federated Type, defined as the tuple of a member: An instance of `tff.Type`, and a placement: The specification of placement of the member comonents (where this type is hosted at, for example, at `tff.SERVER` or `tff.CLIENTS`).

In [20]:

SERVER_MODEL_TYPE = tff.type_at_server(MODEL_TYPE)

In [21]:
str(SERVER_MODEL_TYPE)

'float32[50,1]@SERVER'

Following, the same logic, we create the Federated Type of each client's data.

In [22]:

CLIENT_DATA_TYPE = tff.type_at_clients(LOCAL_DATA_TYPE)

In [23]:
str(CLIENT_DATA_TYPE)

'{<y=float32[?],x=float32[?,50]>*}@CLIENTS'

## Accuracy Testing

In [24]:

@tf.function
def accuracy(model, dataset):
    
    @tf.function
    def _batch_accuracy(model, batch):
        x_batch, y_batch = batch['x'], tf.expand_dims(batch['y'], axis=1)

        # dot(w, x) for the batch (each instance of x in x_batch) with with shape=(batchsize, 1)
        weights_dot_x_batch = tf.matmul(x_batch, model)

        # Prediction batch with shape=(batchsize, 1)
        y_pred_batch = tf.sign(weights_dot_x_batch)

        accuracy = tf.reduce_mean(tf.cast(tf.equal(y_pred_batch, y_batch), tf.float32))

        return accuracy
    
    # We take advantage of AutoGraph (convert Python code to TensorFlow-compatible graph code automatically)
    acc, num_batches = 0., 0.
    for batch in dataset:
        acc += _batch_accuracy(model, batch)
        num_batches += 1
        
    acc = acc / num_batches
    
    return acc

In [25]:

@tff.tf_computation(MODEL_TYPE, LOCAL_DATA_TYPE)
def accuracy_fn(model, dataset):
    model = tf.Variable(initial_value=model)
    return accuracy(model, dataset)

Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


In [26]:
str(accuracy_fn.type_signature)

'(<model=float32[50,1],dataset=<y=float32[?],x=float32[?,50]>*> -> float32)'

## Federated Training

### Server Update

The server update takes as input the *average* of the client's models and creates its model as follows

In [27]:

@tff.tf_computation(MODEL_TYPE)
def server_update_fn(mean_client_model):
    model = tf.Variable(initial_value=mean_client_model)
    return model

**Note**: This abstraction for this simple jupyter (where the model is a `tf.Variable`) is not necessary. We create this abstraction since it is common practice generally.

### Client train

Each client trains on its own dataset (which is a sequence of batches). Hence, we create the training process, currently a PA-1 Classifier. The input of `client_train` is the client model materialized inside its client and its dataset.

In [28]:

@tf.function
def client_train(model, dataset):
    
    @tf.function
    def _train_on_batch(model, batch, C=0.01):

        x_batch, y_batch = batch['x'], tf.expand_dims(batch['y'], axis=1)

        # dot(w, x) for the batch (each instance of x in x_batch) with with shape=(batchsize, 1)
        weights_dot_x_batch = tf.matmul(x_batch, model)

        # Prediction batch with shape=(batchsize, 1)
        y_pred_batch = tf.sign(weights_dot_x_batch)

        # Suffer loss for each prediction (of instance) in the batch with shape=(batchsize,1)
        loss_batch = tf.maximum(0., 1. - tf.multiply(y_batch, weights_dot_x_batch))

        # shape=(batchsize,1) where each instance is ||x||^2, x in x_batch
        norm_batch = tf.expand_dims(tf.reduce_sum(tf.square(x_batch), axis=1), axis=1)

        # PA-1 : Learning rate t for each instance x, with shape=(batchsize,1)
        t_batch = tf.maximum(C, tf.divide(loss_batch, norm_batch))

        # each instance is y*t*x, where y,t scalars and x in x_batch. shape=(batchsize,d)
        t_y_x_batch = tf.multiply(t_batch, tf.multiply(y_batch, x_batch))

        # !!!! Update with mean t*y*x
        t_y_x_update = tf.expand_dims(tf.reduce_mean(t_y_x_batch, axis=0) ,axis=1)

        # Update
        model.assign_add(t_y_x_update)
    
    for batch in dataset:
        _train_on_batch(model, batch)
        
    return model

Using the functions decorated with `tf.function` (context inside Tensorflow) we create the `client_train_fn` with context inside TFF. `client_train_fn` takes as input the `initial_model` which is the model broadcasted from the server to each client and the client dataset. Notice that each client first creates it's own model using the server model.

In [29]:

@tff.tf_computation(MODEL_TYPE, LOCAL_DATA_TYPE)
def client_train_fn(initial_model, dataset):
    model = tf.Variable(initial_value=initial_model)
    return client_train(model, dataset)

In [30]:
str(client_train_fn.type_signature)

'(<initial_model=float32[50,1],dataset=<y=float32[?],x=float32[?,50]>*> -> float32[50,1])'

### Training Round

Remember the 4 elements of an FL round:

1. A server-to-client broadcast of the weights.
2. A local client training 'step' on its own data.
3. A client-to-server upload step (returning the trained weights).
4. A server update step.

In [31]:

@tff.federated_computation(SERVER_MODEL_TYPE, CLIENT_DATA_TYPE)
def run_one_round(server_model, federated_dataset):
    
    # 1. Broadcast the current server model to the clients
    server_model_at_client = tff.federated_broadcast(server_model)
    
    # 2. 3. Train the client models on their respective datasets
    client_models = tff.federated_map(client_train_fn, (server_model_at_client, federated_dataset))
    
    # 4. Compute the mean of the client weights
    mean_client_model = tff.federated_mean(client_models)
    
    # 4. Update the server model
    server_model = tff.federated_map(server_update_fn, mean_client_model)
    
    return server_model

In [32]:
str(run_one_round.type_signature)

'(<server_model=float32[50,1]@SERVER,federated_dataset={<y=float32[?],x=float32[?,50]>*}@CLIENTS> -> float32[50,1]@SERVER)'

## Training.

In [33]:

NUM_ROUNDS = 20

# Initial model of zeros (in Python context, to be passed to server)
model = tf.Variable(tf.zeros(shape=(d, 1)), trainable=True, name='weights', dtype=tf.float32)

In [34]:

train_federated_data = create_federated_data()

In [35]:

test_dataset = create_tf_dataset_for_test()

In [36]:

for r in range(NUM_ROUNDS):
    model = run_one_round(model, train_federated_data)
    print(f"Round: {r}  Server Model Accuracy: {accuracy_fn(model, test_dataset)}")

Round: 0  Server Model Accuracy: 0.8102036714553833
Round: 1  Server Model Accuracy: 0.8213858008384705
Round: 2  Server Model Accuracy: 0.8266773223876953
Round: 3  Server Model Accuracy: 0.8282747864723206
Round: 4  Server Model Accuracy: 0.8309704661369324
Round: 5  Server Model Accuracy: 0.8332667946815491
Round: 6  Server Model Accuracy: 0.8334664702415466
Round: 7  Server Model Accuracy: 0.8340654969215393
Round: 8  Server Model Accuracy: 0.8343650102615356
Round: 9  Server Model Accuracy: 0.834664523601532
Round: 10  Server Model Accuracy: 0.8343650102615356
Round: 11  Server Model Accuracy: 0.8344648480415344
Round: 12  Server Model Accuracy: 0.8343650102615356
Round: 13  Server Model Accuracy: 0.8344648480415344
Round: 14  Server Model Accuracy: 0.8347643613815308
Round: 15  Server Model Accuracy: 0.8345646858215332
Round: 16  Server Model Accuracy: 0.8342651724815369
Round: 17  Server Model Accuracy: 0.8342651724815369
Round: 18  Server Model Accuracy: 0.8340654969215393
Roun