##### Copyright 2020 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

In [1]:
#@test {"skip": true}
!pip install --quiet --upgrade tensorflow_federated
!pip install --quiet --upgrade nest_asyncio

import nest_asyncio
nest_asyncio.apply()

[K     |████████████████████████████████| 522kB 8.3MB/s 
[K     |████████████████████████████████| 1.1MB 13.5MB/s 
[K     |████████████████████████████████| 320.4MB 47kB/s 
[K     |████████████████████████████████| 153kB 67.4MB/s 
[K     |████████████████████████████████| 174kB 55.2MB/s 
[K     |████████████████████████████████| 112kB 57.7MB/s 
[K     |████████████████████████████████| 20.1MB 1.2MB/s 
[K     |████████████████████████████████| 3.0MB 58.8MB/s 
[K     |████████████████████████████████| 460kB 54.9MB/s 
[?25h  Building wheel for absl-py (setup.py) ... [?25l[?25hdone
[31mERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.[0m
[31mERROR: albumentations 0.1.12 has requirement imgaug<0.2.7,>=0.2.5, but you'll have imgaug 0.2.9 which is incompatible.[0m


In [2]:
import collections
import io

import pandas as pd
import tensorflow as tf
import tensorflow_federated as tff

from google.colab import files

# Test that TFF is working:
tff.federated_computation(lambda: 'Hello, World!')()

b'Hello, World!'

In [27]:
from IPython.core.magic import register_cell_magic

@register_cell_magic
def norun_except_tests(*args, **kwargs):
    return

**NOTE**: This colab has been verified to work with the [latest released version](https://github.com/tensorflow/federated#compatibility) of the `tensorflow_federated` pip package, but the Tensorflow Federated project is still in pre-release development and may not work on `main`.


## Overview

In the [image classification](federated_learning_for_image_classification.ipynb) and
[text generation](federated_learning_for_text_generation.ipynb) tutorials, we learned how to set up model and data pipelines for Federated Learning using datasets provided by TFF.

In order to try Federated Learning for different applications, you may want to provide your own dataset. This tutorial shows you how to load a CSV file into a `tff.simulation.ClientData` for use in federated computations.


## Creating and Downloading a CSV File with Shakespeare Data

Before we can demonstrate loading a CSV file for use in TFF, we need to create two CSV files: one with training data, and one with testing data. We will load a Shakespeare dataset from the `tff.simulation.datasets` package and convert the data into CSV format. This is the same dataset used in the [text generation](federated_learning_for_text_generation.ipynb) tutorial.

In [9]:
train_data, test_data = tff.simulation.datasets.shakespeare.load_data()

Downloading data from https://storage.googleapis.com/tff-datasets-public/shakespeare.tar.bz2


The TFF dataset is partitioned by client ID, where each client corresponds to a dataset on a particular device that might participate in federated learning. In the case of the Shakespeare dataset, each client is a character from Shakespeare, and the `client_id` is a character's name.

To get a `tf.data.Dataset` for a particular client, we can use the `create_tf_dataset_for_client` function. In the case of this dataset, each `tf.data.Dataset` consists of multiple lines (`snippets`) spoken by that Shakespeare character. We create a column in the CSV file for each snippet, with the `client_id` in the `character` column. Thus, the same character's name can appear in many rows.


In [10]:
def write_data_to_csv_file(tff_dataset, f):
  f.write('"character","snippets"\n')
  # Use a subset of the clients to speed up execution.
  for client_id in tff_dataset.client_ids[200:]:
    tf_dataset = tff_dataset.create_tf_dataset_for_client(client_id)
    for element in tf_dataset.as_numpy_iterator():
      # The CSV standard specifies that double quotes in the data must be escaped by preceding them with another double quote.
      f.write('"' + client_id + '","' + str(element['snippets'], 'ascii').replace('"', '""') + '"\n')

We will create two separate files, one for training data and one for testing data.

In [11]:
#@test {"skip": true}
filenames = ['shakespeare_train.csv', 'shakespeare_test.csv']
for filename, data in zip(filenames, [train_data, test_data]):
  with open(filename, 'w') as f:
    write_data_to_csv_file(data, f)

Let's see what the first few lines of each file look like. The client keys consist of the name of the play joined with
the name of the character, so for example `MUCH_ADO_ABOUT_NOTHING_OTHELLO` corresponds to the lines for the character Othello in the play *Much Ado About Nothing*.

In [12]:
#@test {"skip": true}
for filename in filenames:
  with open(filename, 'r') as f:
    print("Reading file " + filename)
    for i in range(10):
      print(f.readline())

Reading file shakespeare_train.csv
"character","snippets"

"PERICLES__PRINCE_OF_TYRE_EXTON","Both have I spill'd. O, would the deed were good!

For now the devil, that told me I did well,

Says that this deed is chronicled in hell.

This dead King to the living King I'll bear.

Take hence the rest, and give them burial here.       Exeunt

Great King, within this coffin I present"

"PERICLES__PRINCE_OF_TYRE_EXTON","'Have I no friend will rid me of this living fear?'

Was it not so?

'Have I no friend?' quoth he. He spake it twice"

Reading file shakespeare_test.csv
"character","snippets"

"PERICLES__PRINCE_OF_TYRE_EXTON","Didst thou not mark the King, what words he spake?"

"PERICLES__PRINCE_OF_TYRE_FIRST_CITIZEN","Give you good morrow, sir."

"PERICLES__PRINCE_OF_TYRE_FIRST_CITIZEN","mother.

Come, come, we fear the worst; all will be"

"PERICLES__PRINCE_OF_TYRE_FIRST_HERALD","Harry of Hereford, Lancaster, and Derby,"

"PERICLES__PRINCE_OF_TYRE_FIRST_MURDERER","done.

Where's thy consc

Now we can download the files we just created. You may need to click "Allow" on a a popup in your browser asking for permission to download the files from Colab.

In [13]:
#@test {"skip": true}
for filename in filenames:
  files.download(filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Take a look in your downloads folder to ensure that the files `shakespeare_train.csv` and `shakespeare_test.csv` were downloaded. Now we can delete the files from colab, since we will be uploading them from our local filesystem.

In [14]:
#@test {"skip": true}
!rm "shakespeare_train.csv"
!rm "shakespeare_test.csv"

## Uploading files into Colab

Now we can upload the CSV files, as if we had the data in CSV format locally in the first place. The `files.upload()` function should bring up a "Choose Files" button. When you click this button, you should be able to choose files from your filesystem to upload. Choose the `shakespeare_train.csv` and `shakespeare_test.csv` files that you downloaded in the previous step.

In [15]:
#@test {"skip": true}
uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving shakespeare_train.csv to shakespeare_train.csv
Saving shakespeare_test.csv to shakespeare_test.csv
User uploaded file "shakespeare_train.csv" with length 2181298 bytes
User uploaded file "shakespeare_test.csv" with length 320054 bytes


## Converting CSV Data to a TFF Dataset

We will now write a function that returns a federated dataset (`tff.simulation.ClientData`) given the CSV file contents. 

We will use Pandas to read in the CSV data. Then we implement `create_tf_dataset_for_client_fn`, which takes a client ID and returns a TensorFlow dataset for that client.

In [16]:
#@test {"skip": true}
# Create Pandas dataframes from the CSV files.
train_load_csv = pd.read_csv(io.StringIO(uploaded['shakespeare_train.csv'].decode("ascii")))
test_load_csv = pd.read_csv(io.StringIO(uploaded['shakespeare_test.csv'].decode("ascii")))

In [28]:
# Running this cell will call the write_data_to_csv_file
# function to write the data from the Shakespeare datasets
# to in-memory buffers, and then load the contents of
# the buffers into Pandas dataframes.
# If the previous cell executed successfully, there is
# no need to run this one.
# If you would like to skip the steps above involving
# downloading and re-uploading the CSV files,
# you can comment out the line below to run this cell.
%%norun_except_tests

train_stream = io.StringIO()
write_data_to_csv_file(train_data, train_stream)
train_stream.seek(0)
train_load_csv = pd.read_csv(train_stream)

test_stream = io.StringIO()
write_data_to_csv_file(test_data, test_stream)
test_stream.seek(0)
test_load_csv = pd.read_csv(io.StringIO(test_stream.getvalue()))

In [18]:
client_id_colname = 'character' # the column that represents client ID
# The signature of the `tf.data.Dataset` that will be output for each client.
output_signature = {'snippets': tf.TensorSpec(shape=(), dtype=tf.string)}
# Rows for which any of the columns in this list are null will be discarded.
notnull_cols = output_signature.keys()

def create_tff_dataset_for_csv_file(df):
  # Collect unique character names.
  client_ids = df[client_id_colname].unique().tolist()

  # Define a function that takes client ID and returns a tf.data.Dataset for
  # that client. The tf.data.Dataset should contain a dictionary for each
  # line spoken by the character, where the single key in each dictionary is
  # "snippets" and the value is a string tensor containing the text of the line.
  def create_tf_dataset_for_client_fn(client_id):
    # Retrieve only the rows corresponding to this client.
    client_data = df[df[client_id_colname] == client_id]
    # Filter out any rows without a snippet of text spoken by the character.
    client_data = client_data[client_data[notnull_cols].notnull().all(axis=1)]
    # Select the data columns, discarding the client id column.
    client_data = client_data[output_signature.keys()]
    # Convert to a dictionary in the format
    # [{column1 : value1, column2 : value2}]
    records = client_data.to_dict('records')

    # Define a generator that outputs a map for each row with column names as
    # keys and row contents as values. In this example there is only one column,
    # 'snippets', but this approach is shown to demonstrate how one might
    # load a CSV file with more columns.
    def dataset_gen():
      for row in records:
        yield row
    # Generate a dataset for the client, specifying the output type explicitly
    # as otherwise Tensorflow expects a tensor as the toplevel type.
    return tf.data.Dataset.from_generator(
        dataset_gen,
        output_types={k:v.dtype for k,v in output_signature.items()},
        output_shapes={k:v.shape for k,v in output_signature.items()}
    )

  # Now that we have a list of client IDs and a function to generate a dataset
  # for each client, we can use from_clients_and_fn to create the federated
  # dataset.
  return tff.simulation.ClientData.from_clients_and_fn(
      client_ids=client_ids,
      create_tf_dataset_for_client_fn=create_tf_dataset_for_client_fn
  )

In [19]:
test_data = create_tff_dataset_for_csv_file(test_load_csv)
train_data = create_tff_dataset_for_csv_file(train_load_csv)

The datasets we just created consist of a sequence of maps from the key 'snippet' to 
string `Tensors`, one for each line spoken by a particular character in a
Shakespeare play.

We can get the data for a particular client by calling `create_tf_dataset_for_client` with that client_id. Note that in a real federated learning scenario
clients are never identified or tracked by ids, but for simulation it is useful
to work with keyed datasets.

Let's take a look at lines spoken by Exton from *Pericles, Prince of Tyre*. We saw these lines printed above in the CSV file, so we can make sure they were loaded faithfully into the dataset.

In [20]:
# Here the play is "Pericles, Prince of Tyre" and the character is "Exton".
raw_example_train_dataset = train_data.create_tf_dataset_for_client(
    'PERICLES__PRINCE_OF_TYRE_EXTON')
# Each entry x is a dictionary with a single key 'snippets' which contains the
# text. If you import your own dataset, you may have more keys in each
# dictionary corresponding to different features.
for x in raw_example_train_dataset.take(2):
  print(x)

{'snippets': <tf.Tensor: shape=(), dtype=string, numpy=b"Both have I spill'd. O, would the deed were good!\nFor now the devil, that told me I did well,\nSays that this deed is chronicled in hell.\nThis dead King to the living King I'll bear.\nTake hence the rest, and give them burial here.       Exeunt\nGreat King, within this coffin I present">}
{'snippets': <tf.Tensor: shape=(), dtype=string, numpy=b"'Have I no friend will rid me of this living fear?'\nWas it not so?\n'Have I no friend?' quoth he. He spake it twice">}


To make sure the test data loaded correctly, we can look at some data from King Lear:

In [21]:
# Here the play is "The Tragedy of King Lear" and the character is "King".
raw_example_test_dataset = test_data.create_tf_dataset_for_client(
    'THE_TRAGEDY_OF_KING_LEAR_KING')
# Each entry x is a dictionary with a single key 'snippets' which contains the
# text. If you import your own dataset, you may have more keys in each
# dictionary corresponding to different features.
for x in raw_example_test_dataset.take(2):
  print(x)

{'snippets': <tf.Tensor: shape=(), dtype=string, numpy=b'Sir, I will pronounce your sentence: you shall fast a week'>}
{'snippets': <tf.Tensor: shape=(), dtype=string, numpy=b'Teach us, sweet madam, for our rude transgression'>}


Now that the data from the CSV file has been loaded into a `tff.simulation.ClientData`, we can use `tf.data.Dataset` transformations to prepare the data for training. Refer to the [text generation](federated_learning_for_text_generation.ipynb) tutorial for instructions on how to transform the data and train a model using Federated Learning.


## Modifying the Code to Work With Your Dataset

If you have a CSV file you would like to use for Federated Learning, you can try modifying the code above to load your file into a `tff.simulation.ClientData`. The code was written to be easily modifiable to work with different datasets, but there are a few things you will need to consider:

* `client_id_colname`: What column from your CSV file will be used as the `client_id`? This is a fundamental question about how you want to partition your data into client datasets. For realistic simulation of the challenges of Federated Learning, you may *not* want your data to be independent and identically distributed (IID) across clients.

* `output_signature`: What is the type of the elements of the `tf.data.Dataset` that will be generated for each client_id? In this example, we had only one data column, `snippets`, but your dataset may have multiple data columns corresponding to different features and labels, possibly with different types.

* `notnull_cols`: The current implementation will filter out rows in which any of the data columns is null. However, you might want to change this if null values are tolerable for some of the columns. Are there other conditions you want to filter on, such as filtering out rows with NaN values? 
