##### Copyright 2021 Google LLC. All Rights Reserved.

In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#**RLDS: Performance best practices**
This colab provides some performance hints for RL dataset pipelines. If you are looking for an introduction to RLDS, see the [RLDS tutorial](https://colab.research.google.com/github/google-research/rlds/blob/main/rlds/examples/rlds_tutorial.ipynb) in Google Colab.


<table class="tfo-notebook-buttons" align="left">
  <td>
    <a href="https://colab.research.google.com/github/google-research/rlds/blob/main/rlds/examples/rlds_performance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Run In Google Colab"/></a>
  </td>
</table>

# Remarks

This Colab provides just a few ideas of speeding up RL dataset transformations. While examples work for many datasets, they are not generic and performance can degrade for other datasets with uncommon characteristics. We advise to use existing [RLDS transformations](https://github.com/google-research/rlds/tree/main/rlds/transformations) when possible.


#Install Modules

In [2]:
!pip install rlds[tensorflow]
!pip install tfds-nightly --upgrade
!pip install envlogger
!apt-get install libgmp-dev

Collecting rlds
  Downloading rlds-0.1.1-py3-none-manylinux2010_x86_64.whl (35 kB)
Installing collected packages: rlds
Successfully installed rlds-0.1.1
Collecting tfds-nightly
  Downloading tfds_nightly-4.4.0.dev202109210107-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 8.9 MB/s 
Installing collected packages: tfds-nightly
Successfully installed tfds-nightly-4.4.0.dev202109210107
Collecting envlogger
  Downloading envlogger-1.0.5-cp37-cp37m-manylinux2010_x86_64.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 6.3 MB/s 
Collecting mock
  Downloading mock-4.0.3-py3-none-any.whl (28 kB)
Collecting dm-env
  Downloading dm_env-1.5-py3-none-any.whl (26 kB)
Installing collected packages: mock, dm-env, envlogger
Successfully installed dm-env-1.5 envlogger-1.0.5 mock-4.0.3
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libgmpxx4ldbl
Suggested packa

##Import Modules

In [3]:
import functools
import numpy as np
import rlds
import tensorflow as tf
import tensorflow_datasets as tfds

# Experimental dataset

In our performance experiment we will use *d4rl_mujoco_halfcheetah* dataset. To limit execution time of the benchmarked code we limit analysis to 50 episodes by default.

In [4]:
dataset = tfds.load('d4rl_mujoco_halfcheetah/v0-medium')['train'].take(50)

[1mDownloading and preparing dataset 82.92 MiB (download: 82.92 MiB, generated: 98.43 MiB, total: 181.34 MiB) to /root/tensorflow_datasets/d4rl_mujoco_halfcheetah/v0-medium/1.1.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/1 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/1002 [00:00<?, ? examples/s]

Shuffling d4rl_mujoco_halfcheetah-train.tfrecord...:   0%|          | 0/1002 [00:00<?, ? examples/s]

[1mDataset d4rl_mujoco_halfcheetah downloaded and prepared to /root/tensorflow_datasets/d4rl_mujoco_halfcheetah/v0-medium/1.1.0. Subsequent calls will reuse this data.[0m


# Benchmarking

To measure performance of our pipelines we will use the following measuring method:

In [5]:
import time

def benchmark(f, dataset):
  start_time = time.monotonic()
  start_cpu = time.process_time()
  result = f(dataset)
  wall_time = time.monotonic() - start_time
  cpu_time = time.process_time() - start_cpu
  print(f'Result: {result}, Execution time: {wall_time}, CPU: {cpu_time}')


# Size of the dataset

Just so that we know how big is the dataset we play with, lets first compute the number of episodes and steps.

In [6]:
episodes = 0
steps = 0
for episode in dataset:
  episodes += 1
  steps += episode[rlds.STEPS].cardinality()

print(f'Episodes: {episodes}, steps: {steps}')

Episodes: 50, steps: 49950


# Computing the total reward

We will experiment with RL dataset pipeline performance by trying to compute a sum of steps' rewards returned in all episodes of the example dataset. The starting point implementation is a simple Python's double loop over episodes and steps:

In [7]:
def compute_return(episode_dataset):
  result = 0
  for episode in episode_dataset:
    for step in episode[rlds.STEPS]:
      result += step[rlds.REWARD]
  return result

benchmark(compute_return, dataset)

Result: 189805.9375, Execution time: 11.76337768500025, CPU: 10.168450590999981


# Prefetching
 
The double loop from the example above is very simple, while execution time is quite significant given the total number of steps in the dataset. One could expect the source of slowness is retrieval of elements from the dataset. If so, prefetching a dataset could help.



In [8]:
def compute_return(episode_dataset):
  result = 0
  for episode in episode_dataset.prefetch(2):
    for step in episode[rlds.STEPS].prefetch(2):
      result += step[rlds.REWARD]
  return result

benchmark(compute_return, dataset)

Result: 189805.9375, Execution time: 20.498231358999874, CPU: 15.270387427999992


Turns out, however, that not only does the CPU time increase due to additional work performed, execution time increases as well. In the case of our dataset, per-step information is small and the overhead of thread synchronization exceeds the latency of retrieving dataset elements synchronously. Iterating over the dataset from Python directly is the main cost currently. Python loop can be replaced with a [tf.data.Dataset.reduce](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#reduce) operation.
 



In [9]:
def episode_return_sum(episode):
  return episode[rlds.STEPS].reduce(np.float32(0), lambda x, step: step[rlds.REWARD] + x)

def compute_return(episode_dataset):
  return episode_dataset.reduce(np.float32(0), lambda x, episode: episode_return_sum(episode) + x)

benchmark(compute_return, dataset)

Result: 189806.734375, Execution time: 2.5693010330001016, CPU: 2.9349800629999834


This gives significant execution time improvement as well as CPU usage reduction. As the previous bottleneck is eliminated, there is hope that prefetching could help at this point. Try adding prefetching to see what is the outcome.

# Using built-in methods whenever possible
 
Executing custom lambda functions on many data elements of the 
[tf.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) is expensive compared to using built-in methods. Avoiding that whenever possible is beneficial for performance. In our example we can replace the outer *reduce* with a [tf.data.Dataset.flat_map](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#flat_map), which allows to apply a simpler lambda method that can be executed more efficiently.
 



In [10]:
def compute_return(episode_dataset):
  return episode_dataset.flat_map(lambda x: x[rlds.STEPS]).reduce(np.float32(0), lambda x, step: step[rlds.REWARD] + x)

benchmark(compute_return, dataset)

Result: 189805.9375, Execution time: 2.0682176840000466, CPU: 2.899168076999956


# Eliminating lambda calls with batching
 
As elements in many RLDS datasets are small, the cost of invoking *lambdas* exceed the cost of additional data copies/moves. It is hence possible to reduce the number of *lambda* calls using [batching](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch).
 



In [11]:
def compute_return(episode_dataset):
  return episode_dataset.flat_map(lambda x: x[rlds.STEPS]).batch(100).reduce(np.float32(0), lambda x, step: tf.math.reduce_sum(step[rlds.REWARD]) + x)

benchmark(compute_return, dataset)

Result: 189806.75, Execution time: 0.34771978399976433, CPU: 0.36866338800001586


# Vectorized transformations

An example we analyzed so far focused on computing aggregated statistics for a given dataset. Sometimes it is required to perform custom per-step modifications of the dataset instead. For that reason RLDS provides *map_nested_steps* operation that maintains the episodic structure. In this example, we will try to implement a simple transformation ourselves with the use of [tf.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) operators. Lets implement a transformation which changes a given episode dataset into a collection of steps with doubled reward values.

In [12]:
def double_reward(step):
  step[rlds.REWARD] *= 2
  return step

double_reward_dataset = dataset.flat_map(lambda x: x[rlds.STEPS]).map(lambda step : double_reward(step))

Lets now measure the performance of the new dataset:

In [13]:
def compute_return(step_dataset):
  return step_dataset.batch(100).reduce(np.float32(0), lambda x, step: tf.math.reduce_sum(step[rlds.REWARD]) + x)

benchmark(compute_return, double_reward_dataset)

Result: 379613.5, Execution time: 1.55345633800016, CPU: 2.552577060000033


Similarly to the previous examples, the main bottleneck is the per-step call of the *double_reward* function. We can reduce that overhead by first batching multiple steps, then applying vectorized version of the *double_reward* and un-batching the result.

In [14]:
def vectorized_double_reward(steps):
  return tf.vectorized_map(double_reward, steps)

double_reward_dataset = dataset.flat_map(lambda x: x[rlds.STEPS]).batch(100).map(vectorized_double_reward).unbatch()

benchmark(compute_return, double_reward_dataset)

Result: 379613.5, Execution time: 0.6719628939999893, CPU: 0.9763153779999811


The use of [tf.vectorized_map](https://www.tensorflow.org/api_docs/python/tf/vectorized_map) results in significantly faster implementation.
