License

```
Copyright (c) Facebook, Inc. and its affiliates.

This source code is licensed under the MIT license found in the
LICENSE file in the root directory of this source tree.
```

# Using CompilerGym environments with RLlib

In this notebook we will use [RLlib](https://docs.ray.io/en/master/rllib.html) to train an agent for CompilerGym's [LLVM environment](https://facebookresearch.github.io/CompilerGym/llvm/index.html). RLlib is a popular library for scalable reinforcement learning, built on [Ray](https://docs.ray.io/en/master/index.html). It provides distributed implementations of several standard reinforcement learning algorithms.

Our goal is not to produce the best agent, but to demonstrate how to integrate CompilerGym with RLlib. It will take about 20 minutes to work through. Let's get started!

## Installation

We'll begin by installing the `compiler_gym` and `ray` packages:

In [1]:
# Print the versions of the libraries that we are using:
import compiler_gym
import ray
import gym

print("compiler_gym version:", compiler_gym.__version__)
print("ray version:", ray.__version__)
print("gym version:", gym.__version__)
!python3 --version

compiler_gym version: 0.1.9
ray version: 1.4.1
gym version: 0.18.0
Python 3.8.10


## Defining an Environment

Next we will define the environment to use for our experiments. For the purposes of a simple demo we will apply two simplifying constraints to CompilerGym's LLVM environment:

1. We will use only a small subset of the command line flag action space.
2. We will clip the length of episodes to a maximum number of steps.

To make things simple we will define a `make_env()` helper function to create our environment, and use the [compiler_gym.wrappers](https://facebookresearch.github.io/CompilerGym/compiler_gym/wrappers.html) API to implement these constraints. There is quite a lot going on in this cell, be sure to read through the comments for an explanation of what is going on!

In [2]:
#from compiler_gym.wrappers import ConstrainedCommandline, TimeLimit
from ray import tune
from compiler_gym.wrappers import ConstrainedCommandline, TimeLimit

def make_env() -> compiler_gym.envs.CompilerEnv:
    """Make the reinforcement learning environment for this experiment."""
    # We will use LLVM as our base environment. Here we specify the observation
    # space from this paper: https://arxiv.org/pdf/2003.00671.pdf and the total
    # IR instruction count as our reward space, normalized against the 
    # performance of LLVM's -Oz policy.
    env = gym.make(
        "llvm-v0",
        observation_space="Autophase",
        reward_space="IrInstructionCountOz",
    )
    # Here we constrain the action space of the environment to use only a 
    # handful of command line flags from the full set. We do this to speed up
    # learning by pruning the action space by hand. This also limits the 
    # potential improvements that the agent can achieve compared to using the 
    # full action space.
    
    env = ConstrainedCommandline(env, flags=[
        "-break-crit-edges",
        "-early-cse-memssa",
        "-gvn-hoist",
        "-gvn",
        "-instcombine",
        "-instsimplify",
        "-jump-threading",
        "-loop-reduce",
        "-loop-rotate",
        "-loop-versioning",
        "-mem2reg",
        "-newgvn",
        "-reg2mem",
        "-simplifycfg",
        "-sroa",
    ])
    # Finally, we impose a time limit on the environment so that every episode
    # for 5 steps or fewer. This is because the environment's task is continuous
    # and no action is guaranteed to result in a terminal state. Adding a time
    # limit means we don't have to worry about learning when an agent should 
    # stop, though again this limits the potential improvements that the agent
    # can achieve compared to using an unbounded maximum episode length.
    env = TimeLimit(env, max_episode_steps=5)
    
    return env

In [3]:
# Let's create an environment and print a few attributes just to check that we 
# have everything set up the way that we would like.
with make_env() as env:
    print("Action space:", env.action_space)
    print("Observation space:", env.observation_space)
    print("Reward space:", env.reward_space)

Action space: Commandline([-break-crit-edges -early-cse-memssa -gvn-hoist -gvn -instcombine -instsimplify -jump-threading -loop-reduce -loop-rotate -loop-versioning -mem2reg -newgvn -reg2mem -simplifycfg -sroa])
Observation space: Box(0, 9223372036854775807, (56,), int64)
Reward space: IrInstructionCountOz


## Datasets

Now that we have an environment, we will need a set of programs to train on. In CompilerGym, these programs are called *benchmarks*. CompilerGym ships with [several sets of benchmarks](https://facebookresearch.github.io/CompilerGym/llvm/index.html#datasets). Here we will take a handful of benchmarks from the `npb-v0` dataset for training. We will then further divide this set into training and validation sets. We will use `chstone-v0` as a holdout test set.

In [7]:
from itertools import islice

with make_env() as env:
  # The two datasets we will be using:
  npb = env.datasets["npb-v0"]
  chstone = env.datasets["chstone-v0"]

  # Each dataset has a `benchmarks()` method that returns an iterator over the
  # benchmarks within the dataset. Here we will use iterator sliceing to grab a 
  # handful of benchmarks for training and validation.
  train_benchmarks = list(islice(npb.benchmarks(), 55))
  train_benchmarks, val_benchmarks = train_benchmarks[:50], train_benchmarks[50:]
  # We will use the entire chstone-v0 dataset for testing.
  test_benchmarks = list(chstone.benchmarks())

print("Number of benchmarks for training:", len(train_benchmarks))
print("Number of benchmarks for validation:", len(val_benchmarks))
print("Number of benchmarks for testing:", len(test_benchmarks))

Number of benchmarks for training: 50
Number of benchmarks for validation: 5
Number of benchmarks for testing: 12


## Registering the environment with RLlib

Now that we have our environment and training benchmarks, we can register the environment for use with RLlib. To do this we will define a second `make_training_env()` helper that uses the [CycleOverBenchmarks](https://facebookresearch.github.io/CompilerGym/compiler_gym/wrappers.html#compiler_gym.wrappers.CycleOverBenchmarks) wrapper to ensure that the environment uses all of the training benchmarks. We then call `tune.register_env()`, assining the environment a name.

In [8]:
from compiler_gym.wrappers import CycleOverBenchmarks

def make_training_env(*args) -> compiler_gym.envs.CompilerEnv:
  """Make a reinforcement learning environment that cycles over the
  set of training benchmarks in use.
  """
  del args  # Unused env_config argument passed by ray
  return CycleOverBenchmarks(make_env(), train_benchmarks)

tune.register_env("compiler_gym", make_training_env)

In [9]:
# Lets cycle through a few calls to reset() to demonstrate that this environment
# selects a new benchmark for each episode.
with make_training_env() as env:
  env.reset()
  print(env.benchmark)
  env.reset()
  print(env.benchmark)
  env.reset()
  print(env.benchmark)

benchmark://npb-v0/1
benchmark://npb-v0/2
benchmark://npb-v0/3


## Run the training loop

Now that we have the environment set up, let's run a training loop. Here will use RLlib's [Proximal Policy Optimization](https://docs.ray.io/en/master/rllib-algorithms.html#ppo) implementation, and run a very short training loop just for demonstative purposes.

In [10]:
import ray
from ray.rllib.agents.ppo import PPOTrainer

# (Re)Start the ray runtime.
if ray.is_initialized():
  ray.shutdown()
ray.init(include_dashboard=False, ignore_reinit_error=True)

tune.register_env("compiler_gym", make_training_env)

analysis = tune.run(
    PPOTrainer,
    checkpoint_at_end=True,
    stop={
        "episodes_total": 500,
    },
    config={
        "seed": None,
        "framework": "torch",
        "num_workers": 8,
        # Specify the environment to use, where "compiler_gym" is the name we 
        # passed to tune.register_env().
        "env": "compiler_gym",
        # Reduce the size of the batch/trajectory lengths to match our short 
        # training run.
        "rollout_fragment_length": 40,
        "train_batch_size": 40,
        "sgd_minibatch_size": 40,
    }
)

Trial name,status,loc
PPO_compiler_gym_25894_00000,PENDING,


[2m[36m(pid=5637)[0m 2021-08-08 17:01:15,505	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


Result for PPO_compiler_gym_25894_00000:
  agent_timesteps_total: 40
  custom_metrics: {}
  date: 2021-08-08_17-01-20
  done: false
  episode_len_mean: 5.0
  episode_media: {}
  episode_reward_max: 0.9186046511627907
  episode_reward_mean: 0.8066860465116279
  episode_reward_min: 0.372093023255814
  episodes_this_iter: 8
  episodes_total: 8
  experiment_id: bca003b6cdbb4523933474353d3f5c1f
  hostname: phesse
  info:
    learner:
      default_policy:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.2
          cur_lr: 5.0e-05
          entropy: 2.6838157176971436
          entropy_coeff: 0.0
          kl: 0.024068277329206467
          policy_loss: -0.1702536791563034
          total_loss: -0.013393327593803406
          vf_explained_var: 0.1914040446281433
          vf_loss: 0.15204669535160065
    num_agent_steps_sampled: 40
    num_steps_sampled: 40
    num_steps_trained: 40
  iterations_since_restore: 1
  node_ip: 192.168.1.18
  num_healthy_workers:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_compiler_gym_25894_00000,RUNNING,192.168.1.18:5637,1,0.369661,40,0.806686,0.918605,0.372093,5


Result for PPO_compiler_gym_25894_00000:
  agent_timesteps_total: 600
  custom_metrics: {}
  date: 2021-08-08_17-01-25
  done: false
  episode_len_mean: 5.0
  episode_media: {}
  episode_reward_max: 13.431818181818183
  episode_reward_mean: 1.5202102148724421
  episode_reward_min: -0.39999999999999997
  episodes_this_iter: 8
  episodes_total: 120
  experiment_id: bca003b6cdbb4523933474353d3f5c1f
  hostname: phesse
  info:
    learner:
      default_policy:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 2.278125
          cur_lr: 5.0e-05
          entropy: 2.156273126602173
          entropy_coeff: 0.0
          kl: 0.019648782908916473
          policy_loss: -0.12628258764743805
          total_loss: 0.023693528026342392
          vf_explained_var: 0.7793610095977783
          vf_loss: 0.10521373897790909
    num_agent_steps_sampled: 600
    num_steps_sampled: 600
    num_steps_trained: 600
  iterations_since_restore: 15
  node_ip: 192.168.1.18
  num_he

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_compiler_gym_25894_00000,RUNNING,192.168.1.18:5637,15,5.10867,600,1.52021,13.4318,-0.4,5


Result for PPO_compiler_gym_25894_00000:
  agent_timesteps_total: 1200
  custom_metrics: {}
  date: 2021-08-08_17-01-30
  done: false
  episode_len_mean: 5.0
  episode_media: {}
  episode_reward_max: 148.0
  episode_reward_mean: 4.469694483576115
  episode_reward_min: -39.0
  episodes_this_iter: 8
  episodes_total: 240
  experiment_id: bca003b6cdbb4523933474353d3f5c1f
  hostname: phesse
  info:
    learner:
      default_policy:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 2.562890625
          cur_lr: 5.0e-05
          entropy: 2.0724945068359375
          entropy_coeff: 0.0
          kl: 0.006857914384454489
          policy_loss: -0.07182396203279495
          total_loss: -0.023833811283111572
          vf_explained_var: 0.6915101408958435
          vf_loss: 0.03041406348347664
    num_agent_steps_sampled: 1200
    num_steps_sampled: 1200
    num_steps_trained: 1200
  iterations_since_restore: 30
  node_ip: 192.168.1.18
  num_healthy_workers: 8
  o

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_compiler_gym_25894_00000,RUNNING,192.168.1.18:5637,30,10.0923,1200,4.46969,148,-39,5


Result for PPO_compiler_gym_25894_00000:
  agent_timesteps_total: 1840
  custom_metrics: {}
  date: 2021-08-08_17-01-36
  done: false
  episode_len_mean: 5.0
  episode_media: {}
  episode_reward_max: 139.0
  episode_reward_mean: 11.810474284824522
  episode_reward_min: 0.05555555555555555
  episodes_this_iter: 8
  episodes_total: 368
  experiment_id: bca003b6cdbb4523933474353d3f5c1f
  hostname: phesse
  info:
    learner:
      default_policy:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.4416259765625001
          cur_lr: 5.0e-05
          entropy: 1.6911900043487549
          entropy_coeff: 0.0
          kl: 0.022438909858465195
          policy_loss: -0.09711308777332306
          total_loss: 0.4777809977531433
          vf_explained_var: -0.6321806907653809
          vf_loss: 0.5425456166267395
    num_agent_steps_sampled: 1840
    num_steps_sampled: 1840
    num_steps_trained: 1840
  iterations_since_restore: 46
  node_ip: 192.168.1.18
  num_hea

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_compiler_gym_25894_00000,RUNNING,192.168.1.18:5637,46,15.0277,1840,11.8105,139,0.0555556,5


Result for PPO_compiler_gym_25894_00000:
  agent_timesteps_total: 2520
  custom_metrics: {}
  date: 2021-08-08_17-01-39
  done: true
  episode_len_mean: 5.0
  episode_media: {}
  episode_reward_max: 12.045454545454545
  episode_reward_mean: 1.8014903542005625
  episode_reward_min: 0.42118863049095606
  episodes_this_iter: 8
  episodes_total: 504
  experiment_id: bca003b6cdbb4523933474353d3f5c1f
  hostname: phesse
  info:
    learner:
      default_policy:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.3684184074401857
          cur_lr: 5.0e-05
          entropy: 1.7659261226654053
          entropy_coeff: 0.0
          kl: 0.010017195716500282
          policy_loss: -0.04969358816742897
          total_loss: 13.75989818572998
          vf_explained_var: 0.4776380658149719
          vf_loss: 13.79588508605957
    num_agent_steps_sampled: 2520
    num_steps_sampled: 2520
    num_steps_trained: 2520
  iterations_since_restore: 63
  node_ip: 192.168.1.18


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_compiler_gym_25894_00000,TERMINATED,,63,18.5488,2520,1.80149,12.0455,0.421189,5


2021-08-08 17:01:40,722	INFO tune.py:549 -- Total run time: 27.43 seconds (26.49 seconds for the tuning loop).


## Evaluate the agent

After running the training loop we can create a new agent that has exploration disabled, restore it from the training checkpoint, and then use it for running inference tests.

In [11]:
agent = PPOTrainer(
    env="compiler_gym",
    config={
        "num_workers": 1,
        "seed": None,
        # For inference we disable the stocastic exploration that is used during 
        # training.
        "explore": False,
    },
)

# We only made a single checkpoint at the end of training, so restore that. In
# practice we may have many checkpoints that we will select from using 
# performance on the validation set.
checkpoint = analysis.get_best_checkpoint(
    metric="episode_reward_mean", 
    mode="max", 
    trial=analysis.trials[0]
)

agent.restore(checkpoint)

2021-08-08 17:08:08,912	INFO trainer.py:671 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
2021-08-08 17:08:08,913	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


AttributeError: 'list' object has no attribute 'items'

In [None]:
# Lets define a helper function to make it easy to evaluate the agent's 
# performance on a set of benchmarks.

def run_agent_on_benchmarks(benchmarks):
  """Run agent on a list of benchmarks and return a list of cumulative rewards."""
  with make_env() as env:
    rewards = []
    for i, benchmark in enumerate(benchmarks, start=1):
        observation, done = env.reset(benchmark=benchmark), False
        while not done:
            action = agent.compute_action(observation)
            observation, _, done, _ = env.step(action)
        rewards.append(env.episode_reward)
        print(f"[{i}/{len(benchmarks)}] {env.state}")

  return rewards

# Evaluate agent performance on the validation set.
val_rewards = run_agent_on_benchmarks(val_benchmarks)

In [None]:
# Evaluate agent performance on the holdout test set.
test_rewards = run_agent_on_benchmarks(test_benchmarks)

In [None]:
# Finally lets plot our results to see how we did!
from matplotlib import pyplot as plt

def plot_results(x, y, name, ax):
  plt.sca(ax)
  plt.bar(range(len(y)), y)
  plt.ylabel("Reward (higher is better)")
  plt.xticks(range(len(x)), x, rotation = 90)
  plt.title(f"Performance on {name} set")

fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(13, 3)
plot_results(val_benchmarks, val_rewards, "val", ax1)
plot_results(test_benchmarks, test_rewards, "test", ax2)
plt.show()

That's it for this demonstration! Check out the [documentation site](https://facebookresearch.github.io/CompilerGym/) for more details, API reference, and more. If you can encounter any problems, please [file an issue](https://github.com/facebookresearch/CompilerGym/issues).