# Crossing the Freeway street with Proximal Policy Optimization and Random Network Distillation

This jupyter notebook contains the code used for the main routine of the implementation of the Atari game Freeway and the experiment on Venture. 

Additionally, in the directory, there are two additional modules:
- **agent.py**: This file contains the code related to the agent, including taking actions in the environment, networks creation and optimization. 
- **utilities.py**: In this file there are 4 classes: 
  - EnvWrapper: a simple wrapper that manages the input dimensions. 
  - Normalizer: a wrapper that utilizes the Gym's class RunningStdMean to normalize the state and the reward in the RND settings.
  - OptimizationBatch: used to collect all the data and create the minibatches during the optimization phase.
  - InfoWriter: used to write the Tensorboard's logs. 
  
All the graphs shown in the report are available in the directory "./logs", it's possible to visualize them using Tensorboard. Addiotionally, is possible to find the graphs of the entopy and those of the loss functions, that were used in the debugging phase.

In [None]:
%load_ext tensorboard

## Note: code transparency
To ensure completely transparency about the following code, here's a list of the material that inspired its implementation:
- [CleanRL RND implementation](https://docs.cleanrl.dev/rl-algorithms/ppo-rnd/) 
  - This code was consulted to draw inspiration for implementing the RND algorithm. Specifically, the choice to use EnvPool and Gym's normalization class [RunningMeanStd](https://gymnasium.farama.org/main/_modules/gymnasium/wrappers/normalize/) (they followed the original RND code, where the authors used the same class as well) was influenced by this implementation. 
- [Machine Learning with Phil - Proximal Policy Optimization (PPO) | Full PPO Tutorial](https://www.youtube.com/watch?v=hlv79rcHws0)
  - This helpful video was consulted to verify the correctness of the implemented code during the debugging phase.
- Various other articles found online, but with any significant influence on this work.

It is important to note that no lines of code were directly copied from these resources. They primarely served as a reference to provide a "direction" during the implementation and, in particular, in the debugging phase. 

## Settings and imports

Is possible to load and learn every Atari game available in the EnvPool collection.
The Freeway experiment with the RND bonus is performed using PPO parameters (rnd_hyperparams=False) and the exploration bonus activated (rnd_bonus=True)

In [None]:
log                 = True
seeds               = [42]
name                = "Freeway-v5"
log_info            = ""

# Activate the exploration bonus 
rnd_bonus           = True

# Select RND hyperparams or PPO's
# This RND implementation only uses 32 parallel environments (instead of 128) and 8000 rollouts (instead of 30000) for hardware and time constraints.
rnd_hyperparms      = False

# RND settings: (~132 million frames => 8000 rollouts * 128 steps * 32 actors * 4 stacked frames)
# PPO settings: (~40 million frames => 10000 rollouts * 128 steps * 8 actors * 4 stacked frames)
tot_rlls            = 8000 if rnd_hyperparms else 10000

# Is possible to set a threshold to go to the next seed
thresh_next_seed    = 30000
thresh_reward       = 5

In [None]:
import os
#os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"

In [None]:
from tensorflow     import keras
import envpool
import numpy        as np
import tensorflow   as tf
from utilities      import *
from tqdm.notebook  import tqdm 
from agent          import Agent
from datetime       import datetime

### Hyperparameters

In [None]:
def set_seed(seed):
    tf.keras.utils.set_random_seed(seed=seed)

In [None]:
steps_rll       = 128    
num_envs        = 32    if rnd_hyperparms else 8
num_minib       = 4 
steps_init      = num_envs * steps_rll
steps_opt       = 4     if rnd_hyperparms else 3

############################################
ext_rwd_coeff   = 2 
int_rwd_coeff   = 1 
lr              = 1e-4  if rnd_hyperparms else 2.5e-4 
optimizer       = keras.optimizers.legacy.Adam(learning_rate=lr)
gae_lambda      = 0.95 
entropy_coeff   = 0.001 if rnd_hyperparms else 0.01
ext_rwd_gamma   = 0.999 if rnd_hyperparms else 0.99
int_rwd_gamma   = 0.99
clip_epsilon    = 0.1
vf_coeff        = 1 


############################################
# Max episode frames is equal to 18000, every step 4 frame
max_episode_steps = 18000 // 4 
max_steps       = 10e6 // (steps_rll * num_envs)
batch_size  = steps_rll * num_envs
minib_size  = batch_size // num_minib

############################################

### Environment settings

Using EnvPool in order to manage multiple parallel envs
https://envpool.readthedocs.io/en/latest/env/atari.html.

In [None]:
def get_env():
    parallel_env = envpool.make(name,
                        env_type    = "gymnasium",
                        num_envs    = num_envs,
                        seed        = 42,    # The default envpool seed
                        frame_skip  = 4,
                        img_height  = 84,
                        img_width   = 84,
                        stack_num   = 4,
                        gray_scale  = True,
                        reward_clip = True,
                        max_episode_steps           = max_episode_steps if rnd_hyperparms else 27000,
                        repeat_action_probability   = 0.25              if rnd_hyperparms else 0,
                        )

    return parallel_env

### Resources initialization

Creating the agent, initializing the Optimization batch and the normalizers

In [None]:
def init_resources():
        global running_int_returns, int_returns_continous
        parallel_env = get_env()
        env = EnvWrapper(parallel_env, name) 

        action_space_n  = env.action_n
        state_shape     = env.observation_shape

        policy_inshape  = (*state_shape, 1)
        policy_outshape = action_space_n
        
        rnd_inshape     = (1, *policy_inshape[1:]) 
        rnd_outshape    = policy_outshape

        player = Agent(
                env             = env,
                n_envs          = num_envs,
                action_space_n  = action_space_n,
                int_rwd_gamma   = int_rwd_gamma,
                ext_rwd_gamma   = ext_rwd_gamma,
                gae_lambda      = gae_lambda,
                batch_size      = batch_size,
                minib_s         = minib_size,
                rnd_bonus       = rnd_bonus,
                policy_inshape  = policy_inshape,
                policy_outshape = policy_outshape,
                rnd_inshape     = rnd_inshape,
                rnd_outshape    = rnd_outshape,
                clip_epsilon    = clip_epsilon,
                entropy_coeff   = entropy_coeff,
                vf_coeff        = vf_coeff,
                optimizer       = optimizer,
                verbose         = False
                )

        batch = OptimizationBatch(steps_rll, num_envs, policy_inshape)
        
        
        running_int_returns = np.zeros((steps_rll, num_envs))
        int_returns_continous = np.zeros((num_envs))

        norm_obs = Normalizer("state", rnd_inshape)
        norm_rwd = Normalizer("rwd", (num_envs))

        return player, batch, norm_obs, norm_rwd

In [None]:
if log:
    date_str    = datetime.now().strftime("%Y%m%d-%H%M%S")
    log_dir     = "logs/scalars/" + name + "/" + date_str + f"_{num_envs}"

In [None]:
seed = seeds.pop()
set_seed(seed)
player, batch, norm_obs, norm_rwd = init_resources()
if log:
    info_writer = InfoWriter(log_dir, num_envs)
    info_writer.log_seed(seed, log_info) 

### Play for timesteps 

In [None]:
def play_for_steps(state, tot_steps, random = False):

    ended_this_play = 0

    for _ in range(1, tot_steps + 1):
        next_state, ext_rwd, done, action, action_logit, ext_value, int_value = player.play_one_step(state, random=random)
        
        if not random:
            if rnd_bonus:
                norm_state  = norm_obs(next_state)
                int_rwd     = player.compute_int_rwd(norm_state)
            else: int_rwd = np.zeros_like(ext_rwd)
            
            ended = np.nonzero(done)[0]
            if np.size(ended) > 0:
                next_state[ended] = player.env.reset_ended(ended)      
            
            # Add experience to the buffer queue
            batch.append(state, action, action_logit, ext_rwd, int_rwd, ext_value, int_value, done)

            if log:
                info_writer.update_ext_rewards(ext_rwd, int_rwd)
                for idx in ended:
                    ended_this_play += 1
                    info_writer.update_ended(idx, ended_this_play)

        else: 
            # Update the norms normalizer in the random steps
            norm_obs.update(state)

        state = next_state

    if log and ended_this_play > 0:
        info_writer.update_last_reward()                  
    return state

### Intrinsic reward normalization

In [None]:
def discount_int_rwds(int_rwds):
    global running_int_returns, int_returns_continous

    for step, rwd in reversed(list(enumerate(int_rwds))):
        running_int_returns[step] = rwd + int_rwd_gamma * int_returns_continous
        int_returns_continous = running_int_returns[step]

    return running_int_returns


In [None]:
def normalize_int_rwds(int_rwds):
    #Normalize the intrinsic reward with running estimate int returns, continuosly
    updated_int_returns = discount_int_rwds(int_rwds)
    
    # Update norm rwd params
    norm_rwd.update_reward(updated_int_returns)

    # Normalize int rwds
    norm_int_rwds = norm_rwd(int_rwds)
    return norm_int_rwds

### Train routine

In [None]:
def train_networks(step):
    info_writer.reset_train_infos()

    for opt in range(1, steps_opt + 1):
        # Optimize theta_pi wrt PPO loss on batch, R and A using Adam
        sums_info = player.training_step_ppo(num_minib, batch)
        
        if rnd_bonus:
            # Optimize theta_f^ wrt distillation loss on batch using Adam        
            rnd_train_states    = norm_obs(batch.states)
            rnd_loss            = player.training_step_distill(num_minib, rnd_train_states)
        else: rnd_loss = 0
        
        sums_info = sums_info + (rnd_loss,)

        if log:
            info_writer.incremental_mean_train(opt, sums_info)
            
    if log:
        info_writer.write_infos(step)

### Main routine

The scaling of the clip epsilon and the learning rate is only active with the PPO settings

In [None]:
def scale_alpha(step):
    alpha = ((max_steps - step)/max_steps)
    player.clip_epsilon     = clip_epsilon * alpha
    player.optimizer.learning_rate = lr * alpha

In [None]:
def check_end_conditions(step):
    reward_mean = info_writer.get("ext_rwd")
    if (step % 500 == 0):
        print("Saving checkpoints...")
        player.save_checkpoints(date_str)
        
    if step >= thresh_next_seed and reward_mean <= thresh_reward:
        print("Moving to the next seed")
        return True

    return False

In [None]:
def ppo_rnd_algorithm():
    
    state = player.reset_env()
    if rnd_bonus:
        # Initialize normalization parameters playing randomly
        state = play_for_steps(state, steps_init, random=True)
        
    for step in tqdm(range(1, tot_rlls + 1)):
        
        # Reset the batch
        batch.reset()

        # Play steps_rll steps and collect the data in the batch
        next_state = play_for_steps(state, steps_rll)
        _ , next_ext_value, next_int_value = player._forward_policy(next_state)
        state = next_state

        # Compute returns and advantages for extrinsic and intrisic rewards
        advs_ext, returns_ext = player.calculate_advs_and_returns("ext", batch.ext_rwds, batch.ext_values, next_ext_value, batch.dones)
        
        if rnd_bonus:
            batch.int_rwds = normalize_int_rwds(batch.int_rwds)
            info_writer.update_int_rewards(batch.int_rwds)
            advs_int, returns_int = player.calculate_advs_and_returns("int", batch.int_rwds, batch.int_values, next_int_value, np.zeros(batch.values_shape))       
            # Combine advs and rewards
            advs_combined = ext_rwd_coeff * advs_ext + int_rwd_coeff * advs_int
        else: 
            returns_int = np.zeros_like(returns_ext)
            advs_combined = advs_ext

        if log:
            info_writer.update_advs_returns(advs_combined, returns_ext, returns_int)

        # Add the calculated advantages and returns to the batch and resize the batch 
        batch.add_advs_returns(advs_combined, returns_ext, returns_int)
        batch.resize_for_minibatch()

        if not rnd_hyperparms:
            scale_alpha(step)

        if rnd_bonus:
            # Update obs normalization using the batch
            norm_obs.update(batch.states)
        
        # Perform the training steps
        train_networks(step)

        if check_end_conditions(step):
            break   

In [None]:
# Can be manually used to restore a checkpoint

def restore_session():
    date = "2023_06_23"
    player.restore_checkpoints(date_str = date)

In [None]:
print(f"##################### Starting main routine #####################")
print(f"Total steps:\t\t{tot_rlls} ")
print(f"Logging:\t\t{log}")
print(f"RND bonus:\t\t{rnd_bonus}")
if rnd_hyperparms:
    print(f"RND hyperparameters")
else: 
    print(f"PPO hyperparameters")

# restore_session()

while True:
    print(f"Seed:\t\t\t{seed}")
    ppo_rnd_algorithm()
    if log:
        print("Saving the model")
        path = f"./saved_model/{name}/" + date_str + f"_{seed}"
        player.policy.save(path + "/policy")
    if not seeds:
        break
    else:
        seed = seeds.pop()
        set_seed(seed)
        player, batch, norm_obs, norm_rwd = init_resources()
        if log:
            info_writer = InfoWriter(log_dir, num_envs)
            info_writer.log_seed(seed, log_info) 

Too see real time graphs:

In [None]:
%load_ext tensorboard