<a href="https://colab.research.google.com/github/julia-lina-tan/rl-policy-fusion/blob/main/rl_policy_fusion_new.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro

This notebook trains agents from Stable Baselines3 in the "Reacher" Gym environment.

1. An agent will be trained to reach for rewards in the 1st quadrant of the workspace
2. An agent will be trained to reach for rewards in the 2nd quadrant of the workspace
3. A fused agent will be initialised from these two previous agents using OT techniques
4. The fused agent will relearn to reach for rewards in both the 1st and 2nd quadrants of the workspace
5. The effectiveness of initialising via fusion (as compared to relearning from the original models or creating a new model from scratch) will be evaluated

# Setup

Install Stable Baselines and other dependencies.

In [None]:
!pip install stable-baselines3[extra]
!apt install swig cmake
!pip install stable-baselines3[extra] box2d box2d-kengz

In [None]:
# Additional installations/imports for rendering Gym environment

!apt-get install -y xvfb x11-utils
!pip install gym[box2d]==0.17.* pyvirtualdisplay==0.2.* PyOpenGL==3.1.* PyOpenGL-accelerate==3.1.* 
!apt-get install imagemagick

import pyvirtualdisplay
_display = pyvirtualdisplay.Display(visible=False,  # use False with Xvfb
                                    size=(1400, 900))
_ = _display.start()

Import RL policy, RL agents and wrappers.

In [14]:
import gym
import numpy as np
import torch as pt
import matplotlib.pyplot as plt

from stable_baselines3 import PPO
from stable_baselines3.common.monitor import Monitor

Install [pybullet-gym](https://github.com/benelot/pybullet-gym) and import Reacher environment.

In [None]:
!git clone https://github.com/openai/gym.git
!cd gym && pip install -e .

In [None]:
!git clone https://github.com/benelot/pybullet-gym.git
!cd pybullet-gym && pip install -e .

In [None]:
import gym  # open ai gym
import pybulletgym  # register PyBullet enviroments with open ai gym

env = gym.make('ReacherPyBulletEnv-v0')
# env.render() # call this before env.reset, if you want a window showing the environment
env.reset()  # should return a state vector if everything worked

## Rendering agent in environment

In [11]:
import cv2
from google.colab.patches import cv2_imshow
from matplotlib import animation

def save_frames_as_gif(frames, path='../content', filename='gym_animation.gif'):

    #Mess with this to change frame size
    plt.figure(figsize=(frames[0].shape[1] / 72.0, frames[0].shape[0] / 72.0), dpi=72)

    patch = plt.imshow(frames[0])
    plt.axis('off')

    def animate(i):
        patch.set_data(frames[i])

    anim = animation.FuncAnimation(plt.gcf(), animate, frames = len(frames), interval=50)
    anim.save(path + filename, writer='imagemagick', fps=60)

def render_agent_in_env(agent, env, n_eval_episodes=5, path='../content', filename='gym_animation'):
    for i in range(n_eval_episodes):
      frames = []
      obs = env.reset()
      for t in range(500):

          #Render to frames buffer
          frame = np.array(env.render('rgb_array'))
          cv2.putText(frame, text=f'Episode {i+1}', org=(50,50), fontFace=cv2.FONT_HERSHEY_DUPLEX, fontScale=0.8, color=(0,0,0))
          frames.append(frame)
          action, _states = model.predict(obs)
          obs, rewards, done, info = env.step(action)
          if done:
            break
      save_frames_as_gif(frames, path=path, filename=f'{filename}-ep{i+1}.gif')

# Train agent

In [None]:
env = Monitor(env)
model = PPO('MlpPolicy', env, verbose=1, seed=1)

# TODO: configure the model

model.learn(total_timesteps=1e4)

## Extracting policy from model parameters

In [None]:
model_params = model.get_parameters()

def get_policy(model_params, net='action'):
    """
    Get either action or value net representing the actor-critic policy.

    :param model_params: (dict) the model parameters
    :param net: (str) the net type to return; either ``action`` or ``value``
    """
    if net != 'action' and net != 'value':
        raise ValueError('Must be either action net or value net')
    return model_params.get('policy').get(net+'_net.weight')

action_net = get_policy(model_params, net='action')
print(action_net)

# Test agent

In [22]:
from stable_baselines3.common.evaluation import evaluate_policy

def plot_rewards(mean_reward, title=None):
    plt.figure(figsize=(10,5))
    plt.title(title)
    plt.xlabel('Episodes')
    plt.ylabel('Rewards at episode')
    plt.xticks(list(range(1, len(mean_reward)+1)))
    plt.plot(list(range(1, len(mean_reward)+1)), mean_reward, marker='o')
    plt.show()

In [None]:
ep_rewards, ep_steps = evaluate_policy(model, env, n_eval_episodes=10, deterministic=True, return_episode_rewards=True)

print(f'mean reward={(sum(ep_rewards)/len(ep_rewards)):.2f} +/- {np.std(ep_rewards):.2f}')
plot_rewards(ep_rewards, title='Rewards over evaluation episodes')

Visualise the performance of the agent over a number of evaluation episodes.

In [None]:
os.makedirs('../content/agent1', exist_ok=True)
render_agent_in_env(model, env, n_eval_episodes=5, path='../content/agent1', filename='test')