In [1]:
import gym
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

USE_GPU = torch.cuda.is_available()
CUDA_DEVICE = torch.device('cuda')
CPU_DEVICE = torch.device('cpu')
DEVICE = CUDA_DEVICE if USE_GPU else CPU_DEVICE

### Deep Q Learning 
Train an agent with deep q or double deep q learning algorithm \
Watch a pretrained agent solving LunarLander

In [2]:
from dqAgent import DQAgent
env = gym.make("LunarLander-v2")
ag_q = DQAgent(env, double_qn=False, device = DEVICE)
#ag.train()
ag_q.load_models('dq1')
ag_q.visualize(ep = 4)

[251.33790420922693, 271.5718362863971, 264.98726907083756, 273.66507730860036]

### Proximal Policy Optimization
Train an agent with ppo algorithm, an algorithm besed on policy gradient and actor-ctitic paradigm \
Watch a pretrained agent solving LunarLander as good as DQAgent most of the times \
Now for training we need several environment in witch the agent interacts simultaneously \
An important hyperparameter is the rollout step that it is set by default as 1

In [9]:
from ppoAgent import PPOAgent
num_envs = 16
env = [gym.make("LunarLander-v2") for _ in range(num_envs)]
ag_ppo = PPOAgent(env, num_env=num_envs, device=DEVICE)
#ag_ppo.train()
ag_ppo.load_models('ppo')
ag_ppo.visualize(ep = 4)

[267.2439062563842, 253.55319460616138, 290.63272832621664]

## Imitation Learning 
For the following two algorithms we are going to need demonstrations of optimal interaction with the environment. \
For that we use collect data from the interaction of a pretrained RL Agent like DQAGent

In [3]:
ex = ag_q.get_experiences(num_exp=10*1024)

### Behavioural Cloning
We will train an agent to imitate the actions of an expert \
We collect touples of states and actions of the interaction of an rl agent like DQAGent and perform supervised learning on that dataset. 

In [4]:
from bcAgent import BCAgent
env = gym.make("LunarLander-v2")
ag_bc = BCAgent(env, experience=ex, device=DEVICE)
#ag_bc.train()
ag_bc.load_models('bc6', 'cpu')
ag_bc.visualize(ep = 4)

[-330.5044019014285,
 -263.29283521425646,
 -316.69186146196773,
 -259.34746298495253]

In [2]:
from dqAgent import DQAgent
env = gym.make("LunarLander-v2")
ag_dq = DQAgent(env, layers = [128,32], eps = 1, double_qn=True)
#ag_q.load_models('dq1')
ex = ag_dq.get_experiences(10)

### Generative Adverarial Imitation Learning
We will leverage the policy gradient paradigm and adversarial training techniques for imitation learning \
We use a PPOAgent as a generator that ouputs actions given states and a simple MLP classifier as a discriminator. The descriminator aims to distinguish between the state-action pairs provided by the generator and state-actions pairs provided by the expert. \
PPOAgent uses discriminator's output as reward.


In [3]:
from gailAgent import GAILAgent
num_envs = 16
env = [gym.make("LunarLander-v2") for _ in range(num_envs)]
ag_gail = GAILAgent(env, expert_demos=ex, rollout_steps=64, device=DEVICE)
#ag_gail.train()
ag_gail.load_models('gail', 'cpu')
ag_gail.visualize(ep = 4)       

### Goal Conditioned Supervised Learning
For this imitation learning algorithm no expert demonstrations are needed. The agent learns a goal conditioned policy by exploiting its own interactions with the environment. The algorithm main idea is that even if a trajectory is not optimal it is optimal for reaching the final state of the trajectory begining from the initial state of the trajectory \


As we can see the agent manages to reach its goal -> getting to point (0, 0) of the environment, while both rocket feet touch the moon. However, because it never sees the reward function it doens't learn to land the rocket smoothly.

In [2]:
from gcslAgent import GCSLAgent
env = gym.make('LunarLander-v2')
ag_gcsl = GCSLAgent(env)
ag_gcsl.load_models('gcsl')
ag_gcsl.visualize()

  obs = torch.tensor(obs, dtype=torch.float32, device = self.device)


final state:  [ 0.05221357 -0.04342753  1.          1.        ]
final state:  [-0.05183087 -0.04321696  1.          1.        ]
final state:  [ 0.09947443 -0.01601993  0.          1.        ]
final state:  [ 0.00568418 -0.03415377  1.          1.        ]
