# Imitation Learning via Reinforcecment Learning (ILRL)

This project uses policy gradient methods like PPO to imitate an expert policy. *Given an Expert Policy as input the GAIL algorithm uses Policy Gradient method like PPO (in this case) to learn the policy and in most cases the learned policy gets better than the Expert Policy.*

### Steps:
1. **Run PPO algorithm** - run the PPO algorithm on an environment
    1. *Create Actor-Critic* architecture which represents the two policy networks
    2. *Code the PPO algorithm*
    3. *Train an agent using the PPO algorithm*
2. **Sample trajectories** - Sample some trajectories which represents the Expert Policy which we later use to train our agent which uses Imitation learning
    1. *Restore the agent policy network weights*
    2. *Sample some state and action using the expert policy*
    3. *Save the sampled states and actions into csv files*
3. **Test Expert Policy** - Test the learned expert policy to see if it satisfies the criteria for solving the environment (render the runs if you want)  
4. **Train agent using GAIL for imitation learning** - given the expert trajectories as input we use Generative Adversarial Imitation Learning to train the agent
    1. *Create a Discriminator* that differentiates between the Expert Policy and Generated Policy (same as in a conventional Generative Adversarial Network)
    2. *Train the agent* to learn by imitating the given expert policy (uses GAIL algorithm)
5. **Run Baseline implementations of PPO and TRPO to compare performance with our implementations**
6. **Observe reward plots on Tensorboard** - the tensorboard contains the following plots :-
    1. Our PPO implementation Reward and Lengths
    2. Expert Policy Testing plot
    3. GAIL reward and lengths plot
    4. Baseline reward, length and loss plots
    
**Note** - We can also use other Policy gradient methods like TRPO to generate expert polocy as well as the utility algorithm in GAIL for imitation learning

In [1]:
import tensorflow as tf
import gym
import numpy as np
import copy
import time

## A2C

Actor-Critic Architecture

In [2]:
class A2C:
    def __init__(self, name : str, obs_space, action_space, sess):
        self.observation = obs_space
        self.action = action_space
        self.scope_name = name
        self.sess = sess
        
        with(tf.variable_scope(self.scope_name)):
            # placeholder for inputs to the network
            self.inputs = tf.placeholder(shape = [None] + list(self.observation.shape) , dtype = tf.float32)
        
            # build the two networks
            self.build_network()
            
            # stochastic action
            self.act = tf.multinomial(tf.log(self.act_probs),1)
            self.act = tf.reshape(self.act, shape = [-1])
            
    def build_network(self):
        # critic network gives out value prediction for the given inputs
        with tf.variable_scope('critic', reuse = tf.AUTO_REUSE):
            cout = tf.layers.dense(self.inputs, units = 16, activation = tf.tanh)
            cout = tf.layers.dense(cout, units = 32, activation = tf.tanh)
            self.value = tf.layers.dense(cout, units = 1, activation = None)
        
        # actor network spits out action probabilites
        with tf.variable_scope('actor', reuse = tf.AUTO_REUSE):
            aout = tf.layers.dense(self.inputs, 16,activation = tf.tanh)
            aout = tf.layers.dense(aout, units = 32, activation = tf.tanh)
            aout = tf.layers.dense(aout, units = 16, activation = tf.tanh)
            self.act_probs = tf.layers.dense(aout, self.action.n , activation = tf.nn.softmax)
                    
    # get actions based on the given inputs
    def get_action(self, inputs):
        return self.sess.run(self.act, feed_dict = {self.inputs : inputs})
    
    # get value prediction for the given inputs
    def get_value(self, inputs):
        return self.sess.run(self.value, feed_dict = {self.inputs : inputs})
    
    # get all trainable variables required for policy update later
    def trainable_vars(self):
        return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,self.scope_name)

## PPO

In [3]:
class PPO:
    
    def __init__(self, env, sess, eps = 0.2, gamma = 0.95, clip1=1, clip2=0.01, learning_rate = 5e-5):
        self.sess = sess
        self.eps = eps
        self.gamma = gamma
        self.learning_rate = learning_rate
        self.clip1 = clip1
        self.clip2 = clip2
        self.act_clip_max = 1
        self.act_clip_min = 1e-10
        
        self.pi = A2C("pi", env.observation_space, env.action_space, self.sess)
        self.old_pi = A2C("old_pi", env.observation_space, env.action_space, self.sess)
        
        self.pi_trainable_params = self.pi.trainable_vars()
        self.old_pi_trainable_params = self.old_pi.trainable_vars()
        
        with tf.variable_scope('update_policy'):
            self.update_ops = [old_pi_vals.assign(pi_vals) for pi_vals, old_pi_vals in zip(self.pi_trainable_params, self.old_pi_trainable_params)]
        
        with tf.variable_scope('training_inputs'):
            self.actions = tf.placeholder(shape = [None], dtype=tf.int32)
            self.rewards = tf.placeholder(shape = [None], dtype=tf.float32)
            self.v_next = tf.placeholder(shape = [None], dtype=tf.float32)
            self.adv = tf.placeholder(shape = [None], dtype=tf.float32)
            
        act_probs = self.hotify_action(self.pi.act_probs)
        act_old_probs = self.hotify_action(self.old_pi.act_probs)
            
        with tf.variable_scope("loss"):
        
            # loss calculations
            clipped_act_probs = tf.log(tf.clip_by_value(act_probs, self.act_clip_min, self.act_clip_max))
            clipped_old_act_probs = tf.log(tf.clip_by_value(act_old_probs, self.act_clip_min, self.act_clip_max))
            
            ratio = tf.exp(clipped_act_probs - clipped_old_act_probs)
    
            clipped_ratio = tf.clip_by_value(ratio, 1 -self.eps, 1 + self.eps)
            surrogate = tf.multiply(ratio, self.adv)
            surrogate_clipped = tf.multiply(clipped_ratio, self.adv)
            
            clipped_loss = tf.minimum(surrogate, surrogate_clipped)
            clipped_loss = tf.reduce_mean(clipped_loss)
            
            entropy = -tf.reduce_sum(self.pi.act_probs * tf.log(tf.clip_by_value(self.pi.act_probs, self.act_clip_min, self.act_clip_max)), 1)
            entropy = tf.reduce_mean(entropy, 0)
            
            value = self.pi.value
            error = self.rewards + self.gamma * self.v_next
            loss_value = tf.squared_difference(error, value)
            loss_value = tf.reduce_sum(loss_value)
            
            self.loss = -(clipped_loss - self.clip1 * loss_value + self.clip2 * entropy)
            self.loss_plot = tf.summary.scalar('loss', self.loss)
        
        opt = tf.train.AdamOptimizer(self.learning_rate, epsilon=1e-5)
        self.gradients = opt.compute_gradients(self.loss, var_list = self.pi_trainable_params)
        self.train_op = opt.minimize(self.loss, var_list = self.pi_trainable_params)

    # get action given state
    def get_action(self, inputs):
        return self.pi.get_action(inputs)

    # get value estimate given state
    def get_value(self, inputs):
        return self.pi.get_value(inputs)
    
    # update old policy network to the new network parameters    
    def update_old_policy(self):
        self.sess.run(self.update_ops)
    
    def train_policy(self, inputs, actions, rewards, v_next, advantages):
        self.sess.run(self.train_op, feed_dict = {self.pi.inputs : inputs,
                                                  self.old_pi.inputs : inputs,
                                                  self.actions: actions,
                                                  self.rewards: rewards,
                                                  self.v_next: v_next, 
                                                  self.adv: advantages})
    
    # get advantage estimates
    def get_gaes(self, rewards, v_preds, v_preds_next):
        deltas = [r_t + self.gamma * v_next - v for r_t, v_next, v in zip(rewards, v_preds_next, v_preds)]
        gaes = copy.deepcopy(deltas)
        for t in reversed(range(len(gaes) - 1)):
            gaes[t] = gaes[t] + self.gamma * gaes[t + 1]
        return gaes
    
    def hotify_action(self, action):
        action *= tf.one_hot(self.actions, action.shape[1])
        action = tf.reduce_sum(action, 1)
        return action
    
    def get_entropy(self, act_probs):
        entropy = -tf.reduce_sum(act_probs * tf.log(tf.clip_by_value(act_probs, self.act_clip_min, self.act_clip_max)), 1)
        return tf.reduce_mean(entropy, 0)        

In [4]:
def epoch_train(num_epochs, ppo, obs, actions, adv, rewards, v_preds_next):
    transitions = [obs, actions, adv, rewards, v_preds_next]
    
    for epochs in range(num_epochs):
        # random sampling
        index = indices = np.random.randint(0, obs.shape[0], size = 32)
        samples = [np.take(transition, index, axis=0) for transition in transitions]

        # training
        ppo.train_policy(inputs = samples[0],
                     actions = samples[1],
                     advantages = samples[2],
                     rewards = samples[3],
                     v_next = samples[4])

In [5]:
def preprocess(input):
    return np.stack([input])

In [6]:
def z_score(input):
    return (input - input.mean())/input.std()

## Hyperparameters

In [7]:
iterations = 8000
num_of_epochs = 6
success_threshold = 195

## PPO runner

In [8]:
tf.reset_default_graph()  

with tf.Session() as sess:
    tensor_plot = tf.summary.FileWriter('log/ppo', graph = sess.graph)
    env = gym.make('CartPole-v0')
    ppo = PPO(env, sess)
    sess.run(tf.global_variables_initializer())
    state = env.reset()
    succ_run = 0
    
    # save trained model
    saver = tf.train.Saver()
    
    for i in range(iterations+1):
        obs = []
        actions = []
        rewards = []
        values = []
        length = 0
        
        if i % 1000 == 0:
                print('Episode number: {}'.format(i))
             
        while True:
            length += 1
            
            state = preprocess(state)
            
            action = ppo.get_action(state)
            action = np.asscalar(action)
            
            value = ppo.get_value(state)
            value = np.asscalar(value)
            
            next_state, reward, done, _ = env.step(action)
            
            obs.append(state)
            actions.append(action)
            rewards.append(reward)
            values.append(value)
               
            if done:
                next_state = preprocess(next_state)
                next_value = ppo.get_value(next_state)
                next_value = np.asscalar(next_value)
                v_preds_next = values[1:] + [next_value]
                state = env.reset()
                break
            else:
                state = next_state
        
        tensor_plot.add_summary(tf.Summary(value = [tf.Summary.Value(tag="my_ppo_episode_rewards", simple_value = sum(rewards))]), i)
        tensor_plot.add_summary(tf.Summary(value = [tf.Summary.Value(tag="my_ppo_episode_length", simple_value = length)]), i)
                
        if sum(rewards) >= success_threshold:
            succ_run += 1
            
            if succ_run >= 100:
                path = saver.save(sess, "trained_model/ppo.ckpt")
                print("Threshold reached so we end training early. Model saved at {}".format(path))
                tensor_plot.close()
                break
        else:
            succ_run = 0
            
        adv = ppo.get_gaes(rewards, values, v_preds_next)

        obs = np.reshape(obs, newshape=(-1,) + env.observation_space.shape)
        
        rewards = np.array(rewards)
        v_preds_next = np.array(v_preds_next)
        actions = np.array(actions)
        adv = z_score(np.array(adv))
                
        ppo.update_old_policy()

        epoch_train(num_of_epochs, ppo, obs, actions, adv, rewards, v_preds_next)
    
    path = saver.save(sess, "trained_model/ppo.ckpt")
    print("Model saved at {}".format(path))
    tensor_plot.close()
env.close()

print("Training complete. Check the tensorboard for plots")

  result = entry_point.load(False)


Episode number: 0
Episode number: 1000
Episode number: 2000
Episode number: 3000
Episode number: 4000
Episode number: 5000
Episode number: 6000
Episode number: 7000
Threshold reached so we end training early. Model saved at trained_model/ppo.ckpt
Model saved at trained_model/ppo.ckpt
Training complete. Check the tensorboard for plots


## Extract Expert Trajectories

In [8]:
def save_to_csv(file_path, data):
    with open(file_path, 'ab') as f_handle:
        np.savetxt(f_handle, data, fmt='%s')

In [10]:
tf.reset_default_graph()
num_samples = 20

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    env = gym.make('CartPole-v0')
    ppo = PPO(env, sess)
    
    saver = tf.train.Saver()
    
    saver.restore(sess, "trained_model/ppo.ckpt")
    print("Model restored!")
    
    state = env.reset()
    
    for itr in range(num_samples):
        obs = []
        actions = []
        length = 0
        done = False
        
        while True:
            length += 1
            state = preprocess(state)
            
            action = ppo.get_action(state)
            action = np.asscalar(action)
            
            # take action
            next_state, reward, done, _ = env.step(action)
            
            obs.append(state)
            actions.append(action)
            
            if done:
                state = env.reset()
                break
            else:
                state = next_state
        
        obs = np.reshape(obs, newshape=[-1] + list(env.observation_space.shape))
        actions = np.array(actions)
        
        save_to_csv("trajectories/expert_obs.csv", obs)
        save_to_csv("trajectories/expert_actions.csv", actions)
        
env.close()    

INFO:tensorflow:Restoring parameters from trained_model/ppo.ckpt
Model restored!


# Render trained model (Demo)
## Condition for environment to be considered as solved:
Considered solved when the average reward is **greater than or equal to 195.0 over 100 consecutive trials**.
So, we check our test for the above condition and break the testing once this condition is satisfied.

Observe the rewards obtained in the demo on tensorboard under the tag *'test_episode_rewards'*

In [11]:
tf.reset_default_graph()
success_threshold = 195

with tf.Session() as sess:
    tensor_plot = tf.summary.FileWriter('log/ppo', graph = sess.graph)
    sess.run(tf.global_variables_initializer())
    env = gym.make('CartPole-v0')
    ppo = PPO(env, sess)
    
    saver = tf.train.Saver()
    succ_runs = 0
    
    saver.restore(sess, "trained_model/ppo.ckpt")
    print("Model restored!")
            
    state = env.reset()
    done = False
        
    for itr in range(1001):
        rewards = []
        
        if itr % 50 == 0:
            print('Episode: {}'.format(itr))
            
        while True:
            state = preprocess(state)
            # if you want to render the solved environment consider uncommenting the time.sleep statement below to facilitate 
            # slow rendering 
            # env.render()

            action = ppo.get_action(state)
            action = np.asscalar(action)

            state,reward,done,_ = env.step(action)
            rewards.append(reward)
            
            # to demonstrate slow rendering
            # time.sleep(0.025)
            
            if done:
                state = env.reset()
                break
                
        tensor_plot.add_summary(tf.Summary(value = [tf.Summary.Value(tag="test_episode_rewards", simple_value = sum(rewards))]), itr) 
        
        if sum(rewards) >= success_threshold:
            succ_runs  += 1
            
            if succ_runs > 100:
                print("Solved at Episode: {}".format(itr))
                break
        else:
            succ_runs = 0
                    
env.close()

INFO:tensorflow:Restoring parameters from trained_model/ppo.ckpt
Model restored!
Episode: 0
Episode: 50
Episode: 100
Solved at Episode: 100


# Discriminator
Discriminator represensted by a Generative Adversarial Network to discriminate between the generated trajectory and the expert trajectory.

In [13]:
class Discriminator:
    def __init__(self, name : str, env, sess):
        self.sess = sess
        self.env = env
        self.scope_name = name
        
        with tf.variable_scope('discriminator'):
            with(tf.variable_scope('expert_inputs')):
                self.expert_act = tf.placeholder(shape = [None], dtype=tf.int32)
                self.expert_state = tf.placeholder(shape = [None] + list(self.env.observation_space.shape), dtype=tf.float32)
                self.expert_one_hot_a = tf.one_hot(self.expert_act, depth = self.env.action_space.n)
                self.expert_one_hot_a = tf.random_normal(tf.shape(self.expert_one_hot_a), mean = 0.2, stddev=0.1, dtype = tf.float32)/1.2
                self.expert_SA = tf.concat([self.expert_state, self.expert_one_hot_a], axis=1)

            with(tf.variable_scope('agent_inputs')):
                self.agent_act = tf.placeholder(shape = [None], dtype=tf.int32)
                self.agent_state = tf.placeholder(shape = [None] + list(self.env.observation_space.shape), dtype=tf.float32)
                self.agent_one_hot_a = tf.one_hot(self.agent_act, depth = self.env.action_space.n)
                self.agent_one_hot_a = tf.random_normal(tf.shape(self.agent_one_hot_a), mean = 0.2, stddev = 0.1, dtype = tf.float32)/1.2
                self.agent_SA = tf.concat([self.agent_state, self.agent_one_hot_a], axis=1)

            with tf.variable_scope('network') as scope:
                self.prob1 = self.build_network(input = self.expert_SA)
                scope.reuse_variables()
                self.prob2 = self.build_network(input = self.agent_SA)

            with tf.variable_scope('loss'):            
                expert_loss = tf.reduce_mean(tf.log(tf.clip_by_value(self.prob1, 0.01, 1)))
                agent_loss = tf.reduce_mean(tf.log(tf.clip_by_value(1 - self.prob2, 0.01, 1)))
                
                # as we have to perform maximizing
                loss = -(expert_loss + agent_loss)
                # gradient ascent is achieved by minimizing negative of the loss which is the same as maximizing the calculated loss
                self.train_op = tf.train.AdamOptimizer().minimize(loss)

            self.rewards = tf.log(tf.clip_by_value(self.prob2, 1e-10, 1))
        
    
    def train_dis(self, expert_s, expert_a, agent_s, agent_a):
        return self.sess.run(self.train_op, feed_dict = {self.expert_state: expert_s,
                                                     self.expert_act: expert_a,
                                                     self.agent_state: agent_s,
                                                     self.agent_act: agent_a})
    
    def get_rewards(self, agent_s, agent_a):
        return self.sess.run(self.rewards, feed_dict = {self.agent_state: agent_s,
                                                       self.agent_act: agent_a})
    
    def build_network(self, input):
        prob = tf.layers.dense(input, units = 20,activation = tf.nn.leaky_relu, name = 'layer1')
        prob = tf.layers.dense(prob, units = 20, activation = tf.nn.leaky_relu, name = 'layer2')
        prob = tf.layers.dense(prob, units = 20, activation = tf.nn.leaky_relu, name = 'layer3')
        prob = tf.layers.dense(prob, units = 1, activation = tf.sigmoid, name = 'prob')       
        return prob
    
    def trainable_vars(self):
        return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,self.scope_name)

# Generative Adversarial Imitation Learning

In [14]:
# hyperparameters
iterations = 15000
dis_epochs = 3
success_threshold = 195

In [15]:
def epoch_train(num_epochs, ppo, obs, actions, adv, rewards, v_preds_next):
    transitions = [obs, actions, adv, rewards, v_preds_next]
    
    for epochs in range(num_epochs):
        # random sampling
        index = indices = np.random.randint(0, obs.shape[0], size = 32)
        samples = [np.take(transition, index, axis=0) for transition in transitions]

        # training
        ppo.train_policy(inputs = samples[0],
                     actions = samples[1],
                     advantages = samples[2],
                     rewards = samples[3],
                     v_next = samples[4])

In [16]:
tf.reset_default_graph()  

with tf.Session() as sess:
    tensor_plot = tf.summary.FileWriter('log/ppo', graph = sess.graph)
    env = gym.make('CartPole-v0')

    ppo = PPO(env, sess)
    gan = Discriminator("Discriminator", env, sess)

    sess.run(tf.global_variables_initializer())
    
    expert_state = np.genfromtxt('trajectories/expert_obs.csv')
    expert_act = np.genfromtxt('trajectories/expert_actions.csv', dtype = np.int32)
    
    state = env.reset()
    succ_runs = 0
    
    # save trained model
    saver = tf.train.Saver()
    
    for i in range(iterations+1):
        obs = []
        actions = []
        values = []
        length = 0
        rewards = []
        
        if i % 1000 == 0:
            print('Episode number: {}'.format(i))
             
        while True:
            length += 1
            
            state = preprocess(state)
            
            action = ppo.get_action(state)
            action = np.asscalar(action)
            
            value = ppo.get_value(state)
            value = np.asscalar(value)
            
            next_state, reward, done, _ = env.step(action)
            
            obs.append(state)
            actions.append(action)
            values.append(value)
            reward = rewards.append(reward)
            
            if done:
                next_state = preprocess(next_state)
                next_value = ppo.get_value(next_state)
                next_value = np.asscalar(next_value)
                v_preds_next = values[1:] + [next_value]
                state = env.reset()
                break
            else:
                state = next_state
                
        tensor_plot.add_summary(tf.Summary(value = [tf.Summary.Value(tag="gail_episode_rewards", simple_value = sum(rewards))]), i)
        tensor_plot.add_summary(tf.Summary(value = [tf.Summary.Value(tag="gail_episode_length", simple_value = length)]), i)
        
        if sum(rewards) >= success_threshold:
            succ_runs  += 1
            
            if succ_runs > 100:
                print("Solved at Episode: {}".format(itr))
                print("GAIL training completed early!")
                break
        else:
            succ_runs = 0
            
        # preprocess inputs
        obs = np.reshape(obs, newshape=(-1,) + env.observation_space.shape)
        v_preds_next = np.array(v_preds_next)
        actions = np.array(actions)
        
        # train discriminator
        for _ in range(dis_epochs):
            gan.train_dis(expert_state, expert_act, obs, actions)
        
        gan_rewards = gan.get_rewards(obs, actions)
        gan_rewards = np.reshape(gan_rewards, newshape=(-1,))
        
        adv = ppo.get_gaes(gan_rewards, values, v_preds_next)
                
        ppo.update_old_policy()

        epoch_train(dis_epochs, ppo, obs, actions, adv, gan_rewards, v_preds_next)
    
    print("GAIL training complete!")
env.close()

Episode number: 0
Episode number: 1000
Episode number: 2000
Episode number: 3000
Episode number: 4000
Episode number: 5000
Episode number: 6000
Episode number: 7000
Episode number: 8000
Episode number: 9000
Episode number: 10000
Episode number: 11000
Episode number: 12000
Episode number: 13000
Episode number: 14000
Episode number: 15000
GAIL training complete!


## Baselines Implementation of PPO

From the stable-baselines repo to compare my implementation with the baselines implementation of PPO

In [17]:
from stable_baselines import PPO1

model = PPO1('MlpPolicy', 'CartPole-v0', verbose=1, tensorboard_log="log/ppo/")
model.learn(total_timesteps=10000)

Creating environment from the given name, wrapped in a DummyVecEnv.
********** Iteration 0 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
     -0.00179 |      -0.00693 |      76.23267 |       0.00015 |       0.69303
     -0.00902 |      -0.00692 |      74.91623 |       0.00110 |       0.69216
     -0.01602 |      -0.00690 |      73.50521 |       0.00339 |       0.68998
     -0.02153 |      -0.00685 |      72.08931 |       0.00834 |       0.68524
Evaluating losses...
     -0.02223 |      -0.00682 |      71.12242 |       0.01187 |       0.68188
----------------------------------
| EpLenMean       | 21.9         |
| EpRewMean       | 21.9         |
| EpThisIter      | 11           |
| EpisodesSoFar   | 11           |
| TimeElapsed     | 0.374        |
| TimestepsSoFar  | 256          |
| ev_tdlam_before | -0.0104      |
| loss_ent        | 0.6818789    |
| loss_kl         | 0.011868857  |
| loss_pol_entpen | -0.006818789 |
| loss_p

<stable_baselines.ppo1.pposgd_simple.PPO1 at 0x7f857872e400>

## Baselines Implementation of TRPO

From the stable-baselines repo to compare my implementation with the baselines implementation of TRPO

In [None]:
from stable_baselines import TRPO

model = TRPO('MlpPolicy', 'CartPole-v0', verbose=1, tensorboard_log="log/trpo/")
model.learn(total_timesteps=10000)

Creating environment from the given name, wrapped in a DummyVecEnv.


  result = entry_point.load(False)


********** Iteration 0 ************
Optimizing Policy...
[35msampling[0m


## Monitor Tensorboard for plots

1. Change directory into this project directory
2. Execute the following command
    `tensorboard --logdir=log/`
3. Visit the localhost page with the provided port number to monitor tensorboard

### Results on CartPole-v0 environment:

## Our PPO implementation Rewards
![Our PPO implementation Rewards](/plots/my_ppo.png)

## GAIL learned agent Rewards
![GAIL rewards](/plots/gail_rewards.png)

## Baseline PPO rewards
![Baselines PPO rewards](/plots/baseline_ppo.png)