# Solving cartpole task with policy-based RL algorithms

**Decription of the task:**
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import gym
from importlib import reload

In [3]:
def play_episode(agent, env, return_states=False):
    # Reset environment (start of an episode)
    state = env.reset()
    rewards = []
    log_probs = []
    done = []
    
    if return_states:
        states = [state]
        
        
    steps = 0
    while True:
        action, log_prob = agent.get_action(state, return_log = True)
        new_state, reward, terminal, info = env.step(action) # gym standard step's output
        
        if return_states:
            states.append(new_state)
            
        if terminal and 'TimeLimit.truncated' not in info:
            # give -1 if cartpole falls but not if episode is truncated
            reward = -1 
            
        rewards.append(reward)
        log_probs.append(log_prob)
        done.append(terminal)
        
        if terminal:
            break
            
        state = new_state
       
    rewards = np.array(rewards)
    done = np.array(done)
    
    if return_states:
        return rewards, log_probs, np.array(states), done
    else:
        return rewards, log_probs, done

In [4]:
def render_test_episode(agent):
    # Create environment
    env = gym.make("CartPole-v1")
    observation_space = env.observation_space.shape[0]
    action_space = env.action_space.n
    state = env.reset()
    while True:
        env.render()
        action = agent.get_action(state, return_log = False)
        new_state, reward, terminal, info = env.step(action) # gym standard step's output
        if terminal: 
            break
        else: 
            state = new_state
    env.close()

# Vanilla Policy-Gradient 

In [None]:
import PolicyGradient

In [None]:
reload(PolicyGradient)

In [None]:
def train_cartpole(n_episodes = 100, lr = 0.01, gamma = 0.99):
    # Create environment
    env = gym.make("CartPole-v1")
    observation_space = env.observation_space.shape[0]
    action_space = env.action_space.n
    # Init agent
    agent = PolicyGradient.PolicyGrad(observation_space, action_space, lr, gamma)
    performance = []
    losses = []
    for e in range(n_episodes):
        rewards, log_probs, _ = play_episode(agent, env)
        performance.append(np.sum(rewards))
        if (e+1)%10 == 0:
            print("Episode %d - reward: %.0f"%(e+1, np.mean(performance[-10:])))
        
        loss = agent.update(rewards, log_probs)
        losses.append(loss)
    return agent, np.array(performance), np.array(losses)

In [None]:
%%time
trained_agentPG, cumulative_rewardPG, lossesPG = train_cartpole(n_episodes = 500, lr=5e-3)

In [None]:
T = False
if T:
    n_runs = 30
    results_v0 = []
    for i in range(n_runs):
        trained_agentPG, cumulative_rewardPG, lossesPG = train_cartpole(n_episodes = 500, lr=5e-3)
        results_v0.append(cumulative_rewardPG)

In [None]:
if T:
    np.save('Results/REINFORCE_perf', results_v0)

In [None]:
episodes = np.arange(1,len(cumulative_rewardPG)+1)
plt.plot(episodes, cumulative_rewardPG)
plt.show()

In [None]:
plt.plot(episodes, lossesPG)
plt.show()

In [None]:
render_test_episode(trained_agentPG) 

# Advantage Actor-Critic - trajectory version

In [5]:
import ActorCritic

In [6]:
from importlib import reload
reload(ActorCritic)

<module 'ActorCritic' from '/home/nicola/Nicola_unipd/MasterThesis/Policy-based-RL/ActorCritic.py'>

In [7]:
def train_cartpole_A2C(n_epochs = 100, lr = 0.01, gamma = 0.99, TD=True, twin=False, tau=1.,**kwargs):
    # Create environment
    env = gym.make("CartPole-v1")
    observation_space = env.observation_space.shape[0]
    action_space = env.action_space.n
    # Init agent
    agent = ActorCritic.A2C(observation_space, action_space, lr, gamma, 
                            TD=TD, discrete=False, twin=twin, tau=tau, **kwargs)
    performance = []
    score = []
    for e in range(n_epochs):
        rewards, log_probs, states, done = play_episode(agent, env, return_states=True)
        performance.append(np.sum(rewards))
        if (e+1)%10 == 0:
            print("Episode %d - reward: %.0f"%(e+1, np.mean(performance[-10:])))
        #print("rewards.shape ", rewards.shape)
        #print("log_probs ", log_probs)
        #print("states.shape ", states.shape)
        #print("done.shape ", done.shape)
        #print("done ", done)
        agent.update(rewards, log_probs, np.array([states]) , done)
        
    return agent, np.array(performance)

In [9]:
%%time
HPs = dict(n_epochs=5000, lr=1e-3, twin=True,tau=0.1, debug=True, hiddens=[64,32,16])
agent_TD, performance_TD = train_cartpole_A2C(**HPs)

Discount factor:  0.99
Learning rate:  0.001
Action space:  2
Discrete state space:  False
Temporal Difference learning:  True
Twin networks:  True
Update critic target factor:  0.1
Device used:  cpu


Actor architecture: 
 Actor(
  (net): Sequential(
    (0): Linear(in_features=4, out_features=64, bias=True)
    (1): ReLU()
    (2): Linear(in_features=64, out_features=32, bias=True)
    (3): ReLU()
    (4): Linear(in_features=32, out_features=16, bias=True)
    (5): ReLU()
    (6): Linear(in_features=16, out_features=2, bias=True)
    (7): LogSoftmax()
  )
)
Critic architecture: 
 Critic(
  (net1): BasicCritic(
    (net): Sequential(
      (0): Linear(in_features=4, out_features=64, bias=True)
      (1): ReLU()
      (2): Linear(in_features=64, out_features=32, bias=True)
      (3): ReLU()
      (4): Linear(in_features=32, out_features=16, bias=True)
      (5): ReLU()
      (6): Linear(in_features=16, out_features=1, bias=True)
    )
  )
  (net2): BasicCritic(
    (net): Sequential(
 

KeyboardInterrupt: 

In [None]:
episodes = np.arange(1,len(performance_TD)+1)
plt.scatter(episodes, performance_TD, s=2)

In [None]:
render_test_episode(agent_TD) 

In [None]:
%%time
agent_MC, performance_MC = train_cartpole_A2C(n_epochs=1500, lr=5e-3, TD=False)

In [None]:
episodes = np.arange(1,len(performance_MC)+1)
plt.scatter(episodes, performance_MC, s=2)

In [None]:
render_test_episode(agent_MC) 

## Reward shaping

Try to make a more informative reward.
Idea: store the whole trajectory, then subtract $-\frac{eps \cdot t}{T}$ to all rewards, where $t$ is the step at which the reward was obtained and $T$ the total number of steps.

In [None]:
def shape_rewards(r, eps, power=1):
    T = len(r)
    t = np.arange(1,T+1)
    r -= eps*(t/T)**power
    return r

In [None]:
def train_cartpole_A2C_shaped(n_epochs = 100, n_batches = 1, lr = 0.01, gamma = 0.99, TD=True, eps=1, power=1):
    # Create environment
    env = gym.make("CartPole-v1")
    observation_space = env.observation_space.shape[0]
    action_space = env.action_space.n
    # Init agent
    agent = ActorCritic.A2C(observation_space, action_space, lr, gamma, TD=TD, discrete=False)
    performance = []
    for e in range(n_epochs):
        r_list = []
        logp_list = []
        s_list = []
        done_list = []
        score = []
        
        for b in range(n_batches):
            rewards, log_probs, states, done = play_episode(agent, env, return_states=True)
            if done[-1] == True and len(done) != 500:
                rewards = shape_rewards(rewards, eps, power)
            r_list.append(rewards)
            logp_list.append(log_probs)
            s_list.append(states)
            done_list.append(done)
            score.append(np.sum(rewards))
            
        performance.append(np.mean(score))
        if (e+1)%10 == 0:
            print("Episode %d - reward: %.0f"%(e+1, np.mean(performance[-10:])))
        exp_buff = experience_buffer(r_list, logp_list, s_list, done_list)
        rewards, log_probs, states, done = exp_buff.get_exp()
        #print("rewards.shape ", rewards.shape)
        #print("log_probs ", log_probs)
        #print("states.shape ", states.shape)
        #print("done.shape ", done.shape)
        #print("done ", done)
        agent.update(rewards, log_probs[0], states, done)
        
    return agent, np.array(performance)

In [None]:
%%time
agent_TD_sh, performance_TD_sh = train_cartpole_A2C_shaped(n_epochs=1500, lr=5e-3, power=2, eps=0.01)

In [None]:
episodes = np.arange(1,len(performance_TD_sh)+1)
plt.scatter(episodes, performance_TD_sh, s=2)

In [None]:
render_test_episode(agent_TD_sh) 

In [None]:
%%time
agent_MC_sh, performance_MC_sh = train_cartpole_A2C_shaped(n_epochs=1500, lr=5e-3, power=2, eps=0.1)

In [None]:
episodes = np.arange(1,len(performance_MC_sh)+1)
plt.scatter(episodes, performance_MC_sh, s=2)

In [None]:
render_test_episode(agent_MC_sh) 

**Final discussion:** It is undisputable that A2C setup is much better than a random policy, so it defenitely learns something correctly. I found it very unstable w.r.t. the learning rate and I suspect that a better tuning would require to differentiate between the one of the critic and the one of the actor.

Since each training procedure is stochastic, every time the result is different, all other things been equal, so what I'm going to say next is not supported strongly by the data, but can verified if one has time using an ensemble of agents and averaging the performances at each epoch.

What I observed is that:
- the TD agent is more unstable than the MC one;
- shaping the reward function changes a lot the results. I've done this in 2 ways: the first one is to give -1 instead of +1 to the last reward of an episode if the episode ends with the cartpole falling. This enables the agent to differentiate between an episode ended by truncation (good, it scored the maximum possible) or one in which it committed a sequence of non-optimal actions (bad, could have done better). The second one is similar but more sophisticated and is based on the idea of smoothing the reward, so that the responsability for failing the task gets shared in a weighted way by the last actions taken (with polynomial decay, whose power is a parameter of the model). I found that with these 2 changes the A2C with Monte Carlo estimation reached the maximum reward possible and was much more stable on that performance than all other configurations.

Anyway there is always a source of instability causing sudden drops in performance during the training.