# Train RL to Balance Cartpole

Dis notebook na part of [AI for Beginners Curriculum](http://aka.ms/ai-beginners). E dey take inspiration from [official PyTorch tutorial](https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html) and [dis Cartpole PyTorch implementation](https://github.com/yc930401/Actor-Critic-pytorch).

For dis example, we go use RL train one model wey go fit balance one pole for top one cart wey fit waka go left and right for horizontal scale. We go use [OpenAI Gym](https://www.gymlibrary.ml/) environment take simulate di pole.

> **Note**: You fit run di code for dis lesson for your local machine (like for Visual Studio Code), and di simulation go open for new window. If you dey run di code online, you fit need adjust di code small, as dem describe [here](https://towardsdatascience.com/rendering-openai-gym-envs-on-binder-and-google-colab-536f99391cc7).

We go start by making sure say Gym dey installed:


In [None]:
import sys
!{sys.executable} -m pip install gym

Make we create the CartPole environment and see how we fit take operate am. Environment get dis kain properties:

* **Action space** na all di actions wey we fit do for each step of di simulation
* **Observation space** na di space of di observations wey we fit see


In [None]:
import gym

env = gym.make("CartPole-v1")

print(f"Action space: {env.action_space}")
print(f"Observation space: {env.observation_space}")

Make we see how di simulation go work. Di loop wey dey below go dey run di simulation, until `env.step` no return di termination flag `done`. We go dey randomly choose actions wit `env.action_space.sample()`, wey mean say di experiment fit fail quick-quick (CartPole environment dey terminate when di speed of CartPole, im position or angle pass di limit wey dem set).

> Simulation go open for new window. You fit run di code many times and see as e go behave.


In [None]:
env.reset()

done = False
total_reward = 0
while not done:
   env.render()
   obs, rew, done, info = env.step(env.action_space.sample())
   total_reward += rew
   print(f"{obs} -> {rew}")
print(f"Total reward: {total_reward}")

You fit notice say di observations get 4 numbers. Dem be:  
- Position of di cart  
- Velocity of di cart  
- Angle of di pole  
- Rotation rate of di pole  

`rew` na di reward we dey collect for each step. You go see say for CartPole environment, dem dey give you 1 point for each simulation step, and di goal na to maximize di total reward, wey mean say make di CartPole fit balance for long time without falling.  

For reinforcement learning, our goal na to train one **policy** $\pi$, wey go tell us which action $a$ to take for each state $s$, so basically $a = \pi(s)$.  

If you wan do probabilistic solution, you fit think of policy as something wey dey return set of probabilities for each action, e.g. $\pi(a|s)$ go mean di probability say we suppose take action $a$ for state $s$.  

## Policy Gradient Method  

For di simplest RL algorithm, wey dem dey call **Policy Gradient**, we go train one neural network to predict di next action.  


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch

num_inputs = 4
num_actions = 2

model = torch.nn.Sequential(
    torch.nn.Linear(num_inputs, 128, bias=False, dtype=torch.float32),
    torch.nn.ReLU(),
    torch.nn.Linear(128, num_actions, bias = False, dtype=torch.float32),
    torch.nn.Softmax(dim=1)
)

We go train di network by run many experiments, and update our network afta each run. Make we define one function wey go run di experiment and return di results (di so-called **trace**) - all di states, actions (and dia recommended probabilities), and rewards:


In [None]:
def run_episode(max_steps_per_episode = 10000,render=False):    
    states, actions, probs, rewards = [],[],[],[]
    state = env.reset()
    for _ in range(max_steps_per_episode):
        if render:
            env.render()
        action_probs = model(torch.from_numpy(np.expand_dims(state,0)))[0]
        action = np.random.choice(num_actions, p=np.squeeze(action_probs.detach().numpy()))
        nstate, reward, done, info = env.step(action)
        if done:
            break
        states.append(state)
        actions.append(action)
        probs.append(action_probs.detach().numpy())
        rewards.append(reward)
        state = nstate
    return np.vstack(states), np.vstack(actions), np.vstack(probs), np.vstack(rewards)

You fit run one episode wit untrained network and see say di total reward (wey be length of di episode) dey very low:


In [None]:
s, a, p, r = run_episode()
print(f"Total reward: {np.sum(r)}")

One of di tricky part for policy gradient algorithm na to use **discounted rewards**. Di idea be say we go calculate di vector of total rewards for each step of di game, and for dis process we go discount di early rewards wit one coefficient $gamma$. We go still normalize di vector wey we get, because we go use am as weight to affect our training:


In [None]:
eps = 0.0001

def discounted_rewards(rewards,gamma=0.99,normalize=True):
    ret = []
    s = 0
    for r in rewards[::-1]:
        s = r + gamma * s
        ret.insert(0, s)
    if normalize:
        ret = (ret-np.mean(ret))/(np.std(ret)+eps)
    return ret

Make we start di real training now! We go run 300 episodes, and for each episode, we go do di following:

1. Run di experiment and collect di trace.
2. Calculate di difference (`gradients`) between di actions wey we take and di predicted probabilities. Di smaller di difference, di more sure we dey say we don take correct action.
3. Calculate discounted rewards and multiply di gradients by di discounted rewards - dis one go make sure say steps wey get higher rewards go get more effect for di final result pass di ones wey get lower rewards.
4. Di expected target actions for our neural network go partly come from di predicted probabilities during di run, and partly from di calculated gradients. We go use `alpha` parameter to decide how much we go take gradients and rewards into account - dis one na wetin dem dey call *learning rate* for reinforcement algorithm.
5. Finally, we go train our network on di states and di expected actions, and repeat di process.


In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

def train_on_batch(x, y):
    x = torch.from_numpy(x)
    y = torch.from_numpy(y)
    optimizer.zero_grad()
    predictions = model(x)
    loss = -torch.mean(torch.log(predictions) * y)
    loss.backward()
    optimizer.step()
    return loss

In [None]:
alpha = 1e-4

history = []
for epoch in range(300):
    states, actions, probs, rewards = run_episode()
    one_hot_actions = np.eye(2)[actions.T][0]
    gradients = one_hot_actions-probs
    dr = discounted_rewards(rewards)
    gradients *= dr
    target = alpha*np.vstack([gradients])+probs
    train_on_batch(states,target)
    history.append(np.sum(rewards))
    if epoch%100==0:
        print(f"{epoch} -> {np.sum(rewards)}")

plt.plot(history)

Make we run di episode wit rendering to see di result:


In [None]:
_ = run_episode(render=True)

Hopefully, you fit see say di pole fit balance well well now!

## Actor-Critic Model

Actor-Critic model na di next level for policy gradients, wey we go use neural network take learn di policy and di estimated rewards. Di network go get two outputs (or you fit see am as two separate networks):
* **Actor** go recommend di action wey we go take by giving us di state probability distribution, like di policy gradient model
* **Critic** go estimate wetin di reward go be from di actions. E go return di total estimated rewards for di future for di given state.

Make we define dis kain model:


In [None]:
from itertools import count
import torch.nn.functional as F

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
env = gym.make("CartPole-v1")

state_size = env.observation_space.shape[0]
action_size = env.action_space.n
lr = 0.0001

class Actor(torch.nn.Module):
    def __init__(self, state_size, action_size):
        super(Actor, self).__init__()
        self.state_size = state_size
        self.action_size = action_size
        self.linear1 = torch.nn.Linear(self.state_size, 128)
        self.linear2 = torch.nn.Linear(128, 256)
        self.linear3 = torch.nn.Linear(256, self.action_size)

    def forward(self, state):
        output = F.relu(self.linear1(state))
        output = F.relu(self.linear2(output))
        output = self.linear3(output)
        distribution = torch.distributions.Categorical(F.softmax(output, dim=-1))
        return distribution


class Critic(torch.nn.Module):
    def __init__(self, state_size, action_size):
        super(Critic, self).__init__()
        self.state_size = state_size
        self.action_size = action_size
        self.linear1 = torch.nn.Linear(self.state_size, 128)
        self.linear2 = torch.nn.Linear(128, 256)
        self.linear3 = torch.nn.Linear(256, 1)

    def forward(self, state):
        output = F.relu(self.linear1(state))
        output = F.relu(self.linear2(output))
        value = self.linear3(output)
        return value

We go need small change for our `discounted_rewards` and `run_episode` functions:


In [None]:
def discounted_rewards(next_value, rewards, masks, gamma=0.99):
    R = next_value
    returns = []
    for step in reversed(range(len(rewards))):
        R = rewards[step] + gamma * R * masks[step]
        returns.insert(0, R)
    return returns

def run_episode(actor, critic, n_iters):
    optimizerA = torch.optim.Adam(actor.parameters())
    optimizerC = torch.optim.Adam(critic.parameters())
    for iter in range(n_iters):
        state = env.reset()
        log_probs = []
        values = []
        rewards = []
        masks = []
        entropy = 0
        env.reset()

        for i in count():
            env.render()
            state = torch.FloatTensor(state).to(device)
            dist, value = actor(state), critic(state)

            action = dist.sample()
            next_state, reward, done, _ = env.step(action.cpu().numpy())

            log_prob = dist.log_prob(action).unsqueeze(0)
            entropy += dist.entropy().mean()

            log_probs.append(log_prob)
            values.append(value)
            rewards.append(torch.tensor([reward], dtype=torch.float, device=device))
            masks.append(torch.tensor([1-done], dtype=torch.float, device=device))

            state = next_state

            if done:
                print('Iteration: {}, Score: {}'.format(iter, i))
                break


        next_state = torch.FloatTensor(next_state).to(device)
        next_value = critic(next_state)
        returns = discounted_rewards(next_value, rewards, masks)

        log_probs = torch.cat(log_probs)
        returns = torch.cat(returns).detach()
        values = torch.cat(values)

        advantage = returns - values

        actor_loss = -(log_probs * advantage.detach()).mean()
        critic_loss = advantage.pow(2).mean()

        optimizerA.zero_grad()
        optimizerC.zero_grad()
        actor_loss.backward()
        critic_loss.backward()
        optimizerA.step()
        optimizerC.step()


Now we go run di main training loop. We go use manual network training process by calculate correct loss functions and update network parameters:


In [None]:

actor = Actor(state_size, action_size).to(device)
critic = Critic(state_size, action_size).to(device)
run_episode(actor, critic, n_iters=100)

Finally, make we close di environment.


In [None]:
env.close()

## Wetin We Fit Carry Go

We don see two RL algorithms for dis demo: simple policy gradient, and one wey get more sense, actor-critic. You fit see say dis algorithms dey work with abstract idea of state, action and reward - so dem fit work for plenty different kain environments.

Reinforcement learning dey allow us learn di best way to solve problem just by looking di final reward. Di fact say we no need labelled datasets mean say we fit repeat simulations plenty times to make our models better. But, RL still get plenty wahala, wey you fit learn about if you decide to focus more for dis interesting area of AI.


---

<!-- CO-OP TRANSLATOR DISCLAIMER START -->
**Disclaimer**:  
Dis dokyument don use AI transle-shun service [Co-op Translator](https://github.com/Azure/co-op-translator) do di transle-shun. Even as we dey try make am correct, abeg make you sabi say machine transle-shun fit get mistake or no dey accurate well. Di original dokyument wey dey for im native language na di one wey you go take as di correct source. For important mata, e good make you use professional human transle-shun. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because you use dis transle-shun.
<!-- CO-OP TRANSLATOR DISCLAIMER END -->
