## Reinforcement Learning

Reinforcement learning is concerned with sequential decision making. Given an environment and an agent acting in the environment, the agent chooses actions (one at a time). The environment accepts an action and then transitions to its next state. This new state then gives a reward to the agent.

<img src="rl.png" width=600 />

### The REINFORCE algorithm
REINFORCE is a policy algorithm for deep reinforcement learning. The key idea of the algorithm is during learning actions resulting in good outcomes should become more probable (positively reinforced) and actions resulting in bad outcomes should become less probable.
___

In this notebook we will use the following libraries:
- [torch](https://pytorch.org/docs/stable/torch.html)
- [numpy](https://numpy.org/doc/stable/)
- [torch.nn](https://pytorch.org/docs/stable/nn.html)
- [torch.optim](https://pytorch.org/docs/stable/optim.html)
- [gym](https://gym.openai.com/docs/)

In [1]:
import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
import gym

In [2]:
class PolicyNetwork(nn.Module):
    
    def __init__(self, in_dim, hidden_layer_size, out_dim):
        super(PolicyNetwork, self).__init__()
        layers = [nn.Linear(in_dim, hidden_layer_size),
                  nn.ReLU(),
                  nn.Linear(hidden_layer_size, out_dim),]
        self.model = nn.Sequential(*layers)
        self.onpolicy_reset()
        self.train()
        
    def onpolicy_reset(self):
        self.log_probs = []
        self.rewards = []
        
    def forward_pass(self, x):
        pdparams = self.model(x)
        return pdparams
    
    def act(self, state):
        x = torch.from_numpy(state.astype(np.float32))
        pdparams = self.forward_pass(x)
        pd = torch.distributions.Categorical(logits = pdparams)
        action = pd.sample() # pi(a/s)
        log_prob = pd.log_prob(action) # log[pi(a/s)]
        self.log_probs.append(log_prob)
        return action.item()
    

In [3]:
def train(pi, optimizer, gamma):
    
    T = len(pi.rewards)
    trajectory_returns = np.empty(T, dtype = np.float32)
    future_returns = 0.0
    
    for t in reversed(range(T)):
        future_returns = pi.rewards[t] + gamma * future_returns
        trajectory_returns[t] = future_returns
        
    trajectory_returns = torch.tensor(trajectory_returns)
    log_probs = torch.stack(pi.log_probs)
    loss = -log_probs * trajectory_returns
    loss = torch.sum(loss)
    optimizer.zero_grad()
    loss.backward() # back propogate
    optimizer.step() # gradient ascent and update the weights (all built into torch)
    return loss

In [4]:
def main(baseline=False, gamma=0.09, swing=False):
    env = gym.make('CartPole-v0')
    in_dim = env.observation_space.shape[0]
    out_dim = env.action_space.n
    pi = PolicyNetwork(in_dim, 64, out_dim)
    optimizer = optim.Adam(pi.parameters(), lr = 0.01)
    
    for epi in range(300):
        state = env.reset()
        for t in range(200):
            action = pi.act(state)
            state, reward, done, _ = env.step(action)
            pi.rewards.append(reward)
            env.render()
            # if done:
            if swing == True:
                if state[0] < -4.7 or state[0] > 4.7:
                    break
            else:
                if done:
                    break
        if baseline == True:
            b = (1 / len(pi.rewards)) * sum(pi.rewards)
            pi.rewards = [(r - b) for r in pi.rewards]
            
        loss = train(pi, optimizer, gamma)
        total_reward = sum(pi.rewards)
        solved = total_reward > 195.0
        pi.onpolicy_reset()
        print(f'Episode: {epi}')
        print(f'Total Reward: {total_reward}')
        print(f'Solved: {solved}')
        print('')
        

In [5]:
main(gamma=0.99, swing=True)



Episode: 0
Total Reward: 13.0
Solved: False

Episode: 1
Total Reward: 11.0
Solved: False

Episode: 2
Total Reward: 10.0
Solved: False

Episode: 3
Total Reward: 13.0
Solved: False

Episode: 4
Total Reward: 30.0
Solved: False

Episode: 5
Total Reward: 12.0
Solved: False

Episode: 6
Total Reward: 14.0
Solved: False

Episode: 7
Total Reward: 10.0
Solved: False

Episode: 8
Total Reward: 9.0
Solved: False

Episode: 9
Total Reward: 19.0
Solved: False

Episode: 10
Total Reward: 10.0
Solved: False

Episode: 11
Total Reward: 11.0
Solved: False

Episode: 12
Total Reward: 9.0
Solved: False

Episode: 13
Total Reward: 11.0
Solved: False

Episode: 14
Total Reward: 13.0
Solved: False

Episode: 15
Total Reward: 17.0
Solved: False

Episode: 16
Total Reward: 10.0
Solved: False

Episode: 17
Total Reward: 19.0
Solved: False

Episode: 18
Total Reward: 9.0
Solved: False

Episode: 19
Total Reward: 13.0
Solved: False

Episode: 20
Total Reward: 17.0
Solved: False

Episode: 21
Total Reward: 20.0
Solved: False

E