### Reinforce 

This is a tensorflow implementation of reinforce as presented in RL: An Introduction. 

![Image](images/reinforce_performance.png)

### Imports

Firstly, we must install the neccesary libraries. 

- Tensorflow for training. 
- NumPy for handling matrices
- Matplotlib for graphs 
- Gym for the environment which we are going to interact with. 

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plot
import gym
tf.random.set_random_seed(0)
np.random.seed(0)

### Reinforce Implementation 

The init function includes the specification of the model parameters. Which is a network of size (4,24,24,2), where the activation units for the neurons are relu for the hidden layers and softmax for the output. The reason for using softmax as the output is because Reinforce wants to learn a probabilistic mapping of states to actions and softmax allows us to do this. 

The optimizer that is used is the AdamOptimizer, the reason being that it works. Once I know more about optimizers, I'll add more here. 

Note the loss that is specified for the loss is the catergorical cross entropy. Might seem weird, but mathematically it works out as the same as the policy gradient theorem when the action we took and want to update is represented as a one hot encoding. 

The reason for the negative return is because model.fit by default aims to minimize the loss. By changing the return to a negative value it trys to minimise a negative value, which is the same as the maximisation of a positive value. 

In [None]:
class Reinforce(object):
    #Currently just stolen from DQN
    
    def __init__(self, state_size, action_size, lr , y ):
        self.state_size = state_size
        self.action_size = action_size
        self.actions = np.array(range(self.action_size))
        self.model = tf.keras.Sequential([
            tf.keras.layers.Dense(24, activation = tf.keras.activations.relu, input_shape = self.state_size),
            tf.keras.layers.Dense(24, activation = tf.keras.activations.relu),
            tf.keras.layers.Dense(self.action_size, activation = tf.keras.activations.softmax),
        ])
        
        self.model.compile(optimizer = tf.train.AdamOptimizer(lr),
                          loss = tf.keras.losses.CategoricalCrossentropy(),
                          metrics = ['accuracy'])
        self.y = y #our discount factor
        
    def train(self, experience):
        return self.model.fit(experience[0],
                       experience[1], 
                       sample_weight = np.squeeze(self.reward_to_return(experience[2])),
                       batch_size = experience[0].shape[0],
                       verbose = 1)
        
    def reward_to_return(self, reward):
        rtn = np.zeros((reward.shape))
        current_rtn = 0
        for i in reversed(range(np.max(reward.shape))):       
            rtn[i] = reward[i] + (self.y * current_rtn)
            current_rtn = rtn[i]
        return -rtn #changed back to positive #to minus 
    
    def one_hot(self, actions):
        one_hot = np.zeros((np.max(actions.shape),self.action_size))
        for num,i in enumerate(actions):
            one_hot[num][i] = 1.0
        return one_hot     
    
    def predict(self,state):
        return np.random.choice(self.actions, p = self.model.predict(state)[0])
        


### Memory
This is for storing the results of an episode. 

In [None]:
class MemoryBuffer(object):
    
    def __init__(self):
        self.first_run = True 
    
    def add_to_buffer(self, e):
        """
        The experience (e) added is in form: 
        [state, action, reward]
        The reason for int(not done) is so that if it's a terminal state
        it will be stored as 0 and then will easily allow for calculation
        of the state value without the need for any if statements. 
        """
        if not self.first_run:
            for num, i in enumerate(e):
                self.experience[num] = np.vstack([i,self.experience[num]])                
        else:
            for i in e:
                self.experience = [np.array(e[0]),np.array(e[1]),
                                   np.array(e[2]),
                                  ]

            self.first_run = False
        
    def get_batch(self):
        self.first_run = True 
        return self.experience
    


### Value network
When I add in baselines this will be done. The actual algorithm dictates that it should be updated based on the observed return, I know this is problematic for DQN so might create an experience replay buffer for this to aid stability for the learning of the value network. 

In [None]:
class ValueNetwork(object):
    
    def __init__(self, state_size, lr):
        # output is 1 because regression
        self.model = tf.keras.Sequential([
            tf.keras.layers.Dense(24, activation = tf.keras.activations.relu, input_shape = state_size),
            tf.keras.layers.Dense(24, activation = tf.keras.activations.relu),
            tf.keras.layers.Dense(1, activation = tf.keras.activations.linear)
        ])
        self.model.compile(optimizer = tf.train.RMSPropOptimizer(lr),
                          loss = tf.losses.huber_loss,
                          metrics = ['accuracy'])
    
    def train(self, experience):
        return self.model.fit(experience[0],update, verbose = 0 , batch_size = 32)
        


### running_average

This is just to remove noise from the observed returns. Makes it easier to see if the algorithm is improving when doing hyperparameter tuning. 

In [None]:
def running_average(data):
    new_data =[]
    for i in range(len(data)):
        new_data.append(np.average(data[max(0, i - 10):i+10]))
    return new_data

### Run the environment. 
Here is the running of the environment with the hyperparameters that produced the results at the top of the page. 

In [None]:
env = gym.make('CartPole-v0')
env.seed(0)
policy = Reinforce((4,),2, 0.0005, 0.5)
episode_memory = MemoryBuffer()
rewards = []

In [None]:
episodes = 4000

for _ in range(episodes):
    done = False
    reward = 0
    state = np.array([env.reset()])
    actions = [0,0]
    while not done: 
        action = policy.predict(state)
        actions[action] += 1
        next_state, r, done, _ = env.step(action)
        reward += r
        episode_memory.add_to_buffer([state,action,r])
        state = np.array([next_state])
    experience = episode_memory.get_batch()
    a = policy.train(experience)
    print(reward)
    print(actions)
    rewards.append(reward)



    
    

In [None]:
plot.plot(running_average(rewards))
plot.ylabel('Averaged reward')
plot.xlabel('Episodes')
plot.title('Reinforce for CartPole')
plot.show()