# A2C Solution to OpenAI Gym LunarLander-v2 environment

## Introduction:
    
### OpenAI Gym Environment Description:
"Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Four discrete actions available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine.
"

### Discussion
Due to the continuous and varied nature of the state space, discretising the state space at a sufficiently high resolution would result in an impractically large number of possible states. A conventional Q-table type solution is therefore impractical.
An actor - Critic method is used, training 2 neural networks, an 'actor' network to determine the optimal action, and a 'critic' network to estimate the potential reward of the action. 

Keras, as a frontend for Tensorflow, is used to create and train the neural networks. 

As 'solved' is considered 200 points averaged over 100 episodes, the networks will be trained and optimized until the average over the last 100 iterations exceeds this value. 


In [1]:
#Import the various gym, keras, numpy and libraries we will require

import gym
import gym.spaces
import gym.wrappers
import numpy as np
import matplotlib.pyplot as plt
import random
import pickle
import time

from collections import deque
from keras.layers import Flatten, Dense
from keras import backend as K
from keras.models import Sequential, Model, load_model
from keras import optimizers
from keras.layers.advanced_activations import LeakyReLU
from multiprocessing import Pool, freeze_support

Using TensorFlow backend.


### Creating the models

Functions for model creation allow for flexibility in network size to allow for comparison of network sizes. 

Adam is used as the optimizer, as it has proven efficient on prior problems.


In [2]:
def build_model_critic(num_input_nodes, num_output_nodes, lr = 0.001, size = [256]):
	
	model = Sequential()
	
	model.add(Dense(size[0], input_shape = (8,), activation = 'relu'))
	
	for i in range(1,len(size)):
		model.add(Dense(size[i], activation = 'relu'))
	
    
	model.add(Dense(num_output_nodes, activation = 'linear')) 
	
	adam = optimizers.Adam(lr=lr, beta_1=0.9, beta_2=0.999)
	
	model.compile(loss = 'mse', optimizer = adam)
	
	#print('Critic Model Summary:')
	#model.summary()
	
	return model

def build_model_actor(num_input_nodes, num_output_nodes, lr = 0.001, size = [256]):
	
	model = Sequential()
	
	model.add(Dense(size[0], input_shape = (num_input_nodes,), activation = 'relu'))
	
	for i in range(1, len(size)):
		model.add(Dense(size[i], activation = 'relu'))
	
	model.add(Dense(num_output_nodes, activation = 'softmax')) 
	
	adam = optimizers.Adam(lr=lr, beta_1=0.9, beta_2=0.999)
	
	model.compile(loss = 'categorical_crossentropy', optimizer = adam)
	
	#print('Actor Model Summary:')
	#model.summary()
	
	return model



### Deciding on an Action

Action state is very simple - one of 4 possible actions (do nothing, or fire left, right or main engine). Action is selected randomly from the 4 actions, with the probability of a given action being chosen being proportional to the probability the actor network give for that action being the optimal action. This inherently encourages exploration in the early stages of training, and moves to a exploitation strategy as the network becomes more sure of itself. 


In [3]:
def decide_action(actor, state):

	flat_state = np.reshape(state, [1,8])
	action = np.random.choice(4, 1, p = actor.predict(flat_state)[0])[0]
	
	return(action)




### Running episodes

The simulation is run for a predefined number of episodes.

For each step, the state, action, resulting state, reward and whether or not the step completed the episode (the boolean 'done') were saved in a list 'memory'.

For each episode the totalreward is saved in an array 'totrewardarray'.

Each episode is limited to 1000 timesteps, to cut short scenarios where the lander (which contains infinite fuel) refusing to land in the early stages of training.

The episodes run until either the predefined number of episodes are completed, or the problem is considered solved (average totalreward of last 100 episodes exceeds 200). 





In [4]:
def run_episode(env, actor, r = False):
    
    memory = []
    
    bestyet = float('-inf')
            
    state = env.reset()

    episode_reward = 0

    cnt = 0 

    done = False

    while not done and cnt <1000:

        cnt += 1

        if r:
            env.render()

        action = decide_action(actor, state)
        observation, reward, done, _ = env.step(action)  

        episode_reward += reward

        state_new = observation 

        memory.append((state, action, reward, state_new, done))

        state = state_new 

    return(memory, episode_reward)


### Training the Networks

Now the memory list gathered from running the episodes to a training function which trains the networks. 

The training data is shuffled so it is not presented to the networks in order. 

The discount factor, 'gamma', is another hyperparameter that will need to be optimised. 



In [5]:
def train_models(actor, critic, memory, gamma):

	random.shuffle(memory)
	
	for i in range(len(memory)):

		state, action, reward, state_new, done = memory[i]
			
		flat_state_new = np.reshape(state_new, [1,8])
		flat_state = np.reshape(state, [1,8])
		
		target = np.zeros((1, 1))
		advantages = np.zeros((1, 4))

		value = critic.predict(flat_state)
		next_value = critic.predict(flat_state_new)

		if done:
			advantages[0][action] = reward - value
			target[0][0] = reward
		else:
			advantages[0][action] = reward + gamma * (next_value) - value
			target[0][0] = reward + gamma * next_value
		
		actor.fit(flat_state, advantages, epochs=1, verbose=0)
		critic.fit(flat_state, target, epochs=1, verbose=0)		




### Running episodes without training

Sometimes we might want to run episodes without saving data for training, for instance if we want to render a few episodes of the trained network, or if we want to assess the performance of a trained network. This is simply a modification of the 'run_episodes' function. 

It includes a render option (boolean 'r') which turns on or off rendering the episode. 


In [6]:
def play_game(iters, r = True):
    env = gym.make('LunarLander-v2')
    totalrewardarray = []
    for i in range(iters):
    
        state = env.reset()
        totalreward = 0
        cnt = 0

        done = False

        while not done and cnt <1000:

            cnt += 1

            if r:
                import PIL
                PIL.Image.fromarray(env.render(mode='rgb_array')).resize((320, 420))

            action = decide_action(actor, state)

            observation, reward, done, _ = env.step(action)  

            totalreward += reward

            state_new = observation 

            state = state_new
            
        totalrewardarray.append(totalreward)

    return totalrewardarray



### Putting it all together

With the necessary building blocks in place, it is time to run some episodes and see how it performs. 
This function runs the episodes, trains the models on the episode data, and calculates the average performance over
previous 100 episodes. If the average performance is the best yet, it saves the models. Finally, it plots the average reward 
vs number of episodes used for training. 




In [None]:
def run_train_plot(alr, clr, gamma, numepisodes):
    
    env = gym.make('LunarLander-v2')
  
    i = 0

    actor = build_model_actor(num_input_nodes = 8, num_output_nodes = 4, lr = alr, size = [64,64,64])
    critic = build_model_critic(num_input_nodes = 8, num_output_nodes = 1, lr= clr, size = [64,64,64])

    totrewardarray = [] #For storing the total reward from each episode

    best = float('-inf') #For storing the best rolling average reward

    episodes = len(totrewardarray) #Counting how many episodes have passed

    while episodes < numepisodes:   

        i+= 1

        memory, episode_reward = run_episode(env, actor, r = False)

        totrewardarray.append(episode_reward)

        episodes = len(totrewardarray)

        if episodes >= 100:
            score = np.average(totrewardarray[-100:-1])
            if score > best:
                best = score
                actor.save('actormodel.h5')
                critic.save('criticmodel.h5')
            if episodes%500==0:
                print('ALR:', alr, ' CLR:', clr, 'episode ', episodes, 'of',numepisodes, 'Average Reward (last 100 eps)= ', score)

        train_models(actor, critic, memory, gamma)

        avgarray = []
        cntarray = []

    for i in range(100,len(totrewardarray),10):
        avgarray.append(np.average(totrewardarray[i-100:i]))
        cntarray.append(i)

    plt.plot(cntarray, avgarray, label = 'Best 100 ep av. reward = '+str(best))
        
    plt.title('Rolling Average (previous 100) vs Iterations')
    plt.xlabel('Iterations')
    plt.ylabel('Reward')
    plt.legend(loc='best')
    
    plt.show()
        
    



A grid search (not shown) found the following hyperparameters produced an average per-episode reward of more than 200 over 100 episodes after less than 5000 episodes of training:
Actor Learning Rate = 5e-6
Critic Learning Rate = 5e-4
Gamma Value of 0.999
Neural Network Size [64,64,64] (both networks)

Note: If you do not wish to train the model from scratch, do no run the next section. The following section loads the best saved weights (included in the repository). 

In [None]:
run_train_plot(5e-6, 5e-4, 0.999, 5000)


[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
ALR: 5e-06  CLR: 0.0005 episode  500 of 5000 Average Reward (last 100 eps)=  -186.472727999
ALR: 5e-06  CLR: 0.0005 episode  1000 of 5000 Average Reward (last 100 eps)=  -191.85582657
ALR: 5e-06  CLR: 0.0005 episode  1500 of 5000 Average Reward (last 100 eps)=  -139.905073712
ALR: 5e-06  CLR: 0.0005 episode  2000 of 5000 Average Reward (last 100 eps)=  -116.953608546


It is clear that the performance of the model reached a plateau after approximately 4000 episodes- It is unlikely that training over more episodes will provide greater performance. 

### Reviewing Performance

Now to lets assess the performance of the trained model. Firstly, let's load the weights of the model with the best recorded performance during training:

In [None]:
#Load the saved model at its best performance
actor=load_model('actormodel.h5')
critic=load_model('criticmodel.h5')

Now lets test the model over many episodes and view the distribution of rewards received per episode. 
Rendering has been disabled as it is very slow, however if you wish to watch, setting r=True will render the graphics (you may want to reduce the number of iterations). Note rendering may require additional/different libraries to be installed. 

In [None]:
rewards = play_game(iters = 1000, r = False)
plt.hist(rewards, 40, rwidth=0.8)
plt.title('Performance of trained models')
plt.xlabel('Episode reward')
plt.ylabel('Number of Occurrences')
plt.show()

It is noted that while the average performance is reasonably good, there are still a small fraction of episodes which result in unsatisactory landings. 