**Initialise**

In [3]:
# The final submission is a Deep Learning Advantage Actor/Critic Network (A2C). I also attempted to write a genetic RL algorithm as in https://arxiv.org/pdf/1712.06567.pdf which didn't
# work out as well as in the paper given the platform constraints but I have left the code in for posterity (see the notes at the bottom of this file for info on that)
# The reason for choosing this is that it combines value-based learning (calculating expected return given action from current state) and policy-based
# (building up knowledge of a policy). Rewards in Gravitar are sparse. Plain policy gradient approaches aren't great for Gravitar because of the reward sparsity.
# The agent can take bad (inefficient or stupid) actions and still end up with a decent score at the end of the run so differentiating between action types isn't that easy.
# I originally attempted a Duelling DQN as a value-based learner but found it incredibly slow and also the high variability caused issues. It took a very long time
# to train and didn't seem to work well with the preprocessing code when that was implemented to try speed up learning.
# The critic calculates the q value of taken an action from the state and the actor will modify its policy based on that. The use of an advantage function accounts for the
# state mean which reduces the variability. My particular implementation uses the image based Gravitar and for this reason uses a convolutional net rather than the 
# linear layers. The CNN has been kept quite small to ensure training is reasonable but this may be at the expense of long-term performance. My implementation has a couple
# of strange quirks. Firstly, using the Softmax function as in the referenced code caused a lot of numerical stability issues with the categorical distribution. Tweaking this
# to Softmax(x - max(x)) to try prevent underflow or overflow didn't work. I also added gradient clipping and lowered the learning rate and the issues persisted which suggested this wasn't
# gradient explosion. Eventually the solution to stabilise turned out to be to use Sigmoid which has roughly the same shape but constrains outputs between 0 and 1 so that the categorical distribution
# input is never < 0. I used MSE loss instead of smooth l1 loss. This was mostly while experimenting with gradient issues but I found that it produced more consistent results too which makes sense.
# The aim of preprocessing is as in: https://www.nature.com/articles/nature14236. Prevent sprite flickering due to Atari limitations by taking max of frames,
# downsample the Atari screen to a smaller size to make processing faster (and make it grayscale since the net doesn't benefit from colour) and also stack frames to learn better.
# Since a lot of movement occurs in Gravitar, the agent learning based on this is useful. Also OpenAI Gym annoyingly gives numpy arrays in the wrong order for CHW which Pytorch expects so permute.
# The A2C core code is heavily based on https://github.com/seungeunrho/minimalRL/blob/master/actor_critic.py, MIT License
# Preprocessing code is based on https://github.com/philtabor/Deep-Q-Learning-Paper-To-Code/blob/master/DQN/utils.py, MIT License
# ESDQN code is based on https://github.com/atgambardella/pytorch-es/blob/master/model.py, MIT License

# imports
import gym
import collections
import random
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import cv2
import math
import copy

# hyperparameters
learning_rate = 0.00005
gamma         = 0.98
buffer_limit  = 10000
batch_size    = 32
video_every   = 25
print_every   = 5
n_rollout = 10

"""
Start preprocessing code. This is very generic stuff so it's basically just a stripped back version of the repository mentioned in the header
"""
class MaxFrameWrapper(gym.Wrapper):
  def __init__(self, env, repeat):
      super(MaxFrameWrapper, self).__init__(env)
      self.repeat = repeat
      self.shape = env.observation_space.low.shape
      self.frame_buffer = np.zeros((2, ) + self.shape, dtype = np.uint8)

  def step(self, action):
      t_reward = 0.0
      done = False
      for i in range(self.repeat):
          obs, reward, done, info = self.env.step(action)
          t_reward += reward
          self.frame_buffer[i % 2] = obs
          if done:
              break

      max_frame = np.maximum(self.frame_buffer[0], self.frame_buffer[1])
      return max_frame, t_reward, done, info

  def reset(self):
      obs = self.env.reset()
      self.frame_buffer = np.zeros((2, ) + self.shape, dtype = np.uint8)
      self.frame_buffer[0] = obs
      return obs

class PreprocessFrameWrapper(gym.ObservationWrapper):
  def __init__(self, env, shape):
      super(PreprocessFrameWrapper, self).__init__(env)
      self.shape = (shape[2], shape[0], shape[1])
      self.observation_space = gym.spaces.Box(low = 0.0, high = 1.0, shape = self.shape, dtype = np.float32)

  def observation(self, obs):
      new_frame = cv2.cvtColor(obs, cv2.COLOR_RGB2GRAY)
      resized_screen = cv2.resize(new_frame, self.shape[1:], interpolation=cv2.INTER_AREA)
      new_obs = np.array(resized_screen, dtype=np.uint8).reshape(self.shape)
      new_obs = new_obs / 255.0
      return new_obs

class StackFramesWrapper(gym.ObservationWrapper):
  def __init__(self, env, repeat):
      super(StackFramesWrapper, self).__init__(env)
      self.observation_space = gym.spaces.Box(env.observation_space.low.repeat(repeat, axis=0), env.observation_space.high.repeat(repeat, axis=0), dtype=np.float32)
      self.stack = collections.deque(maxlen = repeat)

  def reset(self):
      self.stack.clear()
      observation = self.env.reset()
      for _ in range(self.stack.maxlen):
          self.stack.append(observation)

      return np.array(self.stack).reshape(self.observation_space.low.shape)

  def observation(self, observation):
      self.stack.append(observation)

      return np.array(self.stack).reshape(self.observation_space.low.shape)

def wrap_env(env):
  env = MaxFrameWrapper(env, 4)
  env = PreprocessFrameWrapper(env, (84,84,1))
  env = StackFramesWrapper(env, 4)
  return env
"""
End preprocessing code
"""
class ActorCritic(nn.Module):
  def __init__(self, env, learn_rate, gamma):
    super(ActorCritic, self).__init__()
    self.learn_rate = learn_rate
    self.gamma = gamma
    self.input_size = env.observation_space.shape
    self.output_size = env.action_space.n
    self.conv_block = nn.Sequential(
            nn.Conv2d(self.input_size[0], 16, kernel_size=3, stride=2, padding=1),
            nn.LeakyReLU(),
            nn.Conv2d(16, 32, 3, stride=2, padding=1),
            nn.LeakyReLU(),
            nn.Conv2d(32, 32, 3, stride=2, padding=1),
            nn.LeakyReLU()
        )
    self.feature_size = self.conv_block(torch.autograd.Variable(torch.zeros(1, *self.input_size))).view(1, -1).shape[1]
    self.critic_linear = nn.Linear(self.feature_size, 1)
    self.actor_linear = nn.Linear(self.feature_size, self.output_size)
    self.sigmoid = nn.Sigmoid()
    self.data = []
    self.optimizer = torch.optim.Adam(self.parameters(), lr=learn_rate)

  def load(self, checkpoint):
    params = torch.load('drive/My Drive/rl_training/ac_' + checkpoint + '.chkpt')
    self.load_state_dict(params['model'])
    self.optimizer.load_state_dict(params['optimizer'])

  def save(self, checkpoint):
    torch.save({'model': self.state_dict(), 'optimizer': self.optimizer.state_dict()}, 'drive/My Drive/rl_training/ac_' + checkpoint + '.chkpt')

  def actor(self, x):
      x = self.conv_block(x)
      x = x.view(x.shape[0], -1)
      x = self.actor_linear(x)
      prob = self.sigmoid(x)
      return prob
  
  def critic(self, x):
      x = self.conv_block(x)
      x = x.view(x.shape[0], -1)
      r = self.critic_linear(x)
      return r
  
  def put_data(self, transition):
      self.data.append(transition)
      
  """
  make_batch and train_net are basically the same as in the original cited in the header, except for my changes to the loss function and gradient clipping
  """
  def make_batch(self):
      s_lst, a_lst, r_lst, s_prime_lst, done_lst = [], [], [], [], []
      for transition in self.data:
          s,a,r,s_prime,done = transition
          s_lst.append(s)
          a_lst.append([a])
          r_lst.append([r/100.0])
          s_prime_lst.append(s_prime)
          done_mask = 0.0 if done else 1.0
          done_lst.append([done_mask])
      
      s_batch, a_batch, r_batch, s_prime_batch, done_batch = torch.tensor(s_lst, dtype=torch.float), torch.tensor(a_lst), \
                                                              torch.tensor(r_lst, dtype=torch.float), torch.tensor(s_prime_lst, dtype=torch.float), \
                                                              torch.tensor(done_lst, dtype=torch.float)
      self.data = []
      return s_batch, a_batch, r_batch, s_prime_batch, done_batch

  def train_net(self):
      s, a, r, s_prime, done = self.make_batch()
      td_target = r + self.gamma * self.critic(s_prime) * done
      delta = td_target - self.critic(s)
      
      pi = self.actor(s)
      pi_a = pi.gather(1,a)
      loss = -torch.log(pi_a) * delta.detach() + F.mse_loss(self.critic(s), td_target.detach())

      self.optimizer.zero_grad()
      loss.mean().backward()
      self.optimizer.step()
      torch.nn.utils.clip_grad_norm_(self.parameters(), 5)


class GenomeQNet(nn.Module):
  def __init__(self, input_size, output_size):
    super(GenomeQNet, self).__init__()
    self.input_size = (input_size[2], input_size[0], input_size[1])
    self.output_size = output_size
    self.conv_block = nn.Sequential(
            nn.Conv2d(self.input_size[0], 16, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(16, 32, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(32, 32, kernel_size=3, stride=1),
            nn.ReLU()
        )
    self.feature_size = self.conv_block(torch.autograd.Variable(torch.zeros(1, *self.input_size))).view(1, -1).shape[1]
    self.linear_layer = nn.Sequential(
        nn.Linear(self.feature_size, 256),
        nn.ReLU()
    )
    self.out = nn.Linear(256, self.output_size)
    self.scores = []

  def forward(self, state):
    z = self.get_from_frame(state)
    features = self.conv_block(z)
    features = features.view(features.shape[0], -1)
    linear = self.linear_layer(features)
    out = self.out(linear)
    return out

  def get_genome(self):
    genome = torch.Tensor()
    for p in self.parameters():
      genome = torch.cat((genome, p.flatten()), 0)
    return genome

  def get_score(self):
    return torch.Tensor(self.scores).mean().item()

  def update_params(self, params):
    for i, p in enumerate(self.parameters()):
      p.data.copy_(params[i])

  def add_score(self, score):
    self.scores.append(score)

  def clear_scores(self):
    self.scores = []

  def sample_action(self, obs, epsilon):
    coin = random.random()
    if epsilon != -1 and coin < epsilon:
        return random.randint(0,1)
    else:
      with torch.no_grad():
        return self.forward(obs).argmax().item()

  def get_from_frame(self, frame):
    frame = torch.from_numpy(np.ascontiguousarray(frame, dtype=np.float32))
    return frame.unsqueeze(0).permute(0, 3, 1, 2).contiguous()

"""
This QNet is based on an implementation that was used in Evolutionary Strategies (ES)
here: https://github.com/atgambardella/pytorch-es/blob/master/model.py so I thought it might be more successful
with my GA due to the introduction of the LSTM cells for memory. It was better than the GenomeQNet but not enough
"""
class ESQNet(GenomeQNet):
  def __init__(self, input_size, output_size):
    super(ESQNet, self).__init__(input_size, output_size)
    self.input_size = (input_size[2], input_size[0], input_size[1])
    self.output_size = output_size
    self.conv_block = nn.Sequential(
            nn.Conv2d(self.input_size[0], 16, kernel_size=3, stride=2, padding=1),
            nn.SELU(),
            nn.Conv2d(16, 32, 3, stride=2, padding=1),
            nn.SELU(),
            nn.Conv2d(32, 32, 3, stride=2, padding=1),
            nn.SELU()
        )
    self.feature_size = self.conv_block(torch.autograd.Variable(torch.zeros(1, *self.input_size))).view(1, -1).shape[1]
    self.lstm = nn.LSTMCell(self.feature_size, 256)
    self.out = nn.Linear(256, self.output_size)
    self.scores = []
    self.cx = torch.autograd.Variable(torch.zeros(1, 256))
    self.hx = torch.autograd.Variable(torch.zeros(1, 256))

  def forward(self, state):
    z = self.get_from_frame(state)
    features = self.conv_block(z)
    features = features.view(-1, self.feature_size)
    self.hx, self.cx = self.lstm(features, (self.hx, self.cx))
    features = self.hx
    out = F.softmax(self.out(features), dim = -1)
    return out

"""
Here's the container for my not so good GA algorithm.
I originally tried it with GenomeQNet but found I got better results by using
ESQNet which is similar except with the introduction of LSTMCells between the convolutional
and the action layer.
"""
class GeneticQNetContainer():
  def __init__(self, input_size, output_size, hyperparameters = {}):
    self._input_size = input_size
    self._output_size = output_size
    self._initial_qnet = ESQNet(input_size, output_size)
    self._genome_param_sizes = self.init_genome_param_sizes(self._initial_qnet)
    self._population = []
    self._best_individual = None
    self._evaluations_per_individual = 1
    self._mutation_chance = 0.1
    self._mutation_factor = 0.02
    self._tournament_size = 20
    self._cull_after = 0.5
    self._population_size = 100
    self._elapsed_generations = 0
    self._use_crossover = True
    self._hyperparameter_keys = {
        'evaluations_per_individual': None,
        'mutation_chance': None,
        'mutation_factor': None,
        'tournament_size': None,
        'cull_after': None,
        'population_size': None,
        'use_crossover': None
    }
    self.set_hyperparameters(hyperparameters)
    self.reset_remaining_tasks()
  
  def set_hyperparameters(self, hyperparameters):
    for p in hyperparameters:
      if p in self._hyperparameter_keys:
        setattr(self, '_' + p, hyperparameters[p])

  def no_tasks_left(self):
    return len(self._tasks_remaining_in_generation) == 0

  def reset_remaining_tasks(self):
    self._tasks_remaining_in_generation = []
    for _ in range(self._evaluations_per_individual):
      self._tasks_remaining_in_generation += list(range(0, self._population_size))
    random.shuffle(self._tasks_remaining_in_generation)

  def init_genome_param_sizes(self, qnet):
    s = []
    params = qnet.parameters()
    for p in params:
      s.append(p.shape)
    return s

  def get_genome_size(self):
    size = 0
    for s in self._genome_param_sizes:
      size += s.numel()
    return size

  """
  Breed 2 nets using ordered crossover
  """
  def breed(self, net0_params, net1_params):
    child = torch.zeros(net0_params.shape.numel())
    a = random.randrange(0, net0_params.shape.numel() + 1)
    b = random.randrange(0, net1_params.shape.numel() + 1)
    min_ab = min(a, b)
    max_ab = max(a, b)
    child[min_ab:max_ab] = net0_params[min_ab:max_ab]
    gapped = net1_params.clone()
    gapped[min_ab:max_ab] = torch.zeros(net1_params.shape.numel())[min_ab:max_ab]
    child = child + gapped
    return child

  def tournament_pick(self, choosable_genomes):
    if self._tournament_size > len(choosable_genomes):
      self._tournament_size = len(choosable_genomes)
    best = None
    for i in range(self._tournament_size):
        i = random.randrange(0, len(choosable_genomes))
        if best == None or self._population[i].get_score() > self._population[best].get_score():
            best = i
    return choosable_genomes[best]

  def rank_population(self, population):
    sorted_pop = sorted(population, key=lambda x: x.get_score(), reverse=True)
    return sorted_pop

  def cull_individuals_with_worst_genomes(self, ordered_population):
    cull_point = int(math.ceil(len(ordered_population) * self._cull_after))
    return ordered_population[:cull_point]

  def get_elite(self, ordered_population):
    return ordered_population[0]

  def breed_population_genomes(self, choosable_genomes, number):
    children = []
    for i in range(number):
        parent0 = self.tournament_pick(choosable_genomes)
        parent1 = self.tournament_pick(choosable_genomes)
        children.append(self.breed(parent0, parent1))
    return children

  """
  mutation_chance is a hyperparameter. Not in the paper but I think it might be useful to have the option
  mutation_factor is a hyperparameter. Default 0.02 as here: https://arxiv.org/pdf/1712.06567.pdf
  """
  def mutate_noise(self, genome):
    size = genome.shape.numel()
    noise = torch.randn(size) * self._mutation_factor
    mutation_attempts = torch.rand(size)
    threshold_met = mutation_attempts < self._mutation_chance
    mutated_whole = genome + noise
    new_genome = torch.where(threshold_met, mutated_whole, genome)
    return new_genome

  def mutate_population_genomes(self, genomes):
    mutated = []
    for g in genomes:
      g_m = self.mutate_noise(g)
      mutated.append(g_m)
    return mutated

  def build_params_from_genome(self, genome):
    start = 0
    params = []
    for size in self._genome_param_sizes:
      flat_size = size.numel()
      genome_block = genome[start:start + flat_size].clone()
      reshaped = genome_block.reshape(size)
      params.append(reshaped)
      start += flat_size
    return params

  def init_random_population(self):
    self._population = []
    for i in range(self._population_size):
      qnet = ESQNet(self._input_size, self._output_size)
      self._population.append(qnet)

  def next_generation(self):
    genomes = []
    population_score = []
    for ind in self._population:
      population_score.append(ind.get_score())
    generation_mean_score = torch.Tensor(population_score).mean().item()
    print('Generation ' + str(self._elapsed_generations) + ' completed. Mean score: ' + str(generation_mean_score))
    ranked_population = self.rank_population(self._population)
    elite = self.get_elite(ranked_population)
    if self._use_crossover:
      culled_population = self.cull_individuals_with_worst_genomes(ranked_population)
      population_size = len(self._population)
      for individual in culled_population:
        genomes.append(individual.get_genome())
      bred_genomes = self.breed_population_genomes(genomes, population_size - 1)
      genomes = bred_genomes
    else:
      for individual in ranked_population[:5]:
        genomes.append(individual.get_genome())
      genomes = genomes * 10
      genomes = genomes[0:len(genomes) - 1]
    mutated_genomes = self.mutate_population_genomes(genomes)
    if self._best_individual is None or self._best_individual.get_score() < elite.get_score():
      self._best_individual = copy.copy(elite)
    self._population[0] = elite
    params = None
    for i, g in enumerate(mutated_genomes):
      params = self.build_params_from_genome(g)
      self._population[i + 1].update_params(params)
    params = None
    for ind in self._population:
      ind.clear_scores()
    self._elapsed_generations += 1

  def get_population_dict(self):
    param_dict = {}
    for i, individual in self._population:
      param_dict['individual_' + str(i)] = individual.state_dict()
    return param_dict

  def load(self, generation):
    self.init_random_population()
    params = torch.load('drive/My Drive/rl_training/generation' + str(generation) + '.chkpt')
    for key in params:
      if key in self._hyperparameter_keys:
        setattr(self, '_' + key, params[key])
      elif key == '_elapsed_generations':
        self._elapsed_generations = params[key]
      else:
        self._population[int(key[len('individual_') - 1:])].load_state_dict(params[key])
    params = None

  def save(self):
    params = self.get_population_dict()
    for key in self._hyperparameter_keys:
      params[key] = getattr(self, key)
    params['_elapsed_generations'] = self._elapsed_generations
    torch.save(params, 'drive/My Drive/rl_training/generation' + str(self._elapsed_generations) + '.chkpt')

  def run_episode(self, env, obs, epsilon):
    individual_to_run = self._tasks_remaining_in_generation.pop(0)
    q = self._population[individual_to_run]
    done = False
    score = 0
    while True:
      a = q.sample_action(obs, -1)
      new_obs, r, done, info = env.step(a)

      score += r
      obs = new_obs
      if done:
          break
    q.add_score(score)
    if self.no_tasks_left():
      self.next_generation()
      self.reset_remaining_tasks()
    return score

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
"""
BELOW IS THE GENETIC RUN CODE I WAS USING (that wasn't as effective as I hoped) along with some tests I wrote along the way
to see if my genetic algorithms worked properly. By tweaking settings, it got a mean score of 198 of 1000 episodes and a max score of 850.
"""
"""
# setup the Gravitar ram environment, and record a video every 50 episodes. You can use the non-ram version here if you prefer
env = gym.make('Gravitar-v0')
env = gym.wrappers.Monitor(env, "./video", video_callable=lambda episode_id: (episode_id%video_every)==0,force=True)

# reproducible environment and action spaces, do not change lines 6-11 here (tools > settings > editor > show line numbers)
seed = 742
torch.manual_seed(seed)
env.seed(seed)
random.seed(seed)
np.random.seed(seed)
env.action_space.seed(seed)
#g = GenomeQNet(env.observation_space.shape, env.action_space.n)
#c = GeneticQNetContainer(env.observation_space.shape, env.action_space.n)
#print(g.get_genome()[:100].shape)
#print(c.get_genome_size())
#params = c.build_params_from_genome(torch.zeros(c.get_genome_size()))
#g.update_params(params)
#print(g.get_genome()[:100].shape)
#print(c.mutate_noise(torch.Tensor([0.5, 0.6, 0.7, 0.8, 0.9, 1.5,]), 0.5))
#for i in range(100):
#  print(c.breed(torch.Tensor([1, 2, 3, 4, 5, 6, 7, 8, 9]), torch.Tensor([10, 20, 30, 40, 50, 60, 70, 80, 90])))


genetic_qnet = GeneticQNetContainer(env.observation_space.shape, env.action_space.n, {
    'population_size': 50,
    'cull_after': 0.6,
    'mutation_chance': 0.9,
    'use_crossover': False
})
genetic_qnet.init_random_population()
score    = 0.0
marking  = []
#optimizer = optim.Adam(q.parameters(), lr=learning_rate)
n_episodes = 101

for n_episode in range(n_episodes):
    epsilon = max(0.01, 0.08 - 0.01*(n_episode/200)) # linear annealing from 8% to 1%
    s = env.reset()
    score = genetic_qnet.run_episode(env, s, epsilon)
    # do not change lines 44-48 here, they are for marking the submission log
    marking.append(score)
    if n_episode%100 == 0:
        print("marking, episode: {}, score: {:.1f}, mean_score: {:.2f}, std_score: {:.2f}".format(
            n_episode, score, np.array(marking).mean(), np.array(marking).std()))
        marking = []

    # you can change this part, and print any data you like (so long as it doesn't start with "marking")
    if n_episode%print_every==0 and n_episode!=0:
        print("episode: {}, score: {:.1f}, epsilon: {:.2f}".format(n_episode, score, epsilon))
"""

'\n# setup the Gravitar ram environment, and record a video every 50 episodes. You can use the non-ram version here if you prefer\nenv = gym.make(\'Gravitar-v0\')\nenv = gym.wrappers.Monitor(env, "./video", video_callable=lambda episode_id: (episode_id%video_every)==0,force=True)\n\n# reproducible environment and action spaces, do not change lines 6-11 here (tools > settings > editor > show line numbers)\nseed = 742\ntorch.manual_seed(seed)\nenv.seed(seed)\nrandom.seed(seed)\nnp.random.seed(seed)\nenv.action_space.seed(seed)\n#g = GenomeQNet(env.observation_space.shape, env.action_space.n)\n#c = GeneticQNetContainer(env.observation_space.shape, env.action_space.n)\n#print(g.get_genome()[:100].shape)\n#print(c.get_genome_size())\n#params = c.build_params_from_genome(torch.zeros(c.get_genome_size()))\n#g.update_params(params)\n#print(g.get_genome()[:100].shape)\n#print(c.mutate_noise(torch.Tensor([0.5, 0.6, 0.7, 0.8, 0.9, 1.5,]), 0.5))\n#for i in range(100):\n#  print(c.breed(torch.Tenso

**Train**

← You can download the videos from the videos folder in the files on the left

In [4]:
# setup the Gravitar ram environment, and record a video every 50 episodes. You can use the non-ram version here if you prefer
env = gym.make('Gravitar-v0')
env = gym.wrappers.Monitor(env, "./video", video_callable=lambda episode_id: (episode_id%video_every)==0,force=True)
env = wrap_env(env)
# reproducible environment and action spaces, do not change lines 6-11 here (tools > settings > editor > show line numbers)
seed = 742
torch.manual_seed(seed)
env.seed(seed)
random.seed(seed)
np.random.seed(seed)
env.action_space.seed(seed)

AC_NET = ActorCritic(env, learning_rate, gamma)

score    = 0.0
marking  = []

MAX_EPISODES = int(1e32)

for n_episode in range(MAX_EPISODES):
    score = 0.0
    done = False
    s = env.reset()
    while not done:
        for t in range(n_rollout):
            prob = AC_NET.actor(torch.from_numpy(s).float().unsqueeze(0))
            m = torch.distributions.Categorical(prob)
            a = m.sample().item()
            s_prime, r, done, info = env.step(a)
            AC_NET.put_data((s,a,r,s_prime,done))
            
            s = s_prime
            score += r
            
            if done:
              break                     
        
        AC_NET.train_net()

    # do not change lines 44-48 here, they are for marking the submission log
    marking.append(score)
    if n_episode%100 == 0:
        print("marking, episode: {}, score: {:.1f}, mean_score: {:.2f}, std_score: {:.2f}".format(
            n_episode, score, np.array(marking).mean(), np.array(marking).std()))
        marking = []

    # you can change this part, and print any data you like (so long as it doesn't start with "marking")
    if n_episode%print_every==0 and n_episode!=0:
        running_mean = -1
        if len(marking) > 0:
          running_mean = np.array(marking).mean()
        print("episode: {}, score: {:.1f}, mean: {:.2f}".format(n_episode, score, running_mean))

marking, episode: 0, score: 500.0, mean_score: 500.00, std_score: 0.00
episode: 5, score: 0.0, mean: 210.00
episode: 10, score: 250.0, mean: 180.00
episode: 15, score: 0.0, mean: 183.33
episode: 20, score: 250.0, mean: 160.00
episode: 25, score: 250.0, mean: 176.00
episode: 30, score: 0.0, mean: 185.00
episode: 35, score: 100.0, mean: 171.43
episode: 40, score: 200.0, mean: 167.50
episode: 45, score: 0.0, mean: 151.11
episode: 50, score: 500.0, mean: 153.00
episode: 55, score: 0.0, mean: 159.09
episode: 60, score: 0.0, mean: 157.50
episode: 65, score: 250.0, mean: 163.85
episode: 70, score: 100.0, mean: 174.29
episode: 75, score: 100.0, mean: 172.00
episode: 80, score: 500.0, mean: 175.00
episode: 85, score: 250.0, mean: 172.94
episode: 90, score: 300.0, mean: 172.78
episode: 95, score: 0.0, mean: 172.11
marking, episode: 100, score: 250.0, mean_score: 173.00, std_score: 185.53
episode: 100, score: 250.0, mean: -1.00
episode: 105, score: 350.0, mean: 240.00
episode: 110, score: 0.0, me

KeyboardInterrupt: ignored

Information about the Genetic Algorithm (GA)

I think GAs are beautiful given that they learn based on the proven methods in nature and I thought they might work well for Gravitar considering the reward sparsity. I found this paper https://arxiv.org/pdf/1712.06567.pdf which showed the use of GAs for solving Atari games (and proved that GAs are not just random search in this space) and it looked promising so I wrote this container for running genome operations on neural nets. However, their population size was 1K and mine could only be 50 because I didn't use their compression technique and was running out of RAM (and didn't have time to run circa 1000 episodes for 1 generation and risk little improvement over hours). Their implementation just added noise to each weight with a factor. I tried improving this by adding ordered crossover of flattened neural nets but this didn't seem to work well which may be because it broke correlation between parts of the net so ended up just using their noise method with a high mutation chance. Interestingly, the use of LSTMCells seemed to make the agent perform better but this may be coincidental. If they actually do help it could be because GA is gradient-free (hence its fast run time per episode) and the LSTM somehow acts as inter-generational memory. The paper did use the preprocessing methods that I ended up using for the A2C but they didn't work properly with my implementation so the GA was slower than it should have been. However, it did hit a competitive mean score with my A2C and peaked at a score of 850-900 but unfortunately started to fall again which suggests the mutation rate needed to be adjusted, especially considering I added an aggressive culling scheme and mutated only from the most elite population members to attempt to improve convergence. I couldn't be sure that my implementation would actually converge hence I abandoned it in favour of a more traditional method (Duelling DQN first then A2C).