## DQN Training
This is the first experiment using the CartPole environment. The experiments in CartPole act as a tutorial on how to use our code and all the other experiments are based upon them, so this provides a good starting point if you want to run your own experiments. 
The tutorials build upon each other and it is recommended to follow them in the following order:
1. DQN-Training (How to train a conventional DQN and a spiking DQN using Surrogate Gradients (DSQN).)
2. Load-DQN (How to load a previously saved D(S)QN and how to save a replay dataset.)
3. Train-Classifier (How to train a spiking or non-spiking classifier on the saved replay data set.)
4. SNN-Conversion (How to convert a DQN and a Classifier to a SNN.)
5. Load in NEST (How to load a converted or directly trained spiking network in NEST.)
6. Conversion in pyNN with NEST or SpyNNaker (How to load spiking network in pyNN using NEST or SpyNNaker as backend.)

In this first part we train a DQN (non-spiking and spiking) on the CartPole problem. At the same time this file serves as a tutorial for training DQNs with any environment.

In [1]:
import torch
import torch.optim as optim
import os
import sys
import random

import numpy as np
import matplotlib.pyplot as plt
# hack to perform relative imports
sys.path.append('../../')
from datetime import date
from Code import train_agent, SQN, FullyConnected

We start with setting up the result directory.
<div class="alert alert-block alert-warning">
<b>Attention:</b> If the directory with the specified name already exists, this will throw an error. You need to specify a different name or delete the old directory. If this happens, you should restart the kernel, as the directory is a relative path which changes everytime this cell is run.
</div>

In [2]:
# switch to the Result Directory
#os.chdir('../../Results/')
# choose the name of the result directory
#result_directory = 'CartPole-Experiment2-DQN-Training'
# create the result directory (throws an error if the directory already exists)
#os.makedirs(result_directory)
#os.chdir(result_directory)
# for the first experiment we create an additinonal sub folder
#os.makedirs('DQN')
#os.chdir('DQN')

In [3]:
# Create Results Directory
dirs = os.listdir('.')
if not any('result' in d for d in dirs):
    result_id = 1
else:
    results = [d for d in dirs if 'result' in d]
    result_id = len(results) + 1

# Get today's date and add it to the results directory
d = date.today()
result_dir = 'result_' + str(result_id) + '_DQN_{}'.format(str(d.year) + str(d.month) + str(d.day))
os.mkdir(result_dir)

We define the seeds, hyperparameters and initial weights to values that reached the Open AI gym standard succesfully. This should make the results reproducible if you install the virtual environment specified in requirements.txt

In [4]:
#torch_seed = 135
#torch.manual_seed(torch_seed)
#random_seed = 795
#random.seed(random_seed)
#gym_seed = 975

Next, we define the environment and all the hyperparameters for a non-spiking Q-network. We then set up the result directory. 

In [5]:
# CartPole
env = 'CartPole-v0'

# hyperparameters
batch_size = 128
discount_factor = 0.999
epsilon_start = 1.0
epsilon_end = 0.05
epsilon_decay = 0.999
target_update_frequency = 10
learning_rate = 0.001
replay_memory_size = 4*10**4

# minimum size of the replay memory before the training starts
initial_replay_size = 0

# the gym standard for CartPole ("solving" it) is to achieve a 100-episode average of <=195 for 100 consecutive episodes
GYM_TARGET_AVG = 195
GYM_TARGET_DURATION = 100

max_steps = 1000
num_episodes = 1000
n_runs = 5

double_q = False
gradient_clipping = False
render = False

architecture = [4,16,16,2]


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Next we initialize the neural network for the problem: We use a Fully-Connected network with two hidden layers with 16 neurons each and ReLu activations. The size of the in- and output layers is determined by the environment. 
The target net initially is a copy of the policy net.
Then we set up the optimizer: We use Adam using the specified learning and rate and the standard parameters for everything else.
The seed fixes the inital weight. Instad, we could also save the initial weights of a network (commented lines below definition of policy net.

In [6]:
# initialize policy and target net
# this creates a network with 4 input neurons, two hidden layers with 16 neurons each and two output neurons

#policy_net = FullyConnected(architecture).to(device)
# initial weights are actually fixed through the seed
# load the fixed initial weights, remove this line to get random initial weights, not necessary if torch seed is specified
#policy_net.load_state_dict(torch.load('./../../CartPole-v0/DQN/initial/model.pt'))

#target_net = FullyConnected(architecture).to(device)
#target_net.load_state_dict(policy_net.state_dict())

# initialize optimizer
#optimizer = optim.Adam(policy_net.parameters(), lr=learning_rate)

We now train the agent with the specified hyperparameters using the function from Code/train_agent. In the plot, blue shows the individual reward on each episode, while orange shows the average reward over the last 100 episodes.

In [None]:
smoothed_scores_all = []
torch_seeds = [829, 861, 632, 130, 743]
random_seeds = [440, 791, 721, 563, 950]
gym_seeds = [127, 691, 157, 534, 217]
for i_run in range(n_runs):
    #torch_seed = random.randint(0, 1000)
    torch_seed = torch_seeds[i_run]
    torch.manual_seed(torch_seed)
    #random_seed = random.randint(0, 1000)
    random_seed = random_seeds[i_run]
    random.seed(random_seed)
    #gym_seed = random.randint(0, 1000)
    gym_seed = gym_seeds[i_run]
    
    policy_net = FullyConnected(architecture).to(device)
    target_net = FullyConnected(architecture).to(device)
    target_net.load_state_dict(policy_net.state_dict())
    optimizer = optim.Adam(policy_net.parameters(), lr=learning_rate)

    scores, smoothed_scores = train_agent(env, policy_net, target_net, batch_size,
                                          discount_factor, epsilon_start, epsilon_end,
                                          epsilon_decay, target_update_frequency, optimizer,
                                          learning_rate, replay_memory_size, device, i_run,
                                          result_dir, num_episodes=num_episodes,
                                          max_steps=max_steps, render=render,
                                          double_q_learning=double_q,
                                          gradient_clipping=gradient_clipping,
                                          initial_replay_size=initial_replay_size,
                                          gym_seed=gym_seed, torch_seed=torch_seed,
                                          random_seed=random_seed)
    np.save(result_dir + '/scores_{}'.format(i_run), scores)
    np.save(result_dir + '/smoothed_scores_{}'.format(i_run), smoothed_scores)
    
    # save smoothed scores in list to plot later
    smoothed_scores_all.append(smoothed_scores)



Episode 100	Average Score: 54.85	 Epsilon: 0.05
Episode 200	Average Score: 185.87	 Epsilon: 0.05
Episode 300	Average Score: 203.53	 Epsilon: 0.05
Episode 400	Average Score: 180.20	 Epsilon: 0.05
Episode 500	Average Score: 142.80	 Epsilon: 0.05
Episode 600	Average Score: 136.24	 Epsilon: 0.05
Episode 700	Average Score: 198.65	 Epsilon: 0.05
Episode 800	Average Score: 393.05	 Epsilon: 0.05
Episode 900	Average Score: 710.61	 Epsilon: 0.05
Episode 1000	Average Score: 671.23	 Epsilon: 0.05
Best 100 episode average:  753.11  reached at episode  909 . Model saved in folder best.
Complete


<Figure size 432x288 with 0 Axes>

Episode 100	Average Score: 50.07	 Epsilon: 0.05
Episode 200	Average Score: 165.08	 Epsilon: 0.05
Episode 300	Average Score: 139.09	 Epsilon: 0.05
Episode 400	Average Score: 136.53	 Epsilon: 0.05
Episode 500	Average Score: 125.20	 Epsilon: 0.05
Episode 600	Average Score: 143.40	 Epsilon: 0.05
Episode 700	Average Score: 141.67	 Epsilon: 0.05
Episode 800	Average Score: 634.08	 Epsilon: 0.05
Episode 900	Average Score: 342.42	 Epsilon: 0.05
Episode 1000	Average Score: 751.89	 Epsilon: 0.05
Best 100 episode average:  751.89  reached at episode  1000 . Model saved in folder best.
Complete


<Figure size 432x288 with 0 Axes>

Episode 100	Average Score: 71.84	 Epsilon: 0.05
Episode 200	Average Score: 198.97	 Epsilon: 0.05
Episode 300	Average Score: 152.35	 Epsilon: 0.05
Episode 400	Average Score: 97.55	 Epsilon: 0.055
Episode 500	Average Score: 101.96	 Epsilon: 0.05
Episode 600	Average Score: 164.70	 Epsilon: 0.05
Episode 700	Average Score: 135.30	 Epsilon: 0.05
Episode 800	Average Score: 466.86	 Epsilon: 0.05
Episode 900	Average Score: 886.24	 Epsilon: 0.05
Episode 1000	Average Score: 818.22	 Epsilon: 0.05
Best 100 episode average:  902.37  reached at episode  985 . Model saved in folder best.
Complete


<Figure size 432x288 with 0 Axes>

Episode 100	Average Score: 115.66	 Epsilon: 0.05
Episode 200	Average Score: 359.61	 Epsilon: 0.05
Episode 300	Average Score: 36.52	 Epsilon: 0.055
Episode 400	Average Score: 134.56	 Epsilon: 0.05
Episode 500	Average Score: 189.07	 Epsilon: 0.05
Episode 600	Average Score: 162.33	 Epsilon: 0.05
Episode 700	Average Score: 317.18	 Epsilon: 0.05
Episode 800	Average Score: 989.17	 Epsilon: 0.05
Episode 900	Average Score: 757.34	 Epsilon: 0.05
Episode 1000	Average Score: 290.12	 Epsilon: 0.05
Best 100 episode average:  996.37  reached at episode  825 . Model saved in folder best.
Complete


<Figure size 432x288 with 0 Axes>

Episode 100	Average Score: 125.61	 Epsilon: 0.05
Episode 154	Average Score: 216.73	 Epsilon: 0.05

The figure above shows the training progress of the model. Once the Open AI gym standard is reached, the model is saved
in the result directory as trained/model.pt. Additionally, before the training starts the initial weights and the hyperparameters are saved.

In [None]:
# Plot scores of individual runs
for i in range(len(smoothed_scores_all)):
    fig = plt.figure()
    plt.plot(smoothed_scores_all[i])
    plt.ylim(0, 1000)

In [None]:
# Plot results (median)
mean_smoothed_scores = np.mean(smoothed_scores_all, axis=0)
fig = plt.figure()
plt.plot(range(len(smoothed_scores_all[0])), np.nanmedian(smoothed_scores_all, axis=0))
plt.fill_between(range(len(smoothed_scores_all[0])), np.nanpercentile(smoothed_scores_all, 2, axis=0),
                 np.nanpercentile(smoothed_scores_all, 97, axis=0), alpha=0.25)
plt.show()

In [None]:
# Plot results (mean)
mean_smoothed_scores = np.mean(smoothed_scores_all, axis=0)
fig = plt.figure()
plt.plot(range(len(smoothed_scores_all[0])), mean_smoothed_scores)
plt.fill_between(range(len(smoothed_scores_all[0])), np.nanpercentile(smoothed_scores_all, 2, axis=0),
                 np.nanpercentile(smoothed_scores_all, 97, axis=0), alpha=0.25)
plt.ylim(0, 1000)
plt.savefig(result_dir + '/DQN_training.png', dpi=1000)
plt.show()


### Training of a spiking DQN or DSQN
Next, we train a DSQN using the same hyperparameters as far as possible. That is, all hyperparameters are the same, but we need to define some additional hyperparameters. We adapted the surrogate gradient algorithm we use for the direct training from the SpyTorch jupyter notebooks (available from https://github.com/fzenke/spytorch as of 06.12.2019).

In [None]:
# First we set up a new sub directory
os.chdir('./..')
os.makedirs('DSQN')
os.chdir('DSQN')
# We use a non-leaky integrate-and-fire neuron
ALPHA = 0
BETA = 1
# Simulation time is chosen relatively short, such that the network does not need too much time to run, but not too short,
# such that it can still learn something
SIMULATION_TIME = 20
# We also have to define the input/output and reset methods, to our knowledge, SpyTorch supports only potential outputs 
# and reset-by-subtraction. As input method we use constant input currents. It would be interesting to see if SpyTorch
# can also use reset-to-zero, as this would make it more similar to the iaf_delta models in NEST and SpyNNaker
ENCODING = 'constant'
DECODING = 'potential'
RESET = 'subtraction'
# SpyTorch uses a fixed threshold of one, we didn't test other thresholds, but should be possible
THRESHOLD = 1

We again set the seeds in a way such that the gym standard is reached. We obtained those seed from our experiments, where we saved the seeds upon succesfully reaching the gym standard.

In [None]:
# alternatively use the seeds gym: 240, torch: 18, random: 626 and a learning rate of 0.0005 
# to get the same results as figure 4.9 in the thesis
torch.manual_seed(467)
random.seed(208)
gym_seed = 216

Now, we set up the neural network. Note, that SpyTorch does not support biases, so we instead add a constant input to each observation (equivalent to first layer biases) and add one additional neuron to each hidden layer to compensate for the missing biases.

In [None]:
architecture = [4,17,17,2]
policy_net = SQN(architecture,device,alpha=ALPHA,beta=BETA,simulation_time=SIMULATION_TIME,add_bias_as_observation=True,
                  encoding=ENCODING,decoding=DECODING,reset=RESET,threshold=THRESHOLD)
# load the fixed initial weights, remove this line to get random initial weights
#policy_net.load_state_dict(torch.load('./../../CartPole-v0/DSQN-Surrogate-Gradients/initial/model.pt'))

target_net = SQN(architecture,device,alpha=0,beta=1,simulation_time=SIMULATION_TIME,add_bias_as_observation=True,
                  encoding=ENCODING,decoding=DECODING,reset=RESET,threshold=THRESHOLD)
target_net.load_state_dict(policy_net.state_dict())

# initialize optimizer
optimizer = optim.Adam(policy_net.parameters(), lr=LEARNING_RATE)

In the next cell, the agent is trained. The function we use, is exactly the same as for the DQN, the only difference is that the model we pass now is an instance of our SQN class, rather than the PyTorch neural network base class.

In [None]:
train_agent(env,policy_net,target_net,BATCH_SIZE,DISCOUNT_FACTOR,EPSILON_START,
            EPSILON_END,EPSILON_DECAY,TARGET_UPDATE_FREQUENCY,optimizer,LEARNING_RATE,
            REPLAY_MEMORY_SIZE,device,GYM_TARGET_AVG,GYM_TARGET_DURATION,num_episodes=NUM_EPISODES,
            max_steps=MAX_STEPS,render=RENDER,double_q_learning=DOUBLE_Q,gradient_clipping=GRADIENT_CLIPPING,
            initial_replay_size=INITIAL_REPLAY_SIZE,gym_seed=gym_seed)

The plot again shows the rewards in each episode (blue) and the average reward over the last 100 episodes (orange).

The next experiment in this series is Load-DQN.