<a href="https://colab.research.google.com/github/hsscholte/ccnTest/blob/master/CCN19_Examlab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learning to remember: T-Maze

Many many behavioral tasks require that some part of past information is kept in memory, to make the correct decision at a later time. Think of reading a room number for a lecture, and keeping it in mind until you have reached the specific room.  

When we give such tasks to monkeys, we find that specific subsets of neurons will exhibit persistent activity from seeing the object to be remembered to the end of the trial. This is believed to reflect some aspect of working memory. 

In behavioral science, a T-maze (or the variant Y-maze) is a simple maze used in animal cognition experiments. It is shaped like the letter T (or Y), providing the subject, typically a rodent, with a straightforward choice. T-mazes are used to study how the rodents function with memory and spatial learning through applying various stimuli. Starting in the early 20th century, rodents were used in experiments such as the T-Maze. These concepts of T-mazes are used to assess rodent behavior. The different tasks, such as left-right discrimination and forced alternation, are mainly used with rodents to test reference and working memory.

Here, we study how a T-Maze task can be learned thorugh reinforcement learning with a recurrent neural network to *learn to* remember the correct information in a task.











## Recalling the LSTM: Long Short-Term Memory

Here, we give a simple example of a recurrent neural network specifically designed for learning to keep information active: the LSTM unit, where LSTM stands for Long Short-Term Memory. The LSTM was invented by Sepp Hochreiter and Juergen Schmidhuber, back in 1997, and addresses specific needs to make learning memory in neural networks feasible. The key unit in the LSTM is a neuron with a self-recurrent connection; through this self-recurrency, past information is maintained as persistent (or decaying) activity in the unit. 

With the LSTM, at every timestep, an input $x_t$ is put into the neural unit, and this input affects the internal state of the unit in three ways: on forgetting the internal state, on the value of the internal state, and on the output of the internal state. An LSTM unit thus has a *forget gate*, an *input gate*, and an *output gate*. Can you see the gates in this figure? Hint: the top horizontal line corresponds to the internal state. 

![LSTM](https://drive.google.com/thumbnail?id=1rrUy5QUIn4WhA_2tYZMeZ64DFKuCpkLt&sz=w600)

From left to right, the new input $x_t$ together with the output from the unit at the previous state $h_{t-1}$ acts on the internal state with a multiplication. This effectively is the *forget gate*. The next to blocks ($\sigma$ and `tanh`) determine how much of a function of the input is added to the internal state, through an *input gate*, and the last sigma block determines how much output is gated by the present input. 

## The Working-Memory T-Maze task

The T-Maze is a working memory task adapted from the one used in (Bakker, 2001). The agent has to select to move *Left* or *Right* at the end of a corridor, accordingly to the information given at the beginning of the task -- a road-sign (which could be at either the right or left side at the start of the corridor). The agent had to remember this information to select the optimal action among the four possible: *Up, Down, Left, Right*. The length of the corridor $N$ can be varied, and the agent moves with a step-size of 1. A movement through the wall returned a negative reward of $r_{wall} = -0.1$. At the T-junction, if the agent moved correctly it was rewarded with $r_{goal} = 4$, while a incorrect choice returned $r_{wrong} = -0.1$. The learning was stopped when the agent made $90\%$ correct choices, for each condition, in the last $50$ trials. 

![T-Maze](https://drive.google.com/thumbnail?id=1xHLjEg76NwtrhxapoIGs2wF3q_FPfiMf&sz=w600)

The T-Maze can be used to illustrate the need for gating in Working Memory: when we add noise to the representation of the maze, the agent needs to actively protect its memory of the road sign from deteriorating due to the variable content of the observations $X_t$. 

## Implementation

The T-Maze tasks uses an agent that receives as input an abstract representation of its present location in the T-Maze. So mostly, that is the corridor which is identical at every step.

The network consists of three input neurons, describing the current observations in terms of what the agent can see to the left, front and right. There is a single hidden layer with (here) 12 "normal" neurons and 3 LSTM units. These 15 units project to 4 output neurons corresponding to the four possible actions that the agent can take, *Up, Down, Left, Right*. 

![NetworkLSTM](https://drive.google.com/thumbnail?id=1BE7r1ewJflWdeq1zqEFmGU1kr1nPL-xG&sz=w400)

The output of each output neuron is interpreted as a Q-value, the aim of the network is to learn the action sequence that maximizes the total reward. For this, it needs to learn to remember the road sign, move straight along the corridor, and turn left or right - depending on the road sign - at the end of the corridor when it observes that there are no walls to the left or right but there is a wall in front of the agent. 

## The agent

We put together an working memory agent as described above.  

In [0]:
import os
if os.path.exists('TMaze_main_tutorial_2019.py') != True:
  ! wget https://github.com/isapome/teaching/raw/master/TMaze_main_tutorial_2019.py
if os.path.exists('TMaze_task_tutorial_2019.py') != True:
  ! wget https://github.com/isapome/teaching/raw/master/TMaze_task_tutorial_2019.py
if os.path.exists('weights_trial_10_noiseless.pkl') != True:
  ! wget https://github.com/isapome/teaching/raw/master/weights_trial_10_noiseless.pkl
if os.path.exists('weights_trial_10_noisy.pkl') != True:
  ! wget https://github.com/isapome/teaching/raw/master/weights_trial_10_noisy.pkl
if os.path.exists('weights_trial_20_noiseless.pkl') != True:
  ! wget https://github.com/isapome/teaching/raw/master/weights_trial_20_noiseless.pkl

import TMaze_task_tutorial_2019 as TMaze
from TMaze_task_tutorial_2019 import TaskTmaze
import pickle
import matplotlib.pyplot as plt

The code below allows you to run the agent in various modes.

In [0]:
def plot_qvalues(predictions, save_name=None):
    ax = predictions.plot(lw = 2)
    ax.set_ylabel('Q-values', fontsize=18)
    ax.set_xlabel('Action #', fontsize=18)
    ax.tick_params(labelsize=14)
    ax.get_xaxis().tick_bottom()
    ax.get_yaxis().tick_left()
    ax.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=3,
           ncol=4, mode="expand", borderaxespad=0., fontsize=12)

    plt.tight_layout()
    plt.show()
    if save_name:
        plt.savefig(save_name, bbox_inches='tight') 
    return

def experiment(mode, save_weights=False, filename=None):
    if noise:
        str_noise = 'noisy'
    else:
        str_noise = 'noiseless'
    print("TMaze "+str_noise+" experiment. Corridor length: ", corridor_length, "\n")
    if filename:
        if filename[-4:]!= '.pkl':
            raise ValueError('The filename needs to end with .pkl') 
    if mode=='train':
        print("Training process:")
        weights = TMaze.start_training(seed, corridor_length, learning_rate, n_trainings, discount_factor, noise=noise)
        if save_weights == True:
            if filename:
                with open(filename, 'wb') as file:
                    pickle.dump(weights, file)
            else:
                raise ValueError('For save mode the name of the weight file needs to be provided.')  
    elif mode=='load':
        print("Loading the weights...")
        if filename:
            with open(filename, 'rb') as file:
                weights = pickle.load(file, encoding='latin1')
                print("...weights loaded.")
        else:
            raise ValueError('For load mode the name of the weight file needs to be provided.') 
    else:
        raise ValueError('Unknown mode. Choose \'train\' or \'load\'.')
        
    print(" ")
    print("Testing the network:")
    Results = {}
    # a number <0.5 sets the road sign on the left, a number >= sets the road sign to the right
    for i in [0.1,  0.9]:
        print( " ")
        results = TMaze.testing(weights, i, corridor_length, noise=noise)
        plot_qvalues(results['predictions'])
        Results['results'+str(i)] = results
    return Results



### Task

<font color=blue>
Given the seed you received, compare three different combinations of learning rate and discount factor and report your findings (without noise!). Which combination works best? </font>

In [0]:
# Code to train a network with a set of hyper-parameters. 
#----Train network and save weights (if save_weights==True)
seed = 1234 # enter the seed that will be handed out to you
corridor_length = 12
learning_rate = 0.0005
discount_factor = -0.4
noise = None #choose None for noiseless, anything else for noisy
n_trainings = 50000 
weights_filename = 'MYweights_tutorial.pkl'
results = experiment('train', save_weights=False, filename=weights_filename)