# Experimentation on using Fast / Slow learning updates and life-long-learning as a method of long term memory.

The idea here is that the weights of a neural network represent some kind of memory of the agents previous experience.  By using two networks, one with slow updates and one with fast updates we gain the ability to retain longterm information aswell as respond to short term changes.

**Model Architecture**
* todo...

**Experiments**
* Maze4x4
* T-Maze
* Random NxN partial information mazes

**Notes**
The fast and slow parts require backprop during *testing* aswell as training.  The muxer, however should be froozen.

In [6]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torchvision import datasets, transforms

import matplotlib.pyplot as plt

# 1. Setup our model

In [7]:
# global parameters
USE_CUDA = False
BATCH_SIZE = 32

In [None]:
# Our Fast / Slow model
class Net(nn.Module):
    
    def __init__(self, input_dims, output_dims):
        super().__init__()
        
        internal_dims = 64
        
        self.mux1 = nn.dense(internal_dims*2, internal_dims)
        self.mux2 = nn.dense(internal_dims, output_dims)
        
        self.fast1 = nn.dense(input_dims, internal_dims)
        self.fast2 = nn.dense(internal_dims, internal_dims)
        
        self.slow1 = nn.dense(input_dims, internal_dims)
        self.slow2 = nn.dense(internal_dims, internal_dims)
        
    def fast_part(self, x):
        x = nn.relu(self.fast1(x))
        x = nn.relu(self.fast2(x))
        return x        
    
    def slow_part(self, x):
        x = nn.relu(self.slow1(x))
        x = nn.relu(self.slow2(x))
        return x        
    
    def muxer(self, slow, fast):
        """ Combine fast and slow part """
        x = torch.cat((fast, slow), dim=1)
        x = nn.relu(self.mux1(x))
        x = self.mux2(x)
        return F.log_softmax(x, dim=1)  
    
    def forward(self, x):
        """ Run input through network. """        
        slow = self.slow_part(x)
        fast = self.fast_part(x)
        return self.muxer(slow, fast)
        
# the training loop
def train(model, env, device, train_loader, optimizer, epoch, verbose=True):
    model.train()
    for i, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if verbose and i % 250 == 0:
            print("Train Epoch: {} [{:.0f}%]\tLoss: {:.6f}".format(
                epoch, 100 * i / len(train_loader), loss.item()))            
            


## 2. Experiments

### 2.1 4x4 Maze with complete information

This is just to make sure the algorithm is working...

### 2.2 4x4 Maze with no Observation

A 4x4 maze where agent is given no useful observations.

## 2.2 T-Maze

T-Maze demonistrates the agents ability to retain important information at the start of an episode and apply it later on.

LSTM units typically get ~70 steps on this task. See this [paper](https://papers.nips.cc/paper/1953-reinforcement-learning-with-long-short-term-memory.pdf)

In [None]:
#
#>> this is the code for policy updates...

# Compute Values and Probability Distribution
values, prob = self.ac_net(obs_tensor)

# Compute Policy Gradient (Log probability x Action value)
advantages = return_tensor - values
action_log_probs = prob.log().gather(1, action_tensor)
actor_loss = -(advantages.detach() * action_log_probs).mean()

# Compute L2 loss for values
critic_loss = advantages.pow(2).mean()

# Backward Pass
loss = actor_loss + critic_loss
loss.backward()

In [None]:
# custom learning rates for the network.

optim = Adam(
    [
        {"params": model.fc.parameters(), "lr": 1e-3},
        {"params": model.agroupoflayer.parameters()},
        {"params": model.lastlayer.parameters(), "lr": 4e-2},
    ],
    lr=5e-4,
)