# Reinforcement Learning

In order to train an RL agent, we need to have (i) an environment and (ii) a learning method. In this work, we define a foraging environment where the goal of the agent is to find as many targets as possible in a given time. We consider environments with non-destructive -or replenishable- targets, which we implement by displacing the agent a distance $l_\textrm{c}$ from the center of the found target.

As for the agent, we use Projective Simulation (PS) to model its decision making process and learning method. However, other algorithms that work with stochastic policies can also be used.

First, we import the classes that define the environment (`TargetEnv`), the forager dynamics (`Forager`), and its learning method.

In [10]:
import numpy as np
from rl_opts.rl_framework import TargetEnv, Forager
from tqdm.notebook import tqdm

Note: the class `Forager` as it currently is inherits the methods of a PS agent for decision making and learning. However, other learning algorithms can be directly implemented by changing this inheritance. The learning algorithm should contain a method for decision making, called `deliberate`, which inputs a state; and another one for updating the policy, called `learn`, which inputs a reward.

We set up the parameters defining the length of the episodes (number of RL steps) and the number of episodes.

In [5]:
TIME_EP = 200 #time steps per episode
EPISODES = 1200 #number of episodes

We initialize the environment.

In [3]:
#Environment parameters
Nt = 100 #number of targets
L = 100 #world size
r = 0.5 #target detection radius
lc = 1 #cutoff length

#Initialize environment
env = TargetEnv(Nt, L, r, lc)

We initialize the agent. As states, the agent perceives the value of an internal counter that keeps track of the number of small steps that it has performed without turning. The possible actions are continue walking in the same direction or turning. The agent performs a small step of length $d=1$ in any case after making a decision. Let's define the parameters of the PS forager agent and initialize it:

In [4]:
NUM_ACTIONS = 2 # continue in the same direction, turn
STATE_SPACE = [np.linspace(0, TIME_EP-1, TIME_EP), np.arange(1), np.arange(1)] # one state per value that the counter may possibly have within an episode.
#--the last two entries are just placeholders here, but the code is general enough to implement ensembles of interacting agents that forage together.--
GAMMA = 0.00001 #forgetting parameter in PS
ETA_GLOW = 0.1 #glow damping parameter in PS
INITIAL_DISTR = [] #set a different initialization policy
for percept in range(TIME_EP): 
    INITIAL_DISTR.append([0.99, 0.01]) 
    

#Initialize agent
agent = Forager(num_actions=NUM_ACTIONS,
                state_space=STATE_SPACE,
                gamma_damping=GAMMA,
                eta_glow_damping=ETA_GLOW,
                initial_prob_distr=INITIAL_DISTR)

We run the learning process.

In [11]:
for e in tqdm(range(EPISODES)):
        
    #restart environment and agent's counter and g matrix
    env.init_env()
    agent.agent_state = 0
    agent.reset_g()

    for t in range(TIME_EP):
        
        #step to set counter to its min. value n=1
        if t == 0 or env.kicked[0]:
            #do one step with random direction (no learning in this step)
            env.update_pos(1)
            #check boundary conditions
            env.check_bc()
            #reset counter
            agent.agent_state = 0
            #set kicked value to false again
            env.kicked[0] = 0
            
        else:
            #get perception
            state = agent.get_state()
            #decide
            action = agent.deliberate(state)
            #act (update counter)
            agent.act(action)
            
            #update positions
            env.update_pos(action)
            #check if target was found + kick if it is
            reward = env.check_encounter()
                
            #check boundary conditions
            env.check_bc()
            #learn
            agent.learn(reward)

  0%|          | 0/1200 [00:00<?, ?it/s]

Note: the code can directly accomodate environments with several agents that interact. For this reason, you will find methods in both the environment class `TargetEnv` and the forager class `Forager` that deal with agents that have visual cones and can perceive the presence of other agents in their surroundings. However, these features are not used in this work.

## Reproduction of results

Here, we explain how to reproduce the results of the paper that concern the training of RL agents in the foraging environment.

### Training

First, we run the training as detailed above, and we save the agent's memory periodically, together with other useful data. The results of the training are saved by default in the directory 'results/learning/'.

You can reproduce the training by running the following command line:

```python 
python run_learning.py
```

The code first imports a configuration file that contains the parameters to initialize both the environment and the agent. For each set of parameters we ran, there is an experiment name (by default 'learning'), and an identifier of the form "exp_numconfig" (e.g. exp_0) that uniquely identifies the config file and the folder that contains the saved data. The config files for the experiments that give the results of the paper can be found in the directory 'configurations/learning/'. 

The experiment name (--experiment), configuration number (--num_config) and agent identifier (--run) can be directly parsed when running the file. Here, you can see a snippet of the code that shows what you can modify when parsing these arguments.

In [12]:
from sys import argv
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--experiment', type=str, default='learning', help='type of experiment')        
parser.add_argument('--run', type=int, default=0, help='run id')
parser.add_argument('--num_config', type=int, default=0, help='number of the configuration file')

_StoreAction(option_strings=['--num_config'], dest='num_config', nargs=None, const=None, default=0, type=<class 'int'>, choices=None, required=False, help='number of the configuration file', metavar=None)

Then, the configuration is imported.

In [13]:
from rl_opts.utils import get_config
config = get_config('exp_'+str(num_config)+'.cfg', experiment)

NameError: name 'num_config' is not defined

The parameters are extracted from the config dictionary as follows:

In [None]:
NUM_ACTIONS = config['NUM_ACTIONS']

These are the parameters that you can find in the config files:

`NUM_TARGETS` : number of targets \
`WORLD_SIZE` : side of the square that defines the world (with periodic boundary conditions) \
`r` : target detection radius \
`lc` : cutoff length \
`MAX_STEP_L` : maximum value of the step counter (which coincides with the number of RL steps per episode) \
`NUM_BINS` : number of bins in which the state space is split. This is set to have one state per value of the counter \
`NUM_ACTIONS` : number of actions \
`GAMMA` : forgetting parameter $\gamma$ in PS \
`ETA_GLOW` : glow damping parameter $\eta_g$ in PS \
`PI_INIT` : policy initialization $\pi_0$ ($\forall n$). Note that it is given as $\pi_0(\uparrow|n)$ \
`NUM_EPISODES` : number of episodes


Throughout the training, we periodically save the PS agent's memory (h matrix) by running:

In [None]:
# path of the results folder
results_path = 'results/'+experiment+'/'+'exp_'+str(num_config)+'/'
# save data
np.save(results_path+'memory_agent_'+str(run)+'_episode_'+str(e+1)+'.npy', agent.h_matrix)

We study foraging in enviroments with different cutoff lengths $l_\textrm{c}$. Exp_0 corresponds to $l_\textrm{c}=0.6$. Exp_1..10 correspond to $l_\textrm{c}=1..10$, respectively. In experiments exp_0..10, the initialization policy is $\pi_0(\Rsh|n)=0.01$ $\forall n$. Exp_11 and exp_12 correspond to experiments where the initialization policy is $\pi_0(\Rsh|n)=0.5$ $\forall n$. Each experiment is run with 10 independent, different agents (--run {0->9}).

As an example, you can train an agent with identifier 3 in experiment exp_8 by running the command line:

```python
python run_learning.py --num_config 8 --run 3
```

Alternatively, you can also separately define (i) a config dictionary with the parameters detailed above, (ii) a results path and (iii) an agent identifier and run:

In [None]:
config = {'asdf': 23}

In [1]:
from rl_opts.utils import get_config

In [3]:
c = get_config('exp_0.cfg')

In [4]:
c

{'NUM_TARGETS': 100,
 'WORLD_SIZE': 100,
 'r': 0.5,
 'lc': 0.6,
 'MAX_STEP_L': 20000,
 'NUM_BINS': 20000,
 'NUM_ACTIONS': 2,
 'GAMMA': 1e-05,
 'ETA_GLOW': 0.1,
 'PI_INIT': 0.99,
 'NUM_EPISODES': 12000}

In [None]:
from rl_opts.learn_and_bench import learning
learning(config, results_path, run)

### Postlearning analysis

In order to fairly compare the performance of the RL agents throughout the training with that of the benchmark models (Fig. 2), we need to run the same number of walks. In the training, the agent's policy changes from one episode to the next one, and taking the efficiency of just one episode -i.e. one walk- is not enough since we consider $10^4$ walks for the benchmark policies. Thus, we save the agent's policy at different stages of the training and then, in a postlearning analysis, we run $10^4$ walks with that frozen policy to get a more accurate evaluation of its performance.

To reproduce the results of the postlearning analysis for, say, agent with identifier 3 in experiment exp_8, you can run the following command line:

```python
python run_statistics_postlearning.py --num_config 8 --run 3
```

By default, the performance of the agent is evaluated every $2000$ episodes (--episode_interval), by taking the policy that the agent had in that episode, freezing it and letting the agent do $10^4$ independent walks (--num_walks) following that policy.

Alternatively, you can first define (i) the experiment name, (ii) config number, (iii) agent identifier, (iv) number of walks and (v) an episode interval. Parameters (i) to (iii) are needed to identify the agent that you want to analyze, and load its previously saved policies from the correct folder. Then, you run:

In [1]:
from rl_opts.learn_and_bench import agent_efficiency
agent_efficiency(experiment_name, num_config, run, num_walks, episode_interval)

NameError: name 'experiment_name' is not defined

Essentially, this analysis is carried out by the method `walk_from_policy`, which inputs a policy (that is not changing) and runs the walks in parallel. It outputs a list ('rewards') with the efficiency achieved in each walk.

In [None]:
from rl_opts.utils import get_config, get_policy
from rl_opts.utils_env import walk_from_policy

# path of the results folder
results_path = 'results/'+experiment_name+'/'+'exp_'+str(num_config)+'/'

# get policy from the stored h matrix at the given training_episode
frozen_policy = get_policy('results/'+experiment_name+'/', 'exp_'+str(num_config), run, training_episode)
            
# run the 10^4 walks (in parallel) with the same policy
rewards = walk_from_policy(policy=frozen_policy, 
                           time_ep=config['MAX_STEP_L'], 
                           n=num_walks, 
                           L=config['WORLD_SIZE'], 
                           Nt=config['NUM_TARGETS'], 
                           r=config['r'], 
                           lc=config['lc'])
# save results
np.save(results_path+'performance_post_training_agent_'+str(run)+'_episode_'+str(training_episode)+'.npy', rewards)
        

Note that the method `get_policy` takes the saved agent's memory at the given `training_episode` and transforms it into a policy. 

Note: in the code, the policies are always given as $\pi(\uparrow|n)$.