# Recap

- We tested the single agent setting for a different task.
- The landmarks were always fixed at time of the environment initialization.
- It was shown that the agent learned to navigate to any part of the map.
- Particularly, when the agent's starting coordinate was kept fixed and overflow would happen. Random restart is an essencial part of exploration.
- The optimal policies are not deterministic -- the temperature parameter $\tau$ that regulates the entropy was tested for **1**, **2**, **3**, **5** and **10**.

## Findings

1. The most useful task is to randomly restart the landmarks.
2. Regularization, via parameter clipping, improved learning.
3. The optimal value for $\tau = 5.0$.


# Duo Task

## Goal:

### Agents must learn how to navigate to a target landmark, while avoiding other agents.

- Both agents and landmarks are restarted at the begining of each episode. And agents are assigned a landmark they must navigate to, they must through trial and error be find which landmark they were assigned to.
- States are the coordinates to the other agent and to both landmarks.
- Reward is defined by the distance from an agent to its assigned landmark. If they collide both receive an extra reward=-1.



The objective of this notebook is to compare three learning settings.

1. Centralized Actor Critic

    - Single agent.
    - Fully observable setting.
    - Learnings using the average reward from both players.
2. Cooperative Actor Critic

    - Independent agents.
    - Fully observable setting.
    - Learnings using the average reward from both players.

3. Independent Learners Actor Critic

    - Independent agents.
    - Fully observable setting.
    - Individual rewards.

## Section A: First attempt.


## Settings


1. We compare the three models above. 
2. Initially, $\tau = 100$ and it falls linearly with the number of episodes (`explore_episodes=975`). 
3. Each test dataframe consists of the DataFrame.describe() statistics from **N** = 30 independent random trials, each of which consisting of rollouts of `M=100`, with $\tau$ set to a predetermined value.

Parameters:
```
ALPHA = 0.5  # ALPHA:
BETA = 0.3  # BETA:
TAU = 5.0   # Final TAU
EXPLORE_EPISODES = 975
EPISODES = 1000
EXPLORE = True
BASE_PATH = 'data/16_duo/03_tau05/'

N_WORKERS = 6
N_AGENTS = 2
```

## 1) Central Agent

BASE_PATH = 'data/16_duo/03_tau05/00_central/02'

### 1.1 Rollout Simulation

GIF from the best performing training.

![pipeline-central-simulation](16_duo/03_tau05/00_central/02/simulation-pipeline-best.gif)

### 1.2 Rollout Graph


![pipeline-central-simulation](16_duo/03_tau05/00_central/02/evaluation_rollout_num17.png)

### 1.3 Train<a name="A-1.3"></a> 



![pipeline-central-train-30](16_duo/03_tau05/00_central/02/train_pipeline_m30.png)
![rollout-central-train-30](16_duo/03_tau05/00_central/02/train_rollout_m30.png)

## 2) Cooperative Actor Critic

GIF from the best performing training.


![pipeline-joint-simulation](16_duo/03_tau05/01_joint_learners/02/simulation-pipeline-best.gif)

### 2.2 Rollout Graph


![pipeline-joint-rollout](16_duo/03_tau05/01_joint_learners/02/evaluation_rollout_num1.png)

### 2.3 Train<a name="A-2.3"></a> 


![pipeline-central-train-30](16_duo/03_tau05/1000/00_central/02/train_pipeline_m30.png)
![pipeline-joint-train-30](16_duo/03_tau05/01_joint_learners/02/train_pipeline_m30.png)
![pipeline-independent-train-30](16_duo/03_tau05/1000/02_independent_learners/02/train_pipeline_m30.png)
![rollout-joint-train-30](16_duo/03_tau05/01_joint_learners/02/train_rollout_m30.png)

## 3) Independent Learners Actor Critic

GIF from the best performing training.


![pipeline-independent-simulation](16_duo/03_tau05/02_independent_learners/02/simulation-pipeline-best.gif)

### 3.2 Rollout Graph


![pipeline-independent-rollout](16_duo/03_tau05/02_independent_learners/02/evaluation_rollout_num1.png)

### 3.3 Train <a name="A-3.3"></a> 



![pipeline-independent-train-30](16_duo/03_tau05/02_independent_learners/02/train_pipeline_m30.png)
![rollout-independent-train-30](16_duo/03_tau05/02_independent_learners/02/train_rollout_m30.png)

## 4) Leaderboard 1000<a name="A-leaderboard"></a> 

In [15]:
import pandas as pd
BASE_PATH = '16_duo/03_tau05/'

central_df = pd.read_csv(BASE_PATH + '00_central/02/pipeline.csv', sep=',', index_col=0)
joint_df = pd.read_csv(BASE_PATH + '01_joint_learners/02/pipeline.csv', sep=',', index_col=0)
indep_df = pd.read_csv(BASE_PATH + '02_independent_learners/02/pipeline.csv', sep=',', index_col=0)

def describe(dataframe: pd.DataFrame, label: str) -> pd.DataFrame:
    """Describes the dataframe
    
    Parameters
    ----------
    dataframe: pd.DataFrame
        A dataframe with description N independent rollouts.
        Each consisting of M timesteps.
        Trials are in the columns and rows are statistics.
        The result of df.describe()
   
    Returns
    -------
    dataframe: pd.DataFrame
        A description of the average return.
    
    """
    df = dataframe.drop(['std', 'count', '25%', '50%', '75%'], axis=0)
    ts = df.T.describe()['mean']
    ts.name = label
    return ts.to_frame()

In [17]:
dataframes = []
dataframes.append(describe(central_df, label='central'))
dataframes.append(describe(joint_df, label='joint'))
dataframes.append(describe(indep_df, label='indepent'))
noregdf = pd.concat(dataframes, axis=1)
noregdf

Unnamed: 0,central,joint,indepent
count,30.0,30.0,30.0
mean,-0.902543,-0.771013,-0.730448
std,0.214082,0.250752,0.215126
min,-1.48868,-1.427638,-1.28006
25%,-1.061837,-0.915154,-0.819776
50%,-0.85727,-0.712802,-0.702597
75%,-0.748428,-0.589685,-0.604472
max,-0.589694,-0.39237,-0.413223


1. The first thing to note is that the Evaluation Rollouts show that the central agent presents one colliison, while the joint action learner presents two collisions and the independent learner presents three collisions. Indicating that **cooperation** is a helpful means to avoid collisions and that **coordination** is an effective way to achieve that.
2. However, we see from table [Leaderboard 1000](#leaderboard-1000) that the central agent underperforms. A hint of the reason is given by the training plots [Plot 1.3 Central Training](#A-1.3) [Plot 2.3 Joint Training](#A-2.3), [Plot 3.3 Independent Training](#A-3.3). It seems that the central agent hasn't had change to learn -- It must learn 125 actions while each other agent has to learn from 5 actions.


# Section B: Successful Attempt<a name="B-section"></a> 

We set `episodes=5000` and `explore_episodes=4975`.

## 1) Central Agent

BASE_PATH = 'data/16_duo/03_tau05/5000/00_central/02'

### 1.1 Rollout Simulation

GIF from the best performing training.

![pipeline-central-simulation](16_duo/03_tau05/5000/00_central/02/simulation-pipeline-best.gif)

### 1.2 Rollout Graph


![pipeline-central-simulation](16_duo/03_tau05/5000/00_central/02/evaluation_rollout_num06.png)

### 1.3 Train<a name="A-1.3"></a> 



![pipeline-central-train-30](16_duo/03_tau05/5000/00_central/02/train_pipeline_m30.png)
![rollout-central-train-30](16_duo/03_tau05/5000/00_central/02/train_rollout_m30.png)

## 2) Cooperative Actor Critic

GIF from the best performing training.


![pipeline-joint-simulation](16_duo/03_tau05/5000/01_joint_learners/02/simulation-pipeline-best.gif)

### 2.2 Rollout Graph


![pipeline-joint-rollout](16_duo/03_tau05/5000/01_joint_learners/02/evaluation_rollout_num01.png)

### 2.3 Train<a name="B-2.3"></a> 



![pipeline-joint-train-30](16_duo/03_tau05/5000/01_joint_learners/02/train_pipeline_m30.png)
![rollout-joint-train-30](16_duo/03_tau05/5000/01_joint_learners/02/train_rollout_m30.png)

## 3) Independent Learners Actor Critic

GIF from the best performing training.


![pipeline-independent-simulation](16_duo/03_tau05/5000/02_independent_learners/02/simulation-pipeline-best.gif)

### 3.2 Rollout Graph


![pipeline-independent-rollout](16_duo/03_tau05/5000/02_independent_learners/02/evaluation_rollout_num01.png)

### 3.3 Train <a name="B-3.3"></a> 



![pipeline-independent-train-30](16_duo/03_tau05/5000/02_independent_learners/02/train_pipeline_m30.png)
![rollout-independent-train-30](16_duo/03_tau05/5000/02_independent_learners/02/train_rollout_m30.png)

## 4) Leaderboard 5000<a name="B-leaderboard"></a> 

In [34]:
import pandas as pd
BASE_PATH = '16_duo/03_tau05/5000/'

central_df = pd.read_csv(BASE_PATH + '00_central/02/pipeline.csv', sep=',', index_col=0)
joint_df = pd.read_csv(BASE_PATH + '01_joint_learners/02/pipeline.csv', sep=',', index_col=0)
indep_df = pd.read_csv(BASE_PATH + '02_independent_learners/02/pipeline.csv', sep=',', index_col=0)

In [35]:
dataframes = []
dataframes.append(describe(central_df, label='central'))
dataframes.append(describe(joint_df, label='joint'))
dataframes.append(describe(indep_df, label='indepent'))
noregdf = pd.concat(dataframes, axis=1)
noregdf

Unnamed: 0,central,joint,indepent
count,30.0,30.0,30.0
mean,-0.707014,-0.936407,-0.88491
std,0.174991,0.160264,0.137636
min,-1.11946,-1.24096,-1.264698
25%,-0.870943,-1.041689,-0.953044
50%,-0.645521,-0.929262,-0.854733
75%,-0.591891,-0.831467,-0.782187
max,-0.475165,-0.637744,-0.684747


In [37]:
BASE_PATH = '16_duo/01_tau02/5000/'

central_df = pd.read_csv(BASE_PATH + '00_central/02/pipeline.csv', sep=',', index_col=0)
central_df.T.describe()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
count,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0
mean,100.0,-0.789356,0.262101,-1.614022,-0.910463,-0.745515,-0.628947,-0.363709
std,0.0,0.19135,0.084206,0.25024,0.245691,0.216647,0.205738,0.185291
min,100.0,-1.176989,0.103623,-2.108609,-1.345804,-1.216842,-1.130174,-0.76106
25%,100.0,-0.920646,0.213262,-1.781672,-1.037202,-0.884421,-0.740292,-0.514763
50%,100.0,-0.775608,0.246014,-1.672889,-0.890028,-0.740801,-0.600437,-0.333291
75%,100.0,-0.674831,0.317402,-1.450856,-0.757286,-0.600869,-0.510171,-0.198981
max,100.0,-0.390616,0.436406,-0.989545,-0.443964,-0.335705,-0.284563,-0.060358
