# Recap: Duo Task.

- The landmarks are not fixed -- they randomly restart.
- There is a player and landmark assignment kept hidden.
- For `episodes=1000` the central agent underperforms.
- For `episodes=5000` the central agent overperforms.

## Findings

1. It takes more episodes to properly train the central agent.

# Duo Task

## Goal:

### Agents must learn how to navigate to a target landmark, while avoiding other agents.

- Both agents and landmarks are restarted at the begining of each episode. And agents are assigned a landmark they must navigate to, they must through trial and error be which landmark they were assigned to.
- States are the coordinates to the other agent and to both landmarks.
- Reward is defined by the distance from an agent to its assigned landmark. If they collide both receive an extra reward=-1.

### General MDP

$$\mathcal{X} = \mathcal{X}_1 \times \mathcal{X}_2$$
$$\mathcal{A} = \mathcal{A}_1 \times \mathcal{A}_2$$
$$r = r_1(x_1) + r_2(x_2)$$

#### TODO: Describe the states, actions and rewards.
* (POSX, POSY, landmark  
### Central Learner

The central agents solves the general MDP above.

<table>
<tr>
<th>Central Agent</th>
</tr>
<tr>
<td>
$$\mathcal{X}_1 \times \mathcal{X}_2$$
$$\mathcal{A}_1 \times \mathcal{A}_2$$
$$r_1(x_1) + r_2(x_2)$$
</td>
</tr>
</table>

### Distributed Learners

The distributed agent have full observability but learn
independently.
<table>
<tr>
<th>Agent 1</th>
<th>Agent 2</th>
</tr>
<tr>
<td>
$$\mathcal{X}_1 \times \mathcal{X}_2$$
$$\mathcal{A}_1 $$
$$r_1(x_1) + r_2(x_2)$$
</td>
<td>
$$\mathcal{X}_1 \times \mathcal{X}_2$$
$$\mathcal{A}_2 $$
$$ r_1(x_1) + r_2(x_2)$$
</td>
</tr>
</table>

### Independent Learner

The distributed agent have partial observability and learn
independently.
<table>
<tr>
<th>Agent 1</th>
<th>Agent 2</th>
</tr>
<tr>
<td>
$$\mathcal{X}_1$$
$$\mathcal{A}_1$$
$$r_1(x_1)$$
</td>
<td>
$$\mathcal{X}_2$$
$$\mathcal{A}_2 $$
$$ r_2(x_2)$$
</td>
</tr>
</table>


## Settings


1. We compare the three models above. 
2. Initially, $\tau = 100$ and it falls linearly with the number of episodes (`explore_episodes=4975`). 
3. Each test dataframe consists of the DataFrame.describe() statistics from **N** = 30 independent random trials, each of which consisting of rollouts of `M=100`, with $\tau$ set to a predetermined value.

Parameters:
```
"""Configuration"""
ALPHA = 0.5  # ALPHA:
BETA = 0.3  # BETA:
TAU = 5.0   # Final TAU
EXPLORE_EPISODES = 4975
EPISODES = 5000
EXPLORE = True  # WHETER OR NOT WE USE EXPLORATION

SEED = 1
BASE_PATH = 'data/01_duo/5000'

N_WORKERS = 6
N_AGENTS = 2
AGENT_TYPE = 'ActorCriticIndependent'
```


## 1) Central Agent

BASE_PATH = 'data/01_duo/00_central/02'

### 1.1 Rollout Simulation

GIF from the best performing training.

![pipeline-central-simulation](01_duo/5000/tau03/00_central/02/simulation-pipeline-best.gif)

### 1.2 Rollout Graph


![pipeline-central-simulation](01_duo/5000/tau03/00_central/02/evaluation_rollout_n2_num06.png)

### 1.3 Train<a name="A-1.3"></a> 



![pipeline-central-train-30](01_duo/5000/tau03/00_central/02/train_pipeline_m30.png)
![rollout-central-train-30](01_duo/5000/tau03/00_central/02/train_rollout_m30.png)

**Results:**
    
- Number of collisions: **9**
- General behaviour: Reaches one landmark and tries to reach the other.
- Training: always increasing.
- Rollouts: Average about **-0.75**.

**Take aways:**

- It seems that the central agent is not doing a good job in avoding collisions.
- Perhaps increase the training time will help.

## 2) Distributed Actor Critic

GIF from the best performing training.


![pipeline-joint-simulation](01_duo/5000/tau03/01_distributed_learners/02/simulation-pipeline-best.gif)

### 2.2 Rollout Graph


![pipeline-joint-rollout](01_duo/5000/tau03/01_distributed_learners/02/evaluation_rollout_n2_num27.png)

### 2.3 Train<a name="A-2.3"></a> 


![pipeline-distributed-train-30](01_duo/5000/tau03/01_distributed_learners/02/train_pipeline_m30.png)
![rollout-distributed-train-30](01_duo/5000/tau03/01_distributed_learners/02/train_rollout_m30.png)

**Results:**
    
- Number of collisions: **5**
- General behaviour: Oscillatory moving both agents around the middle. 
- Training: Decreases by the end.
- Rollouts: Average about **-0.9**.

**Take aways:**

- Less collision and less greedy behavior of seeking to really stop on the landmark.
- Perhaps selecting the best policy instead of the latest policy will help.
- Fine tunning the $\tau$ factor might help.

## 3) Independent Learners Actor Critic

GIF from the best performing training.


![pipeline-independent-simulation](01_duo/5000/tau03/02_independent_learners/02/simulation-pipeline-best.gif)

### 3.2 Rollout Graph


![pipeline-independent-rollout](01_duo/5000/tau03/02_independent_learners/02/evaluation_rollout_n2_num01.png)

### 3.3 Train <a name="A-3.3"></a> 



![pipeline-independent-train-30](01_duo/5000/tau03/02_independent_learners/02/train_pipeline_m30.png)
![rollout-independent-train-30](01_duo/5000/tau03/02_independent_learners/02/train_rollout_m30.png)

**Results:**
    
- Number of collisions: **8**
- General behaviour: Both agents try to reache both landmarks.
- Training: Flattens by the end.
- Rollouts: Average about **-0.7**.

**Take aways:**

- About as many collisions as the central learner.
- Seems that avoiding collisions is not that important for accomplishing the task.

## 4) Leaderboard 5000<a name="A-leaderboard"></a> 

In [2]:
import pandas as pd
BASE_PATH = '01_duo/5000/tau03/'
filename = 'pipeline-rollouts-summary.csv'
central_df = pd.read_csv('{0}00_central/02/{1}'.format(BASE_PATH, filename), sep=',', index_col=0)
joint_df = pd.read_csv('{0}01_distributed_learners/02/{1}'.format(BASE_PATH, filename), sep=',', index_col=0)
indep_df = pd.read_csv('{0}02_independent_learners/02/{1}'.format(BASE_PATH, filename), sep=',', index_col=0)

def describe(dataframe: pd.DataFrame, label: str) -> pd.DataFrame:
    """Describes the dataframe
    
    Parameters
    ----------
    dataframe: pd.DataFrame
        A dataframe with description N independent rollouts.
        Each consisting of M timesteps.
        Trials are in the columns and rows are statistics.
        The result of df.describe()
   
    Returns
    -------
    dataframe: pd.DataFrame
        A description of the average return.
    
    """
    df = dataframe.drop(['std', 'count', '25%', '50%', '75%'], axis=0)
    ts = df.T.describe()['mean']
    ts.name = label
    return ts.to_frame()

In [3]:
dataframes = []
dataframes.append(describe(central_df, label='central'))
dataframes.append(describe(joint_df, label='distributed'))
dataframes.append(describe(indep_df, label='independent'))
noregdf = pd.concat(dataframes, axis=1)
noregdf

Unnamed: 0,central,distributed,independent
count,30.0,30.0,30.0
mean,-0.721073,-0.916111,-0.70563
std,0.196567,0.163963,0.172653
min,-1.080272,-1.47824,-1.107668
25%,-0.899719,-0.975314,-0.822749
50%,-0.711999,-0.880165,-0.664297
75%,-0.594784,-0.811225,-0.566666
max,-0.398747,-0.652587,-0.460799


1. The first thing to note is that the Evaluation Rollouts show that the central agent presents **9** collisions, while the joint action learner presents **5** collisions and the independent learner presents **8** collisions. Indicating one of the following:
    - The central agent hasn´t had time to train --> Increase number of episodes.
    - The problem is relativelly simple. For two agents it pays ff to greedly travel to one landmark --> Increase the constraints by adding more agents.
    - The lenght of the episode is too short.

## 4.1) Leaderboard 5000<a name="A-leaderboard-1"></a>  

### 4.1.1) Collision cost


Raising the collision cost from `reward=-1.0` to `reward=-2.0` tips the scale in favor of the central algorithm.

In [13]:
BASE_PATH = '03_duo_collisions/5000/'
filename = 'pipeline-rollouts-summary.csv'
central_df = pd.read_csv('{0}00_central/02/{1}'.format(BASE_PATH, filename), sep=',', index_col=0)
joint_df = pd.read_csv('{0}01_distributed_learners/02/{1}'.format(BASE_PATH, filename), sep=',', index_col=0)
indep_df = pd.read_csv('{0}02_independent_learners/02/{1}'.format(BASE_PATH, filename), sep=',', index_col=0)

dataframes = []
dataframes.append(describe(central_df, label='central'))
dataframes.append(describe(joint_df, label='distributed'))
dataframes.append(describe(indep_df, label='independent'))
df = pd.concat(dataframes, axis=1)
df

Unnamed: 0,central,distributed,independent
count,30.0,30.0,30.0
mean,-0.748398,-0.98129,-0.751045
std,0.199766,0.167453,0.215294
min,-1.134656,-1.504635,-1.162195
25%,-0.900333,-1.05597,-0.887457
50%,-0.746279,-0.944458,-0.737118
75%,-0.58277,-0.885843,-0.609881
max,-0.428029,-0.761828,-0.32449


**Take aways:**

- Central now overtakes the independent as measured by average reward.
- Central agent simulation has less standard deviation than independent.
- Curiously, the distributed has the least standard deviation of all three agents.

# Trio Task<a name="B-section"></a> 

We set `n_runs=12`.

Parameters:
```
"""Configuration"""
ALPHA = 0.5  # ALPHA:
BETA = 0.3  # BETA:
TAU = 3.0   # Final TAU
EXPLORE_EPISODES = 14975
EPISODES = 15000

EXPLORE = True  # WHETER OR NOT WE USE EXPLORATION

SEED = 1
BASE_PATH = 'data/02_trio/15000'

N_WORKERS = 6
N_AGENTS = 3
AGENT_TYPE = 'ActorCriticIndependent'
```


## 1) Central Agent

BASE_PATH = '02_trio/15000/00_central/02'

### 1.1 Rollout Simulation

GIF from the best performing training.

![pipeline-central-simulation](02_trio/15000/00_central/02/simulation-pipeline-best.gif)

### 1.2 Rollout Graph


![pipeline-central-simulation](02_trio/15000/00_central/02/evaluation_rollout_n3_num04.png)

### 1.3 Train<a name="A-1.3"></a> 



![pipeline-central-train-30](02_trio/15000/00_central/02/train_pipeline_m12.png)
![rollout-central-train-30](02_trio/15000/00_central/02/train_rollout_m12.png)

## 2) Distributed Actor Critic

GIF from the best performing training.


![pipeline-joint-simulation](02_trio/15000/01_distributed_learners/02/simulation-pipeline-best.gif)

### 2.2 Rollout Graph


![pipeline-joint-rollout](02_trio/15000/01_distributed_learners/02/evaluation_rollout_n3_num10.png)

### 2.3 Train<a name="B-2.3"></a> 



![pipeline-joint-train-30](02_trio/15000/01_distributed_learners/02/train_pipeline_m12.png)
![rollout-joint-train-30](02_trio/15000/01_distributed_learners/02/train_rollout_m12.png)

## 2) Independent Actor Critic

GIF from the best performing training.


![pipeline-joint-simulation](02_trio/15000/02_independent_learners/02/simulation-pipeline-best.gif)

### 2.2 Rollout Graph


![pipeline-joint-rollout](02_trio/15000/02_independent_learners/02/evaluation_rollout_n3_num10.png)

### 2.3 Train<a name="B-2.3"></a> 



![pipeline-joint-train-30](02_trio/15000/02_independent_learners/02/train_pipeline_m12.png)
![rollout-joint-train-30](02_trio/15000/02_independent_learners/02/train_rollout_m12.png)

## 4) Leaderboard 15000<a name="B-leaderboard"></a> 

In [4]:
import pandas as pd
BASE_PATH = '02_trio/15000/'
filename = 'pipeline-rollouts-summary.csv'
central_df = pd.read_csv('{0}00_central/02/{1}'.format(BASE_PATH, filename), sep=',', index_col=0)
joint_df = pd.read_csv('{0}01_distributed_learners/02/{1}'.format(BASE_PATH, filename), sep=',', index_col=0)
indep_df = pd.read_csv('{0}02_independent_learners/02/{1}'.format(BASE_PATH, filename), sep=',', index_col=0)


In [5]:
dataframes = []
dataframes.append(describe(central_df, label='central'))
dataframes.append(describe(joint_df, label='distributed'))
dataframes.append(describe(indep_df, label='independent'))
noregdf = pd.concat(dataframes, axis=1)
noregdf

Unnamed: 0,central,distributed,independent
count,12.0,12.0,12.0
mean,-0.83282,-0.968436,-0.845123
std,0.171572,0.141458,0.19545
min,-1.095576,-1.197982,-1.132111
25%,-0.945631,-1.048738,-0.980796
50%,-0.835888,-0.983141,-0.872567
75%,-0.707972,-0.83227,-0.740878
max,-0.592599,-0.761953,-0.488058


## 4.2 ) Central 25000

Is the problem a lack of training episodes for the centralized agent?

In [9]:
BASE_PATH = '02_trio/25000/'
filename = 'pipeline-rollouts-summary.csv'
central_df = pd.read_csv('{0}00_central/02/{1}'.format(BASE_PATH, filename), sep=',', index_col=0)
describe(central_df, label='central')

Unnamed: 0,central
count,5.0
mean,-0.839612
std,0.232856
min,-1.11629
25%,-1.042792
50%,-0.783479
75%,-0.683122
max,-0.572376


**Result**: It doesn´t seem so.