# Trio Task

## Goal:

### Agents must learn how to navigate to a target landmark, while avoiding other agents.

- Both agents and landmarks are restarted at the begining of each episode. And agents are assigned a landmark they must navigate to, they must through trial and error be find which landmark they were assigned to.
- States are the coordinates to the other agent and to both landmarks.
- Reward is defined by the distance from an agent to its assigned landmark. If they collide both receive an extra reward=-1.


### General MDP

$$\mathcal{X} = \mathcal{X}_1 \times \mathcal{X}_2\times \mathcal{X}_3$$
$$\mathcal{A} = \mathcal{A}_1 \times \mathcal{A}_2\times \mathcal{X}_3$$
$$r = r_1(x_1) + r_2(x_2)$$

#### States


$$\mathcal{X}_1 = (\alpha^1_x, \alpha^1_y, v^ 1_x, v^ 1_y, l^1_x, l^1_y, l^2_x, l^2_y) $$
$$\mathcal{X}_2 = (\alpha^2_x, \alpha^2_y, v^2_x, v^2_y, l^1_x, l^1_y, l^2_x, l^2_y) $$
$$\mathcal{X}_3 = (\alpha^3_x, \alpha^3_y, v^3_x, v^3_y, l^1_x, l^1_y, l^3_x, l^3_y) $$

#### Actions

$$\mathcal{A}_1 = (0, 1, 2, 3, 4) $$
$$\mathcal{A}_2 = (0, 1, 2, 3, 4) $$
$$\mathcal{A}_3 = (0, 1, 2, 3, 4) $$


#### Rewards

 `TODO`

### Central Learner

The central agents solves the general MDP above.
    - Single agent.
    - Fully observable setting.
    - Learnings using the average reward from both players.
<table>
<tr>
<th>Central Agent</th>
</tr>
<tr>
<td>
$$\mathcal{X}_1 \times \mathcal{X}_2$$
$$\mathcal{A}_1 \times \mathcal{A}_2$$
$$r_1(x_1) + r_2(x_2)$$
</td>
</tr>
</table>

### Distributed Learners

The distributed agent have full observability but learn
independently.

    - Independent agents.
    - Fully observable setting.
    - Learnings using the average reward from both players.

<table>
<tr>
<th>Agent 1</th>
<th>Agent 2</th>
</tr>
<tr>
<td>
$$\mathcal{X}_1 \times \mathcal{X}_2$$
$$\mathcal{A}_1 $$
$$r_1(x_1) + r_2(x_2)$$
</td>
<td>
$$\mathcal{X}_1 \times \mathcal{X}_2$$
$$\mathcal{A}_2 $$
$$ r_1(x_1) + r_2(x_2)$$
</td>
</tr>
</table>

### Independent Learner

The distributed agent have partial observability and learn
independently.

    - Independent agents.
    - Partially observable setting.
    - Individual rewards.

<table>
<tr>
<th>Agent 1</th>
<th>Agent 2</th>
</tr>
<tr>
<td>
$$\mathcal{X}_1$$
$$\mathcal{A}_1$$
$$r_1(x_1)$$
</td>
<td>
$$\mathcal{X}_2$$
$$\mathcal{A}_2 $$
$$ r_2(x_2)$$
</td>
</tr>
</table>

## Settings


1. We compare the three information strucutures above. 
2. Initially, $\tau = 100$ and it falls linearly with the number of episodes (`explore_episodes=9975`). 
3. Each test dataframe consists of the DataFrame.describe() statistics from **N** = 30 independent random trials, each of which consisting of rollouts of `M=100`, with $\tau$ set to a predetermined value.

Parameters:
```
ALPHA = 0.5  # ALPHA:
BETA = 0.3  # BETA:
TAU = 5.0   # Final TAU
EXPLORE_EPISODES = 24975
EPISODES = 25000
EXPLORE = True  # WHETER OR NOT WE USE EXPLORATION

SEED = 1
BASE_PATH = 'data/01_trio/02_tau05_25000/'

N_WORKERS = 6
N_AGENTS = 3
AGENT_TYPE = 'ActorCriticCentral'
```

## 1) Central Agent

BASE_PATH = '01_trio/02_tau05_25000'

### 1.1 Rollout Simulation

GIF from the best performing training.

![pipeline-central-simulation](01_trio/02_tau05_25000/00_central/02/simulation-pipeline-best.gif)

### 1.2 Rollout Graph



![pipeline-central-simulation](01_trio/02_tau05_25000/00_central/02/evaluation_rollout_n3_num05.png)

### 1.3 Train<a name="A-1.3"></a> 



![pipeline-central-train-12](01_trio/02_tau05_25000/00_central/02/train_pipeline_m12.png)
![rollout-central-train-12](01_trio/02_tau05_25000/00_central/02/train_rollout_m12.png)

## 2) Distributed Actor Critic

BASE_PATH = '01_trio/02_tau05_25000/02_distributed_learners/02/'

### 2.1 Rollouts distributed learners

GIF from the best performing training.

![pipeline-central-simulation](01_trio/02_tau05_25000/02_distributed_learners/02/simulation-pipeline-best.gif)

### 2.2 Rollout Graph


![pipeline-distributed-simulation](01_trio/02_tau05_25000/02_distributed_learners/02/evaluation_rollout_n3_num02.png)

### 2.3 Train<a name="A-2.3"></a> 


![pipeline-distributed-train-12](00_duo/01_tau05_10000/02_distributed_learners/02/train_pipeline_m12.png)
![rollout-distributed-train-12](00_duo/01_tau05_10000/02_distributed_learners/02/train_rollout_m12.png)

## 3) Independent Learners Actor Critic

BASE_PATH = '01_trio/02_tau05_25000/02_independent_learners/02/'

### 2.1 Rollouts distributed learners

GIF from the best performing training.

![pipeline-distributed-simulation](01_trio/02_tau05_25000/02_independent_learners/02/simulation-pipeline-best.gif)

### 3.2 Rollout Graph


![pipeline-independent-rollout](01_trio/02_tau05_25000/02_independent_learners/02/evaluation_rollout_n3_num10.png)

### 3.3 Train <a name="A-3.3"></a> 



![pipeline-independent-train-30](01_trio/02_tau05_25000/02_independent_learners/02/train_pipeline_m12.png)
![rollout-independent-train-30](01_trio/02_tau05_25000/02_independent_learners/02/train_rollout_m12.png)

## 4) Leaderboard 25000<a name="A-leaderboard"></a> 

In [2]:
import pandas as pd
BASE_PATH = '01_trio/02_tau05_25000/'

central_df = pd.read_csv(BASE_PATH + '00_central/02/pipeline-rollouts-summary.csv', sep=',', index_col=0)
distributed_df = pd.read_csv(BASE_PATH + '02_distributed_learners/02/pipeline-rollouts-summary.csv', sep=',', index_col=0)
independent_df = pd.read_csv(BASE_PATH + '02_independent_learners/02/pipeline-rollouts-summary.csv', sep=',', index_col=0)

def describe(dataframe: pd.DataFrame, label: str) -> pd.DataFrame:
    """Describes the dataframe
    
    Parameters
    ----------
    dataframe: pd.DataFrame
        A dataframe with description N independent rollouts.
        Each consisting of M timesteps.
        Trials are in the columns and rows are statistics.
        The result of df.describe()
   
    Returns
    -------
    dataframe: pd.DataFrame
        A description of the average return.
    
    """
    df = dataframe.drop(['std', 'count', '25%', '50%', '75%'], axis=0)
    ts = df.T.describe()['mean']
    ts.name = label
    return ts.to_frame()

In [3]:
dataframes = []
dataframes.append(describe(central_df, label='central'))
dataframes.append(describe(distributed_df, label='distributed'))
dataframes.append(describe(independent_df, label='independent'))
noregdf = pd.concat(dataframes, axis=1)
noregdf

Unnamed: 0,central,distributed,independent
count,12.0,12.0,12.0
mean,-0.814162,-3.652366,-0.811222
std,0.13399,2.801902,0.211808
min,-0.996588,-9.712648,-1.213041
25%,-0.926152,-5.96953,-0.957589
50%,-0.834718,-2.090906,-0.802586
75%,-0.721824,-1.681394,-0.677412
max,-0.59797,-0.69528,-0.429482
