# Recap

- We test three different information structures.
    * Centralized learner
    * Distributed learner
    * Independent learner

- Under the Duo and Trio task settings.

## Findings

1. The policies induced by the information structures were **particular**.
    * Centralized agent approaches a landmark and than tries to approach another. Incurring in bumps.
    * Distributed agents do not seek to settle on a particular landmark but oscilate around landmarks.
    * Independent learners greedly seek to settle on a landmark. Regadless of bumps: High risk high reward.
2. Centralized agent needed more steps to propertly learn.
3. Determine the causes of distributed agent failing at learning on the latter parts of the episode.


# Duo Task

## Goal:

### Agents must learn how to navigate to a target landmark, while avoiding other agents.

- Both agents and landmarks are restarted at the begining of each episode. And agents are assigned a landmark they must navigate to, they must through trial and error be find which landmark they were assigned to.
- States are the coordinates to the other agent and to both landmarks.
- Reward is defined by the distance from an agent to its assigned landmark. If they collide both receive an extra reward=-1.


The objective of this notebook:
* Leaderboard: `episodes=5000` to `episodes=10000`.
* Test the new version of the distributed learners.
    -`TODO: CRITIC BEFORE`
    -`TODO: CRITIC AFTER`

### General MDP

$$\mathcal{X} = \mathcal{X}_1 \times \mathcal{X}_2$$
$$\mathcal{A} = \mathcal{A}_1 \times \mathcal{A}_2$$
$$r = r_1(x_1) + r_2(x_2)$$

#### States

`TODO`

#### Actions

`TODO`

#### Rewards

 `TODO`

### Central Learner

The central agents solves the general MDP above.
    - Single agent.
    - Fully observable setting.
    - Learnings using the average reward from both players.
<table>
<tr>
<th>Central Agent</th>
</tr>
<tr>
<td>
$$\mathcal{X}_1 \times \mathcal{X}_2$$
$$\mathcal{A}_1 \times \mathcal{A}_2$$
$$r_1(x_1) + r_2(x_2)$$
</td>
</tr>
</table>

### Distributed Learners

The distributed agent have full observability but learn
independently.

    - Independent agents.
    - Fully observable setting.
    - Learnings using the average reward from both players.

<table>
<tr>
<th>Agent 1</th>
<th>Agent 2</th>
</tr>
<tr>
<td>
$$\mathcal{X}_1 \times \mathcal{X}_2$$
$$\mathcal{A}_1 $$
$$r_1(x_1) + r_2(x_2)$$
</td>
<td>
$$\mathcal{X}_1 \times \mathcal{X}_2$$
$$\mathcal{A}_2 $$
$$ r_1(x_1) + r_2(x_2)$$
</td>
</tr>
</table>

### Independent Learner

The distributed agent have partial observability and learn
independently.

    - Independent agents.
    - Partially observable setting.
    - Individual rewards.

<table>
<tr>
<th>Agent 1</th>
<th>Agent 2</th>
</tr>
<tr>
<td>
$$\mathcal{X}_1$$
$$\mathcal{A}_1$$
$$r_1(x_1)$$
</td>
<td>
$$\mathcal{X}_2$$
$$\mathcal{A}_2 $$
$$ r_2(x_2)$$
</td>
</tr>
</table>

## Settings


1. We compare the three information strucutures above. 
2. Initially, $\tau = 100$ and it falls linearly with the number of episodes (`explore_episodes=9975`). 
3. Each test dataframe consists of the DataFrame.describe() statistics from **N** = 30 independent random trials, each of which consisting of rollouts of `M=100`, with $\tau$ set to a predetermined value.

Parameters:
```
ALPHA = 0.5  # ALPHA:
BETA = 0.3  # BETA:
TAU = 5.0   # Final TAU
EXPLORE_EPISODES = 975
EPISODES = 1000
EXPLORE = True
BASE_PATH = 'data/00_duo/01_tau05_10000'

N_WORKERS = 6
N_AGENTS = 2
```

## 1) Central Agent

BASE_PATH = '00_duo/01_tau05_10000/00_central/02'

### 1.1 Rollout Simulation

GIF from the best performing training.

![pipeline-central-simulation](00_duo/01_tau05_10000/00_central/02/simulation-pipeline-best.gif)

### 1.2 Rollout Graph


![pipeline-central-simulation](00_duo/01_tau05_10000/00_central/02/evaluation_rollout_n2_num11.png)

### 1.3 Train<a name="A-1.3"></a> 



![pipeline-central-train-12](00_duo/01_tau05_10000/00_central/02/train_pipeline_m12.png)
![rollout-central-train-12](00_duo/01_tau05_10000/00_central/02/train_rollout_m12.png)

## 2) Distributed Actor Critic

BASE_PATH = '00_duo/01_tau05_10000/02_distributed_learners/02'

### 2.1 Rollouts distributed learners

GIF from the best performing training.

![pipeline-central-simulation](00_duo/01_tau05_10000/02_distributed_learners/02/simulation-pipeline-best.gif)

### 2.2 Rollout Graph


![pipeline-distributed-simulation](00_duo/01_tau05_10000/02_distributed_learners/02/evaluation_rollout_n2_num06.png)

### 2.3 Train<a name="A-2.3"></a> 


![pipeline-central-train-12](00_duo/01_tau05_10000/02_distributed_learners/02/train_pipeline_m12.png)
![rollout-central-train-12](00_duo/01_tau05_10000/02_distributed_learners/02/train_rollout_m12.png)

## 3) Independent Learners Actor Critic

BASE_PATH = '00_duo/01_tau05_10000/03_independent_learners/02'

### 2.1 Rollouts distributed learners

GIF from the best performing training.

![pipeline-independent-simulation](00_duo/01_tau05_10000/02_independent_learners/02/simulation-pipeline-best.gif)

### 3.2 Rollout Graph


![pipeline-independent-rollout](00_duo/01_tau05_10000/02_independent_learners/02/evaluation_rollout_n2_num05.png)

### 3.3 Train <a name="A-3.3"></a> 



![pipeline-independent-train-30](00_duo/01_tau05_10000/02_independent_learners/02/train_pipeline_m12.png)
![rollout-independent-train-30](00_duo/01_tau05_10000/02_independent_learners/02/train_rollout_m12.png)

## 4) Leaderboard 10000<a name="A-leaderboard"></a> 

In [11]:
import pandas as pd
BASE_PATH = '00_duo/01_tau05_10000/'

central_df = pd.read_csv(BASE_PATH + '00_central/02/pipeline-rollouts-summary.csv', sep=',', index_col=0)
distributed_df = pd.read_csv(BASE_PATH + '02_distributed_learners/02/pipeline-rollouts-summary.csv', sep=',', index_col=0)
independent_df = pd.read_csv(BASE_PATH + '02_independent_learners/02/pipeline-rollouts-summary.csv', sep=',', index_col=0)

def describe(dataframe: pd.DataFrame, label: str) -> pd.DataFrame:
    """Describes the dataframe
    
    Parameters
    ----------
    dataframe: pd.DataFrame
        A dataframe with description N independent rollouts.
        Each consisting of M timesteps.
        Trials are in the columns and rows are statistics.
        The result of df.describe()
   
    Returns
    -------
    dataframe: pd.DataFrame
        A description of the average return.
    
    """
    df = dataframe.drop(['std', 'count', '25%', '50%', '75%'], axis=0)
    ts = df.T.describe()['mean']
    ts.name = label
    return ts.to_frame()

In [12]:
dataframes = []
dataframes.append(describe(central_df, label='central'))
dataframes.append(describe(distributed_df, label='distributed'))
dataframes.append(describe(independent_df, label='independent'))
noregdf = pd.concat(dataframes, axis=1)
noregdf

Unnamed: 0,central,distributed,independent
count,12.0,12.0,12.0
mean,-0.726005,-0.711079,-0.866688
std,0.224352,0.17939,0.135822
min,-1.195369,-1.084424,-1.061007
25%,-0.87125,-0.764382,-0.95829
50%,-0.673064,-0.671396,-0.900654
75%,-0.60394,-0.635131,-0.758374
max,-0.416687,-0.388977,-0.66126


## 5) Leaderboard 5000<a name="B-leaderboard"></a> 

We set `episodes=5000` and `explore_episodes=4975`.

In [13]:
import pandas as pd
BASE_PATH = '16_duo/03_tau05/5000/'

central_df = pd.read_csv(BASE_PATH + '00_central/02/pipeline.csv', sep=',', index_col=0)
joint_df = pd.read_csv(BASE_PATH + '01_joint_learners/02/pipeline.csv', sep=',', index_col=0)
indep_df = pd.read_csv(BASE_PATH + '02_independent_learners/02/pipeline.csv', sep=',', index_col=0)

In [16]:
dataframes = []
dataframes.append(describe(central_df, label='central'))
dataframes.append(describe(joint_df, label='joint'))
dataframes.append(describe(indep_df, label='independent'))
noregdf = pd.concat(dataframes, axis=1)
noregdf

Unnamed: 0,central,joint,independent
count,30.0,30.0,30.0
mean,-0.707014,-0.936407,-0.88491
std,0.174991,0.160264,0.137636
min,-1.11946,-1.24096,-1.264698
25%,-0.870943,-1.041689,-0.953044
50%,-0.645521,-0.929262,-0.854733
75%,-0.591891,-0.831467,-0.782187
max,-0.475165,-0.637744,-0.684747
