# Recap

- We test three different information structures.
    * Centralized learner
    * Distributed learner
    * Independent learner

- Under the Duo and Trio task settings.


## Findings

1. The policies induced by the information structures were **particular**.
    * Centralized agent approaches a landmark and than tries to approach another. Incurring in bumps.
    * Distributed agents do not seek to settle on a particular landmark but oscilate around landmarks.
    * Independent learners greedly seek to settle on a landmark. Regadless of bumps: High risk high reward.
2. Centralized agent needed more steps to propertly learn.
3. Determine the causes of distributed agent failing at learning on the latter parts of the episode.


# Duo Task

## Goal:

### Agents must learn how to navigate to a target landmark, while avoiding other agents.

- Both agents and landmarks are restarted at the begining of each episode. And agents are assigned a landmark they must navigate to, they must through trial and error be find which landmark they were assigned to.
- States are the coordinates to the other agent and to both landmarks.
- Reward is defined by the distance from an agent to its assigned landmark. If they collide both receive an extra reward=-1.


The objective of this notebook:
* Leaderboard: `episodes=5000` to `episodes=10000`.
* Test the new version of the distributed learners.
    - **Before**
    \begin{align*}
    \delta_t &\leftarrow \bar{r}_{t+1} - \mu_t + V(x(s_{t+1}); \omega_t) - V(x(s_t); \omega_t)\\
    \omega_{t+1} &\leftarrow \omega_t + \alpha  \delta_t x(s_t) 
    \end{align*}
    - **After**
    \begin{align*}
    \delta_t &\leftarrow \bar{r}_{t+1} - \mu_t + Q(x(s_{t+1}), a_{t+1} ; \omega_t) - Q(x(s_t), a_t; \omega_t)\\
    \omega^{a_t}_{t+1} &\leftarrow \omega^{a_t}_t + \alpha  \delta_t x(s_t) 
    \end{align*}
 
In summary the agent can now also learn from actions.
### General MDP

$$\mathcal{X} = \mathcal{X}_1 \times \mathcal{X}_2$$
$$\mathcal{A} = \mathcal{A}_1 \times \mathcal{A}_2$$
$$r = r_1(x_1) + r_2(x_2)$$

#### States

`TODO`

#### Actions

`TODO`

#### Rewards

 `TODO`

### Central Learner

The central agents solves the general MDP above.

    - Single agent.
    - Fully observable setting.
    - Learnings using the average reward from both players.
<table>
<tr>
<th>Central Agent</th>
</tr>
<tr>
<td>
$$\mathcal{X}_1 \times \mathcal{X}_2$$
$$\mathcal{A}_1 \times \mathcal{A}_2$$
$$r_1(x_1) + r_2(x_2)$$
</td>
</tr>
</table>

### Distributed Learners

The distributed agent have full observability but learn
independently.

    - Independent agents.
    - Fully observable setting.
    - Learnings using the average reward from both players.

<table>
<tr>
<th>Agent 1</th>
<th>Agent 2</th>
</tr>
<tr>
<td>
$$\mathcal{X}_1 \times \mathcal{X}_2$$
$$\mathcal{A}_1 $$
$$r_1(x_1) + r_2(x_2)$$
</td>
<td>
$$\mathcal{X}_1 \times \mathcal{X}_2$$
$$\mathcal{A}_2 $$
$$ r_1(x_1) + r_2(x_2)$$
</td>
</tr>
</table>

### Independent Learner

The distributed agent have partial observability and learn
independently.

    - Independent agents.
    - Partially observable setting.
    - Individual rewards.

<table>
<tr>
<th>Agent 1</th>
<th>Agent 2</th>
</tr>
<tr>
<td>
$$\mathcal{X}_1$$
$$\mathcal{A}_1$$
$$r_1(x_1)$$
</td>
<td>
$$\mathcal{X}_2$$
$$\mathcal{A}_2 $$
$$ r_2(x_2)$$
</td>
</tr>
</table>

## Settings


1. We compare the three information strucutures above. 
2. Initially, $\tau = 100$ and it falls linearly with the number of episodes (`explore_episodes=9975`). 
3. Each test dataframe consists of the DataFrame.describe() statistics from **N** = 30 independent random trials, each of which consisting of rollouts of `M=100`, with $\tau$ set to a predetermined value.

Parameters:
```
"""Configuration"""
ALPHA = 0.5  # ALPHA:
BETA = 0.3  # BETA:
TAU = 5.0   # Final TAU
EXPLORE_EPISODES = 9000
EPISODES = 10000
EXPLORE = True  # WHETER OR NOT WE USE EXPLORATION

SEED = 0
BASE_PATH = 'data/00_duo/01_tau05_10000/'

N_WORKERS = 6
N_AGENTS = 2
AGENT_TYPE = 'ActorCriticCentral'

PIPELINE_SEEDS = [
    47,
    48,
    49,
    50,
    51,
    52,
    53,
    54,
    55,
    56,
    57,
    58,
    # 59,
    # 60,
    # 61,
    # 62,
    # 63,
    # 64,
    # 65,
    # 66,
    # 67,
    # 68,
    # 69,
    # 70,
    # 71,
    # 72,
    # 73,
    # 74,
    # 75,
    # 76
]

```

##  Adversary

In [7]:
import pandas as pd
BASE_PATH = '00_duo/01_tau05_10000/'

central_df = pd.read_csv(BASE_PATH + '00_central/02/pipeline-rollouts-summary.csv', sep=',', index_col=0)
distributed_df = pd.read_csv(BASE_PATH + '01_distributed_learners2/02/pipeline-rollouts-summary.csv', sep=',', index_col=0)
independent_df = pd.read_csv(BASE_PATH + '02_independent_learners/02/pipeline-rollouts-summary.csv', sep=',', index_col=0)
consensus_df = pd.read_csv(BASE_PATH + '03_consensus_learners/02/pipeline-rollouts-summary.csv', sep=',', index_col=0)

def describe(dataframe: pd.DataFrame, label: str) -> pd.DataFrame:
    """Describes the dataframe
    
    Parameters
    ----------
    dataframe: pd.DataFrame
        A dataframe with description N independent rollouts.
        Each consisting of M timesteps.
        Trials are in the columns and rows are statistics.
        The result of df.describe()
   
    Returns
    -------
    dataframe: pd.DataFrame
        A description of the average return.
    
    """
    df = dataframe.drop(['std', 'count', '25%', '50%', '75%'], axis=0)
    ts = df.T.describe()['mean']
    ts.name = label
    return ts.to_frame()

In [9]:
dataframes = []
dataframes.append(describe(central_df, label='central'))
dataframes.append(describe(distributed_df, label='distributed v'))
dataframes.append(describe(independent_df, label='independent'))
dataframes.append(describe(consensus_df, label='consensus'))
noregdf = pd.concat(dataframes, axis=1)
noregdf

Unnamed: 0,central,distributed v,independent,consensus
count,12.0,12.0,12.0,12.0
mean,-0.93616,-0.778078,-0.921831,-1.775816
std,0.203765,0.287804,0.144684,1.847337
min,-1.4053,-1.416441,-1.192741,-6.583758
25%,-1.003028,-0.915378,-0.979425,-1.531485
50%,-0.914324,-0.793019,-0.911007,-1.012742
75%,-0.8716,-0.533962,-0.828651,-0.813505
max,-0.501187,-0.41423,-0.690237,-0.624688


# SECTION A: Hyperparameter Exploration

- $\tau \in \left\{3.0, 5.0, 10.0\right\}$
- $(\alpha,\beta) \in \left\{(0.05, 0.01), (0.10, 0.05), (0.90, 0.50)\right\}$
- $\zeta \in (0.001, 0.01, 0.05)$

## 1) Central Agent

In [1]:
import pandas as pd

def describe(dataframe: pd.DataFrame, label: str) -> pd.DataFrame:
    """Describes the dataframe
    
    Parameters
    ----------
    dataframe: pd.DataFrame
        A dataframe with description N independent rollouts.
        Each consisting of M timesteps.
        Trials are in the columns and rows are statistics.
        The result of df.describe()
   
    Returns
    -------
    dataframe: pd.DataFrame
        A description of the average return.
    
    """
    df = dataframe.drop(['std', 'count', '25%', '50%', '75%'], axis=0)
    ts = df.T.describe()['mean']
    ts.name = label
    return ts.to_frame()

In [2]:
PREFIX_PATH = '00_duo_hyperparameter_search/'
SUFFIX_PATH = '00_central/02/pipeline-rollouts-summary.csv'

def get_path(x):
    return '%s/%s/%s' % (PREFIX_PATH, x, SUFFIX_PATH)
def get_csv(x):
    return pd.read_csv(get_path(x), sep=',', index_col=0)
tau_03_df = get_csv('00_tau03')
tau_05_df = get_csv('01_tau05')
tau_10_df = get_csv('02_tau10')
alpha_005_df = get_csv('03_alpha005_beta001')
alpha_050_df = get_csv('04_alpha010_beta005')
alpha_090_df = get_csv('05_alpha090_beta050')
zeta_0001_df = get_csv('06_zeta0001')
zeta_001_df = get_csv('07_zeta001')
zeta_005_df = get_csv('08_zeta005')


### Leaderboard<a name="A-leaderboard"></a> 

In [3]:
dataframes = []
dataframes.append(describe(tau_03_df, label='Tau = 3'))
dataframes.append(describe(tau_05_df, label='Tau = 5'))
dataframes.append(describe(tau_10_df, label='Tau = 10'))
dataframes.append(describe(alpha_005_df, label='Alpha = 0.05'))
dataframes.append(describe(alpha_050_df, label='Alpha = 0.10'))
dataframes.append(describe(alpha_090_df, label='Alpha = 0.90'))
dataframes.append(describe(zeta_0001_df, label='Zeta = 0.001'))
dataframes.append(describe(zeta_001_df, label='Zeta = 0.01'))
dataframes.append(describe(zeta_005_df, label='Zeta = 0.05'))
noregdf = pd.concat(dataframes, axis=1)
noregdf

Unnamed: 0,Tau = 3,Tau = 5,Tau = 10,Alpha = 0.05,Alpha = 0.10,Alpha = 0.90,Zeta = 0.001,Zeta = 0.01,Zeta = 0.05
count,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0
mean,-1.027446,-1.030752,-0.801755,-1.121143,-0.862657,-0.926981,-1.00296,-0.980269,-1.030918
std,0.222353,0.203084,0.164323,0.293388,0.17448,0.206394,0.205614,0.179683,0.198961
min,-1.679605,-1.543148,-1.078303,-1.683968,-1.339847,-1.568402,-1.618359,-1.485941,-1.63687
25%,-1.13811,-1.123768,-0.921898,-1.37178,-0.944237,-1.055936,-1.09592,-1.070249,-1.110846
50%,-0.983555,-0.991086,-0.770042,-1.061032,-0.834386,-0.871808,-1.001346,-0.941784,-1.036429
75%,-0.852044,-0.888892,-0.690146,-0.92055,-0.746351,-0.79697,-0.852489,-0.852608,-0.889156
max,-0.731049,-0.624713,-0.488327,-0.558211,-0.592805,-0.598535,-0.649731,-0.712581,-0.742197


![pipeline-central-train-30](00_duo_hyperparameter_search/09_best_combination/00_central/02/train_pipeline_m30.png)

### All Agents<a name="B-leaderboard"></a> 

We set `episodes=5000` and `explore_episodes=4975`.

In [4]:
PREFIX_PATH = '00_duo_hyperparameter_search/10_best_combination_double_runs'
SUFFIX_PATH = '02/pipeline-rollouts-summary.csv'

def get_path(x):
    return '%s/%s/%s' % (PREFIX_PATH, x, SUFFIX_PATH)
def get_csv(x):
    return pd.read_csv(get_path(x), sep=',', index_col=0)

central_df = get_csv('00_central')
distributed_df = get_csv('01_distributed_learners2')
independent_df = get_csv('02_independent_learners')
consensus_df = get_csv('03_consensus_learners')


In [5]:
dataframes = []
dataframes.append(describe(central_df, label='central'))
dataframes.append(describe(distributed_df, label='distributed'))
dataframes.append(describe(independent_df, label='independent'))
dataframes.append(describe(consensus_df, label='consensus'))
noregdf = pd.concat(dataframes, axis=1)
noregdf

Unnamed: 0,central,distributed,independent,consensus
count,30.0,30.0,12.0,30.0
mean,-0.91975,-0.744186,-0.653909,-1.164424
std,0.194393,0.214785,0.146739,0.351595
min,-1.410282,-1.36405,-0.98112,-1.736371
25%,-1.036102,-0.779072,-0.698352,-1.485933
50%,-0.866742,-0.678873,-0.60469,-1.184939
75%,-0.790471,-0.621297,-0.573968,-0.898429
max,-0.624046,-0.506707,-0.454609,-0.543161


### Central

![pipeline-central-train-30](00_duo_hyperparameter_search/10_best_combination_double_runs/00_central/02/train_pipeline_m30.png)

### Distributed Learners

![pipeline-distributed-train-30](00_duo_hyperparameter_search/10_best_combination_double_runs/01_distributed_learners2/02/train_pipeline_m30.png)

### Independent 

![pipeline-distributed-train-12](00_duo_hyperparameter_search/10_best_combination_double_runs/02_independent_learners/02/train_pipeline_m12.png)

### Consensus

![pipeline-distributed-train-30](00_duo_hyperparameter_search/10_best_combination_double_runs/03_consensus_learners/02/train_pipeline_m30.png)

## 6) Consensus Learners


BASE_PATH = '00_duo/01_tau05_10000/04_consensus_learners/02'

### 6.1 Rollouts
GIF from the best performing training.

![pipeline-independent-simulation](00_duo/01_tau05_10000/03_consensus_learners/02/simulation-pipeline-best.gif)

### 6.2 Rollout Graph


![pipeline-consensus-rollout](00_duo/01_tau05_10000/03_consensus_learners/02/evaluation_rollout_n2_num07.png)

### 6.3 Train <a name="A-3.3"></a> 


![pipeline-consensus-train-30](00_duo/01_tau05_10000/03_consensus_learners/02/train_pipeline_m12.png)
![rollout-consensus-train-30](00_duo/01_tau05_10000/03_consensus_learners/02/train_rollout_m12.png)