# Recap

- We tested the single agent setting for a different task.
- The landmarks were always fixed at time of the environment initialization.
- It was shown that the agent learned to navigate to any part of the map.
- Particularly, when the agent's starting coordinate was kept fixed and overflow would happen. Random restart is an essencial part of exploration.
- The optimal policies are not deterministic -- the temperature parameter $\tau$ that regulates the entropy was tested for **1**, **2**, **3**, **5** and **10**.

## Findings

1. The most useful task is to randomly restart the landmarks.
2. Regularization, via parameter clipping, improved learning.
3. The optimal value for $\tau = 5.0$.


# Duo Task

## Goal:

### Agents must learn how to navigate to a target landmark, while avoiding other agents.

- Both agents and landmarks are restarted at the begining of each episode. And agents are assigned a landmark they must navigate to, they must through trial and error be find which landmark they were assigned to.
- States are the coordinates to the other agent and to both landmarks.
- Reward is defined by the distance from an agent to its assigned landmark. If they collide both receive an extra reward=-1.



The objective of this notebook is to compare three learning settings.

1. Centralized Actor Critic

    - Single agent.
    - Fully observable setting.
    - Learnings using the average reward from both players.
2. Cooperative Actor Critic

    - Independent agents.
    - Fully observable setting.
    - Learnings using the average reward from both players.

3. Independent Learners Actor Critic

    - Independent agents.
    - Fully observable setting.
    - Individual rewards.

## Settings

1. We compare the three models above. 
2. Initially, $\tau = 100$ and it falls linearly with the number of episodes (`explore_episodes=975`). 
3. Each test dataframe consists of the DataFrame.describe() statistics from **N** = 30 independent random trials, each of which consisting of rollouts of `M=100`, with $\tau$ set to a predetermined value.

Parameters:
```
ALPHA = 0.5  # ALPHA:
BETA = 0.3  # BETA:
TAU = 5.0   # Final TAU
EXPLORE_EPISODES = 975
EPISODES = 1000
EXPLORE = True
BASE_PATH = 'data/16_duo/03_tau05/'

N_WORKERS = 6
N_AGENTS = 2
```

## 1) Central Agent

BASE_PATH = 'data/16_duo/03_tau05/00_central/02'

### 1.1 Rollout Simulation

GIF from the best performing training.

![pipeline-central-simulation](16_duo/03_tau05/00_central/02/simulation-pipeline-best.gif)

### 1.2 Rollout Graph


![pipeline-central-simulation](16_duo/03_tau05/00_central/02/evaluation_rollout_num17.png)

### 1.3 Train 



![pipeline-central-train-30](16_duo/03_tau05/00_central/02/train_pipeline_m30.png)
![rollout-central-train-30](16_duo/03_tau05/00_central/02/train_rollout_m30.png)

## 2) Cooperative Actor Critic

GIF from the best performing training.


![pipeline-joint-simulation](16_duo/03_tau05/01_joint_learners/02/simulation-pipeline-best.gif)

### 2.2 Rollout Graph


![pipeline-joint-rollout](16_duo/03_tau05/01_joint_learners/02/evaluation_rollout_num1.png)

### 2.3 Train 


![pipeline-joint-train-30](16_duo/03_tau05/01_joint_learners/02/train_pipeline_m30.png)
![rollout-joint-train-30](16_duo/03_tau05/01_joint_learners/02/train_rollout_m30.png)

## 3) Independent Learners Actor Critic

GIF from the best performing training.


![pipeline-independent-simulation](16_duo/03_tau05/02_independent_learners/02/simulation-pipeline-best.gif)

### 3.2 Rollout Graph


![pipeline-independent-rollout](16_duo/03_tau05/02_independent_learners/02/evaluation_rollout_num1.png)

### 3.3 Train 


![pipeline-independent-train-30](16_duo/03_tau05/02_independent_learners/02/train_pipeline_m30.png)
![rollout-independent-train-30](16_duo/03_tau05/02_independent_learners/02/train_rollout_m30.png)

# 2) [Leaderboard 1000](#leaderboard-untrained)

In [15]:
import pandas as pd
BASE_PATH = '16_duo/03_tau05/'

central_df = pd.read_csv(BASE_PATH + '00_central/02/pipeline.csv', sep=',', index_col=0)
joint_df = pd.read_csv(BASE_PATH + '01_joint_learners/02/pipeline.csv', sep=',', index_col=0)
indep_df = pd.read_csv(BASE_PATH + '02_independent_learners/02/pipeline.csv', sep=',', index_col=0)

def describe(dataframe: pd.DataFrame, label: str) -> pd.DataFrame:
    """Describes the dataframe
    
    Parameters
    ----------
    dataframe: pd.DataFrame
        A dataframe with description N independent rollouts.
        Each consisting of M timesteps.
        Trials are in the columns and rows are statistics.
        The result of df.describe()
   
    Returns
    -------
    dataframe: pd.DataFrame
        A description of the average return.
    
    """
    df = dataframe.drop(['std', 'count', '25%', '50%', '75%'], axis=0)
    ts = df.T.describe()['mean']
    ts.name = label
    return ts.to_frame()

In [17]:
dataframes = []
dataframes.append(describe(central_df, label='central'))
dataframes.append(describe(joint_df, label='joint'))
dataframes.append(describe(indep_df, label='indepent'))
noregdf = pd.concat(dataframes, axis=1)
noregdf

Unnamed: 0,central,joint,indepent
count,30.0,30.0,30.0
mean,-0.902543,-0.771013,-0.730448
std,0.214082,0.250752,0.215126
min,-1.48868,-1.427638,-1.28006
25%,-1.061837,-0.915154,-0.819776
50%,-0.85727,-0.712802,-0.702597
75%,-0.748428,-0.589685,-0.604472
max,-0.589694,-0.39237,-0.413223


1. The first thing to note is that the Evaluation Rollouts show that the central agent presents one colliison, while the joint action learner presents two collisions and the independent learner presents three collisions. Indicating that **cooperation** is a helpful means to avoid collisions and that **coordination** is an effective way to achieve that.
2. However, we see from table


In [28]:
# BASE_PATH = '17_longer_run/03_tau05/5000'
BASE_PATH = '17_longer_run/03_tau05/5000/'

central_df = pd.read_csv(BASE_PATH + '00_central/02/pipeline.csv', sep=',', index_col=0)
central_df.T.describe()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
count,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0
mean,100.0,-0.707014,0.296763,-1.672751,-0.797873,-0.646329,-0.525754,-0.298819
std,0.0,0.174991,0.057792,0.195276,0.184996,0.188454,0.177706,0.160034
min,100.0,-1.11946,0.18023,-1.986225,-1.183472,-1.080297,-0.984967,-0.686302
25%,100.0,-0.870943,0.253144,-1.824943,-0.983946,-0.786415,-0.646241,-0.336848
50%,100.0,-0.645521,0.300217,-1.689882,-0.754098,-0.575584,-0.485484,-0.279884
75%,100.0,-0.591891,0.331558,-1.500907,-0.648895,-0.517344,-0.382428,-0.182048
max,100.0,-0.475165,0.405116,-1.351337,-0.535516,-0.378655,-0.25858,-0.069


In [29]:
# BASE_PATH = '17_longer_run/03_tau05/5000'
BASE_PATH = '17_longer_run/03_tau05/5000/'

joint_df = pd.read_csv(BASE_PATH + '01_joint_learners/02/pipeline.csv', sep=',', index_col=0)
joint_df.T.describe()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
count,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0
mean,100.0,-0.936407,0.308908,-1.967238,-1.095861,-0.90794,-0.716741,-0.437647
std,0.0,0.160264,0.074822,0.373565,0.183909,0.171209,0.164616,0.212473
min,100.0,-1.24096,0.099821,-2.774708,-1.543693,-1.244703,-1.03249,-0.819416
25%,100.0,-1.041689,0.267717,-2.225959,-1.224728,-1.035687,-0.822974,-0.625647
50%,100.0,-0.929262,0.31688,-1.953062,-1.086125,-0.904756,-0.699624,-0.41383
75%,100.0,-0.831467,0.356348,-1.786281,-1.004771,-0.788876,-0.601372,-0.242508
max,100.0,-0.637744,0.466036,-0.986472,-0.767684,-0.569236,-0.429004,-0.078584


In [30]:
# BASE_PATH = '17_longer_run/03_tau05/5000'
BASE_PATH = '17_longer_run/03_tau05/5000/'

joint_df = pd.read_csv(BASE_PATH + '02_independent_learners/02/pipeline.csv', sep=',', index_col=0)
joint_df.T.describe()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
count,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0
mean,100.0,-0.88491,0.327003,-2.060122,-1.019062,-0.842407,-0.6672,-0.39169
std,0.0,0.137636,0.05798,0.220231,0.131686,0.143866,0.156306,0.186736
min,100.0,-1.264698,0.176039,-2.587743,-1.341829,-1.207865,-1.040477,-0.877021
25%,100.0,-0.953044,0.302329,-2.209233,-1.084983,-0.947738,-0.742387,-0.473645
50%,100.0,-0.854733,0.328507,-2.064571,-0.988348,-0.808419,-0.629233,-0.380767
75%,100.0,-0.782187,0.360295,-1.959333,-0.931329,-0.735808,-0.553192,-0.241369
max,100.0,-0.684747,0.453961,-1.687859,-0.789959,-0.648902,-0.445923,-0.132889


## 1.3. In-Sample Simulations


### 1.3.1 Best rollout (Tau=01)

> tau01_df['64']
```
count    100.000000
mean      -0.127377
std        0.096640
min       -0.669727
25%       -0.145477
50%       -0.115342
75%       -0.077520
max       -0.019049
Name: 64, dtype: float64
```

![t01s64](07_no_clip_restart/00_tau01/02/simulation-pipeline-best.gif)

### 1.3.2 Best mean (Tau=05)

> tau05_df['67']
```
count    100.000000
mean      -0.159647
std        0.080587
min       -0.469426
25%       -0.202637
50%       -0.149833
75%       -0.108500
max       -0.014875
Name: 67, dtype: float64
```

![t05s67](07_no_clip_restart/03_tau05/02/simulation-pipeline-best.gif)


# 2. Tau with Regularization


1. We further regularize the variables $\delta_t$ and $\omega_t$ by applicating the techinique called parameter clipping.

2. Other parameters are kept at their values.

Parameters:
```
ALPHA = 0.5  # ALPHA:
BETA = 0.3  # BETA:
TAU = 1.0   # Final TAU. ONLY active is EXPLORE=True
EXPLORE_EPISODES = 475
EPISODES = 500
EXPLORE = True  # WHETER OR NOT WE USE EXPLORATION
RESTART = True
```

In [5]:
PATH = '08_clipping_restart'
tau01_df = pd.read_csv(PATH + '/00_tau01/02/pipeline.csv', sep=',', index_col=0)
tau02_df = pd.read_csv(PATH + '/01_tau02/02/pipeline.csv', sep=',', index_col=0)
tau03_df = pd.read_csv(PATH + '/02_tau03/02/pipeline.csv', sep=',', index_col=0)
tau05_df = pd.read_csv(PATH + '/03_tau05/02/pipeline.csv', sep=',', index_col=0)
tau10_df = pd.read_csv(PATH + '/04_tau10/02/pipeline.csv', sep=',', index_col=0)


In [6]:
dataframes = []
dataframes.append(describe(tau01_df, label='tau01'))
dataframes.append(describe(tau02_df, label='tau02'))
dataframes.append(describe(tau03_df, label='tau03'))
dataframes.append(describe(tau05_df, label='tau05'))
dataframes.append(describe(tau10_df, label='tau10'))
regdf = pd.concat(dataframes, axis=1)
regdf

Unnamed: 0,tau01,tau02,tau03,tau05,tau10
count,30.0,30.0,30.0,30.0,30.0
mean,-0.506445,-0.386484,-0.331849,-0.34209,-0.417652
std,0.235331,0.160066,0.094523,0.071565,0.090178
min,-1.014563,-0.737667,-0.643118,-0.500263,-0.59855
25%,-0.72075,-0.450062,-0.373464,-0.387874,-0.497402
50%,-0.559567,-0.338805,-0.320137,-0.344609,-0.414975
75%,-0.298594,-0.268294,-0.266351,-0.271376,-0.339129
max,-0.121159,-0.171241,-0.175994,-0.237887,-0.281009


We find that regularization futher helps dropping the average and in reducing the average.

In [7]:
df = pd.merge(
    noregdf, 
    regdf, 
    how='inner',  
    left_index=True, 
    right_index=True, 
    suffixes=('_noreg', '_regul'), 
    copy=True).T.sort_index()

In [8]:
df

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
tau01_noreg,30.0,-0.826094,1.341297,-7.614286,-0.709591,-0.516601,-0.356736,-0.127377
tau01_regul,30.0,-0.506445,0.235331,-1.014563,-0.72075,-0.559567,-0.298594,-0.121159
tau02_noreg,30.0,-0.479021,0.304314,-1.469546,-0.509073,-0.378446,-0.26734,-0.168934
tau02_regul,30.0,-0.386484,0.160066,-0.737667,-0.450062,-0.338805,-0.268294,-0.171241
tau03_noreg,30.0,-0.442162,0.29002,-1.139234,-0.50144,-0.315756,-0.253503,-0.171007
tau03_regul,30.0,-0.331849,0.094523,-0.643118,-0.373464,-0.320137,-0.266351,-0.175994
tau05_noreg,30.0,-0.345316,0.119567,-0.714682,-0.410162,-0.321267,-0.261022,-0.159647
tau05_regul,30.0,-0.34209,0.071565,-0.500263,-0.387874,-0.344609,-0.271376,-0.237887
tau10_noreg,30.0,-0.37029,0.090621,-0.565207,-0.412021,-0.325997,-0.302905,-0.250193
tau10_regul,30.0,-0.417652,0.090178,-0.59855,-0.497402,-0.414975,-0.339129,-0.281009
