# Quick Notes on MARL
Here we evaluate some concepts for MARL.

## General Class Structure for Hierarchical Actions

Hierarchical actions: decisions taken in sequence. Often actions are of mixed type, e.g. both binary and continue.

Goal is to reduce the action space, i.e. constrain the model to evaluate action values that would make sense in real world.
See the abstract structure below.
Ideally hierarchical actions should have decoupled rewards.

In [1]:
import torch
def somefunc(*args):
    pass
class Agent:
    def __init__(self, decision_policy, offer_polcy):
        pass
        
    def act(self, observations):
        decision = torch.rand_like(observations) @ observations
        return decision
        


class PolicyBinaryDecision(torch.nn.Module):
    def __init__(self):
        pass

    def act(self, observations):
        actions = torch.zeros_like(observations)
        return actions

class PolicyOffer(torch.nn.Module):
    def __init__(self):
        pass

    def forward(self, observations):
        actions = torch.zeros_like(observations)
        return actions
    
    
    
    
class Environment():
    def __init__(self):
        pass
    
    def step(self, actions):
        decision = actions[0]
        offer_value = actions[1:]
        
        reward_binary = somefunc(decision)
        reward_offer = somefunc(offer_value)
        
        reward_binary = reward_binary if reward_binary > reward_offer else - reward_offer 
        
        return observations, (reward_binary, reward_offer), info
    
    

## Composite Actions

In case the previous proves too computationally, one can force the constraints inside the policy neural network forward or the environment by calculating a composite action that combines both decision and reward. 
Then the reward would not be decoupled for actions, but still the neural network will evaluate valid combinations of actions, e.g. when passing offer price will be always 0.
An abstract example of such composite action policy is found below:

In [2]:
def policy_action():
    logits = nn_policy_logits() #batch_size x 1
    decision_prob = torch.sigmoid(logits) # (0,1)
    
    price = torch.relu(nn_quantity_output(...) - reservation) + reservation# seller
    # price = - torch.relu(reservation - nn_quantity_output(...)) + reservation# buyer

    return torch.bernoulli(p=decision_prob)*price

# but
composite_action = torch.rand([5, 1]) # a continuous composite action generated by policy action
reward = somefunc(composite_action)

    
    

If we use the composite action logic and not decouple the reward, we reduce the action space but may still require high sample counts. The RL will require a lot of samples to "learn" and decouple the reward values and assign credit to the proper actions.
In RL this is commonly known as the credit assignment problem, and is a major challenge in reward design.

## Centralized Environment and Decentrallized Agents
Here we evaluate how it is possible with one environment to parellized and handle many samples (mini-batches, agents and timesteps).

Does having a centrallized shared environment allow for decentralized agent architectures? 
For this we come up with simple small example of 3 agents playing the following game:
- Each agent outputs a continuous value action.
- The agent that outputs the median value wins for the given timestep.
- The agent that wins most timesteps, wins the game

From RL perspective the agents can learn by interacting with several instances of the same environment per time.

In [4]:
# if we had 3 agents that are "linear regressors", and each agent takes the mean over other agent's actions as an observation

n_samples = 6
n_agents = 3


agent_adjacency = torch.tensor([
    [0, 1, 1],
    [1, 0, 1],
    [1, 1, 0]
], dtype=torch.float)

# TODO: There is an error in the observation calcultation. Double check if important
class Env:
    def __init__(self):
        pass
    
    def step(self, actions): 
        """
        param actions: a n_samples x n_agents vector
        """
        median_action = actions.median(-1).values # over all agents not samples

        reward = -((actions - median_action.unsqueeze(-1))**2) # we want broadcast over agent dimension
        observation = (agent_adjacency.unsqueeze(0)*actions.unsqueeze(-1)).mean(-1) # we match samples and agents dimension for broadcast
        
        # detach below means we do not allow gradient propagation through environment calculations!
        return observation.detach(), reward.detach(), {}
    
    def reset(self, n_samples = 10): # or we can call it n_environments 
        return torch.rand([n_samples, n_agents])
    

Given the above the environment step would look like:

In [13]:
env = Env()
env.step(torch.tensor(     #agent_1, agent_2, agent_3
                           [[       3,       2,       1],  # env 1
                            [       4,       3,       5] ] # env 2
                ))


(tensor([[2.0000, 1.3333, 0.6667],
         [2.6667, 2.0000, 3.3333]]),
 tensor([[-1,  0, -1],
         [ 0, -1, -1]]),
 {})

Now let's play with some agents that do linear regression and optimize a q-value loss over time:

In [6]:
n_features = 1

agent_1 = torch.nn.Linear(in_features=n_features, out_features=1)
agent_2 = torch.nn.Linear(in_features=n_features, out_features=1)
agent_3 = torch.nn.Linear(in_features=n_features, out_features=1)

We collect initial observations for 10 samples:
    

In [7]:
initial_observation =  env.reset()
initial_observation

tensor([[0.2929, 0.2356, 0.4162],
        [0.3818, 0.3747, 0.2764],
        [0.2781, 0.2811, 0.5638],
        [0.5451, 0.6116, 0.9397],
        [0.0962, 0.5979, 0.0512],
        [0.9486, 0.3938, 0.1681],
        [0.4565, 0.6818, 0.0627],
        [0.9360, 0.9369, 0.5536],
        [0.8793, 0.7154, 0.0522],
        [0.7321, 0.5410, 0.2515]])

Then we calculate first step actions:

In [21]:
action_agent_1 = agent_1.forward(initial_observation[:, 0].unsqueeze(-1))
action_agent_2 = agent_2.forward(initial_observation[:, 1].unsqueeze(-1))
action_agent_3 = agent_3.forward(initial_observation[:, 2].unsqueeze(-1))
all_actions = torch.cat([action_agent_1, action_agent_2, action_agent_3], dim=-1) # concatenate agents on last dimension
all_actions

tensor([[ 0.4671, -0.9128,  0.3898],
        [ 0.3975, -1.0193,  0.3646],
        [ 0.4787, -0.9477,  0.4163],
        [ 0.2697, -1.2007,  0.4839],
        [ 0.6210, -1.1902,  0.3241],
        [-0.0462, -1.0340,  0.3451],
        [ 0.3390, -1.2544,  0.3262],
        [-0.0363, -1.4497,  0.4145],
        [ 0.0081, -1.2801,  0.3243],
        [ 0.1233, -1.1466,  0.3601]], grad_fn=<CatBackward>)

The observations and rewards for the next step would be:

In [24]:
obs, rew, _ = env.step(all_actions)
print(obs) # observation should be n_samples
print(rew)  # shape should be n_samples x n_agents.

tensor([[ 0.3114, -0.6085,  0.2598],
        [ 0.2650, -0.6795,  0.2431],
        [ 0.3191, -0.6318,  0.2775],
        [ 0.1798, -0.8004,  0.3226],
        [ 0.4140, -0.7935,  0.2161],
        [-0.0308, -0.6893,  0.2301],
        [ 0.2260, -0.8363,  0.2174],
        [-0.0242, -0.9665,  0.2763],
        [ 0.0054, -0.8534,  0.2162],
        [ 0.0822, -0.7644,  0.2401]])
tensor([[-5.9737e-03, -1.6967e+00, -0.0000e+00],
        [-1.0825e-03, -1.9153e+00, -0.0000e+00],
        [-3.8875e-03, -1.8605e+00, -0.0000e+00],
        [-0.0000e+00, -2.1619e+00, -4.5906e-02],
        [-8.8161e-02, -2.2931e+00, -0.0000e+00],
        [-0.0000e+00, -9.7566e-01, -1.5313e-01],
        [-1.6445e-04, -2.4983e+00, -0.0000e+00],
        [-0.0000e+00, -1.9977e+00, -2.0322e-01],
        [-0.0000e+00, -1.6595e+00, -9.9960e-02],
        [-0.0000e+00, -1.6128e+00, -5.6075e-02]])


To start training we create an optimizer per agent

In [25]:
optimizers = [torch.optim.Adam(agent_1.parameters()),
              torch.optim.Adam(agent_2.parameters()),
              torch.optim.Adam(agent_3.parameters())]

all_agents = [agent_1, agent_2, agent_3]

Now if we wanted to backpropagate and our learning loss would be something simple, e.g. discounted sum of rewards times actions, we would end up having the following learning loop in torch.
First we sample actions, observations and rewards based on the current agent (neural network) parameters.

In [26]:
# A game of 100 steps begins
obs =  env.reset()

# rollout experience collection
all_actions = []
all_rewards = []
all_obs = []
for step in range(100):
    
    actions = torch.cat([agent_1.forward(obs[:, 0].unsqueeze(-1).clone()), 
                         agent_2.forward(obs[:, 1].unsqueeze(-1).clone()), 
                         agent_3.forward(obs[:, 2].unsqueeze(-1).clone())
                        ],
                        -1).clone() # stack agents on last dimension
    obs, rew, _ = env.step(actions)
    all_actions.append(actions)
    all_rewards.append(rew)
    all_obs.append(obs)


Then we use these experiences to backpropagate and update parameters.
Once we update we can repeat the previous step and create a loop. 
We perform backpropagation update by rolling out and calculating the learning loss based on the discount rewards!

In [1]:
gamma = 0.99
discount_coefficients = (gamma)**torch.arange(len(all_obs))
for agent in range(n_agents):
    for i, obs in enumerate(all_obs):
        agent_action = all_agents[agent](obs[:, agent].unsqueeze(-1))
        
        # here we use a simple loss, q-value and belmann equation could replace it...
        loss = (-torch.stack(all_rewards[i:])[:, :, agent].sum(-1)*discount_coefficients[i:]*agent_action**2).sum() # discount factor goes here
        loss.backward() # gradient calculation
        optimizers[agent].step() # parameter update
        print(loss) # we should see this dropiing as we repeat the training loop.

NameError: name 'torch' is not defined

## Extra tips
If the agents would receive $n$ features per agent, then the linear layers would receive an input of $n \times m$, where $m$ is the number of agents. We would have to reshape in torch the observation inputs accordingly.

### Broadcasting and dummy dimensions

We can use dummy dimensions to apply broadcasting in torch, in a similar fashion to numpy. 
More about broadcasting semantics can be found here:
https://pytorch.org/docs/stable/notes/broadcasting.html