# Halite challenge - advanced topics

We have seen how to solve the reinforcement learning problem of a single ship that has to optimize the collection of halite. We now want to extend the learning process to multiple ships. To do that we face two new kind of problems/tasks in addition to the one already solved for the single ship case. The first one is how to teach to the shipyard to spawn efficiently the ships. The second one is how to coordinate all the ships at each turn and also in the learning process.

#  Multi-agent framework

Two classes of agents: ship and shipyard. Only one instance of shipyard agent, many of the ship agent.

### Challenges:

1. Interface all the ship agents in order to make them learn withouth crashing into each other;
2. Define for each class of agents what is the (partial) state that they observe at each turn;
3. Shape the reward for multiple ships (one for all or one for each?) and for the shipyard (just a single reward for all the episode?);
4. Choose how to iteratively train the two classes whithout making confiusion that could ruin the training procedure, e.g. assigning a penalty to the shipyard for a fault of a ship agent.


# Ship class with tabular method

The simplest and maybe more straightforward way to generalize from one ship to many ships is to consider them as separate and independent entities, that share as little as possible between them so that their learning is very similar to the single ship case and that it is required less experience to train them. In fact the number of possible states of the system would grow exponentially if we were to consider as a state the union of the states of all ships and that would be infeasible even for two ships. For example for a single ship in a $7 \times 7$ map we have 142.884 possible states; consequently we would have $(142.844)^2 = 20.4\cdot10^9$ states. Considering that each of these states has to be multiplied for all the possible combined actions of the two ships (25) and requires 64 bits, i.e. 8 bytes, to be stored, the memory required to store the Q-value table is $4.08\cdot 10^{12}$ bytes = 4.08 Tb, that definetively doesn't fit any existing RAM.

## State representation

Assessed the fact that a shared state is infeasible, what we can do is to encode in a minimal way some the information that is necessary to a ship to avoid collisions. Since the maximum range of the composed motion of two ships is 2 cells and near a ship there can be multiple ships, we can choose to encode the positions of the other ships inside a rectangle of $5\times5$ relative to the position of the considered ship. But in this way we also include cells that are more than two moves away from the center (because diagonal moves aren't allowed).
In the grid below the `___` are the cells in the range of the two moves, instead the `***` are the ones out of range.

| / | 0| 1 | 2 | 3 | 4 |
| --- | --- | --- | --- | --- | --- |
|0    | *** | *** | ___ | *** | *** |
|1    | *** | ___ | ___ | ___ | *** |
|2    | ___ | ___ | ship | ___ | ___ |
|3    | *** | ___ | ___ | ___ | *** |
|4    | *** | *** | ___ | *** | *** |

All we need is the information about 13 out of the 25 cells. Each of these cells can be occupied (1) or empty (0), thus all the possible combinations are $2^{13} = 8192$. This results in a total number of state-actions of $5.850\cdot 10^9$ and 47 Gb of space. Again this is too much.

Thinking about the scope of this information, we can easily realize that is all about what are the dangerous directions and what are the safe ones. Since we are forced to compress this information, the minimal but still relevant information would be a single direction, the safest one. We can label dangerous a direction if taking an action in that direction has non-zero probability of colliding with another ship. If we have just one dangerous direction, the safest one is the opposite direction; if we have two, we can choose randomly one of the two safe directions; if we have three, only one direction remains and finally if all four direction are dangerous, the safest option is to remain still (this doesn't ensure the safety of the ship, but if all the ships follow the same optimal policy should work). So we can use a variable `safe_dir` $\in [0,4]$ to encode this information. This results in in a total number of state-actions of $3.57\cdot 10^6$ and 28.57 Mb of space. 

## Reward

Since we choose to use independent agents for all the ships, it makes sense to decompose the reward into the contributions of all the ships. The contribution of a ship is 0 if it doesn't deposit halite (and in that case it receives the baseline reward, e.g. -0.01) and is the halite stored divided by 1000 minus the baseline reward for the ship that actually has deposited the halite.

# Shipyard class

Dealing with the shipyard is a completely different problem, for different reasons:

1. It has at disposal a binary choice (spawn ship or stay still);
2. It's not responsible for the absolute result of the episode, but its choices can influence it greatly;
3. Its choices are more affected by the current time of the episode than the ship ones (spawning a ship in the last few turns it's definitively a bad idea);
4. It receives feedback about its action with a much lower frequence than the ship agents, because the only metric to evaluate its policy is the final amount of halite collected and how it has changed through the episodes while keeping fixed the policy of the ships.

If we try to understand what are the informations that the shipyard needs to have in order to decide whether to spawn a ship or not, we find:

1. Current number of ships `N`;
2. Number of turns to the end `t_left` (in this way its flexible about the total number of turns);
3. Halite available `h_tot`;
4. Position of the ships in the map or some variable connected to it (for example we don't want to spawn a ship while another one is dropping the halite in the shipyard, because it would cause a shipwreck);

Since the number of variables is low and the choice it has to do is binary, it seems natural to use a function instead that a table to approximate the Q value of the choices. In other words, instead of keeping record in a table for each state s and action a of the Q value Q(s,a), we can use a function $f$ with the interpretation $f(s) = P[Q(s,1) > Q(s,0)]$, where for convention we use $a=1$ for spawning a ship and $a=0$ for staying still.
Then we have to reformulate the task of the shipyard in order to understand how to determine the function $f$.