# Playing Starcraft and Saving the World Using Multi-agent Reinforcement Learning!

<img src="https://static.starcraft2.com/dist/images/content/f2p-cards/img-f2p-campaign.jpg" />

Image Source:  https://starcraft2.com/en-gb/ .

Powered By:
<img src="https://raw.githubusercontent.com/instadeepai/Mava/develop/docs/images/mava.png" />

and [SMAC - StarCraft Multi-Agent Challenge](https://github.com/oxwhirl/smac).


<a href="https://colab.research.google.com/github/instadeepai/IndabaX-SA-2021/blob/main/IndabaX_SA_Playing_Starcraft_and_Saving_the_World_Using_Multi_agent_Reinforcement_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## 1. Installation
**Installation of StarCraft II can take between 5 - 25 minutes.** 

In [None]:
#@title Timing Tool (Run Cell)
%%capture
!pip install ipython-autotime

%load_ext autotime

time: 2.07 ms (started: 2021-10-29 10:21:01 +00:00)


In [None]:
#@title Install Mava and Some Supported Environments (Run Cell)
%%capture
!pip install git+https://github.com/instadeepai/Mava@feature/smac-env-upgrades#egg=id-mava[reverb,tf,launchpad,envs]

In [None]:
#@title Installs and Imports for Agent Visualization (Run Cell)
%%capture
!pip install git+https://github.com/instadeepai/Mava#egg=id-mava[record_episode]
! apt-get update -y &&  apt-get install -y xvfb &&  apt-get install -y python-opengl && apt-get install ffmpeg && apt-get install python-opengl -y && apt install xvfb -y && pip install pyvirtualdisplay 

import os
from IPython.display import HTML
from pyvirtualdisplay import Display

display = Display(visible=0, size=(1280,720))
display.start()
os.environ["DISPLAY"] = ":" + str(display.display)

In [None]:
#@title Install Starcraft II (Run Cell)
%%capture

# Install Smac
!pip install git+https://github.com/oxwhirl/smac.git

# Install and Download Starcraft 2
!wget https://raw.githubusercontent.com/instadeepai/Mava/develop/install_sc2.sh
!chmod +x install_sc2.sh && ./install_sc2.sh

# https://github.com/deepmind/pysc2/issues/327
!apt-get remove libtcmalloc*

import os
os.environ['SC2PATH'] = "/content/3rdparty/StarCraftII"
os.environ['LD_PRELOAD'] = ''

# Let give us more time - increase episode limit of 3m from 60 - 120 (so we can train on colab)
!sudo apt install vim
!vim -c "%s/: 60/: 180/g|wq" /usr/local/lib/python3.7/dist-packages/smac/env/starcraft2/maps/smac_maps.py

## 2. Import Modules

In [None]:
#@title Imports Modules (Run Cell)
import functools
from datetime import datetime
from typing import Any, Dict, Mapping, Sequence, Union

import launchpad as lp
import numpy as np
import sonnet as snt
import tensorflow as tf
from absl import app, flags
from acme import types
from mava.components.tf import networks
from acme.tf import utils as tf2_utils


from mava import specs as mava_specs
from mava.systems.tf import vdn
from mava.utils import lp_utils
from mava.utils.environments import debugging_utils
from mava.wrappers import MonitorParallelEnvironmentLoop
from mava.components.tf import architectures
from mava.utils.loggers import logger_utils
from mava.components.tf.modules.exploration import LinearExplorationScheduler
from mava.components.tf.modules.exploration import LinearExplorationTimestepScheduler

# Seed random variables
magic_random_seed=42

import random
import os
def seed_data(seed: int) -> None:
    print(f"Using seed {seed}")
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ["TF_DETERMINISTIC_OPS"] = "1"
    os.environ["TF_CUDNN_DETERMINISTIC"] = "1"

seed_data(magic_random_seed)

## 3. SMAC: The StarCraft Multi-Agent Challenge

### SMAC Envrionment

In [None]:
#@title SMAC: The StarCraft Multi-Agent Challenge Video (Run Cell)
from IPython.display import YouTubeVideo
# Video Source https://www.youtube.com/watch?v=VZ7zmQ_obZ0
YouTubeVideo('VZ7zmQ_obZ0',width=1024, height=576)

SMAC is a popular environment/benchmark for cooperative MARL systems. 

It provides:
- Partial Observability.
- Challenging Dynamics.
- High-dimensional observation spaces.

Each unit is an **indepedent RL agent**. 

#### Observations, Actions and Rewards

<img src="http://whirl.cs.ox.ac.uk/blog/wp-content/uploads/2019/01/smac_agent_obs-768x546.jpg"  width="600"  />

**Observation**

Local observations about both allied and enemy units which are within the sight range:
- distance
- relative x
- relative y
- health
- shield
- unit type
- last action (only for allied units)

**Actions** 

Agents can take the following discrete actions:
- move[direction] (four directions: north, south, east, or west)
- attack[enemy id]
- heal[agent id] (only for Medivacs)
- stop

**Rewards**

Reward signal calculated from the hit-point damage dealt and received by agents, some positive (negative) reward after having enemy (allied) units killed and/or a positive (negative) bonus for winning (losing) the battle - [calculation](https://github.com/oxwhirl/smac/blob/503a678c7378a08d5c55233cfa7ef967acfc93c5/smac/env/starcraft2/starcraft2.py#L824).

### Select Map 
SMAC has 22 predifined combat scenarios ranging from **3 to 27** units. Full list of maps are available [here](https://github.com/oxwhirl/smac/blob/master/docs/smac.md).

In [None]:
from mava.utils.environments import smac_utils

map = '3m' #@param {type:"string"}


 # environment
environment_factory = functools.partial(
    smac_utils.make_environment, map_name=map, random_seed=magic_random_seed)

## 4. Train a `VDN` System to Play StarCraft II 

### Run Multi-Agent VDN System.

[VDN](https://arxiv.org/abs/1706.05296)

The Value Decomposition Network architecture (VDN) learns to **decompose
the team value function into agent-wise value functions** from the team reward
signal, by back-propagating the total Q gradient through deep neural networks
representing the individual component value functions. VDN tries to alleviate
the problem of unexplainable reward signals that emerge in purely independent
learners. Moreover, since the value function learned by each agent depends only
on local observations, it is more easily learned than the centralised joint value function. 


VDN fits into the popular multi-agent RL paradigm of centralised training with decentralised execution, i.e., agents learn in a centralised
fashion at training time, but can be deployed individually.

The main assumption made by VDN is that the joint action-value function for the system can
be **additively** decomposed into value functions across agents,

$$
Q\left(\left(o^{1}, o^{2}, \ldots, o^{n}\right),\left(a^{1}, a^{2}, \ldots, a^{n}\right)\right) \approx \sum_{i=1}^{n} \tilde{Q}_{i}\left(o^{i}, a^{i}\right), 
$$

where $n$ is the number of agents, $o^i$ is the local observation of agent $i$ and $a^i$ is its action.

#### Specify logging and checkpointing config. 

In [None]:
#@title Logging config. (Run Cell)
# Directory to store checkpoints and log data. 
base_dir = "/root/mava/"

# File name 
mava_id = datetime.now().strftime("%Y-%m-%d_%H:%M:%S")

# Log every [log_every] seconds
log_every = 0.0
logger_factory = functools.partial(
    logger_utils.make_logger,
    directory=base_dir,
    to_terminal=True,
    to_tensorboard=True,
    time_stamp=mava_id,
    time_delta=log_every,
)

# Checkpointer appends "Checkpoints" to checkpoint_dir
checkpoint_dir = f"{base_dir}/{mava_id}"


### Define Q Networks
We will use the default Q networks for the `VDN` system.

More details on the architecture [here](https://github.com/instadeepai/Mava/blob/develop/mava/systems/tf/vdn/networks.py) and [here](https://github.com/instadeepai/Mava/blob/develop/mava/systems/tf/madqn/networks.py). 

In [None]:
network_factory = lp_utils.partial_kwargs(vdn.make_default_networks,seed=magic_random_seed)

#### Create VDN System.

Description of variables and system available [here](https://github.com/instadeepai/Mava/blob/feature/smac-env-upgrades/mava/systems/tf/vdn/system.py).

In [None]:
system =  vdn.VDN(
        environment_factory=environment_factory,
        network_factory=network_factory,
        logger_factory=logger_factory,
        
        # Number of parallel experience generating processes
        num_executors=1,
        
        # Epsilon decay parameters
        exploration_scheduler_fn=LinearExplorationTimestepScheduler,
        epsilon_min=0.05,
        epsilon_decay_steps=50000,
        
        # Optimizer for Q networks
        optimizer=snt.optimizers.RMSProp(
            learning_rate=0.0005, epsilon=0.00001, decay=0.99
        ),
        batch_size=512,

        checkpoint_subpath=checkpoint_dir,
        executor_variable_update_period=1,
        target_update_period=100,
        max_gradient_norm=10.0,
        eval_loop_fn=MonitorParallelEnvironmentLoop,
        eval_loop_fn_kwargs={"path": checkpoint_dir, "record_every": 50, "figsize":(720,1280)},
        max_executor_steps=100000,
        samples_per_insert=None
    ).build()

In [None]:
%%capture
#@title Kill old runs. (Run Cell)

!pkill -9 launchpad
!pkill -9 3rdpary
!pkill -9 Main_Thread
!pkill -9 process_entry

In [None]:
# Ensure only trainer runs on gpu, while other processes run on cpu. 
local_resources = lp_utils.to_device(program_nodes=system.groups.keys(),nodes_on_gpu=["trainer"])

lp.launch(
    system,
    lp.LaunchType.LOCAL_MULTI_PROCESSING,
    terminal="output_to_files",
    local_resources=local_resources,
)

### Logs and Outputs

#### View outputs from the evaluator process.
*You might need to wait a few moments after launching the run.*
The `CUDA_ERROR_NO_DEVICE` error is expected since the GPU is only used by the trainer. 

In [None]:
!cat /tmp/launchpad_out/evaluator/0

#### View Stored Data 
*You might need to wait a few moments after launching the run.*

In [None]:
! ls $base_dir/$mava_id

### Tensorboard
*You might need to wait a few moments after launching the run.*

While waiting you can look at a previous [run](https://tensorboard.dev/experiment/osGaqNCEQl2tA6wNGmBlEw/).

In [None]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

In [None]:
%tensorboard --logdir $base_dir/$mava_id/tensorboard/evaluator

### View Agent Recording
If the agents are trained, we should see emergence of behaviour like *focus fire*, *learned formations based on armour types* (not in 3m) and making enemy *units give chase* while maintaining enough distance so that little or no damage is done.

#### Behaviour Examples

In [None]:
#@title Poorly trained agents - Agents running away. (RUN ME)
import IPython
IPython.display.HTML(url='https://raw.githubusercontent.com/instadeepai/Mava/feature/smac-env-upgrades/docs/images/runaway.html')

In [None]:
#@title Trained Agents - Emergence of focus fire. (RUN ME)
import IPython
IPython.display.HTML(url='https://raw.githubusercontent.com/instadeepai/Mava/feature/smac-env-upgrades/docs/images/focus_fire.html')

#### Our Agents

Check if any agent recordings are available. 

In [None]:
! ls $base_dir/$mava_id/recordings

#### View agent recordings.

In [None]:
import glob
import os 
import IPython

# Recordings
list_of_files = glob.glob(f"{base_dir}/{mava_id}/recordings/*.html")

if(len(list_of_files) == 0):
  print("No recordings are available yet. Please wait or run the 'Run Multi-Agent VDN System.' cell if you haven't already done this.")
else:
  latest_file = max(list_of_files, key=os.path.getctime)
  first_file =  min(list_of_files, key=os.path.getctime)
  print("Run the next cell to visualize your agents!")

In [None]:
#@title Our agents early in training. (RUN ME)
IPython.display.HTML(filename=first_file)

In [None]:
#@title Our most recent agent. (RUN ME)
IPython.display.HTML(filename=latest_file)

## 5. What's next?
- Scaling. 
- Run MARL System with larger q networks or different hyperparams.


### Scaling
Mava allows for simple scaling of MARL systems. 



In [None]:
%%capture
#@title Kill old runs. (Run Cell)

!pkill -9 launchpad
!pkill -9 3rdpary
!pkill -9 Main_Thread
!pkill -9 process_entry

In [None]:
#@title Logging config. (Run Cell)
# Directory to store checkpoints and log data. 
base_dir = "/root/mava/"

# File name 
mava_id = datetime.now().strftime("%Y-%m-%d_%H:%M:%S")

# Log every [log_every] seconds
log_every = 0.0
logger_factory = functools.partial(
    logger_utils.make_logger,
    directory=base_dir,
    to_terminal=True,
    to_tensorboard=True,
    time_stamp=mava_id,
    time_delta=log_every,
)

# Checkpointer appends "Checkpoints" to checkpoint_dir
checkpoint_dir = f"{base_dir}/{mava_id}"

Simply increase the **num_executors**. 

In [None]:
network_factory = lp_utils.partial_kwargs(vdn.make_default_networks,seed=magic_random_seed)

system =  vdn.VDN(
        environment_factory=environment_factory,
        network_factory=network_factory,
        logger_factory=logger_factory,
        
        # Number of parallel experience generating processes
        num_executors=2,
        
        # Epsilon decay parameters
        exploration_scheduler_fn=LinearExplorationTimestepScheduler,
        epsilon_min=0.05,
        epsilon_decay_steps=50000,
        
        # Optimizer for Q networks
        optimizer=snt.optimizers.RMSProp(
            learning_rate=0.0005, epsilon=0.00001, decay=0.99
        ),
        batch_size=512,

        checkpoint_subpath=checkpoint_dir,
        executor_variable_update_period=1,
        target_update_period=100,
        max_gradient_norm=10.0,
        eval_loop_fn=MonitorParallelEnvironmentLoop,
        eval_loop_fn_kwargs={"path": checkpoint_dir, "record_every": 50, "figsize":(720,1280)},
        max_executor_steps=100000,
        samples_per_insert=None
    ).build()


# Ensure only trainer runs on gpu, while other processes run on cpu. 
local_resources = lp_utils.to_device(program_nodes=system.groups.keys(),nodes_on_gpu=["trainer"])

lp.launch(
    system,
    lp.LaunchType.LOCAL_MULTI_PROCESSING,
    terminal="output_to_files",
    local_resources=local_resources,
)

##### View logs
*You might need to wait a few moments after launching the run.*

In [None]:
cat /tmp/launchpad_out/evaluator/0

#### Tensorboard
You might need to wait a few moments after launching the run.

In [None]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

In [None]:
%tensorboard --logdir ~/mava/$mava_id/tensorboard/evaluator 

### Run MARL System with larger Q networks or different hyperparams.



In [None]:
%%capture
#@title Kill old runs. (Run Cell)

!pkill -9 launchpad
!pkill -9 3rdpary
!pkill -9 Main_Thread
!pkill -9 process_entry

In [None]:
#@title Logging config. (Run Cell)
# Directory to store checkpoints and log data. 
base_dir = "/root/mava/"

# File name 
mava_id = datetime.now().strftime("%Y-%m-%d_%H:%M:%S")

# Log every [log_every] seconds
log_every = 0.0
logger_factory = functools.partial(
    logger_utils.make_logger,
    directory=base_dir,
    to_terminal=True,
    to_tensorboard=True,
    time_stamp=mava_id,
    time_delta=log_every,
)

# Checkpointer appends "Checkpoints" to checkpoint_dir
checkpoint_dir = f"{base_dir}/{mava_id}"

Change hyperparameters - larger Q network, different optimizer + LR decay.

In [None]:
network_factory = lp_utils.partial_kwargs(vdn.make_default_networks,seed=magic_random_seed,policy_networks_layer_sizes=(512, 512, 256))

# LR that's lr for the first 4001 steps, lr * 0.1 for the next 6000 steps,
# and lr * 0.01 for any additional steps.

boundaries = [4000, 10000]
lr = 0.01
values = [lr, lr * 0.1, lr * 0.01]
learning_rate_scheduler_fn = tf.keras.optimizers.schedules.PiecewiseConstantDecay(
    boundaries, values
)

system =  vdn.VDN(
        environment_factory=environment_factory,
        network_factory=network_factory,
        logger_factory=logger_factory,
        
        # Number of parallel experience generating processes
        num_executors=1,
        
        # Epsilon decay parameters
        exploration_scheduler_fn=LinearExplorationTimestepScheduler,
        epsilon_min=0.05,
        epsilon_decay_steps=50000,
        
        # Optimizer for Q networks
        optimizer=snt.optimizers.SGD(learning_rate=lr),
        learning_rate_scheduler_fn=learning_rate_scheduler_fn,
        batch_size=512,

        checkpoint_subpath=checkpoint_dir,
        executor_variable_update_period=1,
        target_update_period=100,
        max_gradient_norm=10.0,
        eval_loop_fn=MonitorParallelEnvironmentLoop,
        eval_loop_fn_kwargs={"path": checkpoint_dir, "record_every": 50, "figsize":(720,1280)},
        max_executor_steps=100000,
        samples_per_insert=None
    ).build()


# Ensure only trainer runs on gpu, while other processes run on cpu. 
local_resources = lp_utils.to_device(program_nodes=system.groups.keys(),nodes_on_gpu=["trainer"])

lp.launch(
    system,
    lp.LaunchType.LOCAL_MULTI_PROCESSING,
    terminal="output_to_files",
    local_resources=local_resources,
)

##### View logs
*You might need to wait a few moments after launching the run.*

In [None]:
cat /tmp/launchpad_out/evaluator/0

#### Tensorboard
You might need to wait a few moments after launching the run.

In [None]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

In [None]:
%tensorboard --logdir ~/mava/$mava_id/tensorboard/evaluator 

## For more examples using different systems, environments and architectures, visit our [github page](https://github.com/instadeepai/Mava/tree/develop/examples).