# **Final Project: Exploring Reward Sharing Strategies for Effective Cooperative Multi-Agent Task Completion**

### **Due Date**: 05/10/2025 at 11:59 PM

####**Project Proposal (Graded copy)**: https://drive.google.com/file/d/1xSh_ITemfAfa_ivwwx9Q2IUzGsIaF9dD/view?usp=sharing
#### **Final Report**: https://docs.google.com/document/d/1Lul-CaBZPpzwnsBR4dFv6vvO8a7qrXCR_R_F0lgOIqc/edit?usp=sharing

####**Video Presentation**:

In [1]:
# WHAT TO RECORD:
# See Final report. It's there.

# **Introduction**

Welcome to the our Final Project of CS 4756/5756. In this project, we will train multiple agents in the Simple Spread environment using MAPPO to investigate how different reward structures - individual, shared, and partially shared - affect learning dynamics and group behavior.

We will use the [**Simple Spread**](https://pettingzoo.farama.org/environments/mpe/simple_spread/) environment from the [PettingZoo](https://pettingzoo.farama.org/content/basic_usage/) library for this project.


**Note**: Our code is an adaptation of [the official MAPPO implementation](https://github.com/marlbenchmark/on-policy) to use different reward schemes.

# **Setup**

## Mount Google Collab and Change Working Directory
Mount the google drive and change working directory to personal google drive.

In [2]:
import os

# mount google collab
from google.colab import drive
drive.mount('/content/drive')

# Change to personal drive
os.chdir('/content/drive/MyDrive')

Mounted at /content/drive


## Install Necessary Libraries

In [3]:
import sys
USING_COLAB = 'google.colab' in sys.modules

if USING_COLAB:
    !apt-get -qq update
    !apt-get -qq install -y libosmesa6-dev libgl1-mesa-glx libglfw3 libgl1-mesa-dev libglew-dev patchelf
    !apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
else:
    !pip install torch torchvision torchaudio
    !pip install numpy
    !pip install tqdm
    !pip install opencv-python

!pip install matplotlib
!pip install pettingzoo
!pip install "ray[rllib]" torch gymnasium
!pip install supersuit

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Selecting previously unselected package libglx-dev:amd64.
(Reading database ... 126102 files and directories currently installed.)
Preparing to unpack .../00-libglx-dev_1.4.0-1_amd64.deb ...
Unpacking libglx-dev:amd64 (1.4.0-1) ...
Selecting previously unselected package libgl-dev:amd64.
Preparing to unpack .../01-libgl-dev_1.4.0-1_amd64.deb ...
Unpacking libgl-dev:amd64 (1.4.0-1) ...
Selecting previously unselected package libegl-dev:amd64.
Preparing to unpack .../02-libegl-dev_1.4.0-1_amd64.deb ...
Unpacking libegl-dev:amd64 (1.4.0-1) ...
Selecting previously unselected package libgles1:amd64.
Preparing to unpack .../03-libgles1_1.4.0-1_amd64.deb ...
Unpacking libgles1:amd64 (1.4.0-1) ...
Selecting previously unselected package libgles-dev:amd64.
Preparing to unpack .../04-libgles-dev_1.4.0-1_amd64

## Environment Variables

In [4]:
os.environ["JUPYTER_PLATFORM_DIRS"] = "1"
os.environ["PYTHONWARNINGS"] = "ignore::DeprecationWarning"
os.environ["CUDA_VISIBLE_DEVICES"]="0"

## Clone on-policy with rewards modifications
Clone the repository into the personal drive, edit the requirements, and install it.

In [5]:
# # Update the git repository for on-policy
# %cd on-policy
# !git pull
# %cd ..

In [6]:
# If not cloned:
if not os.path.isdir('on-policy'):
    try:
        print(f"Getting Fresh copy")
        # Clone the repo
        !git clone https://github.com/juji-lau/on-policy.git
        sys.path.append('/content/drive/MyDrive/on-policy')

        # Edit requirements.txt:
        # !sed -i '/^absl-py==0\.9\.0$/d' on-policy/requirements.txt #(1, 1)
        # !sed -i '/^atari-py==0\.2\.6$/d' on-policy/requirements.txt #(2, 7)
        # # !sed -i 'contextvars' on-policy/requirements.txt
        # # !sed -i 'enum34' on-policy/requirements.txt

        # # Remove all version requirements (let pip figure it out)
        # !sed -i '/^\s*#/! s/[<>=!~].*$//' on-policy/requirements.txt

        # Install the working requirements.txt
        !pip install -r on-policy/requirements.txt
        # Install on-policy
        !pip install -e ./on-policy

    except Exception as e:
        print(f"Failed to clone, modify, and/or install the requirements.")
        print(f"Removing any attempts at cloning.")
        !rm -r on-policy
        print(f"Got error message: {e}")
else:
    print(f"Using existing copy")
    # Just pip install the (modified) requirements.txt (they disappear every time)
    sys.path.append(os.path.join(os.getcwd(), 'on-policy'))
    # Install the working requirements.txt
    !pip install -r on-policy/requirements.txt
    # Install on-policy
    !pip install -e ./on-policy

Using existing copy
Collecting aioredis==1.3.1 (from -r on-policy/requirements.txt (line 1))
  Downloading aioredis-1.3.1-py3-none-any.whl.metadata (22 kB)
Collecting astor==0.8.0 (from -r on-policy/requirements.txt (line 2))
  Downloading astor-0.8.0-py2.py3-none-any.whl.metadata (4.2 kB)
Collecting async-timeout==3.0.1 (from -r on-policy/requirements.txt (line 3))
  Downloading async_timeout-3.0.1-py3-none-any.whl.metadata (4.0 kB)
Collecting atari-py==0.2.6 (from -r on-policy/requirements.txt (line 4))
  Downloading atari-py-0.2.6.tar.gz (790 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m790.2/790.2 kB[0m [31m44.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting atomicwrites==1.2.1 (from -r on-policy/requirements.txt (line 5))
  Downloading atomicwrites-1.2.1-py2.py3-none-any.whl.metadata (5.5 kB)
Collecting blessings==1.7 (from -r on-policy/requirements.txt (line 6))
  Downloading blessings-1.7-py3-none-any.wh

# **Training and Visualizing MAPPO Environments with Different Reward Schemes**

## Imports

In [7]:
import numpy as np
import torch
from pettingzoo.mpe import simple_spread_v2
# change to the following later: from pettingzoo.mpe2 import simple_spread_v2
import supersuit as ss

  from pettingzoo.mpe import simple_spread_v2


## Hyperparameters

In [8]:
ALGORITHM = "mappo"  #already mappo by default
ENVIRONMENT = "MPE"
USE_WANDB = False
SCENARIO = "simple_spread"
REWARD_TYPES = ["individual", "partially_shared", "shared", "original"]

# Run specifiers
NUM_ENV_STEPS = 100000
EPISODE_LENGTH = 25
NUM_TRAINING_THREADS = 1
NUM_ROLLOUT_THREADS = 32
N_EVAL_ROLLOUT_THREADS = 1
SEED = 42

# Simulation specifiers
NUM_AGENTS = 3
NUM_LANDMARKS = 3

# Eval parameters
USE_EVAL = True

# Render parameters
SAVE_GIFS = True
USE_RENDER = True
RENDER_EPISODES = 5


## Train MAPPO for Reward Types

In [9]:
# Train MAPPO using on-policy for a given reward type
def train_mappo(reward_type):
    assert (os.getcwd() == '/content/drive/MyDrive')

    run_dir = f"./runs/{SCENARIO}_{reward_type}"
    os.makedirs(run_dir, exist_ok=True)
    print(f"Train MAPPO: {reward_type}, Saving to: {run_dir}")

    # omitting --share_policy so that share_policy is set to True

    !python on-policy/onpolicy/scripts/train/train_mpe.py \
        --algorithm_name $ALGORITHM \
        --experiment_name $reward_type \
        --seed $SEED \
        --num_env_steps $NUM_ENV_STEPS \
        --n_training_threads $NUM_TRAINING_THREADS \
        --n_rollout_threads $NUM_ROLLOUT_THREADS \
        --n_eval_rollout_threads $N_EVAL_ROLLOUT_THREADS \
        --use_wandb $USE_WANDB \
        --env_name $ENVIRONMENT \
        --episode_length $EPISODE_LENGTH \
        --use_eval $USE_EVAL \
        --save_gifs $SAVE_GIFS \
        --use_render $USE_RENDER \
        --render_episodes $RENDER_EPISODES \
        --scenario_name $SCENARIO \
        --num_landmarks $NUM_LANDMARKS \
        --num_agents $NUM_AGENTS \
        --reward_type $reward_type \
        --ppo_epoch 30 \
        --clip_param 0.1 \
        --entropy_coef 0.005 \
        --lr 5e-4 \
        --critic_lr 5e-4 \
        --use_valuenorm \
        --use_feature_normalization \
        --hidden_size 128 \
        --layer_N 2


In [11]:
# Train and render for each reward type
for reward in REWARD_TYPES:
    print(f"\n--- Training with {reward} rewards ---\n", flush=True)
    train_mappo(reward)


--- Training with individual rewards ---

Train MAPPO: individual, Saving to: ./runs/simple_spread_individual
  import imp
u are choosing to use mappo, we set use_recurrent_policy & use_naive_recurrent_policy to be False
choose to use gpu...
Saved at /content/drive/MyDrive/runs/simple_spread_individual/run12.
Namespace(algorithm_name='mappo', experiment_name='individual', seed=42, cuda=True, cuda_deterministic=True, n_training_threads=1, n_rollout_threads=32, n_eval_rollout_threads=1, n_render_rollout_threads=1, num_env_steps=100000, user_name='marl', use_wandb=False, env_name='MPE', use_obs_instead_of_state=False, episode_length=25, share_policy=True, use_centralized_V=True, stacked_frames=1, use_stacked_frames=False, hidden_size=128, layer_N=2, use_ReLU=True, use_popart=False, use_valuenorm=False, use_feature_normalization=False, use_orthogonal=True, gain=0.01, use_naive_recurrent_policy=False, use_recurrent_policy=False, recurrent_N=1, data_chunk_length=10, lr=0.0005, critic_lr=0.0

## Logging the Metrics
Copy the output from train_mappo for each reward type and put it in on-policy/logging/{reward_type}.txt

In [24]:
for reward in REWARD_TYPES:
    text_file = f"on-policy/logging/{reward}.txt"
    !python on-policy/logging/log_metrics.py --log_file $text_file --reward_type $reward

## Render MAPPO for all Reward Types

In [25]:
# Render MAPPO using on-policy for a given reward type
def render_mappo_5(reward_type, run_number):
    assert (os.getcwd() == '/content/drive/MyDrive')

    model_dir = f"runs/{SCENARIO}_{reward_type}"

    # omitting --share_policy so that share_policy is set to True
    !python on-policy/onpolicy/scripts/render/render_mpe.py \
        --algorithm_name $ALGORITHM \
        --experiment_name $reward_type \
        --seed $SEED \
        --num_env_steps $NUM_ENV_STEPS \
        --n_training_threads $NUM_TRAINING_THREADS \
        --n_rollout_threads 1 \
        --n_eval_rollout_threads $N_EVAL_ROLLOUT_THREADS \
        --use_wandb $USE_WANDB \
        --env_name $ENVIRONMENT \
        --episode_length $EPISODE_LENGTH \
        --use_eval $USE_EVAL \
        --save_gifs $SAVE_GIFS \
        --use_render $USE_RENDER \
        --render_episodes $RENDER_EPISODES \
        --model_dir $model_dir \
        --scenario_name $SCENARIO \
        --num_landmarks $NUM_LANDMARKS \
        --num_agents $NUM_AGENTS \
        --reward_type $reward_type \
        --run_number $run_number \
        --use_valuenorm \
        --use_feature_normalization \
        --hidden_size 128 \
        --layer_N 2

In [27]:
!apt-get install -y xvfb
!pip install pyvirtualdisplay
!pip install pyglet==1.4.10 # need to downgrade pyglet to 1.4.10 otherwise this won't work

from pyvirtualdisplay import Display
display = Display(visible=False, size=(1400, 900))
display.start()

run_num = -1
try:
    for reward in REWARD_TYPES:
        render_mappo_5(reward, run_num)
    display.stop()
except Exception as e:
    import traceback
    traceback.print_exc()
    display.stop()


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
xvfb is already the newest version (2:21.1.4-2ubuntu1.7~22.04.14).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
  import imp
u are choosing to use mappo, we set use_recurrent_policy & use_naive_recurrent_policy to be False
choose to use gpu...
DEBUGGING ALL ARGS: runs/simple_spread_original/run1/models 
 (render_mpe.py)
Run dir: /content/drive/MyDrive/runs/simple_spread_original/run1
/content/drive/MyDrive/runs/simple_spread_original/run1
obs_space:  [Box(-inf, inf, (18,), float32), Box(-inf, inf, (18,), float32), Box(-inf, inf, (18,), float32)]
share_obs_space:  [Box(-inf, inf, (54,), float32), Box(-inf, inf, (54,), float32), Box(-inf, inf, (54,), float32)]
act_space:  [Discrete(5), Discrete(5), Discrete(5)]
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(
  arr = np.fromstring(image_data.get_data(), dtype=np.uint8, sep='')
See her

## Git Commands

In [35]:
def set_remote():
    import os
    import getpass

    # 🔒 Ask for token securely
    token = getpass.getpass('Enter your GitHub PAT: ')

    # ✅ Set your git identity
    !git config --global user.email ""
    !git config --global user.name ""

    # ✅ Set the new remote using token auth
    remote_url = f"https://{token}@github.com/juji-lau/on-policy.git"
    !git remote set-url origin {remote_url}

# set_remote()
# # ✅ Add, commit, and push
# !git add .
# !git commit -m "Final changes to on-policy fork."
# !git push origin main

[main c0d345e] Final changes to on-policy fork.
 33 files changed, 727 insertions(+), 65 deletions(-)
 create mode 100644 logging/individual.png
 create mode 100644 logging/individual.txt
 create mode 100644 logging/log_metrics.py
 create mode 100644 logging/original.png
 create mode 100644 logging/original.txt
 create mode 100644 logging/partially_shared.png
 create mode 100644 logging/partially_shared.txt
 create mode 100644 logging/shared.png
 create mode 100644 logging/shared.txt
 mode change 100755 => 100644 onpolicy/scripts/train_mpe_scripts/train_mpe_comm.sh
 mode change 100755 => 100644 onpolicy/scripts/train_mpe_scripts/train_mpe_reference.sh
 mode change 100755 => 100644 onpolicy/scripts/train_mpe_scripts/train_mpe_spread.sh
 mode change 100755 => 100644 onpolicy/scripts/train_smacv2_scripts/train_protoss_10v10.sh
 mode change 100755 => 100644 onpolicy/scripts/train_smacv2_scripts/train_protoss_10v11.sh
 mode change 100755 => 100644 onpolicy/scripts/train_smacv2_scripts/train



---

