# Unity ml-agents

Unity has a toolkit called `ml-agents` which allows you to use a Unity game as a training environment: <https://github.com/Unity-Technologies/ml-agents>

Documentation: <https://unity-technologies.github.io/ml-agents/>

My task is to add an environment wrapper class to torchrl that will call into the environment for `ml-agents`.

## How to use ml-agents

Installation instructions: <https://unity-technologies.github.io/ml-agents/Installation/>

Install the python interface: <https://unity-technologies.github.io/ml-agents/ML-Agents-Envs-README/>

Getting started guide: <https://unity-technologies.github.io/ml-agents/Getting-Started/>

`ml-agents` currently has a different interfaces into the environments, and we'll need to use one of them:
* gym: <https://unity-technologies.github.io/ml-agents/Python-Gym-API/>
  - This doesn't support multi-agent, so probably not the best option
* pettingzoo: <https://unity-technologies.github.io/ml-agents/Python-PettingZoo-API/>
* native API: <https://unity-technologies.github.io/ml-agents/Python-LLAPI/>


The `ml-agents` environments are defined here: <https://github.com/Unity-Technologies/ml-agents/tree/develop/ml-agents-envs/mlagents_envs>

## Specs

Every agent in MLAgents has an associated behavior spec. A number of agents can share the same behavior spec, but all agents could have a different behavior spec as well.



### 3DBall

```
behavior_specs:
  behavior_name: 3DBall?team=0
    observation_spec[0]:
      name: VectorSensor_size8
      shape: (8,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    action_spec: Continuous: 2, Discrete: ()
      is_continuous: True
      continuous_size: 2
      is_discrete: False
      discrete_branches: ()
    random_action:
      discrete_dtype: <class 'numpy.int32'>
      discrete: []
      continuous: [[0.47754392 0.9049918 ]]
current steps:
  behavior_name: 3DBall?team=0
    decision step:
      agent_id: [ 0  1  2  3  4  5  6  7  8  9 10 11]
      group_id: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
      observation_shapes: [(12, 8)]
      reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      group_reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      action_mask: None
    terminal step:
      agent_id: []
      group_id: []
      observation_shapes: [(0, 8)]
      reward: []
      group_reward: []
```

### StrikersVsGoalie

```
behavior_specs:
  behavior_name: Striker?team=0
    observation_spec[0]:
      name: StackingSensor_size3_BlueRayPerceptionSensor
      shape: (231,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    observation_spec[1]:
      name: StackingSensor_size3_BlueRayPerceptionSensorReverse
      shape: (63,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    action_spec: Continuous: 0, Discrete: (3, 3, 3)
      is_continuous: False
      continuous_size: 0
      is_discrete: True
      discrete_branches: (3, 3, 3)
    random_action:
      discrete_dtype: <class 'numpy.int32'>
      discrete: [[2 0 0]]
      continuous: []
  behavior_name: Goalie?team=1
    observation_spec[0]:
      name: StackingSensor_size3_PurpleGoalieRayPerceptionSensor
      shape: (738,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    action_spec: Continuous: 0, Discrete: (3, 3, 3)
      is_continuous: False
      continuous_size: 0
      is_discrete: True
      discrete_branches: (3, 3, 3)
    random_action:
      discrete_dtype: <class 'numpy.int32'>
      discrete: [[2 1 2]]
      continuous: []
current steps:
  behavior_name: Striker?team=0
    decision step:
      agent_id: [ 0  2  3  5  6  8  9 11 12 14 15 17 18 20 21 23]
      group_id: [1, 1, 3, 3, 5, 5, 7, 7, 9, 9, 11, 11, 13, 13, 15, 15]
      observation_shapes: [(16, 231), (16, 63)]
      reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      group_reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      action_mask_shapes: [(16, 3), (16, 3), (16, 3)]
    terminal step:
      agent_id: []
      group_id: []
      observation_shapes: [(0, 231), (0, 63)]
      reward: []
      group_reward: []
  behavior_name: Goalie?team=1
    decision step:
      agent_id: [ 1  4  7 10 13 16 19 22]
      group_id: [2, 4, 6, 8, 10, 12, 14, 16]
      observation_shapes: [(8, 738)]
      reward: [0. 0. 0. 0. 0. 0. 0. 0.]
      group_reward: [0. 0. 0. 0. 0. 0. 0. 0.]
      action_mask_shapes: [(8, 3), (8, 3), (8, 3)]
    terminal step:
      agent_id: []
      group_id: []
      observation_shapes: [(0, 738)]
      reward: []
      group_reward: []
```

### SoccerTwos

```
behavior_specs:
  behavior_name: SoccerTwos?team=1
    observation_spec[0]:
      name: StackingSensor_size3_PurpleRayPerceptionSensor
      shape: (264,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    observation_spec[1]:
      name: StackingSensor_size3_PurpleRayPerceptionSensorReverse
      shape: (72,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    action_spec: Continuous: 0, Discrete: (3, 3, 3)
      is_continuous: False
      continuous_size: 0
      is_discrete: True
      discrete_branches: (3, 3, 3)
    random_action:
      discrete_dtype: <class 'numpy.int32'>
      discrete: [[0 1 1]]
      continuous: []
  behavior_name: SoccerTwos?team=0
    observation_spec[0]:
      name: StackingSensor_size3_BlueRayPerceptionSensor
      shape: (264,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    observation_spec[1]:
      name: StackingSensor_size3_BlueRayPerceptionSensorReverse
      shape: (72,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    action_spec: Continuous: 0, Discrete: (3, 3, 3)
      is_continuous: False
      continuous_size: 0
      is_discrete: True
      discrete_branches: (3, 3, 3)
    random_action:
      discrete_dtype: <class 'numpy.int32'>
      discrete: [[1 2 1]]
      continuous: []
current steps:
  behavior_name: SoccerTwos?team=1
    decision step:
      agent_id: [ 0  2  4  6  8 10 12 14 16 18 20 22 24 26 28 30]
      group_id: [2, 2, 4, 4, 6, 6, 8, 8, 10, 10, 12, 12, 14, 14, 16, 16]
      observation_shapes: [(16, 264), (16, 72)]
      reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      group_reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      action_mask_shapes: [(16, 3), (16, 3), (16, 3)]
    terminal step:
      agent_id: []
      group_id: []
      observation_shapes: [(0, 264), (0, 72)]
      reward: []
      group_reward: []
  behavior_name: SoccerTwos?team=0
    decision step:
      agent_id: [ 1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31]
      group_id: [1, 1, 3, 3, 5, 5, 7, 7, 9, 9, 11, 11, 13, 13, 15, 15]
      observation_shapes: [(16, 264), (16, 72)]
      reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      group_reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      action_mask_shapes: [(16, 3), (16, 3), (16, 3)]
    terminal step:
      agent_id: []
      group_id: []
      observation_shapes: [(0, 264), (0, 72)]
      reward: []
      group_reward: []
```

### Walker

```
behavior_specs:
  behavior_name: Walker?team=0
    observation_spec[0]:
      name: VectorSensor_size243
      shape: (243,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    action_spec: Continuous: 39, Discrete: ()
      is_continuous: True
      continuous_size: 39
      is_discrete: False
      discrete_branches: ()
    random_action:
      discrete_dtype: <class 'numpy.int32'>
      discrete: []
      continuous: [[ 0.8072419   0.7803087   0.01151393  0.20298165  0.9033102  -0.65760577
   0.5174855  -0.588782   -0.03862659 -0.0283867  -0.5692337   0.8303417
  -0.06972238 -0.49072945  0.90130824 -0.8372041  -0.7015494  -0.7628687
  -0.72954524  0.3241689   0.32257587 -0.60499096 -0.9076488  -0.64096147
   0.17179826  0.6753697  -0.64479506  0.2613258  -0.83917886  0.8475276
  -0.939543    0.85671264  0.18680607 -0.29868662  0.24222809  0.09251133
   0.03180926 -0.5663932  -0.556755  ]]
current steps:
  behavior_name: Walker?team=0
    decision step:
      agent_id: [0 1 2 3 4 5 6 7 8 9]
      group_id: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
      observation_shapes: [(10, 243)]
      reward: [0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
 0.0000000e+00 5.1639726e-14 0.0000000e+00 0.0000000e+00 0.0000000e+00]
      group_reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      action_mask: None
    terminal step:
      agent_id: []
      group_id: []
      observation_shapes: [(0, 243)]
      reward: []
      group_reward: []
```

### Hallway

```
behavior_specs:
  behavior_name: Hallway?team=0
    observation_spec[0]:
      name: StackingSensor_size3_RayPerceptionSensor
      shape: (105,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    observation_spec[1]:
      name: StackingSensor_size3_VectorSensor_size1
      shape: (3,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    action_spec: Continuous: 0, Discrete: (5,)
      is_continuous: False
      continuous_size: 0
      is_discrete: True
      discrete_branches: (5,)
    random_action:
      discrete_dtype: <class 'numpy.int32'>
      discrete: [[1]]
      continuous: []
current steps:
  behavior_name: Hallway?team=0
    decision step:
      agent_id: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]
      group_id: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
      observation_shapes: [(16, 105), (16, 3)]
      reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      group_reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      action_mask_shapes: [(16, 5)]
    terminal step:
      agent_id: []
      group_id: []
      observation_shapes: [(0, 105), (0, 3)]
      reward: []
      group_reward: []
```

### Pyramids

```
behavior_specs:
  behavior_name: Pyramids?team=0
    observation_spec[0]:
      name: RayPerceptionSensor
      shape: (56,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    observation_spec[1]:
      name: RayPerceptionSensor1
      shape: (56,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    observation_spec[2]:
      name: RayPerceptionSensor2
      shape: (56,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    observation_spec[3]:
      name: VectorSensor_size4
      shape: (4,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    action_spec: Continuous: 0, Discrete: (5,)
      is_continuous: False
      continuous_size: 0
      is_discrete: True
      discrete_branches: (5,)
    random_action:
      discrete_dtype: <class 'numpy.int32'>
      discrete: [[3]]
      continuous: []
current steps:
  behavior_name: Pyramids?team=0
    decision step:
      agent_id: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]
      group_id: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
      observation_shapes: [(16, 56), (16, 56), (16, 56), (16, 4)]
      reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      group_reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      action_mask_shapes: [(16, 5)]
    terminal step:
      agent_id: []
      group_id: []
      observation_shapes: [(0, 56), (0, 56), (0, 56), (0, 4)]
      reward: []
      group_reward: []
```

### DungeonEscape

```
behavior_specs:
  behavior_name: DungeonEscape?team=0
    observation_spec[0]:
      name: RBSensor
      shape: (10,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    observation_spec[1]:
      name: StackingSensor_size3_RayPerceptionSensor
      shape: (360,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    observation_spec[2]:
      name: VectorSensor_size1
      shape: (1,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    action_spec: Continuous: 0, Discrete: (7,)
      is_continuous: False
      continuous_size: 0
      is_discrete: True
      discrete_branches: (7,)
    random_action:
      discrete_dtype: <class 'numpy.int32'>
      discrete: [[0]]
      continuous: []
current steps:
  behavior_name: DungeonEscape?team=0
    decision step:
      agent_id: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35]
      group_id: [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 8, 9, 9, 9, 10, 10, 10, 11, 11, 11, 12, 12, 12]
      observation_shapes: [(36, 10), (36, 360), (36, 1)]
      reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      group_reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      action_mask_shapes: [(36, 7)]
    terminal step:
      agent_id: []
      group_id: []
      observation_shapes: [(0, 10), (0, 360), (0, 1)]
      reward: []
      group_reward: []
```

### GridFoodCollector

```
behavior_specs:
  behavior_name: GridFoodCollector?team=0
    observation_spec[0]:
      name: GridSensor-OneHot
      shape: (5, 40, 40)
      dimension_property: (<DimensionProperty.NONE: 1>, <DimensionProperty.TRANSLATIONAL_EQUIVARIANCE: 2>, <DimensionProperty.TRANSLATIONAL_EQUIVARIANCE: 2>)
      observation_type: ObservationType.DEFAULT
    action_spec: Continuous: 3, Discrete: (2,)
      is_continuous: False
      continuous_size: 3
      is_discrete: False
      discrete_branches: (2,)
    random_action:
      discrete_dtype: <class 'numpy.int32'>
      discrete: [[1]]
      continuous: [[0.04317868 0.8829177  0.15289028]]
      continuous.dtype: float32
current steps:
  behavior_name: GridFoodCollector?team=0
    decision step:
      agent_id: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
      group_id: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
      observation_shapes: [(20, 5, 40, 40)]
      reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      group_reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      action_mask_shapes: [(20, 2)]
    terminal step:
      agent_id: []
      group_id: []
      observation_shapes: [(0, 5, 40, 40)]
      reward: []
      group_reward: []
```

### WallJump

```
behavior_specs:
  behavior_name: SmallWallJump?team=0
    observation_spec[0]:
      name: StackingSensor_size6_OffsetRayPerceptionSensor
      shape: (210,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    observation_spec[1]:
      name: StackingSensor_size6_RayPerceptionSensor
      shape: (210,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    observation_spec[2]:
      name: StackingSensor_size6_VectorSensor_size4
      shape: (24,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    action_spec: Continuous: 0, Discrete: (3, 3, 3, 2)
      is_continuous: False
      continuous_size: 0
      is_discrete: True
      discrete_branches: (3, 3, 3, 2)
    random_action:
      discrete_dtype: <class 'numpy.int32'>
      discrete: [[0 2 1 1]]
      continuous: []
      continuous.dtype: float32
  behavior_name: BigWallJump?team=0
    observation_spec[0]:
      name: StackingSensor_size6_OffsetRayPerceptionSensor
      shape: (210,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    observation_spec[1]:
      name: StackingSensor_size6_RayPerceptionSensor
      shape: (210,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    observation_spec[2]:
      name: StackingSensor_size6_VectorSensor_size4
      shape: (24,)
      dimension_property: (<DimensionProperty.NONE: 1>,)
      observation_type: ObservationType.DEFAULT
    action_spec: Continuous: 0, Discrete: (3, 3, 3, 2)
      is_continuous: False
      continuous_size: 0
      is_discrete: True
      discrete_branches: (3, 3, 3, 2)
    random_action:
      discrete_dtype: <class 'numpy.int32'>
      discrete: [[0 0 1 1]]
      continuous: []
      continuous.dtype: float32
current steps:
  behavior_name: SmallWallJump?team=0
    decision step:
      agent_id: [ 0  1  4  5  7 11 12 13 15 17 20 22]
      group_id: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
      observation_shapes: [(12, 210), (12, 210), (12, 24)]
      reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      group_reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      action_mask_shapes: [(12, 3), (12, 3), (12, 3), (12, 2)]
    terminal step:
      agent_id: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
      group_id: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
      observation_shapes: [(24, 210), (24, 210), (24, 24)]
      reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      group_reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  behavior_name: BigWallJump?team=0
    decision step:
      agent_id: [ 2  3  6  8  9 10 14 16 18 19 21 23]
      group_id: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
      observation_shapes: [(12, 210), (12, 210), (12, 24)]
      reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      group_reward: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      action_mask_shapes: [(12, 3), (12, 3), (12, 3), (12, 2)]
    terminal step:
      agent_id: []
      group_id: []
      observation_shapes: [(0, 210), (0, 210), (0, 24)]
      reward: []
      group_reward: []
```

## Problems

### New agents

DungeonEscape and Basic add new agents during a run! This needs to be supported somehow.

### New observation? Agent ID reassigned?

Running the `TestUnityMLAgents.test_env_with_editor` test with DungeonEscape gives this error:

```
  File "/home/endoplasm/develop/torchrl-mlagents/torchrl/envs/libs/unity_mlagents.py", line 301, in _make_td_out
    agent_dict[obs_name] = tensordict_in[group_name, agent_name, obs_name]
...
KeyError: 'key "RBSensor" not found in TensorDict with keys [\'discrete_action\']'
```

Either RBSensor is a new observation that got added to the behavior spec, or the agent ID in question was reassigned to a different agent ID with a different behavior than it had before.