Skip to content
Reza Mahjourian edited this page Oct 16, 2015 · 2 revisions

Introduction

In the NERO game, the user trains intelligent agents to perform well in battle. It is a machine learning game, i.e. the focus is on designing a set of challenges that allow agents to learn the necessary skills step by step. Learning takes place in real time, as the user is observing the game and changing the environment and behavioral objectives on the fly. The challenge for the player is to develop as proficient a team as possible.

The NERO game in OpenNERO is a simpler research and education version of the original NERO game, focusing on demonstrating learning algorithms interactively in order to make it clear how they work. The game environment is first described below, then the two methods for training the agents (neuroevolution and reinforcement learning), how a team can be put together for battle, and then the battle mode itself. Ways of extending the learning methods and handcoding the teams, as well as differences from the original NERO are described in the end. To get a quick introduction to NERO, watch the video below.


Click to play video.

NERO Training

In the training mode, the user selects one of the two training methods (neuroevolution or reinforcement learning) and manipulates the environment and the behavioral goals in order to train them to do what s/he wants.

Typically the training starts by deploying either an rtNEAT team or Q-learning team, and then setting some of the goals (or fitness coefficients) in the parameter window (the sliders become active after a Deploy button is pressed). They are:

  • Stand Ground:
    • Positive: Punished for nonzero movement velocity
    • Negative: Rewarded for nonzero movement velocity
  • Stick Together
    • Positive: Rewarded for small distance to center of mass of teammates.
    • Negative: Rewarded for large distance to center of mass of teammates.
  • Approach Enemy:
    • Positive: Rewarded for small distance to closest enemy agent
    • Negative: Rewarded for large distance to closest enemy agent
  • Approach Flag:
    • Positive: Rewarded for small distance to flag
    • Negative: Rewarded for large distance to flag
  • Hit Target:
    • Positive: Rewarded for hitting enemy agents
    • Negative: Punished for hitting enemy agents
  • Avoid Fire:
    • Positive: Punished for having hit points reduced
    • Negative: Rewarded for having hit points reduced

There are also a number of parameters that effect learning that should be set appropriately (the default values usually are a good starting point):

  • Explore/Exploit: This slider has no effect on neuroevolution. With Reinforcement Learning, it determines the percentage of actions taken greedily (i.e. those with the best Q-values) vs. actions taken to explore the environment (i.e. randomly selected actions)
  • Lifetime: The number of action steps each agent gets to perform before being removed from the simulation (and restarted, or replaced by offspring).
  • Hitpoints: The amount of damage an agent can take before dying (Note: being hit by an enemy is 1 point of damage). In training, the agent is removed from where it is and respawned at the spawn location with lifetime and hitpoints reset.

The second part of the initialization is to set up the environment. An initial environment is already provided, and it is the same as the battle environment. The user can, however, add objects to it to design the training curriculum, through right clicking with the mouse:

When right clicking on empty space: http://www.cs.utexas.edu/users/risto/pictures/menu1.png

  • Add Wall: Generates a standard wall where you clicked
  • Place Flag: Generates a flag to the place where you clicked, or moves the flag there if one already exists. The flag has the appearance of a blue pole. The flags are useful for demonstration purposes, but not necessary in training for battle.
  • Place Turret: Generates an enemy at a location where you clicked. The enemy rotates and fires at anything in its line-of-fire, with the same probabilistic method as the agents themselves. It does not die no matter how many times it is hit.
  • Set Spawn location: moves the location around which the agents are created to the location where you clicked. The locations and orientations of the agents are randomly choces within a small circle around that point. The team is blue in training; in battle there are red and blue teams.

When right clicking on an object (i.e. a wall or a turret) that you placed: http://www.cs.utexas.edu/users/risto/pictures/menu2.png

  • Rotate Object: Rotates the object around the z-axis until the user left clicks.
  • Scale Object: Scales the object until the user left clicks.
  • Move Object: Lets you move the object until you left click.
  • Remove Object: Removes the object

The trees are sensed as small walls; in the current version they cannot be created or modified though.

Over-head display

By hitting the F2 key, you can cycle through additional information about each agent that may be useful during training. This "over-head" display shows up as a bit of text above each agent on the field. When an over-head display is active, the window title will change to say what is being displayed. Some of information is specific to Neuroevolution, and some is specific to RL.

  • fitness
    • for RL, this is the cumulative reward over the agent's lifetime. Because the meaning of the reward values can change with the adjustment of sliders, the exact meaning and units of this value depend on the current slider setting.
    • for rtNEAT, this is the relative fitness of the organism compared to the rest of the population. This is calculated as the weighted sum of the Z-scores (the number of standard deviations above or below population average) of the agent in each of the fitness slider categories.
  • time alive
    • for RL, this is simply the number of steps on the field that the current individual has been trained for. Generally, the longer an agent is fielded, the more experience it has, and, if using a hash table, the larger its representation in the team file.
    • for rtNEAT, this is the total time (in frames) that the phenotype has been on the field. Note that this can be larger than a single lifetime because the same network can be "re-spawned" several times if it is considered good enough, because rtNEAT is an elitist steady-state algorithm.
  • id
    • for RL, this is the body id of the individual, allowing you to keep track of its behavior over time.
    • for rtNEAT, this is the genome id, which you can use to track the behavior and to extract individuals from saved populations for use in combat teams.
  • species id
    • for RL, this shows the value 'q' for the default q-learner to allow you to distinguish RL agents from rtNEAT ones.
    • for rtNEAT, this shows the unique species number that the individual belongs to. rtNEAT uses speciation and fitness sharing in order to protect diversity within the evolving population.
  • champion:
    • not available for RL
    • for rtNEAT, this shows the label 'champ!' above the highest-ranked individual within the current population, allowing you to quickly check what the best behavior so far is according to the current fitness profile.

Neuroevolution (rtNEAT)

The rtNEAT neuroevolution algorithm is a method for evolving (through genetic algorithms) a population of neural networks to control the agents. See the paper on rtNEAT for more details.

When you press the "Deploy rtNEAT" button, a population of 50 agents is created and spawned on the field. Each agent is controlled by a simply neural network connecting the input sensors directly to outputs, with random weights. Over their lifetime, fitness is accumulated based on the behavior objectives specified with the sliders: if e.g. the approach enemy is rewarded, the time they spend near the enemy is multiplied by a large constant and added to the fitness.

After their lifetime expires, they are removed from the field one at a time. If their fitness was low, they are simply discarded. If their fitness was high, they will be put back into the field, and in addition, a new agent is generated by mutating the neural network (i.e. adding nodes and connections and/or changing connection weights) and crossing over its representation with another network with a high fitness. A balance of about 50% new individuals and 50% repeats is maintained in the field in the steady state (the explore/exploit slider has no effect on evolution). In this manner, evolution is running incrementally in the background, constantly evaluating and reproducing individuals.

Over time, evolution is thus likely to come up with more complex networks, including those with recurrent connections. Recurrency is useful e.g. when an agent needs to pursue an enemy around the corner (i.e. even though the enemy disappeared from view, activation in a recurrent network will retain that information). In other word, it allows disambiguating the state in a POMDP problem (where the state is partially observable).

When the population is saved, the genomes of each agents are written into a text file. That file can be edited to form composite teams, reloaded for further training, or loaded into battle.

The rtNEAT algorithm is parameterized using the file neat-params.dat; you can edit it in order to experiment with different versions of the method (such as mutation and speciation rates, balance of old and new agents, etc.)

Reinforcement Learning (Q-learning)

The reinforcement learning method in NERO is a version of Q-learning (familiar from the Q-learning demo), using either static, linear discretization or a tile-coding function approximator. The agents learn during their lifetime to optimize the behavioral objectives.

When you press the "Deploy Q-learning" button, a Q-learning agent is created according to the specs in the file mods/_NERO/data/shapes/character/steve_blue_qlearning.xml. The <Python agent="NERO.agent.QLearningAgent()"> XML element can be changed to include keyword arguments that will be passed to the QLearning constructor. These parameters are:

  • gamma - reinforcement learning discount factor
  • alpha - learning rate
  • epsilon - exploration factor for epsilon-greedy action selection, note that this can also be changed during the NERO simulation by manipulating the "Exploit/Explore" slider
  • action_bins - discretize each continuous dimension in the action space into this many linear bins (there are 2 action dimensions in NERO: turning and moving)
  • state_bins - discretize each continuous dimension in the state space into this many linear bins (there are about 15 state dimensions in NERO)
  • num_tiles - the number of tiles you want to use in the tile-coding approximator
  • num_weights - the amount of memory reserved for storing function approximations in the tile-coding approximator

The last four parameters specify the discretization of the state and action dimensions so that the agent's state can be represented as a discretized table of Q-values, one for each state/action pair (these values are initialized to zero). If you choose to use the tile-coding approximator, be sure to set action_bins and state_bins to 0; conversely, if you wish to use the static bins, be sure to set num_tiles and num_weights to 0. The default Q-Learning agents are created with action_bins set to 3 and state_bins set to 5.

The population for the game is generated by cloning this agent 50 times; each agent gets its Q-table to update, so different agents can learn different Q-values depending on their experiences.

Q-learning progresses as usual during the lifetime of these individuals, modifying the values in the table. Using the Exploit-Explore slider you can adjust the fraction of the actions taken greedily (i.e. those with the best Q-values) vs. actions taken to explore the environment (i.e. randomly selected actions). When the lifetime of an agent expires, it is respawned, and continues from the spawn location with its current Q-tables.

When the population is saved, the Q-tables of each individual are saved together with its parameters and the function approximation parameters, so that they can be loaded for further training and battle.

Training strategy

The game consists in trying to come up with a sequence of increasingly demanding goals, so that the agents will perform well in the end. It is a good idea to start with something simple, such as approaching the enemy. Once the agents learn that, place the enemy behind a wall so they learn to go around it. Then reward the agents for hitting the enemy as well. Then start penalizing them for getting hit. Introduce more enemies, and walls behind which the agents can take cover. You can also explore the effects of staying close or apart from teammates, and standing ground or moving a lot. In this manner, you can create agents with distinctly different personalities, and thus possibly serving different roles in battle.

Achieving each objective will take some time. Within a couple of minutes you should see some of the agents perform the task sometimes; within 10-15 minutes, almost the entire team may converge. Using the F2 displays you can follow the behavior of the current champion, which agents are drawing fire and which are avoiding it, and with rtNEAT, observe which agents are new and which are old, and how speciation is progressing. Note that it is not always good to converge completely, because it may be difficult to learn new skills then. The trick is to discover a sequence where later skills build on earlier ones so that little has to be unlearned between them.

It is a good idea to train several teams, and then test them in the battle mode. In this manner, you can develop an understanding of what works and why, and can focus your training better. Based on that knowledge you can also decide how to put a good team together from several different trained teams, as will be described next.

Composing a Team for Battle

Note that you can train several different teams to perform different behaviors, for instance a team of attackers, defenders, snipers, etc. It may then be useful to combine agents with such different behaviors into a single team. Because the save files are simply text, you can form such composite teams simply by editing them by hand. You can also "clone" agents by copying them multiple times. You can even combine agents created by neuroevolution and reinforment learning into a single team. The first 50 in the save file will be used in the battle; if there are fewer than 50 agents in the file, they will be copied until 50 are created in battle.

The basic structure of the file is like this for rtNEAT teams:

genomestart 120
trait 1 0.112808 0.000000 0.305447 0.263380 0.991934 0.000000 0.306283 0.855288
...
node 1 1 1 1 FriendRadarSensor 3 90 -90 15 0
...
node 21 1 1 3
...
gene 1 1 22 0.041885 0 1.000000 0.041885 1
...
genomeend 120

In words, a population consists of one or more genomes. Each genome
starts with a genomestart (followed by its ID) line
and ends with a genomeend line. Between these lines,
there are one or more trait lines followed by one or more input
(sensor) lines, followed by some other node lines, followed by the
gene lines.

For RL teams, the file looks like this:

22 serialization::archive 5 0 0 0.8 0.8 0.1 3 3 ... 1 7 27 OpenNero::TableApproximator 1 0
0 0 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 0 ...

22 serialization::archive 5 0 0 0.8 0.8 0.1 3 3 ... 1 7 27 OpenNero::TableApproximator 1 0
0 0 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 0 ...

...

Each team member is represented by a bunch of numbers representing
the stored Q table for the agent. Unlike rtNEAT teams, RL agents in
this file are separated by one blank line.

Either way, you will probably want to pick and choose the
individual agents from your training episodes that perform the best
for the tasks you anticipate. You should assemble these agents into
one file for the battle.

(Note: If you include reinforcement learning agents, you need to separate all agents in your submission file with one blank line. Also note: if you form a team by combining individuals from different rtNEAT runs, you current cannot train such a combo team further (because rtNEAT training depends on historical markings that then would not match)).

Before you submit to the tournament, you should test your file by loading it into NERO_Battle and making sure it runs correctly.
If you want, you can test your team e.g. against this sample team.

NERO Battle

In the NERO-battle environment the user first loads the two teams: one
is identified as Red and the other as Blue based on how the top of
the head of the robots is painted. By default they spawn on the opposite
sides of the central wall in the standard environment (the environment
and the spawn locations can be changed as in training mode).

The Hitpoints slider specifies how many times each agent can be hit
before it dies and is removed from the battle. The game ends when one
team is completely eliminated or when the time runs out, in which case the team that has more hits on the opponent wins. The current hitpoints are displayed in the title bar of the NERO window; the agent that delivered the winning shot will jump up and down in jubilation :-).

The game starts when the user presses the Continue button. The agents
are spawned only once, and they then have to move around in the
environment and engage the other team. This is where the training pays
off: the agents need to respond appropriately to the opponents'
actions, emploing different skills in different situations, such as
attacking, retreating, sniping, ambushing, sometimes perhaps working
together with teammates and sometimes independently of them. There is
no a-priori winning strategy; the performance of the team depends on
the ingenuity of its creator!

To see how the battle mode works, or see how well your team is doing, you can use this sample team.

NERO Tournament

A fun event in e.g. AI or machine learning courses is to organize a
NERO tournament. The students develop teams, and the teams are then
played against each other in a round-robin or a double-elimination
tournament. One such tournament was held in Fall 2011 for the Stanford
Online AI course; the tournament assignment is here.


Extending NERO Methods

The ingenuity is not limited to simply training the agents with the
methods that have been implemented in OpenNERO. The game is open
source, and you can modify all aspects of it by changing the python
code (and in some case, the C++ code). The main files are...

For instance, you can implement more sophisticated versions of the
sensors and effectors, or entirely new ones such as line-of-fire
sensors, or sending and receiving signals between the agents. You can
implement more sophisticated function approximators for reinforcement
learning, and even other neuroevolution and reinforcement learning
algorithms. If you so desire, you can also program the agent behaviors
entirely by hand.

Note that many such changes will require making corresponding changes
into the battle mode as well, and therefore it will not be possible to
use them in the NERO Tournament. However,
note that as long as your team is represented in terms of genomes and
Q-tables, it doesn't matter how that representation is created. That
is, if your changes apply to training only, and your team can still be
saved in the existing format, the team can be entered into the
tournament. For instance, you can express behaviors in terms of rules
and finite state automata based on the sensors and effectors in NERO,
and then mechanically translate them into neural networks (see
e.g. this paper). Those networks can then be represented as a genome and entered into tournament.

Differences between OpenNERO and Original NERO

The NERO game in OpenNERO differs from the original NERO game in
several important ways. First of all, whereas the original NERO was
based on the Torque game engine, OpenNERO is entirely open source
(based on the Irrlicht game engine and many other open-source
components). This design makes it a good platform for research and
education, i.e. it is possible for the users to extend it and to
understand it fully.

Second, the original NERO was designed to demonstrate that machine
learning games can be viable. It therefore aimed to be a more
substantial game, and included many features such as more advanced
graphics, sound, and user interface, as well as more detailed
environments that made gameplay more enjoyable. The 2.0 version of
NERO also included interactive battle where the human players
specified targets and composed teams dynamically.

Third, OpenNERO includes reinforcement learning as an alternative
method of learning for NERO agents. The idea is to demonstrate more
broadly how learning can take place in intelligent agents, both for
research and education.

Fourth, the original NERO included several features that have not yet
been implemented in OpenNERO, but could in the future. They include a
sensor for line-of-fire (which may help develop more sophisticated
behaviors); taking friendly fire into account; collisions among NERO
agents; different types of turrets in training; a button that
converges a population to a single individual, and a button that
removes an individual from the population. We invite the users to
implement such features, and perhaps others, in the game, and
contributed them to OpenNERO!

Fifth, much of OpenNERO is written in Python (instead of C++), making
it easier to understand and modify, again supporting research and
education. Unfortunately, it has the result of slowing down the
simulation by an order of magnitude. However, we believe that
researchers and students have the patience it takes to "play"
OpenNERO, in order to gain the better insight into the learning in it.

Software Issues

OpenNERO is academic software and under (hyper)active development. It
is possible that you will come across a bug in it, or a feature that
should be implemented. If so, please report it here, so that
everyone can see it and track it (please first check whether it has
already been reported).

Clone this wiki locally