Leaderboard

Qi Qiao edited this page Nov 16, 2018 · 86 revisions

This page tracks the performance of user algorithms for various tasks in gym. Previously, users could submit their scores directly to gym.openai.com/envs, but it has been decided that a simpler wiki might do this task more efficiently.

This wiki page is a community driven page. Anyone can edit this page and add to it. We encourage you to contribute and modify this page and add your scores and links to your write-ups and code to reproduce your results. We also encourage you to add new tasks with the gym interface, but not in the core gym library (such as roboschool) to this page as well.

Links to videos are optional, but encouraged. Videos can be youtube, instagram, a tweet, or other public links. Write-ups should explain how to reproduce the result, and can be in the form of a simple gist link, blog post, or github repo.

We have begun to copy over the previous performance scores and write-up links over from the the previous page. This is an ongoing effort, and we can use some help.

Environments

Classic control

CartPole-v0

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
  • Environment Details

  • CartPole-v0 defines "solving" as getting average reward of 195.0 over 100 consecutive trials.

  • This environment corresponds to the version of the cart-pole problem described by Barto, Sutton, and Anderson [Barto83].

User Episodes before solve Write-up Video
Tom 9 writeup video
Udacity DRLND Team 13 writeup gif
TeaPearce, nanastassacos 16 writeup video
MisterTea, econti 24 writeup
yingzwang 32 writeup
SurenderHarsha 40 writeup
n1try 85 writeup
khev 96 writeup video
ceteke 99 writeup
manikanta 100 writeup video
JamesUnicomb 145 writeup video
Harshit Singh Lodha 265 writeup gif
mbalunovic 306 writeup
ruippeixotog 933 writeup

MountainCar-v0

A car is on a one-dimensional track, positioned between two "mountains". The goal is to drive up the mountain on the right; however, the car's engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum.
  • Environment details
  • MountainCar-v0 defines "solving" as getting average reward of -110.0 over 100 consecutive trials.
  • This problem was first described by Andrew Moore in his PhD thesis [Moore90].
User Episodes before solve Write-up Video
Harshit Singh Lodha 643 writeup gif
jing582 1119
DaveLeongSingapore 1967

MountainCarContinuous-v0

A car is on a one-dimensional track, positioned between two "mountains". The goal is to drive up the mountain on the right; however, the car's engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum. Here, the reward is greater if you spend less energy to reach the goal

Here, this is the continuous version.

  • Environment details
  • MountainCarContinuous-v0 defines "solving" as getting average reward of 90.0 over 100 consecutive trials.
  • This problem was first described by Andrew Moore in his PhD thesis [Moore90].
User Episodes before solve Write-up Video
Ashioto 1 writeup
camigord 18 writeup
Tobias Steidle 32 writeup video
lirnli 33 writeup
khev 130 writeup video

Pendulum-v0

The inverted pendulum swingup problem is a classic problem in the control literature. In this version of the problem, the pendulum starts in a random position, and the goal is to swing it up so it stays upright.
  • Environment details
  • Pendulum-v0 is an unsolved environment, which means it does not have a specified reward threshold at which it's considered solved.
User Best 100-episode performance Write-up Video
msinto93 -123.79 ± 6.90 writeup
heerad -134.48 ± 9.07 writeup
ThyrixYang -136.16 ± 11.97 writeup
lirnli -152.24 ± 10.87 writeup

Acrobot-v1

The acrobot system includes two joints and two links, where the joint between the two links is actuated. Initially, the links are hanging downwards, and the goal is to swing the end of the lower link up to a given height.

  • Acrobot-v1 is an unsolved environment, which means it does not have a specified reward threshold at which it's considered solved.*
  • The acrobot was first described by Sutton [Sutton96]. We are using the version from RLPy [Geramiford15], which uses Runge-Kutta integration for better accuracy.
User Best 100-episode performance Write-up Video
mallochio -42.37 ± 4.83 taken down
marunowskia -59.31 ± 1.23
MontrealAI -60.82 ± 0.06
Daniel Barbosa -67.18 writeup
lirnli -72.09 ± 1.15
tsdaemon -77.87 ± 1.54
a7b23 -80.68 ± 1.18
DaveLeongSingapore -84.02 ± 1.46
loicmarie -99.18 ± 2.60
simonoso -113.66 ± 5.15
alebac -427.26 ± 15.02
mehdimerai -500.00 ± 0.00

Box2D

LunarLander-v2

Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Four discrete actions available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine.
  • LunarLander-v2 defines "solving" as getting average reward of 200 over 100 consecutive trials.=
  • by @olegklimov
User Episodes before solve Write-up Video
Christoph Powazny 658 writeup gif
Daniel Barbosa 674 writeup gif
Ruslan Miftakhov 814 writeup gif
Ollie Graham 987 writeup gif
Nikhil Barhate 1500 writeup gif
Udacity DRLND Team 1504 writeup gif
JamesUnicomb 2100 writeup video
ksankar 2148 Working on it
koltafrickenfer 499474 writeup youtube

LunarLanderContinuous-v2

Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Action is two real values vector from -1 to +1. First controls main engine, -1..0 off, 0..+1 throttle from 50% to 100% power. Engine can't work with less than 50% power. Second value -1.0..-0.5 fire left engine, +0.5..+1.0 fire right engine, -0.5..0.5 off.
  • LunarLanderContinuous-v2 defines "solving" as getting average reward of 200 over 100 consecutive trials.
User Episodes before solve Write-up Video

BipedalWalker-v2

Reward is given for moving forward, total 300+ points up to the far end. If the robot falls, it gets -100. Applying motor torque costs a small amount of points, more optimal agent will get better score. State consists of hull angle speed, angular velocity, horizontal speed, vertical speed, position of joints and joints angular speed, legs contact with ground, and 10 lidar rangefinder measurements. There's no coordinates in the state vector.
  • Environment Details
  • BipedalWalker-v2 defines "solving" as getting average reward of 300 over 100 consecutive trials.
  • by @olegklimov
User Episodes before solve Write-up Video
mayurmadnani 1000 Write-up Youtube
chitianqilin 47956 writeup Youtube
koltafrickenfer N/A writeup youtube
alirezamika N/A writeup
404akhan N/A writeup
Udacity DRLND Team N/A writeup gif

BipedalWalkerHardcore-v2

Hardcore version with ladders, stumps, pitfalls. Time limit is increased due to obstacles. Reward is given for moving forward, total 300+ points up to the far end. If the robot falls, it gets -100. Applying motor torque costs a small amount of points, more optimal agent will get better score. State consists of hull angle speed, angular velocity, horizontal speed, vertical speed, position of joints and joints angular speed, legs contact with ground, and 10 lidar rangefinder measurements. There's no coordinates in the state vector.
  • BipedalWalkerHardcore-v2 defines "solving" as getting average reward of 300 over 100 consecutive trials.
User Episodes before solve Write-up Video
DollarAkshay N/A writeup
ryogrid N/A writeup

CarRacing-v0

Easiest continuous control task to learn from pixels, a top-down racing environment. Discreet control is reasonable in this environment as well, on/off discretisation is fine. State consists of 96x96 pixels. Reward is -0.1 every frame and +1000/N for every track tile visited, where N is the total number of tiles in track. For example, if you have finished in 732 frames, your reward is 1000 - 0.1*732 = 926.8 points. Episode finishes when all tiles are visited. Some indicators shown at the bottom of the window and the state RGB buffer. From left to right: true speed, four ABS sensors, steering wheel position, gyroscope.
  • by @olegklimov
  • CarRacing-v0 defines "solving" as getting average reward of 900 over 100 consecutive trials.
User Episodes before solve Write-up Video
lmclupr N/A writeup
IPAM-AMD 900 writeup Video

MuJoCo

Walker2d-v1

Make a two-dimensional bipedal robot walk forward as fast as possible.
  • Walker2d-v1 is an unsolved environment, which means it does not have a specified reward threshold at which it's considered solved.
  • The robot model is based on work by Erez, Tassa, and Todorov [Erez11].
User 100-Episode Average Score Write-up Video
pat-coady 7167.24 link video
joschu 5594.75 link video
songrotek 1222.12 link video
your name your score link video

Ant-v1

Make a four-legged creature walk forward as fast as possible.
  • Ant-v1 defines "solving" as getting average reward of 6000.0 over 100 consecutive trials.
  • This task originally appeared in [Schulman15].
User Episodes before solve Write-up Video
pat-coady 69154 writeup
joschu N/A writeup

PyGame Learning Environment

FlappyBird-v0

This environment adapts a game from the PyGame Learning Environment (PLE) (http://pygame-learning-environment.readthedocs.io/en/latest/). To run it, you will need to install gym-ple from https://github.com/lusob/gym-ple.

Flappybird is a side-scrolling game where the agent must successfully navigate through gaps between pipes. The up arrow causes the bird to accelerate upwards. If the bird makes contact with the ground or pipes, or goes above the top of the screen, the game is over. For each pipe it passes through it gains a positive reward of +1. Each time a terminal state is reached it receives a negative reward of -1.

  • FlappyBird-v0 is an unsolved environment, which means it does not have a specified reward threshold at which it's considered solved.
  • by @lusob
User Best 100-episode performance Write-up Video
andreimuntean 261.12 ± 2.61 writeup
chuchro3 62.26 ± 7.81 writeup
warmar 11.28 ± 14.25 writeup video1 video2

Atari Games

Breakout-v0

User Best 100-episode performance Write-up
ppwwyyxx 760.07 ± 18.37 writeup

Pong-v0

User Best 100-episode performance Write-up
ppwwyyxx 20.81 ± 0.04 writeup

MsPacman-v0

User Best 100-episode performance Write-up
ppwwyyxx 5738.30 ± 171.99 writeup

Seaquest-v0

User Best 100-episode performance Write-up
ppwwyyxx 50209 ± 2440.07 writeup

Snake-v0

Snake is a game where the agent must maneuver a line which grows in length each time food is touched by the head of the segment. The line follows the previous paths taken which eventually become obstacles for the agent to avoid.

The food is randomly spawned inside of the valid window while checking it does not make contact with the snake body.

User Best 100-episode performance Write-up Video
carsonprindle .44 ± .04 writeup

Doom

Toy text

Simple text environments to get you started.

Taxi-v2

This task was introduced in [Dietterich2000] to illustrate some issues in hierarchical reinforcement learning. There are 4 locations (labeled by different letters) and your job is to pick up the passenger at one location and drop him off in another. You receive +20 points for a successful dropoff, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.

[Dietterich2000] T Erez, Y Tassa, E Todorov, "Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition", 2011.

User 100 Episodes Best Average Reward Write-up Video
giskmov 9.700 writeup
jo4x962k7JL 9.600 writeup
ksankar 9.530 writeup
Tom Roth 9.500 writeup
Herimiaina ANDRIA-NTOANINA 9.446 writeup
Cihan Soylu 9.423 writeup
mostoo45 9.492 writeup
Mahaveer Jain 9.296 writeup
Brungi Vishwa Sourab 9.23 writeup
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.