<a href="https://colab.research.google.com/github/AI4Finance-Foundation/ElegantRL/blob/master/tutorial_BipedalWalker_v3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **BipedalWalker-v3 Example in ElegantRL**






# **Task Description**

[BipedalWalker-v3](https://gym.openai.com/envs/BipedalWalker-v2/) is a robotic task in OpenAI Gym since it performs one of the most fundamental skills: moving. In this task, our goal is to get a 2D bipedal walker to walk through rough terrain. BipedalWalker is a difficult task in continuous action space, and there are only a few RL implementations can reach the target reward.

# **Part 1: Install ElegantRL**

In [None]:
# install elegantrl library
!pip install git+https://github.com/AI4Finance-LLC/ElegantRL.git

# **Part 2: Import Packages**


*   **elegantrl**
*   **OpenAI Gym**: a toolkit for developing and comparing reinforcement learning algorithms.



In [1]:
import gymnasium as gym
from elegantrl.agents import AgentPPO
from elegantrl.train.config import get_gym_env_args, Config
from elegantrl.train.run import *

gym.logger.set_level(40) # Block warning

# **Part 3: Get environment information**

In [2]:
get_gym_env_args(gym.make("BipedalWalker-v3"), if_print=False)

{'env_name': 'BipedalWalker-v3',
 'num_envs': 1,
 'max_step': 1600,
 'state_dim': 24,
 'action_dim': 4,
 'if_discrete': False}

# **Part 4: Specify Agent and Environment**

*   **agent**: chooses a agent (DRL algorithm) from a set of agents in the [directory](https://github.com/AI4Finance-Foundation/ElegantRL/tree/master/elegantrl/agents).
*   **env_func**: the function to create an environment, in this case, we use gym.make to create BipedalWalker-v3.
*   **env_args**: the environment information.


In [8]:
env_func = gym.make
env_args = {
    "env_num": 1,
    "env_name": "BipedalWalker-v3",
    "max_step": 1600,
    "state_dim": 24,
    "action_dim": 4,
    "if_discrete": False,
    "target_return": 300,
    "id": "BipedalWalker-v3",
}
# env = build_env(env_class=env_func, env_args=env_args)
args = Config(AgentPPO, env_class=env_func, env_args=env_args)

# **Part 4: Specify hyper-parameters**
A list of hyper-parameters is available [here](https://elegantrl.readthedocs.io/en/latest/api/config.html).

In [9]:
args.target_step = args.max_step * 4
args.gamma = 0.98
args.eval_times = 2**2
args.repeat_times = 8

# **Part 5: Train and Evaluate the Agent**






In [10]:
train_agent(args)

| Arguments Remove cwd: ./BipedalWalker-v3_PPO_0
| Evaluator:
| `step`: Number of samples, or total training steps, or running times of `env.step()`.
| `time`: Time spent from the start of training to this moment.
| `avgR`: Average value of cumulative rewards, which is the sum of rewards in an episode.
| `stdR`: Standard dev of cumulative rewards, which is the sum of rewards in an episode.
| `avgS`: Average of steps in an episode.
| `objC`: Objective of Critic network. Or call it loss function of critic network.
| `objA`: Objective of Actor network. It is the average Q value of the critic network.
################################################################################
ID     Step    Time |    avgR   stdR   avgS  stdS |    expR   objC   objA   etc.


  if not isinstance(terminated, (bool, np.bool8)):


0  2.05e+03       4 | -105.90    6.8    160     5 |   -5.64   1.22   0.06  -0.00
0  2.25e+04      40 | -101.19    0.3    156    32 |   -5.63   0.91   0.06  -0.00
0  4.30e+04      77 | -105.62    0.2    142     5 |   -5.65   1.96   0.06  -0.00
0  6.35e+04     116 | -106.94    0.1     96     2 |   -5.63   0.06   0.07  -0.00
0  8.40e+04     155 |  -76.43    0.8   1600     0 |   -5.69   0.08   0.05  -0.00


0  1.04e+05     199 |  -72.13    0.1   1600     0 |   -5.62   0.07   0.05  -0.01


Understanding the above results::
*   **Step**: the total training steps.
*  **MaxR**: the maximum reward.
*   **avgR**: the average of the rewards.
*   **stdR**: the standard deviation of the rewards.
*   **objA**: the objective function value of Actor Network (Policy Network).
*   **objC**: the objective function value (Q-value)  of Critic Network (Value Network).