<a href="https://colab.research.google.com/github/AI4Finance-LLC/ElegantRL/blob/master/BipedalWalker_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **BipedalWalker-v3 Example in ElegantRL**






# **Part 1: Testing Task Description**

[BipedalWalker-v3](https://gym.openai.com/envs/BipedalWalker-v2/) is a classic task in robotics since it performs one of the most fundamental skills: moving. In this task, our goal is to make a 2D biped walker to walk through rough terrain. BipedalWalker is a difficult task in continuous action space, and there are only a few RL implementations can reach the target reward.

In [None]:
from IPython.display import HTML
HTML(f"""<video src={"https://gym.openai.com/videos/2019-10-21--mqt8Qj1mwo/BipedalWalker-v2/original.mp4"} width=500 controls/>""") # the random demonstration of the task from OpenAI Gym

# **Part 2: Install ElegantRL**

In [1]:
# install elegantrl library
!pip install git+https://github.com/AI4Finance-LLC/ElegantRL.git

Collecting git+https://github.com/AI4Finance-LLC/ElegantRL.git
  Cloning https://github.com/AI4Finance-LLC/ElegantRL.git to /tmp/pip-req-build-yykg2fyl
  Running command git clone -q https://github.com/AI4Finance-LLC/ElegantRL.git /tmp/pip-req-build-yykg2fyl
Collecting pybullet
[?25l  Downloading https://files.pythonhosted.org/packages/e6/9c/7b76db10cdaa69c840b211fe21ce6f31fb80b611b198fe18a64ddb8f374e/pybullet-3.1.0-cp37-cp37m-manylinux1_x86_64.whl (88.7MB)
[K     |████████████████████████████████| 88.7MB 65kB/s 
Collecting box2d-py
[?25l  Downloading https://files.pythonhosted.org/packages/87/34/da5393985c3ff9a76351df6127c275dcb5749ae0abbe8d5210f06d97405d/box2d_py-2.3.8-cp37-cp37m-manylinux1_x86_64.whl (448kB)
[K     |████████████████████████████████| 450kB 43.1MB/s 
Building wheels for collected packages: elegantrl
  Building wheel for elegantrl (setup.py) ... [?25l[?25hdone
  Created wheel for elegantrl: filename=elegantrl-0.3.1-cp37-none-any.whl size=35699 sha256=a9b4488eda0a

# **Part 3: Import Packages**


*   **elegantrl**
*   **OpenAI Gym**: a toolkit for developing and comparing reinforcement learning algorithms.
*   **PyBullet Gym**: an open-source implementation of the OpenAI Gym MuJoCo environments.



In [2]:
from elegantrl.run import *
from elegantrl.agent import AgentPPO
from elegantrl.env import PreprocessEnv
import gym
gym.logger.set_level(40) # Block warning

# **Part 4: Specify Agent and Environment**

*   **args.agent**: firstly chooses one DRL algorithm to use, and the user is able to choose any agent from agent.py
*   **args.env**: creates and preprocesses the environment, and the user can either customize own environment or preprocess environments from OpenAI Gym and PyBullet Gym from env.py.


> Before finishing initialization of **args**, please see Arguments() in run.py for more details about adjustable hyper-parameters.




In [4]:
args = Arguments(if_off_policy=True)
args.agent = AgentPPO()  # AgentSAC(), AgentTD3(), AgentDDPG()
args.env = PreprocessEnv(env=gym.make('BipedalWalker-v3'))
args.reward_scale = 2 ** -1  # RewardRange: -200 < -150 < 300 < 334
args.gamma = 0.95
args.rollout_num = 2 # the number of rollout workers (larger is not always faster)


| env_name:  BipedalWalker-v3, action space if_discrete: False
| state_dim:   24, action_dim: 4, action_max: 1.0
| max_step:  1600, target_reward: 300


# **Part 5: Train and Evaluate the Agent**

> The training and evaluating processes are all finished inside function **train_and_evaluate_mp()**, and the only parameter for it is **args**. It includes the fundamental objects in DRL:

*   agent,
*   environment.

> And it also includes the parameters for training-control:

*   batch_size,
*   target_step,
*   reward_scale,
*   gamma, etc.

> The parameters for evaluation-control:

*   break_step,
*   random_seed, etc.






In [None]:
train_and_evaluate_mp(args) # the training process will terminate once it reaches the target reward.

| multiprocessing, act_workers: 2
| multiprocessing, None:
| GPU id: 0, cwd: ./AgentPPO/BipedalWalker-v3_0
| Remove history
ID      Step      MaxR |    avgR      stdR       objA      objC
0   0.00e+00    -92.10 |


Understanding the above results::
*   **Step**: the total training steps.
*  **MaxR**: the maximum reward.
*   **avgR**: the average of the rewards.
*   **stdR**: the standard deviation of the rewards.
*   **objA**: the objective function value of Actor Network (Policy Network).
*   **objC**: the objective function value (Q-value)  of Critic Network (Value Network).