Reinforcement Learning: Twin Delayed Deep Deterministic Policy Gradient #3512

tareknaser · 2023-07-07T16:10:58Z

Description

This pull request implements the TD3 (Twin Delayed Deep Deterministic Policy Gradient) algorithm, along with 2 test cases.

Implementation details

TD3 (Twin Delayed Deep Deterministic Policy Gradient) is a reinforcement learning algorithm designed for continuous action spaces. It builds upon DDPG and introduces twin critics and delayed updates to improve stability and performance.

Implemented 6 networks:

policyNetwork (actor network)
targetPNetwork (target actor network)
learningQ1Network (first critic network)
targetQ1Network (first target critic network)
learningQ2Network (second critic network)
targetQ2Network (second target critic network)

How Has This Been Tested?

Included a Pendulum test that successfully passes with different configuration values.
- With TargetNetworkSyncInterval = 1 ——> -1081.52
- With TargetNetworkSyncInterval = 2 ——> -508.788
- With TargetNetworkSyncInterval = 3 ——> -1209.31
Additionally, added a test for continuous action spaces, which also passes.

The networks for the 2 tests are the same for DDPG and SAC for comparison.

Signed-off-by: Tarek <tareknaser360@gmail.com>

tareknaser · 2023-07-08T13:13:43Z

I tested a python implementation of TD3 with 2 different networks (all other hyperparameters are the same) on gym Pendulum-v1 and these are the results I got  :

1- with 2 hidden layers
score: -723.92

2- with 3 hidden layers
score: -910.24

This is a similar situation to what I experienced in the unit tests.
But if I train them for longer I get very different results and the bigger network performs better. It’s a bit flaky. There are a lot of hyperparameters.
The unit tests we have use testAgent function which stops whenever the environment is solved. The unit tests are just meant to validate that the agent passes and the TD3 agent passes with different network architecture.
We can further train this in the examples repository.

zoq

All good from my side, tested it locally as well, without any test failures.

mlpack-bot

Second approval provided automatically after 24 hours. 👍

feat(rl): implement Twin Delayed Deep Deterministic policy gradient

f2d8f51

Signed-off-by: Tarek <tareknaser360@gmail.com>

mlpack-bot bot added s: needs review s: unanswered s: unlabeled labels Jul 7, 2023

feat(history): add pr 3512 to history.md

9feb73e

Signed-off-by: Tarek <tareknaser360@gmail.com>

zoq added c: methods and removed s: unanswered s: unlabeled labels Jul 11, 2023

zoq approved these changes Jul 18, 2023

View reviewed changes

mlpack-bot bot approved these changes Jul 19, 2023

View reviewed changes

mlpack-bot bot removed the s: needs review label Jul 19, 2023

zoq merged commit 24ad24b into mlpack:master Jul 21, 2023
9 of 17 checks passed

tareknaser deleted the td3 branch July 26, 2023 16:41

rcurtin mentioned this pull request Sep 5, 2023

Release version 4.2.1 #3533

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reinforcement Learning: Twin Delayed Deep Deterministic Policy Gradient #3512

Reinforcement Learning: Twin Delayed Deep Deterministic Policy Gradient #3512

tareknaser commented Jul 7, 2023

tareknaser commented Jul 8, 2023

zoq left a comment

mlpack-bot bot left a comment

Reinforcement Learning: Twin Delayed Deep Deterministic Policy Gradient #3512

Reinforcement Learning: Twin Delayed Deep Deterministic Policy Gradient #3512

Conversation

tareknaser commented Jul 7, 2023

Description

Implementation details

How Has This Been Tested?

tareknaser commented Jul 8, 2023

zoq left a comment

Choose a reason for hiding this comment

mlpack-bot bot left a comment

Choose a reason for hiding this comment