Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reinforcement Learning: Twin Delayed Deep Deterministic Policy Gradient #3512

Merged
merged 2 commits into from Jul 21, 2023

Conversation

tareknaser
Copy link
Member

Description

This pull request implements the TD3 (Twin Delayed Deep Deterministic Policy Gradient) algorithm, along with 2 test cases.

Implementation details

TD3 (Twin Delayed Deep Deterministic Policy Gradient) is a reinforcement learning algorithm designed for continuous action spaces. It builds upon DDPG and introduces twin critics and delayed updates to improve stability and performance.

Implemented 6 networks:

  • policyNetwork (actor network)
  • targetPNetwork (target actor network)
  • learningQ1Network (first critic network)
  • targetQ1Network (first target critic network)
  • learningQ2Network (second critic network)
  • targetQ2Network (second target critic network)

How Has This Been Tested?

  • Included a Pendulum test that successfully passes with different configuration values.
    • With TargetNetworkSyncInterval = 1 ——> -1081.52
    • With TargetNetworkSyncInterval = 2 ——> -508.788
    • With TargetNetworkSyncInterval = 3 ——> -1209.31
  • Additionally, added a test for continuous action spaces, which also passes.

The networks for the 2 tests are the same for DDPG and SAC for comparison.

Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
@tareknaser
Copy link
Member Author

I tested a python implementation of TD3 with 2 different networks (all other hyperparameters are the same) on gym Pendulum-v1 and these are the results I got

:

1- with 2 hidden layers
score: -723.92
image
2- with 3 hidden layers
score: -910.24
image

This is a similar situation to what I experienced in the unit tests.
But if I train them for longer I get very different results and the bigger network performs better. It’s a bit flaky. There are a lot of hyperparameters.
The unit tests we have use testAgent function which stops whenever the environment is solved. The unit tests are just meant to validate that the agent passes and the TD3 agent passes with different network architecture.
We can further train this in the examples repository.

Copy link
Member

@zoq zoq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good from my side, tested it locally as well, without any test failures.

Copy link

@mlpack-bot mlpack-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second approval provided automatically after 24 hours. 👍

@zoq zoq merged commit 24ad24b into mlpack:master Jul 21, 2023
9 of 17 checks passed
@tareknaser tareknaser deleted the td3 branch July 26, 2023 16:41
@rcurtin rcurtin mentioned this pull request Sep 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants