feature(zjow): add Implicit Q-Learning #821

zjowowen · 2024-07-29T10:18:42Z

Add Implicit Q-Learning (IQL) algorithm.

PaParaZz1 · 2024-07-30T02:51:34Z

dizoo/d4rl/config/halfcheetah_medium_iql_config.py

+        ),
+        collect=dict(data_type='d4rl', ),
+        eval=dict(evaluator=dict(eval_freq=5000, )),
+        other=dict(replay_buffer=dict(replay_buffer_size=2000000, ), ),


why replay buffer here

PaParaZz1 · 2024-07-30T02:51:56Z

dizoo/d4rl/entry/d4rl_iql_main.py

+    config = Path(__file__).absolute().parent.parent / 'config' / args.config
+    config = read_config(str(config))
+    config[0].exp_name = config[0].exp_name.replace('0', str(args.seed))
+    serial_pipeline_offline(config, seed=args.seed)


why not add max_train_iter

PaParaZz1 · 2024-07-30T02:59:25Z

ding/utils/data/dataset.py

@@ -114,6 +114,38 @@ def __init__(self, cfg: dict) -> None:
        except (KeyError, AttributeError):
            # do not normalize
            pass
+        if hasattr(cfg.env, "reward_norm"):
+            if cfg.env.reward_norm == "normalize":
+                dataset['rewards'] = (dataset['rewards'] - dataset['rewards'].mean()) / dataset['rewards'].std()


PaParaZz1 · 2024-07-30T03:00:47Z

ding/policy/iql.py

@@ -0,0 +1,654 @@
+from typing import List, Dict, Any, Tuple, Union


add this policy into the table in readme

PaParaZz1 · 2024-07-30T03:01:38Z

ding/policy/iql.py

+            # (str type) action_space: Use reparameterization trick for continous action
+            action_space='reparameterization',
+            # (int) Hidden size for actor network head.
+            actor_head_hidden_size=512,


add more comments for each arguments

PaParaZz1 · 2024-07-30T03:05:58Z

ding/policy/iql.py

+            'policy_grad_norm': policy_grad_norm,
+        }
+
+    def _get_policy_actions(self, data: Dict, num_actions: int = 10, epsilon: float = 1e-6) -> List:


where is this method used

PaParaZz1 · 2024-07-30T03:06:44Z

ding/policy/iql.py

+        # 9. update policy network
+        self._optimizer_policy.zero_grad()
+        policy_loss.backward()
+        policy_grad_norm = torch.nn.utils.clip_grad_norm_(self._model.actor.parameters(), 1)


enable the argument can be set in the optimizer

PaParaZz1 · 2024-07-30T03:07:52Z

ding/policy/iql.py

+                transforms=[TanhTransform(cache_size=1),
+                            AffineTransform(loc=0.0, scale=1.05)]
+            )
+            next_action = next_obs_dist.rsample()


why rsample rather than sample here

PaParaZz1 · 2024-07-30T03:08:23Z

ding/policy/iql.py

+        log_prob = dist.log_prob(action)
+
+        eval_data = {'obs': obs, 'action': action}
+        new_value = self._learn_model.forward(eval_data, mode='compute_critic')


maybe you can use with torch.no_grad() here

PaParaZz1 · 2024-07-30T03:10:00Z

ding/policy/iql.py

+        with torch.no_grad():
+            (mu, sigma) = self._collect_model.forward(data, mode='compute_actor')['logit']
+            dist = Independent(Normal(mu, sigma), 1)
+            action = torch.tanh(dist.rsample())


for offline RL algorithm, you may opt to leave the methods related to collect with empty

zjowowen added 6 commits July 16, 2024 22:14

Add IQL algo

2b72b8b

Polish IQL Algorithm

e43074a

Merge branch 'main' of https://github.com/zjowowen/DI-engine into iql

496c540

Polish IQL Algorithm

56ec93e

Polish IQL Algorithm

61d3722

Polish IQL Algorithm

63363f4

zjowowen added the algo Add new algorithm or improve old one label Jul 29, 2024

PaParaZz1 changed the title ~~feature(zjow): Add Implicit Q-Learning~~ feature(zjow): add Implicit Q-Learning Jul 29, 2024

PaParaZz1 requested changes Jul 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(zjow): add Implicit Q-Learning #821

feature(zjow): add Implicit Q-Learning #821

zjowowen commented Jul 29, 2024

PaParaZz1 Jul 30, 2024

PaParaZz1 Jul 30, 2024

PaParaZz1 Jul 30, 2024

PaParaZz1 Jul 30, 2024

PaParaZz1 Jul 30, 2024

PaParaZz1 Jul 30, 2024

PaParaZz1 Jul 30, 2024

PaParaZz1 Jul 30, 2024

PaParaZz1 Jul 30, 2024

PaParaZz1 Jul 30, 2024

		@@ -0,0 +1,654 @@
		from typing import List, Dict, Any, Tuple, Union

feature(zjow): add Implicit Q-Learning #821

Are you sure you want to change the base?

feature(zjow): add Implicit Q-Learning #821

Conversation

zjowowen commented Jul 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment