Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trpo_try #567

Closed

Conversation

ShangYizhan
Copy link

Description

Linked issue(s)/Pull request(s)

Type of Change

  • Non-breaking bug fix
  • Breaking bug fix
  • New feature
  • Test
  • Doc update
  • Docker update

Related Component

  • Simulation toolkit
  • RL toolkit
  • Distributed toolkit

Has Been Tested

  • OS:
    • Windows
    • Mac OS
    • Linux
  • Python version:
    • 3.7
    • 3.8
    • 3.9
  • Key information snapshot(s):

Needs Follow Up Actions

  • New release package
  • New docker image

Checklist

  • Add/update the related comments
  • Add/update the related tests
  • Add/update the related documentations
  • Update the dependent downstream modules usage

@@ -1,7 +1,7 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.

from .rl_component_bundle import rl_component_bundle
from rl_component_bundle import rl_component_bundle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert this change, otherwise the run_rl_example.py won't work.

@@ -8,7 +8,7 @@
from maro.rl.rollout import AbsEnvSampler, CacheElement
from maro.simulator.scenarios.cim.common import Action, ActionType, DecisionEvent

from .config import action_shaping_conf, port_attributes, reward_shaping_conf, state_shaping_conf, vessel_attributes
from config import action_shaping_conf, port_attributes, reward_shaping_conf, state_shaping_conf, vessel_attributes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert this change, otherwise the run_rl_example.py won't work.

from .algorithms.ppo import get_ppo, get_ppo_policy
from examples.cim.rl.config import action_num, algorithm, env_conf, reward_shaping_conf, state_dim
from examples.cim.rl.env_sampler import CIMEnvSampler
from algorithms.ac import get_ac, get_ac_policy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert this change, otherwise the run_rl_example.py won't work.

@@ -383,3 +383,5 @@ def to_device(self, device: torch.device) -> None:
def _to_device_impl(self, device: torch.device) -> None:
"""Implementation of `to_device`."""
raise NotImplementedError


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unnecessary blank lines (you may run pre-commit run --all to do auto-formatting).

@@ -6,6 +6,7 @@
from .dqn import DQNParams, DQNTrainer
from .maddpg import DiscreteMADDPGParams, DiscreteMADDPGTrainer
from .ppo import DiscretePPOWithEntropyTrainer, PPOParams, PPOTrainer
from .trpo import *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use from XXX import * since it is ambiguous.

# mask -> actions个1
batch = self._get_batch()
# trpo_main.update_params(batch)
for _ in range(self._params.grad_iters):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to this pseudo code, we should update actor first, then update critic?

Args:
batch (TransitionBatch): Batch.
"""
self._v_critic_net.step(self._get_critic_loss(batch))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call self._v_critic_net.train() before updating critic net.

"""
loss = self._get_actor_loss(batch)

self._policy.train_step(loss)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call self._policy.train() before updating policy.

def _get_actor_loss(self, batch: TransitionBatch):
assert isinstance(self._policy, DiscretePolicyGradient) or isinstance(self._policy, ContinuousRLPolicy)
self._policy.train()
rewards = ndarray_to_tensor(batch.rewards)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify device in ndarray_to_tensor.

prev_value = 0
prev_advantage = 0

for i in reversed(range(rewards.size(0))):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be done in preprocess_batch?

@Jinyu-W Jinyu-W deleted the branch microsoft:refine_rl_component_bundle December 27, 2022 08:28
@Jinyu-W Jinyu-W closed this Dec 27, 2022
@ShangYizhan ShangYizhan deleted the maro_trpo branch January 16, 2023 02:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants