-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question on formula of the continuous action #1
Comments
Thanks for this awesome reimplementation! It really gave me a good intuition for adapting the paper for my work. I agree with @dbsxdbsx for the weighting term in the continuous actor loss. Mathematically, current implementation seems to "square" the weighting the entropy bonus term in the actor loss. Want to hear from the code author @nisheeth-golakiya about this. Thank you! |
Oh, now I get the implementation. The The Thus, the original implementation seems to be correct. @dbsxdbsx you may verify this. |
@dbsxdbsx Apologies for my late response. @mch5048 and @dbsxdbsx I am glad that you're interested in my implementation! @mch5048 has got it right. To get the expected value of a quantity, we usually approximate it by sampling (recall that for continous actions, we don't have exact probablity value of a given action). But for discrete actions, the probability value of the action is readily available. Hence, instead of sampling, we directly multiply by the probability value. The first section of the appendix of the paper reveals many implementation details. In the following line, min_qf_next_target = next_state_prob_d * (torch.min(qf1_next_target, qf2_next_target) - alpha * next_state_prob_d * next_state_log_pi_c - alpha_d * next_state_log_pi_d)
For policy loss of the continuous component, policy_loss_c = (prob_d * (alpha * prob_d * log_pi_c - min_qf_pi)).sum(1).mean() please have a look at the fourth point of the first section of the appendix, which says
Lastly, Please let me know if you need more clarification. |
Regarding the calculation of discrete part of the policy loss, I would like to point out that the third point in the first section of the appendix says that it is optimized by minimizing the KL-divergence between the distribution of discrete actions and softmax of the Q-values with temperature I have taken the liberty to modify this calculation according to this implementation found in discrete-SAC. If you try the KL-divergence approach, do share the results :) |
Thanks for the clarification. I'll delve into the kl div minimization view of algorithm soon. |
@nisheeth-golakiya @mch5048 , thanks your attention and explanation on this topic. critic lossFirst, let's take look at the min_qf_next_target = next_state_prob_d * (torch.min(qf1_next_target, qf2_next_target) - alpha * next_state_prob_d * next_state_log_pi_c - alpha_d * next_state_log_pi_d)
next_q_value = torch.Tensor(s_rewards).to(device) + (1 - torch.Tensor(s_dones).to(device)) * args.gamma * (min_qf_next_target.sum(1)).view(-1) If the env only contains discrete actions, then no doubt it should be: min_qf_next_target = next_state_prob_d * (torch.min(qf1_next_target, qf2_next_target) - alpha_d * next_state_log_pi_d)
next_q_value = torch.Tensor(s_rewards).to(device) + (1 - torch.Tensor(s_dones).to(device)) * args.gamma * (min_qf_next_target.sum(1)).view(-1) Now let me put it in a more sensible way: min_pred_next_state_score = torch.min(qf1_next_target, qf2_next_target)
d_entropy_bonus= - alpha_d * next_state_log_pi_d # [batch_size,d_dim],each ele of d_dim represents the specific entropy of discrete action at the index
score_of_all_actions = min_pred_next_state_score + d_entropy_bonus
next_state_expected_score = (next_state_prob_d * score_of_all_actions).sum(1) # the sum(1) make the shape into [batch_size,], which makes it a math expectation scalar for each batch
next_q_value = torch.Tensor(s_rewards).to(
device) + (1 - torch.Tensor(s_dones).to(device)) * args.gamma * (next_state_expected_score).view(-1) (Here I also changed some variable name, hope it doesn't bother you). At present, everything is fine, right? min_pred_next_state_score = torch.min(qf1_next_target, qf2_next_target)
d_entropy_bonus= - alpha_d * next_state_log_pi_d # [batch_size,d_dim],each ele of d_dim represents the specific entropy of discrete action at the index
c_entropy_bonus= - alpha * next_state_log_pi_c # [batch_size,d_dim],each ele of d_dim represents the extra continuous entropy within that discrete action
score_of_all_actions = min_pred_next_state_score + d_entropy_bonus + c_entropy_bonus # each element contains ALL ACTION Q VALUE predicted on next state
next_state_expected_score = (next_state_prob_d * score_of_all_actions).sum(1) # the sum() make the shape into [batch_size,], which makes it a math expectation scalar for each batch
next_q_value = torch.Tensor(s_rewards).to(
device) + (1 - torch.Tensor(s_dones).to(device)) * args.gamma * (next_state_expected_score).view(-1) Intuitionally, the Doesn't it make sense? Then, with this modification, I tested in env And I mean no off sense, but the original version seems to be complex and I still don't follow the meaning of actor loss(I would use actions_c, actions_d, log_pi_c, log_pi_d, prob_d = pg.get_action(s_obs, device)
qf1_pi = qf1.forward(s_obs, actions_c, device)
qf2_pi = qf2.forward(s_obs, actions_c, device)
min_qf_pi = torch.min(qf1_pi, qf2_pi)
policy_loss_d = (prob_d * (alpha_d * log_pi_d - min_qf_pi)).sum(1).mean()
policy_loss_c = (prob_d * (alpha * prob_d * log_pi_c - min_qf_pi)).sum(1).mean()
policy_loss = policy_loss_d + policy_loss_c
policy_optimizer.zero_grad()
policy_loss.backward()
policy_optimizer.step() Still we have no doubt on the pure discreate action part, but the question is from the jointly continuous part. SUDDENLY, I realize that treating the Since trying to maximize soft state value is not a bad interpretation for actor loss, and soft state value is exactly what we used to calculate d_entropy_bonus = -alpha_d * log_pi_d
c_entropy_bonus = -alpha * log_pi_c
policy_bonus = (prob_d * (min_qf_pi + d_entropy_bonus+c_entropy_bonus)).sum(1)
policy_loss = - policy_bonus.mean() #average over all batches From my test, still with fixed alpha, the result seems slightly slower than the original version. Now, for both actor and critic loss part, I can't help asking myself, why both the original version and my new version can both making SAC convergent. My answer is, to a certain degree, the operator alpha lossFollowing the hybrid action SAC paper, since the target alpha for discrete and continuous are different, I think it is better to put these 2 into 2 separate losses as in the original version. But still doubt on the continuous part, alpha_loss = (-log_alpha * p_d * (p_d * lpi_c + target_entropy)).sum(1).mean()
a_optimizer.zero_grad()
alpha_loss.backward()
a_optimizer.step()
alpha = log_alpha.exp().detach().cpu().item() to: alpha_loss = (-log_alpha * p_d * ( lpi_c + target_entropy)).sum(1).mean()
a_optimizer.zero_grad()
alpha_loss.backward()
a_optimizer.step()
alpha = log_alpha.exp().detach().item() The alpha_loss = log_alpha * ( -lpi_c - target_entropy)
= log_alpha * (alpha_entropy - target_entropy) From this modified alpha loss, it is easy to see that, to make And because each element of alpha_loss = log_alpha * p_d* ( -lpi_c - target_entropy)
= log_alpha * p_d* (alpha_entropy - target_entropy) But what about the inside While the outside At least in my version, because I calculate # copied from actor loss, similar insights occurs from critic loss part
d_entropy_bonus = -alpha_d * log_pi_d
c_entropy_bonus = -alpha * log_pi_c
policy_bonus = (prob_d * (min_qf_pi + d_entropy_bonus+c_entropy_bonus)).sum(1) here Finally, after testing with the modified alpha, the modified SAC is still convergentable. |
@dbsxdbsx Oh, your point seems convincing. I also found a work that formulates hybrid-sac very similar to this paper, and it has official code implementation. In this implementation, the formulation follows exactly the same to your thoughts. |
First, thank your for the code related to paper
Discrete and Continuous Action Representation for Practical RL in Video Games
.Second, according to your code, all of action spaces of the environments you used for this project are based on the 5th action space architecture stated from the paper ---only ONE dimension of discrete action space + continuous action space for EACH action from discrete action space(If I am wrong ,please tell me).
And My question is the loss formula related to the
continous part
:From each code formula, you times distribution object (
next_state_prob_d
,prob_d
andp_d
) with the continuous objects (next_state_log_pi_c
,log_pi_c
andlpi_c
),even there is another same distribution object outside parenthesis.
Intuitionally, I think there is no need to multiple distribution object right with the corresponding continuous objects inside parenthesis.
I don't know whether I am wrong mathmatically. So I ask this question.
The text was updated successfully, but these errors were encountered: