Skip to content

Latest commit

 

History

History

PPO

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Proximal Policy Optimization (PPO)

  • advantages: Informs the actor how much better its actions were compared to critic's expectations.
    • this is similar to critic's loss term, but the state values predicted by critic is detacted during actor's back-prop in order to decouple actor & critic.
  • clipping: Bounds the learning rate of the actor network by setting a limit on how much the log probabilties generated by the actor in previous episode can differ from the newly generated ones during the training updates.
  • critic: Tries to minimize the difference between predicted vs actual discounted returns

Pseudo-code

for each episode:
    memory = record_samples(env, policy=actor)
    disc_returns = compute_discounted_disc_returns(memory, gamma)

    for epoch in range(num_epochs):
        for observations, actions, old_log_probs, disc_returns in minibatch(memory, batch_size):
            values = critic(observations)
            advantages = disc_returns - values
            
            // Compute new log probs
            new_log_probs = actor(observations, actions)
            
            // Calculate ratios
            ratios = exp(new_log_probs - old_log_probs)
            
            // Actor loss
            surr1 = ratios * advantages
            surr2 = clip(ratios, 1 - epsilon, 1 + epsilon) * advantages
            actor_loss = -mean(min(surr1, surr2))
            
            // Critic loss
            critic_loss = mean((disc_returns - values)^2)
            
            // Step actor backprop
            // Step critic backprop