Skip to content

mahatria/reinforcement-learning

 
 

Repository files navigation

Deep Reinforcement learning

Part 1: Q Value based

Name Paper
Baseline DQN: Deep Q Learning 2013
Improv. 1 Double DQN (DDQN) 2015
Improv. 2 Prioritized DQN 2015
Improv. 3 Dueling DQN 2015
Improv. 4 A3C 2016
Improv. 5 Noisy DQN 2017
Improv. 6Distributional DQN (C51)2017
Combine 6 Rainbow 2017

Part 2: Policy Gradient based

Name Paper
VPG: Vanilla Policy Gradient (aka REINFORCE) 1992
TRPO: Trust Region Policy Optimization 2015
DDPG: Deep Deterministic Policy Gradients 2015
A2C: Advantage Actor Critic
A3C: Asynchronous Advantage Actor Critic 2016
PPO: Proximal Policy Optimization 2017
TD3: Twin Delayed Deep Deterministic Policy Gradients 2018
SAC: Soft Actor-Critic 2018
SAC-Discrete: Soft Actor-Critic for Discrete Actions 2019

What are Policy Gradient Methods?

  • Policy methods search directly for the optimal policy, without simultaneously maintaining a value function.
  • Policy gradient methods are a subtype of policy methods that estimate the optimal policy through gradient ascent.

Problem: Maximize the expected return U(θ) = ∑ P(τ,θ) R(τ)

  • τ: Is the Trajectory, a state-action sequence.
  • R(τ): Is the Reward at each time step. (How good was my action)
  • P(τ,θ): Is the Probability of picking that action at that time step. (How confident i was)

Is like the loss of deep learning, but insted of minimizing it, you have to maximize it with gradient ascent.

VPG: Vanilla Policy Gradient (aka REINFORCE)

  1. Use the policy π (network) to collect N trajectories τ (episodes)
  2. Use the trajectories to estimate the gradient of the expected return U(θ)
  3. Update the weights of the network (gradient ascent: θ = θ+α∇U(θ))
  4. Loop over steps 1-3.

PPO: Proximal Policy Optimization

Part 3: Multi agent RL

Extra: AlphaGo → AlphaGo Zero → AlphaZero → MuZero

Extra 2: Dopamine

Afruitful relationship between neuroscience and AI

Actions

  • Discrete: (action probabilities)
    • Only one: Sofmax
    • Multiple: Sigmoid
    • Action picking:
      • Deterministic: The most probable always.
      • Stochastic: Random according probabilities.
  • Continuous: (action values)
    • [0,1]: Sigmoid (ej: acelerador)
    • [-1,1]: Tanh (ej: volante)
    • [0, inf]: ReLU
    • [-inf, inf]: Nothing

References

About

🏆 My RL notebooks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%