Name | Paper | ||
---|---|---|---|
Baseline | DQN: Deep Q Learning | 2013 | |
Improv. 1 | Double DQN (DDQN) | 2015 | |
Improv. 2 | Prioritized DQN | 2015 | |
Improv. 3 | Dueling DQN | 2015 | |
Improv. 4 | A3C | 2016 | |
Improv. 5 | Noisy DQN | 2017 | |
Improv. 6 | Distributional DQN (C51) | 2017 | |
Combine 6 | Rainbow | 2017 |
Name | Paper |
---|---|
VPG: Vanilla Policy Gradient (aka REINFORCE) | 1992 |
TRPO: Trust Region Policy Optimization | 2015 |
DDPG: Deep Deterministic Policy Gradients | 2015 |
A2C: Advantage Actor Critic | |
A3C: Asynchronous Advantage Actor Critic | 2016 |
PPO: Proximal Policy Optimization | 2017 |
TD3: Twin Delayed Deep Deterministic Policy Gradients | 2018 |
SAC: Soft Actor-Critic | 2018 |
SAC-Discrete: Soft Actor-Critic for Discrete Actions | 2019 |
- Policy methods search directly for the optimal policy, without simultaneously maintaining a value function.
- Policy gradient methods are a subtype of policy methods that estimate the optimal policy through gradient ascent.
τ
: Is the Trajectory, a state-action sequence.R(τ)
: Is the Reward at each time step. (How good was my action)P(τ,θ)
: Is the Probability of picking that action at that time step. (How confident i was)
Is like the loss of deep learning, but insted of minimizing it, you have to maximize it with gradient ascent.
- Use the policy π (network) to collect N trajectories τ (episodes)
- Use the trajectories to estimate the gradient of the expected return U(θ)
- Update the weights of the network (gradient ascent: θ = θ+α∇U(θ))
- Loop over steps 1-3.
Extra: AlphaGo → AlphaGo Zero → AlphaZero → MuZero
Extra 2: Dopamine
Afruitful relationship between neuroscience and AI
- Discrete: (action probabilities)
- Only one: Sofmax
- Multiple: Sigmoid
- Action picking:
- Deterministic: The most probable always.
- Stochastic: Random according probabilities.
- Continuous: (action values)
[0,1]
: Sigmoid (ej: acelerador)[-1,1]
: Tanh (ej: volante)[0, inf]
: ReLU[-inf, inf]
: Nothing