Deep Reinforcement learning

Part 1: Q Value based

	Name	Paper
Baseline	DQN: Deep Q Learning	2013
Improv. 1	Double DQN (DDQN)	2015
Improv. 2	Prioritized DQN	2015
Improv. 3	Dueling DQN	2015
Improv. 4	A3C	2016
Improv. 5	Noisy DQN	2017
Improv. 6	Distributional DQN (C51)	2017
Combine 6	Rainbow	2017

Part 2: Policy Gradient based

Name	Paper
VPG: Vanilla Policy Gradient (aka REINFORCE)	1992
TRPO: Trust Region Policy Optimization	2015
DDPG: Deep Deterministic Policy Gradients	2015
A2C: Advantage Actor Critic
A3C: Asynchronous Advantage Actor Critic	2016
PPO: Proximal Policy Optimization	2017
TD3: Twin Delayed Deep Deterministic Policy Gradients	2018
SAC: Soft Actor-Critic	2018
SAC-Discrete: Soft Actor-Critic for Discrete Actions	2019

What are Policy Gradient Methods?

Policy methods search directly for the optimal policy, without simultaneously maintaining a value function.
Policy gradient methods are a subtype of policy methods that estimate the optimal policy through gradient ascent.

Problem: Maximize the expected return `U(θ) = ∑ P(τ,θ) R(τ)`

τ: Is the Trajectory, a state-action sequence.
R(τ): Is the Reward at each time step. (How good was my action)
P(τ,θ): Is the Probability of picking that action at that time step. (How confident i was)

Is like the loss of deep learning, but insted of minimizing it, you have to maximize it with gradient ascent.

VPG: Vanilla Policy Gradient (aka REINFORCE)

Use the policy π (network) to collect N trajectories τ (episodes)
Use the trajectories to estimate the gradient of the expected return U(θ)
Update the weights of the network (gradient ascent: θ = θ+α∇U(θ))
Loop over steps 1-3.

PPO: Proximal Policy Optimization

Part 3: Multi agent RL

Extra: AlphaGo → AlphaGo Zero → AlphaZero → MuZero

The Evolution of AlphaGo to MuZero

Extra 2: Dopamine

Afruitful relationship between neuroscience and AI

Actions

Discrete: (action probabilities)
- Only one: Sofmax
- Multiple: Sigmoid
- Action picking:
  - Deterministic: The most probable always.
  - Stochastic: Random according probabilities.
Continuous: (action values)
- [0,1]: Sigmoid (ej: acelerador)
- [-1,1]: Tanh (ej: volante)
- [0, inf]: ReLU
- [-inf, inf]: Nothing

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.ipynb_checkpoints		.ipynb_checkpoints
img		img
.DS_Store		.DS_Store
1. DQN (Cartpole).ipynb		1. DQN (Cartpole).ipynb
2. Double QDN (LunarLander) GOOD.ipynb		2. Double QDN (LunarLander) GOOD.ipynb
2. Double QDN (LunarLander).ipynb		2. Double QDN (LunarLander).ipynb
Kaggle Connect X competition.ipynb		Kaggle Connect X competition.ipynb
Kaggle RL envs overview.ipynb		Kaggle RL envs overview.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

img

img

.DS_Store

.DS_Store

1. DQN (Cartpole).ipynb

1. DQN (Cartpole).ipynb

2. Double QDN (LunarLander) GOOD.ipynb

2. Double QDN (LunarLander) GOOD.ipynb

2. Double QDN (LunarLander).ipynb

2. Double QDN (LunarLander).ipynb

Kaggle Connect X competition.ipynb

Kaggle Connect X competition.ipynb

Kaggle RL envs overview.ipynb

Kaggle RL envs overview.ipynb

README.md

README.md

Repository files navigation

Deep Reinforcement learning

Part 1: Q Value based

Part 2: Policy Gradient based

What are Policy Gradient Methods?

Problem: Maximize the expected return `U(θ) = ∑ P(τ,θ) R(τ)`

VPG: Vanilla Policy Gradient (aka REINFORCE)

PPO: Proximal Policy Optimization

Part 3: Multi agent RL

Extra: AlphaGo → AlphaGo Zero → AlphaZero → MuZero

Extra 2: Dopamine

Actions

References

About

Releases

Packages

Languages

mahatria/reinforcement-learning

Folders and files

Latest commit

History

Repository files navigation

Deep Reinforcement learning

Part 1: Q Value based

Part 2: Policy Gradient based

What are Policy Gradient Methods?

Problem: Maximize the expected return U(θ) = ∑ P(τ,θ) R(τ)

VPG: Vanilla Policy Gradient (aka REINFORCE)

PPO: Proximal Policy Optimization

Part 3: Multi agent RL

Extra: AlphaGo → AlphaGo Zero → AlphaZero → MuZero

Extra 2: Dopamine

Actions

References

About

Resources

Stars

Watchers

Forks

Languages

Problem: Maximize the expected return `U(θ) = ∑ P(τ,θ) R(τ)`