Reinforcement Learning modules for pytorch.
- Policy Gradient Loss
- CLIP loss (PPO)
- Entropy Loss
- Action - discrete (Categorical/Multinomial)
- Action - continuous (Normal/OUNoise (https://github.com/vitchyr/rlkit/blob/master/rlkit/exploration_strategies/ou_strategy.py))
- Reward - no documentation yet
- RewardHistory - no documentation yet
- ExperienceReplay
- RND (Random Network Distillation) - adding curiosity to your agent
- GAE (general advantages estimator)
- Solving openAI gym CartPole in less than 30 lines of code using RL_modules
- Breaking down the code
Requirements:
- PyTorch 1.1
- numpy 1.16
The gradient of the loss function is defined by:
Similar to using any pytorch loss function, we declare the loss function in the begining and use it later. i.e.:
import RL_modules as rl
#beginning of the code
PGloss_func = rl.PGloss()
#backprop:
loss = PGloss_func(log_pi, Q)
#or
loss = PGloss_func(actions.probs, Q)
#or
loss = PGloss_func(actions, Q)
#or
loss = PGloss_func(actions, rewards)
where log_pi and actions.probs are the log of the sampled actions probability, actions is an Action object and rewards is a Reward object.
-IMPORTANT: This function causes the gradients to accent (as they should) when using any optimizer for gradient descent. So use the function 'as is'.
A summarize of the PPO algorithm: https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#ppo
Example:
import RL_modules as rl
CLIPloss_func = rl.CLIPloss() #option A
#or
CLIPloss_func = rl.CLIPloss(epsilon=0.1) #option B
#backprop:
loss = CLIPloss_func(action_old, action, advantage, epsilon=0.3) #goes well with option A
#or
loss = CLIPloss_func(action_old, action, advantage) #goes well with option B
#or
loss = CLIPloss_func(action_old, action, advantage, epsilon=0.3) #with option B: epsilon=0.1 from initialization will be ignored
#or
loss = CLIPloss_func(actions, rewards) #with option A: default epsilon=0.2 will be used.
where
- action_old is the output Action object of
- action is the output Action object of
- advantage usually is the output of the old critic model.
The entropy loss tries to maximize the entropy of the policy distribution to increase exploration.
Combining PGloss and Entropy loss. example:
import RL_modules as rl
#beginning of the code
PGloss_func = rl.PGloss()
Entropyloss_func = rl.Entropyloss()
#backprop:
loss = PGloss_func(actions, Q) + beta * Entropyloss_func(actions)
- beta is a regularization factor for the Entropy loss.
Choosing a discrete action in RL requiers many steps:
- Getting the linear output of the Policy network.
- Softmaxing it.
- Sampling from the softmax Policy distribution.
- Saving log_pi [log(sampled action probability)].
- Get the entropy of the Policy distribution for later minimization/maximization.
- Get the chosen action in a one hot representation.
An Action object is like a numpy.array or a torch.tensor specially tailored for reinforcement learning.
Just convert the output of the Policy network to an Action by:
import RL_modules as rl
y = PolicyNet(state)
action = rl.Action(y) #converting PolicyNet output to an Action
and Action will execute and save the results of steps 2-6.
Action automatically checks if the output of the Policy network is linear or a distribution and acts accordingly.
import RL_modules as rl
#begining of the code: initializing an empty Action object.
action = Action([])
#Convert the Policy network's output to an Action, and add the new Action to the cumulative Action.
action += rl.Action(PolicyNet(state))
#outputs the last sampled action to the environment.
next_state, reward, done, info = env.step(action())
That's it!
where PolicyNet is the policy network, state is the policy network's input and action is an object containing useful information for training.
- action() -> last sampled action index.
- action.probs -> a tensor of action probabilities.
- action.entropy -> a tensor of policy distribution entropy.
- action.idx -> a tensor of sampled action indices.
- action.prob -> a tensor of sampled probability.
- action.one_hot -> a tensor of sampled actions' one hot representation.
- action.log_prob -> a tensor of log(sampled probability). This is the famous from the Policy Gradient Loss which is very useful when training an agent with PG method.
- action(0) -> sampled index of the 1st action.
- action(-1) -> sampled index of the last action (equivalent to action()).
- action(n) -> sampled index of the n-th action.
- action([]) -> an empty action.
- action[-5:] -> a new Action object with the last 5 actions only.
- action[b:n] -> a new Action object with actions b to n.
- action.size() -> a tuple with the number of sampled actions in index 0 and the number of possible actions in index 1.
- action.size(n) -> n-th index of action.size().
- len(action) -> the same as action.size(0).
To combine between action and new_action, both must be Action objects (all combination methods are equivalent):
- action = action + new_action
- action += new_action
- action.append(new_action)
- action.push(new_action)
i.e.:
import RL_modules as rl
x = rl.Action(torch.randn(5))
print(x.size())
---> (1, 5)
x += rl.Action(torch.randn((20, 5)))
print(x.size())
---> (21, 5)
The goal of RND module is to add curiosity to your agent. It can also be used to find anomalies in data, recognize a familiar pathes and much more. A good explanation about the algorithm can be found here: https://towardsdatascience.com/reinforcement-learning-with-exploration-by-random-network-distillation-a3e412004402
Before calling RND module, you must initialize 2 networks (architecture issues are discussed in the next part):
- RNDnet - a network which will remain frozen and will only be used for calculations. No gradients will pass in this network.
- PRDnet - a predictor network which will try to guess the output of the RND network and learn how to be more like RNDnet.
- IMPORTANT: It is recomended to add your own optimizer to PRDnet in advance. i.e.:
PRDnet.optimizer = optim.Adam(PRDnet.parameters(), lr=1e-3)
Initializing (after RNDnet and PRDnet exist):
import RL_modules as rl
rnd = RND(RNDnet, PRDnet, memory_capacity = 5000)
In this case, the RND module will save up to 5000 inputs to learn from. When reaching the memory_capacity, random inputs will be deleted. This should prevent catastrophic forgetting.
Then use one of the two:
- Getting the intrinsic RND curiosity reward (immuned to the noisy-TV problem):
next_state = rl.np2torch(next_state)
intrinsic_reward = rnd(next_state)
OR
- Getting Next-State Prediction (NSP) reward (NOT immuned to the noisy-TV problem):
state = rl.np2torch(state)
intrinsic_reward = rnd(torch.cat((state, action.one_hot), dim=-1))
The RNDnet and PRDnet can have the same architecture (it is recommended) but they must be initialized with different weights.
The networks must have the same input and output dimensions. The number of dimensions may differ only in the hidden layers.
Notice from the last part that:
- The input to the RND module is a state. If you use RND curiosity reward, the networks' input dimension should be the same as the state dimension.
- The input to the RND module is a state + action. If you use NSP reward, the networks' input dimension should be the sum of the state and action dimensions.
You can choose the output dimensions of the RNDnet and PRDnet with no constraints. Yet, you must remember:
- The length of a diagonal of a 1x1 square is . The length of a diagonal of a 1x1x1 cube is . The higher the dimention, the more volume the shape has and the farther away random points within the shape are.
- Higher dimentsion in the output layer might slow down the calculations and the learning process.
A recommended number of dimentions for the output is 2-4.
Each time you use RND module to get the RND_reward (i.e.: RND_reward = rnd(next_state)), the input is saved but not learned. To start learning the saved inputs, write:
rnd.learn(n_epochs=10, chunk_size=1000)
This line says that the the inputs will be processed in chunks of 1000 (big chunks need more memory but make the calculations faster). n_epochs is the number of epochs in learning process (the number of optimizer steps).
RND can also be used with Encoder-Decoder or autoencoder networks. Insert the Encoder part of the network as an RNDnet. Let the bottleneck of the Autoencoder be the output layer of RNDnet and initialize PRDnet with the same architecture as the Encoder. Since the Encoder changes as it learns, you should update the RNDnet when it happens. i.e.:
VAE.optimizer.step()
rnd.RNDnet = VAE.Encoder
Using REINFORCE algorithm + plotting the results. Example:
import torch
import gym
from NetworkModule import Network
import RL_modules as rl
import matplotlib.pyplot as plt
env = gym.make('CartPole-v1') #create the environment
#Parameters and Hyperparameters
n_episodes = 200
lr = 1e-2 #learning_rate
beta = 1e-6 #entropy loss coefficient
gamma = 0.99 #discount factor
torch.manual_seed(41)
PolicyNet = Network(L=[4,*1*[8],2], lr=lr, optimizer='RMSprop', dropout=0)
PolicyNet.PGloss = rl.PGloss()
PolicyNet.Entropyloss = rl.Entropyloss()
reward_history = rl.RewardHistory(running_steps=10)
for episode in range(n_episodes):
state = env.reset()
actions = rl.Action([])
rewards = rl.Reward([])
done = False
while done is False:
actions += rl.Action(PolicyNet(rl.np2torch(state)))
next_state, reward, done, info = env.step(actions())
rewards.append(rl.Reward(reward))
state = next_state
env.render()
loss = PolicyNet.PGloss(actions, rewards(gamma, norm=True)) + beta*PolicyNet.Entropyloss(actions)
PolicyNet.optimizer.zero_grad()
loss.backward()
PolicyNet.optimizer.step()
reward_history += rewards
if episode%10 == 0:
print("Episode #", episode, " score: ", rewards.sum().item())
reward_history.plot()
plt.show()
Creating the CartPole environment from gym package:
env = gym.make('CartPole-v1') #create the environment
Defining parameters and hyperparametes:
lr = 1e-2 #learning_rate
beta = 1e-6 #entropy loss coefficient
gamma = 0.99 #discount factor
n_episodes = 200 #number of episodes
Initializing a policy network using NetworkModule(https://github.com/omrijsharon/NetworkModule). The Policy network suppose to get a state as an input and outputs a number for each action (that becomes actions distribution later on). The network architecture:
- an input layer with 4 nodes - because the state shape is 4.
- 1 hidden layer with 8 nodes - these numbers are hyperparameters.
- an output layer with 2 nodes: - because the action space has 2 discrete actions.
torch.manual_seed(41)
PolicyNet = Network(L=[4,*1*[8],2], lr=lr, optimizer='RMSprop', dropout=0)
Setting the loss functions:
PolicyNet.PGloss = rl.PGloss()
PolicyNet.Entropyloss = rl.Entropyloss()
Initializing a RewardHistory module. This module will save the cumulative reward of each episode. It will calculate the mean and the standard deviation of last 10 cumulative rewards (running mean and running std).
reward_history = rl.RewardHistory(running_steps=10)
Looping over the environment for n_episodes and initializing the environment:
for episode in range(n_episodes):
state = env.reset()
Initializing empty Action and Reward modules:
actions = rl.Action([])
rewards = rl.Reward([])
Setting done to False and starting to interact with the environment. The environment sets done to be True in the end of an episode.
done = False
while done is False:
The following bulleted commands are all contained in this line:
actions += rl.Action(PolicyNet(rl.np2torch(state)))
- np2torch converts a numpy array to a torch tensor:
state_tensor = rl.np2torch(state_array)
- Running tensored state in the Policy network and outputing a tensor with size 2.
output_tensor = PolicyNet(state_tensor)
- Converting the Policy network output to an Action object and adding/appending it to the last actions
actions += rl.Action(output_tensor)
actions() returns an integer which represents the last sampled action. env.step(actions()) gives the envirenment an action and gets its response as a state and reward. if done is True, the episode will end. state = next_state updates the state for the next loop.
next_state, reward, done, info = env.step(actions())
rewards.append(rl.Reward(reward))
state = next_state
Rendering the environment. For faster preformance, delete this line or make it a comment with #.
env.render()
rewards(gamma, norm=True) returns normalized dicounted reward.
PGloss gets 2 arguments.
- log_pi tensor or actions.log_prob or just an Action object. PGloss knows to handle with an Action object and get its log_prob attribute automatically.
- dicounted_rewards tensor or rewards(gamma, norm=True) or rewards. PGloss knows to handle with a Reward object. Using only rewards (without the brackets) will result using default arguments.
Entropyloss gets an Action object and returns -action.entropy.sum(). Minimization of this loss results in more exploration and less certainty in the actions the agents is choosing. This effect will increase with beta.
loss = PolicyNet.PGloss(actions, rewards(gamma, norm=True)) + beta*PolicyNet.Entropyloss(actions)
zeroing grads, backpropagating the loss, walking one step in the gradient direction:
PolicyNet.optimizer.zero_grad()
loss.backward()
PolicyNet.optimizer.step()
Adding/appending the cumulative reward from the last episode to a RewardHistory object so we can plot it and see the learning progress of the agent:
reward_history += rewards
Updating every 10 episodes with the cumulative reward of the last episode:
if episode%10 == 0:
print("Episode #", episode, " score: ", rewards.sum().item())
Plotting the raw cumulative reward, its running mean and its running standard deviation:
reward_history.plot()
plt.show()