<h1 align="center"> Deep RL - A Deep Reinforcement Learning Framework </h1>

![](imgs/logo.png)

# Set the path of DeepRL

Before using any component of the library we need to add the framework to the path of Python interpreter. We append the location of the root folder of the framework to system path. In my case it is at the following path. After setting the path we import the RLExp class which helps us to create a Deep RL Experiment.

In [1]:
FRAMEWORK_PATH = "/home/mayank/Documents/Codes/ValueBased_DeepRL/"

In [2]:
import sys
import os
sys.path.append(FRAMEWORK_PATH)
from RLExp import RLExp

# Querying available implementations

The framework is under heavy development and we are adding new algorithms after testing everyday. The list of all available implementations of algorithm, exploration strategies, experience replays can be queried as demonstrated below.

In [3]:
exp = RLExp()

## 1. Get list of all available algorithms

In [4]:
exp.getAvailableAlgorithms()

['DQN', 'DDQN']

## 2. Get list of all implemented experience replays

In [5]:
exp.getAvailableReplays()

['Uniform-Sampling', 'Prioritized-Sampling']

## 3. Get list of all implemented exploration strategies

In [6]:
exp.getAvailableExplorePolicies()

['epsilon-greedy', 'annealing-ep-greedy']

# Setting Up RL Experiment

Now, we set up a RL experiment. To do that we need to call a function setup_exp of class RLExp which takes the following arguments as the input. In this step we will define the algorithm, exploration policy, replay method and neural network architecture we like to use for our experiment.

Arguments of setup_exp function
-  algorithm - one of the algorithm returned using getAvailableAlgorithms method
-  explore_policy - one of the exploration policy returned using getAvailableExplorePolicies method
-  replay - one of the replay returned using getAvailableReplays method
-  observation_shape - shape of the observation which is feed to the input of neural network
-  num_actions - the number of actions agent can perform in the environment

Further, to demonstrate we will train our agent on a simple cartpole environment from OpenAI Gym. Since, OpenAI exposes the size of Action and Observation so it becomes really easy to determine the shape of observation and num of actions which are required as the arguments to the setup_exp method.



In [7]:
import gym
from gym import wrappers
env = gym.make("CartPole-v0")
#env = wrappers.Monitor(env, 'data/cartpole-experiment-1')
observation_size = env.observation_space.shape[0] # observation should be numpy array/vector
num_actions = env.action_space
print("Observation Shape : ",observation_size)
print("Number of actions : ",num_actions)

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Observation Shape :  4
Number of actions :  Discrete(2)


## Defining Neural Network Architecture

When we call setup_exp method it looks for the defination of neural network architecture which is compatible with provided observation_shape and num_actions. The network should be defined in the file provided in "network" folder. Please do not change the file name. File **q_network.py** should contain the network defination. The network is defined using PyTorch. Example given below demonstrates how to define a PyTorch Network.

In [8]:
! cd ..; tree network;

[01;34mnetwork[00m
├── [01;34m__pycache__[00m
│   └── q_network.cpython-36.pyc
└── q_network.py

1 directory, 2 files


In [9]:
! cd ..; cd network; cat q_network.py

import torch.nn as nn
import torch

# Define your network here
# input output format should be same as provided here.
class q_network(nn.Module):

    def __init__(self,input_size,out_size):
        super(q_network,self).__init__()

        self.input_size = input_size
        self.out_size = out_size

        self.layer1 = nn.Linear(out_features=24,in_features=self.input_size)
        self.layer2 = nn.Linear(out_features=48,in_features=24)
        self.layer3 = nn.Linear(out_features=self.out_size,in_features=48)

        self.relu = nn.ReLU()

    def forward(self,x,bsize):

        x = x.view(bsize,self.input_size)
        q_out = self.relu(self.layer1(x))
        q_out = self.relu(self.layer2(q_out))
        q_out = self.layer3(q_out)

        return q_out


## Calling setup_exp method to create an experiment

Finally, once all the step given above are completed. We can call setup_exp function to create the experiment.

In [10]:
exp.setup_exp(algorithm="DQN",explore_policy="annealing-ep-greedy",replay="Uniform-Sampling",observation_shape=observation_size,num_actions=2,exp_name='cartpole',save_interval=100)

Device Set to :  cpu
q_network(
  (layer1): Linear(in_features=4, out_features=24, bias=True)
  (layer2): Linear(in_features=24, out_features=48, bias=True)
  (layer3): Linear(in_features=48, out_features=2, bias=True)
  (relu): ReLU()
)
Algorithm Setup Done. Please set hyper-parameters.


# Setting hyper-parameters

An RL experiment have many hyper-parameters which are required to tuned in order to get the best performance from the algorithms. The Deep RL exposes all the hyper-parameters which can be changed and queried at any time during the experiment. The following methods can be used to get the all hyper-parameters of each component, followed by how to change them.

## Querying and Setting Algorithm hyper-parameters

In [11]:
exp.algorithm_object.getHyperParams()

Hyper Paramters (DQN)					Current Value
-------------------------------------------------------------------
1. batch_size							None
2. gamma							None
3. freeze_steps							None
4. max_steps							None
5. max_episodes							None
6. optimizer							None
7. learn rate							None
8. loss func							None


In [12]:
exp.algorithm_object.setHyperParams(batch_size=32,gamma=0.99,freeze_steps=10000,max_steps=10**6,max_episodes=10000,optimizer='adam',lr=0.001,loss="huber")

## Querying and Setting Exploration Policy hyper-parameters

In [13]:
exp.explore_policy_object.getHyperParams()

Hyper Paramters (annealed_ep_greedy)			Current Value
-------------------------------------------------------------------
1. current_epsilon					None
2. intital_epsilon					None
3. final_epsilon					None
4. episodes_to_anneal					None


In [14]:
exp.explore_policy_object.setHyperParams(episodes_to_anneal=7000,initial_epsilon=1.0,final_epsilon=0.01)

## Querying and Setting Replay hyper-parameters

In [15]:
exp.replay_object.getHyperParams()

Hyper Paramters (uniform-sampling)			Current Value
-------------------------------------------------------------------
1. capacity						None


In [16]:
exp.replay_object.setHyperParams(capacity=1000000)

## Printing Final values of hyper-parameters

In [17]:
exp.algorithm_object.getHyperParams()

Hyper Paramters (DQN)					Current Value
-------------------------------------------------------------------
1. batch_size							32
2. gamma							0.99
3. freeze_steps							10000
4. max_steps							1000000
5. max_episodes							10000
6. optimizer							adam
7. learn rate							0.001
8. loss func							huber


In [18]:
exp.explore_policy_object.getHyperParams()

Hyper Paramters (annealed_ep_greedy)			Current Value
-------------------------------------------------------------------
1. current_epsilon					1.0
2. intital_epsilon					1.0
3. final_epsilon					0.01
4. episodes_to_anneal					7000.0


In [19]:
exp.replay_object.getHyperParams()

Hyper Paramters (uniform-sampling)			Current Value
-------------------------------------------------------------------
1. capacity						1000000


# Training Loop

Since, we give the control of looping over loop to the programmer. He/She need to pass the current step and episode number whenever calling the RLExp class so that the data can be logged.

## Fill the memory with random transitions

Filling the replay memory before running the training loop is a good idea and hence, we force use to fill the replay to atleast greater than batch size. If replay memory is smaller than the batch size an error is raised.

In [20]:
for episode in range(0,1000):
    
    prev_state = env.reset()
    
    for step in range(0,exp.algorithm_object.params["max_steps"]):
        
        action = env.action_space.sample()
        
        next_state,reward,done,_ = env.step(action)
        
        sample = exp.pack_sample(prev_state,action,reward,next_state,done)
        
        exp.replay_object.add_sample(sample)
        
        prev_state = next_state
        
        if done:
            break

## Important Note

Please don't rerun this block of the code if it is once executed. Restart from start.

In [None]:
for epsiode in range(0,exp.algorithm_object.params["max_episodes"]):
    
    episode_reward = 0.0
    prev_state = env.reset()
    #env.render()
    
    for step in range(0,exp.algorithm_object.params["max_steps"]):
        
        # get an action
        action = exp.explore_policy_object.exploreAction(state=prev_state,curr_epsiode=epsiode)
        
        # perform action in environment
        next_state,reward,done,_ = env.step(action)
        
        # pack the data into a named tuple provided by the framework
        sample = exp.pack_sample(prev_state,action,reward,next_state,done)
        
        # add sequence to experience replay
        exp.replay_object.add_sample(sample)
        
        episode_reward += reward
        prev_state = next_state
        
        exp.algorithm_object.update()
        
        # important
        exp.time_steps_elapsed += 1
        
        # copy weights
        if (exp.time_steps_elapsed % exp.algorithm_object.params["freeze_steps"]) == 0:
            exp.algorithm_object.copytoTarget()
        
        # update epsilon
        exp.explore_policy_object.updateEpsilon()
        
        if done:
            #print(done,step+1)
            break
    
    # store episode reward
    exp.algorithm_object.store_episode_reward(episode_reward)
    
    
    if (epsiode+1) % exp.save_interval == 0:
        exp.dumpData_object.export_csv()
    
    print('Current Exploration Rate : %.4f'%(exp.explore_policy_object.params["current_epsilon"]))

In [27]:
import torch
import numpy as np

# Performance of the Agent

In [40]:
# Calculate Performance of the agent
test_episodes = 100
reward_list = []

In [41]:
for episode in range(0,test_episodes):
    
    prev_state = env.reset()
    total_reward = 0
    
    for step in range(0,1000):
        
        with torch.no_grad():
            torch_x = torch.from_numpy(prev_state).to(exp.device).float()
            out = exp.main_model.forward(torch_x,bsize=1)
            val, action = out.max(dim=1)
            action = int(action)
        
        next_state,reward,done,_ = env.step(action)
        
        total_reward += reward
        
        prev_state = next_state
        
        if done:
            break
    reward_list.append(total_reward)
print('Test Done')

Test Done


In [42]:
reward_list = np.array(reward_list)
print('Mean Reward : ',reward_list.mean())
print('Std. Deviation : ',reward_list.std())
print('Max Reward : ',reward_list.max())
print('Min Reward : ',reward_list.min())

Mean Reward :  94.61
Std. Deviation :  3.158781410607578
Max Reward :  103.0
Min Reward :  88.0
