<a href="https://colab.research.google.com/github/lblogan14/reinforcement_learning_with_tensorflow/blob/master/ch2_training_RL_using_openai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [2]:
%cd /content/drive/My' 'Drive/Colab' 'Notebooks/Reinforcement_Learning_with_TensorFlow/

/content/drive/My Drive/Colab Notebooks/Reinforcement_Learning_with_TensorFlow


#Understand an OpenAI Gym Environment
Use **Frozen Lake** as an example.

Load the Frozen Lake environment

In [0]:
import gym

In [6]:
# make function of gym loads the specified environment
env = gym.make('FrozenLake-v0')

  result = entry_point.load(False)


Next, reset the environment.

In [8]:
s = env.reset()
# reset the environment and returns the state state as a value
print('Initial state is',s)


Initial state is 0


While performing a RL task, an agent undergoes learning through multiple episodes. As a result, at the start of each episode, the environment needs to be **reset** so that it comes to its initial situation and the agent begins from the start state.

After taking each action, it may be a requirement to show the status of the agent in the environment. This can be visualized by executing:

In [9]:
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


This output shows that this is an environment with 4x4 girds, that is, 16 states arranged in the preceding manner where S, H, F, and G represent different forms of a state where:
* **S**: Start block
* **F**: Frozen block
* **H**: Block has hole
* **G**: Goal block

The environment features cannot be modified directly, but this can be done by unwrapping the environment parameters with:

In [0]:
env = env.unwrapped

Each environment is defined by the **state spaces** and **action spaces** for the agent to perform. To access the type (discrete or continuous) and size of state spaces and action spaces:

**Action Space**:

In [12]:
print('Type of Action Space:', env.action_space)

Type of Action Space: Discrete(4)


In [13]:
print('Size of Action Space:', env.action_space.n)

Size of Action Space: 4


`Discrete(4)` means the action space of the Frozen Lake environemnt is a discrete set of values and has four distinct actions that can be performed by the agent.

**State Space**:

In [14]:
print('Type of State Space:', env.observation_space)

Type of State Space: Discrete(16)


In [15]:
print('Size of State Space:', env.observation_space.n)

Size of State Space: 16


`Discrete(16)` means that the observation (state) space of the Frozen
Lake environment is a discrete set of values and has 16 different states to be explored by the
agent.

#Program an Agent Using an OpenAI Gym Environment
This toy example environment consists of 4x4 grids representing a lake. Thus, there are 16 grid blocks, where each block can be a start block(S), frozen block(F), goal block(G), or a hole block(H). Therefore, the objective of the agent is to learn to navigate from start to goal without falling in the hole.

Reset the environment and start a new environment:

In [16]:
# load the environment FrozenLake-v0
env = gym.make('FrozenLake-v0')
env.render()
# output the environment and position of the agent


[41mS[0mFFF
FHFH
FFFH
HFFG


  result = entry_point.load(False)


At any given state, an agent has four actions to perform, which are up, down, left, and right.
The reward at each step is 0 except the one leading to the goal state, then the reward would
be 1. Start from the S state and the goal is to reach the G state without landing up in the H state in the most optimized path through the F states.

##Q-Learning
Q-learning consists of a Q-table that contains Q-values for each state-action pair.

The number of rows
in the table is equal to the number of states in the environment and the number of columns
equals the number of actions.

In this example, the number of states is 16 and the number of actions is 4, the Q-table for this environment consists of 16 rows and 4 columns.

In [18]:
print('Number of actions : ', env.action_space.n)
print('Number of states : ', env.observation_space.n)

Number of actions :  4
Number of states :  16


Steps in Q-learning:
1. Initialize the Q-table with zeros (eventually, updating will happen with a reward received for each action taken during learning).
2. Updating of a Q value for a state-action pair, that is, $Q(s,a)$ is given by:
$$Q(s,a)\leftarrow Q(s,a)+\alpha[r+\gamma\max_{a'}Q(s',a')-Q(s,a)]$$
where
 * $s$ = current state
 * $a$ = action taken (choosing new action through epsilon-greedy approach)
 * $s'$ = resulted new state
 * $a'$ = action for the new state
 * $r$ = reward received for the action $a$
 * $\alpha$= learning rate, that is, the rate at which the learning of the agent converges towards minimized error
 * $\gamma$= discount factor, that is, discounts the future reward to get an
idea of how important that future reward is with regards to the
current reward
3. By updating the Q-values as per the formula mentioned in step 2, the table
converges to obtain accurate values for an action in a given state.

###Epsilon-Greedy approach
The Epsilon-Greedy is a widely used solution to the **explore-exploit** dilemma.

**Exploration** is all about searching and exploring new options through experimentation and research to generate new values, while **exploitation** is all about refining existing options by repeating those options and improving their values.

Implementation:

    epsilon = 0.05 or 0.1 # any small value between 0 to 1 
    #epsilon is the probability of exploration
    
    p = random number between 0 and 1
    if p <= epsilon:
      pull a random action
    else:
      pull current best action

Eventually, after several iterations, the best actions are obtained among all at each state
because it gets the option to explore new random actions as well as exploit the existing
actions and refine them.

In the FrozenLake example, the following implementation of a basic Q-learning algorithm is to make an agent learn how to navigate across this frozen lake of 16 grids, from the start to the goal without falling into the hole.

In [0]:
# import dependencies
from __future__ import print_function
import gym
import numpy as np
import time

In [20]:
# load the environment
env = gym.make('FrozenLake-v0')
s = env.reset()
print('Initial state : ', s)
print()

env.render()
print()

print(env.action_space) # number of actions
print(env.observation_space) # number of states
print()

print('Number of actions : ', env.action_space.n)
print('Number of states : ', env.observation_space.n)
print()

Initial state :  0


[41mS[0mFFF
FHFH
FFFH
HFFG

Discrete(4)
Discrete(16)

Number of actions :  4
Number of states :  16



  result = entry_point.load(False)


In [0]:
# Epsilon-Greedy
def epsilon_greedy(Q, s, na):
  epsilon = 0.3
  p = np.random.uniform(low=0, high=1)
  #print('p = ', p)
  if p > epsilon:
    return np.argmax(Q[s,:])
    '''for each state consider the action having highest Q-value...'''
  else:
    return env.action_space.sample()

In [0]:
# ## Q-learning implementation

# initializing Q-table with zeros
Q = np.zeros([env.observation_space.n, env.action_space.n])

# set hyperparameters
lr = 0.5 # learning rate
y = 0.9 # discount factor
eps = 100000 # total episodes being 100000


for i in range(eps):
  s = env.reset()
  t = False
  while(True):
    a = epsilon_greedy(Q, s, env.action_space.n)
    s_, r, t, _ = env.step(a)
    if (r==0):
      if t==True:
        r = -5 # give negative rewards when holes turn up
        Q[s_] = np.ones(env.action_space.n) * r # in terminal state Q value equals the reward
      else:
        r = -1 # give negative rewards to avoid long routes
    if (r==1):
      r = 100
      Q[s_] = np.ones(env.action_space.n) * r # in terminal state Q value equals the reward
    
    Q[s,a] = Q[s,a] + lr * (r + y*np.max(Q[s_,a]) - Q[s,a])
    s = s_
    if (t==True):
      break
      
      

In [33]:
print('Q-table')
print(Q)

Q-table
[[ -9.83778557  -3.24135919  -9.63678007 -10.        ]
 [ -9.78785909  -7.98648554  -9.58707493 -10.        ]
 [ -9.75532542  -6.00764762  -9.58265224 -10.        ]
 [ -9.73179677  -9.15491916  -9.5942822  -10.        ]
 [ -9.80030994  -5.81278331  -9.55178373  -9.91776723]
 [ -5.          -5.          -5.          -5.        ]
 [ -9.76763265  -3.00126116  -9.38367652  -9.69605172]
 [ -5.          -5.          -5.          -5.        ]
 [ -9.6958445    1.61616862  -9.41833013  -9.66906161]
 [ -9.58194434  11.99208401  -8.18834419  -9.57363767]
 [ -9.63871314  73.21798673  -9.43075185  -9.56714924]
 [ -5.          -5.          -5.          -5.        ]
 [ -5.          -5.          -5.          -5.        ]
 [ -9.55076316  84.58387031  -7.28222463  -7.389242  ]
 [ -9.61050912 146.88923396   7.32480692  -2.15727352]
 [100.         100.         100.         100.        ]]


In [38]:
print('Output after learning')
print()
# learning ends with the end of the above loop of several episodes above
s = env.reset()
env.render()
while(True):
  a = np.argmax(Q[s])
  s_, r, t, _ = env.step(a)
  print('====================')
  env.render()
  s = s_
  if (t==True):
    break

Output after learning


[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Down)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Down)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG


##Q-Network for Real-World Applications
Q-table is a good option for the real world problem since the states become infinite. This is where neural network acts a function approximator, which is trained over data of different state information and their corresponding Q-values for all action, thereby, they are abl to predict Q-values for any new state information input.

The neural network used to predict Q-values instead of using a Q-table is called **Q-network**.

In the toy example, `FrozenLake-v0` environment, a single neural network taking state information as input is used, where the state information is represented as a **one-hot encoded** vector of the **1 x number of states** shape (here, 1 x 16) and outputs a vecotr of the **1 x number of actions** shape (here, 1 x 4).

For example, consider  there are 16 states numbered from state 0 to state 15, then state number 4 will be represented in one-hot encoded vector as

`input_state = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]`

Advantage of Q-network over Q-table is to add more hidden layers and different activation functions. The Q-values in a Q-network are updated by minimizing the loss through backpropagation. The loss functions is given by:
$$\mathrm{Loss}=\sum(Q_{target}-Q_{predicted})^2$$
and
$$Q(s,a)_{target}=r+\gamma\max_{a'}Q(s',a')$$

The following code is to implement a basic Q-Network to make an agent learn to navigate across this frozen lake of 16 grids from the
start to the goal without falling into the hole:

In [0]:
# import dependencies
import gym
import numpy as np
import tensorflow as tf
import random
from matplotlib import pyplot as plt

In [46]:
# load the environment
env = gym.make('FrozenLake-v0')

  result = entry_point.load(False)


In [0]:
# ## Q-Network Implementation

# ### Create neural network

tf.reset_default_graph()

# tensors for inputs, weights, biases, Qtarget
inputs = tf.placeholder(shape=[None, env.observation_space.n], 
                        dtype=tf.float32)
W = tf.get_variable(name='W', 
                    dtype=tf.float32,
                    shape=[env.observation_space.n, env.action_space.n],
                    initializer=tf.contrib.layers.xavier_initializer())
b = tf.Variable(tf.zeros(shape=[env.action_space.n]),
                dtype=tf.float32)

qpred = tf.add(tf.matmul(inputs, W), b)
apred = tf.argmax(qpred, 1)

qtar = tf.placeholder(shape=[1, env.action_space.n], dtype=tf.float32)
loss = tf.reduce_sum(tf.square(qtar - qpred))

train = tf.train.AdamOptimizer(learning_rate=0.001)
minimizer = train.minimize(loss)

In [50]:
# ### Train neural network

init = tf.global_variables_initializer()

# learning parameters
y = 0.5 # discount factor
e = 0.3 # epsilon value for epsilon-greedy task
episodes = 10000 # total number of episodes

# list to capture total steps and rewards per episodes
slist = []
rlist = []

with tf.Session() as sess:
  sess.run(init)
  for i in range(episodes):
    s = env.reset() # reset the environment at the start of each episode
    r_total = 0 # calculate the sum of rewards in the current episode
    while(True):
      # run the Q-Network created above
      a_pred, q_pred = sess.run([apred, qpred], 
                                feed_dict={inputs:np.identity(env.observation_space.n)[s:s+1]})
      '''a_pred is the action prediction by the neural network
         q)pred contains q_values of the actions at current state s'''
      
      if np.random.uniform(low=0, high=1) < e:
        a_pred[0] = env.action_space.sample()
        '''explore different action by randomly assigning them as the next action'''
      
      s_, r, t, _ = env.step(a_pred[0]) 
      '''action taken and new state s_ is encountered with a feedback reward r'''
      
      if r==0:
        if t==True:
          r = -5 # if hole, make the reward more negative
        else:
          r = -1 # if block is fine/frozen then give slight negative reward to optimize the path
      if r==1:
          r = 5 # good positive GOAT state reward
          
      q_pred_new = sess.run(qpred,
                            feed_dict={inputs:np.identity(env.observation_space.n)[s_:s_+1]})
      '''q_pred_new contains q_values of the actions at the new state'''
      
      # update the Q-target value for action taken
      targetQ = q_pred
      max_qpredn = np.max(q_pred_new)
      targetQ[0, a_pred[0]] = r + y*max_qpredn
      '''This gives the targetQ'''
      
      # train the neural network to minimize the loss
      _ = sess.run(minimizer,
                   feed_dict={inputs:np.identity(env.observation_space.n)[s:s+1],
                              qtar:targetQ})
      r_total += r
      
      s = s_
      if t==True:
        break
      
    if i%200==0:
      print('Training #{} is completed.'.format(i))
        
  '''learning ends with the end of the loop of several episodes above
     Check how much the agent has learned'''
  print('Output after learning')
  print()
  s = env.reset()
  env.render()
  while(True):
    a = sess.run(apred,
                 feed_dict={inputs:np.identity(env.observation_space.n)[s:s+1]})
    s_, r, t, _ = env.step(a[0])
    print('================')
    env.render()
    s = s_
    if t==True:
      break

Training #0 is completed.
Training #200 is completed.
Training #400 is completed.
Training #600 is completed.
Training #800 is completed.
Training #1000 is completed.
Training #1200 is completed.
Training #1400 is completed.
Training #1600 is completed.
Training #1800 is completed.
Training #2000 is completed.
Training #2200 is completed.
Training #2400 is completed.
Training #2600 is completed.
Training #2800 is completed.
Training #3000 is completed.
Training #3200 is completed.
Training #3400 is completed.
Training #3600 is completed.
Training #3800 is completed.
Training #4000 is completed.
Training #4200 is completed.
Training #4400 is completed.
Training #4600 is completed.
Training #4800 is completed.
Training #5000 is completed.
Training #5200 is completed.
Training #5400 is completed.
Training #5600 is completed.
Training #5800 is completed.
Training #6000 is completed.
Training #6200 is completed.
Training #6400 is completed.
Training #6600 is completed.
Training #6800 is com

There is a cost of stability associated with both Q-learning and Q-networks. There will be
cases when with the given set of hyperparameters of the Q-values are not converge, but
with the same hyperparameters, sometimes converging is witnessed.

This is because of the
instability of these learning approaches. In order to tackle this, a better initial policy should
be defined (here, the maximum Q-value of a given state) if the state space is small.
Moreover, hyperparameters, especially learning rate, discount factors, and epsilon value,
play an important role. Therefore, these values must be initialized properly.

Q-networks provide more flexibility compared to Q-learning, owing to increasing state
spaces. A deep neural network in a Q-network might lead to better learning and
performance.