# Preliminaries

This notebook lets you import a gym environment and set up an agent that acts within the environment. Your tasks is to then implement some of the classical RL algorithms: Value iteration and Policy iteration. Play attention to how you are going to evaluate your agents.

First, we make sure that all dependencies are met

In [1]:
!pip install gym > /dev/null 2>&1

# Testing the Gym environments

Our next step is to import the gym package, create an environment, and make sure that we can use it.

In [9]:
import gym
import math
import random
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

#create a cliff-walker
env = gym.make('CliffWalking-v0')

#set the start state
state = env.reset()
#and take some random actions
for i in range(4):
  #render the environment
  env.render()
  
  #select a random action
  env.action_space.sample()
  #take a step and record next state, reward and termination
  state, reward, done, _ = env.step(action)
  print("Acted: {}".format(action))
  print("State: {}".format(state))
  print("Reward: {}".format(reward))
  if done:
    #this environment only terminates once the goal is reached
    print("Done.")
    break

o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
x  C  C  C  C  C  C  C  C  C  C  T

Acted: 3
State: 36
Reward: -1
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
x  C  C  C  C  C  C  C  C  C  C  T

Acted: 3
State: 36
Reward: -1
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
x  C  C  C  C  C  C  C  C  C  C  T

Acted: 3
State: 36
Reward: -1
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
x  C  C  C  C  C  C  C  C  C  C  T

Acted: 3
State: 36
Reward: -1


In [47]:

env.P[47]

{0: [(1.0, 35, -1, False)],
 1: [(1.0, 47, -1, True)],
 2: [(1.0, 47, -1, True)],
 3: [(1.0, 36, -100, False)]}

# Defining an agent

The next step is to define a class for our agents. We will derive from this class to later implement a Value Iteration, Policy Iteration and Monte Carlo control agent. The base class will only provide simple functionality.

In [90]:
class Agent :
  def __init__(self,env,discount_factor):
    self.env = env
    self.gamma = discount_factor
  
  def act(self, state):
    return self.env.action_space.sample() #returns a random action

  def evaluate(self):
    # now let's test our random action agent
    n_steps = 100 #number of steps per episode

    s = env.reset()
    episode_reward = 0
    
    for i in range(n_steps):
      s, r, d, _ = env.step(self.act(s))
      episode_reward += r
      if d:
        break
    return episode_reward

#test simple evaluation function
random_agent = Agent(env,0.99)
episode_reward=random_agent.evaluate()
print("Episode return {}".format(episode_reward))

Episode return -991


# Value Iteration Agent

In this section you are to implement an agent that solves the environment, using Value Iteration

In [92]:
class ValueAgent(Agent):
  def __init__(self,env,discount_factor,theta):
    super().__init__(env,discount_factor)
    #theta is an approximation error threshold
    self.theta = theta
    self.V = np.random.rand(self.env.observation_space.n)
    #set terminal state to 0
    #self.V[-11:-1] = -1000 
    self.V[-1] = 0
  
  def act(self, state): 
    #here choose action that would bring us to state with highest value
    values=[]
    for i in range(self.env.nA):
      prob, next_state, reward, done = env.P[state][i][0]
      values.append(reward + self.gamma*self.V[next_state])
    
    action = np.argmax(values)
    if (type(action)==np.array): print (action)
    return action

   

  def iterate(self):
    while(True):
      delta = 0.0
      for state in range(self.env.nS-1):
        v = self.V[state]
        action = self.act(state)
        prob, next_state, reward, done = self.env.P[state][action][0]

        self.V[state] = prob * (reward + self.gamma*self.V[next_state])
        delta = max([delta, np.abs(v-self.V[state])])
      print(delta)
      if (delta < self.theta):
        print(delta)
        break


agent = ValueAgent(env,0.99,0.001)
print(agent.V[:12])
print(agent.V[12:24])
print(agent.V[24:36])
print(agent.V[36:])
#perform value iteration
agent.iterate()
#evaluate agent and plot relevant qualities
episode_reward=agent.evaluate()
print("Episode return {}".format(episode_reward))
np.set_printoptions(precision=3, linewidth=200)
print(agent.V[:12])
print(agent.V[12:24])
print(agent.V[24:36])
print(agent.V[36:])

[0.973 0.137 0.582 0.797 0.941 0.984 0.975 0.036 0.883 0.534 0.198 0.094]
[0.444 0.485 0.765 0.721 0.705 0.466 0.246 0.906 0.864 0.603 0.95  0.425]
[0.927 0.895 0.412 0.393 0.335 0.179 0.617 0.869 0.147 0.095 0.142 0.139]
[0.077 0.205 0.133 0.208 0.267 0.584 0.98  0.245 0.978 0.796 0.847 0.   ]
2.874456488614557
1.7060187236901483
1.6889585364532465
1.672068951088714
1.6553482615778266
1.6387947789620485
1.6224068311724285
1.4254172414026236
1.4111630689885972
1.3970514382987114
0.91328134112843
0.9041485277171457
0.8951070424399745
0.886155972015576
0.8548599526441283
0.0
0.0
Episode return -13
[-13.125 -12.248 -11.362 -10.466  -9.562  -8.648  -7.726  -6.793  -5.852  -4.901  -3.94   -2.97 ]
[-12.248 -11.362 -10.466  -9.562  -8.648  -7.726  -6.793  -5.852  -4.901  -3.94   -2.97   -1.99 ]
[-11.362 -10.466  -9.562  -8.648  -7.726  -6.793  -5.852  -4.901  -3.94   -2.97   -1.99   -1.   ]
[-12.248 -11.362 -10.466  -9.562  -8.648  -7.726  -6.793  -5.852  -4.901  -3.94   -1.      0.   ]


# Policy Iteration Agent
Follow the same procedure for implementing a policy iteration agent

In [None]:
#code here

#Monte Carlo control agent
Follow the same procedure for implementing a Monte Carlo control agent

In [None]:
#code here