# Preliminaries

This notebook lets you import a gym environment and set up an agent that acts within the environment. Your tasks is to then implement some of the classical RL algorithms: Value iteration and Policy iteration. Play attention to how you are going to evaluate your agents.

First, we make sure that all dependencies are met

In [1]:
!pip install gym > /dev/null 2>&1

# Testing the Gym environments

Our next step is to import the gym package, create an environment, and make sure that we can use it.

In [9]:
import gym
import math
import random
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

#create a cliff-walker
env = gym.make('CliffWalking-v0')

#set the start state
state = env.reset()
#and take some random actions
for i in range(4):
  #render the environment
  env.render()
  
  #select a random action
  env.action_space.sample()
  #take a step and record next state, reward and termination
  state, reward, done, _ = env.step(action)
  print("Acted: {}".format(action))
  print("State: {}".format(state))
  print("Reward: {}".format(reward))
  if done:
    #this environment only terminates once the goal is reached
    print("Done.")
    break

o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
x  C  C  C  C  C  C  C  C  C  C  T

Acted: 3
State: 36
Reward: -1
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
x  C  C  C  C  C  C  C  C  C  C  T

Acted: 3
State: 36
Reward: -1
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
x  C  C  C  C  C  C  C  C  C  C  T

Acted: 3
State: 36
Reward: -1
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
x  C  C  C  C  C  C  C  C  C  C  T

Acted: 3
State: 36
Reward: -1


In [47]:

env.P[47]

{0: [(1.0, 35, -1, False)],
 1: [(1.0, 47, -1, True)],
 2: [(1.0, 47, -1, True)],
 3: [(1.0, 36, -100, False)]}

# Defining an agent

The next step is to define a class for our agents. We will derive from this class to later implement a Value Iteration, Policy Iteration and Monte Carlo control agent. The base class will only provide simple functionality.

In [4]:
class Agent :
  def __init__(self,env,discount_factor):
    self.env = env
    self.gamma = discount_factor
  
  def act(self, state):
    return self.env.action_space.sample() #returns a random action

  def evaluate(self):
    # now let's test our random action agent
    n_steps = 100 #number of steps per episode

    s = env.reset()
    episode_reward = 0
    
    for i in range(n_steps):
      s, r, d, _ = env.step(self.act(s))
      episode_reward += r
      if done:
        break
    return episode_reward

#test simple evaluation function
random_agent = Agent(env,0.99)
episode_reward=random_agent.evaluate()
print("Episode return {}".format(episode_reward))

Episode return -1387


# Value Iteration Agent

In this section you are to implement an agent that solves the environment, using Value Iteration

In [48]:
class ValueAgent(Agent):
  def __init__(self,env,discount_factor,theta):
    super().__init__(env,discount_factor)
    #theta is an approximation error threshold
    self.theta = theta
    self.V = np.random.rand(self.env.observation_space.n)
    #set terminal state to 0
    self.V[-1] = 0 
  
  def act(self, state): 
    #here choose action that would bring us to state with highest value
    # Select the action that has highest expected value
 
    values=[]
    for i in range(self.env.nA):
      _,next_state,_,_ = env.P[state][i][0]
      values.append(self.V[next_state])
    
    action = np.argmax(values)
    #print(action)
    return action

   

  def iterate(self):
    while(True):
    #for i in range(5):
      #print(self.V) 
      delta = 0.0
      for state in range(self.env.nS-1):
        v = self.V[state]
        action = self.act(state)
        prob, next_state, reward, done = self.env.P[state][action][0]

        #if not done:
        self.V[state] = prob * (reward + self.gamma*self.V[next_state])
        delta = max([delta, np.abs(v-self.V[state])])
      print(delta)
      if (delta < self.theta):
        print(delta)
        break


agent = ValueAgent(env,0.99,0.001)
#perform value iteration
agent.iterate()
#evaluate agent and plot relevant qualities
episode_reward=agent.evaluate()
print("Episode return {}".format(episode_reward))
print(agent.V)

100.79795732681274
98.0562179234899
100.68085875665723
100.68085875665723
100.64052158641118
100.56518805330413
100.54953617277108
100.52488860386308
100.50963971782446
100.49454332064622
98.2567250796206
0.9041130297750186
0.8950718994772675
0.8861211804824958
0.8514155908691734
0.0
0.0
Episode return -100
[-13.12541872 -12.2478977  -11.36151283 -10.46617457  -9.5617925
  -8.64827525  -7.72553056  -6.79346521  -5.85198506  -4.90099501
  -3.940399    -2.9701     -12.2478977  -11.36151283 -10.46617457
  -9.5617925   -8.64827525  -7.72553056  -6.79346521  -5.85198506
  -4.90099501  -3.940399    -2.9701      -1.99       -11.36151283
 -10.46617457  -9.5617925   -8.64827525  -7.72553056  -6.79346521
  -5.85198506  -4.90099501  -3.940399    -2.9701      -1.99
  -1.         -12.2478977  -11.36151283 -10.46617457  -9.5617925
  -8.64827525  -7.72553056  -6.79346521  -5.85198506  -4.90099501
  -3.940399    -1.           0.        ]


# Policy Iteration Agent
Follow the same procedure for implementing a policy iteration agent

In [None]:
#code here

#Monte Carlo control agent
Follow the same procedure for implementing a Monte Carlo control agent

In [None]:
#code here