<img src="skoltech_logo.png" alt="Skoltech" width=60% height=60% />
<h1 style="color:#333333; text-align:center; line-height: 0;">Reinforcement Learning</h1>
<h5 style="color:#333333; text-align:center;">Course MA030422</h5>

<h2 style="color:#A7BD3F;">Homework 1</h2>

***

Welcome to homework 1 of the *Reinforcement Learning* course at Skoltech CDISE! In this homework we will be covering two types of classical approaches to RL problems: **value iteration** and **policy iteration**.

### Components

* **Section 1**: Review of relevant concepts
* **Section 2**: OpenAI FrozenLake environment
* **Section 3**: Introduction to algorithms in DP for finite MDPs
    * Exercise 1
        * Problem 1.1 - Value Iteration (15 points)
        * Problem 1.2 - Policy Iteration (15 points)
        * Problem 1.3 - FrozenLake8x8 (2 points)
        
Total points: 32

<h2 style="color:#A7BD3F;">Section 1</h2>

***

### Background material 

Before we begin, let's refresh our memory on this question: what is the core problem that is being solved for in reinforcement learning? 

At the simplest level, the problem we are solving for is to teach the agent to behave *optimally* in a specific environment. An example might be to teach a robot to bounce a ball for some period of time; or program a helicopter to keep the same altitude in unpredictable windy conditions.

The numerical definition of what is *optimal* is defined by the **objective function**, which is typically maximized in the context of MDPs (Markov Decision Processes). In the context of MDPs , this objective function is known as the **value function**, aka the *total expected reward*, which is received over sequential state transitions.

The goal then, is to teach the agent to maximize the *total expected reward* it earns over some time horizon (theoretically, it is an infinite time horizon) by selecting the best action as dictated by the value function. The agent learns to maximize rewards as it transitions from state to state, taking actions in each state. In a deterministic case, the transitioning process is diagrammed as follows:

$$\text{State 1}\xrightarrow[Action]{}\text{State 2} \xrightarrow[Action]{}\text{State 3}\xrightarrow[Action]{}\dots$$

### Markov Decision Processes

Let's formalize the key components of the RL problem in the context of MDPs:

An MDP is defined by: $(S, A, P, R, S_0, \gamma)$
* S = set of states (state-space)
* A = set of actions (action-space)
* P = state transition probabilities
* R = reward for taking an action $a\in\text{A}$ in state $s\in\text{S}$
* $S_0$ = starting state
* $\gamma$ = discount rate
    
In more detail:
* **States** - states can be discrete/finite (imagine cells in a grid world) or continuous/infinite (position on a road).
    * Referred to as the *state space* (i.e. discrete state space or continuous state space)
* **Actions** - actions can also be discrete (moving up/down/left/right in a grid world cell) or continuous (how many degrees to turn a steering wheel when driving a car).
    * Referred to as the *action space* (i.e. discrete action space or continuous action space)
* **Rewards** - rewards are issued by a reward function $\rho : S_t \times A_t \rightarrow R$. The reward function is a property of the environment.
* **Transition probabilities**. In MDPs, this is denoted by $P_{s,a}$. The transition probability is the probability that, for example, some action $A$ in state $S$ leads to state $S^\prime$ (prime denotes the next time step) - represented notationally as $p(s^\prime|s,a)$.
* **Discount factor** - the discount factor is a number greater than 0 and less than 1 that is used to discount rewards received over sequential time-steps. It is denoted as $\gamma \in [0, 1)$
* **Value function** - one of the primary functions learned by the agent: the value function dictates either the value of a state or the value of action. More on this below.
* **Policy function** - one of the primary functions learned by the agent: the policy maps states to actions. More below.

#### Other useful definitions
* **Experience** - $\big(\text{State}_{t}$, $\text{Action}_{t}$, $\text{Reward}_{t}\big)$ tuple
* **Trajectory** - A sequence of *experiences* through time, represented as: $\tau$ (tau)
* **Episode** - A trajectory that ends in a terminal state


### Policy

The process of learning for the agent can be thought of as a sequence of mapping states to actions $a = \pi(s)$ to maximize expected reward over an episode. This is known as the **policy:** $\pi: S \rightarrow A$. 

Note:
* The agent needs to explore and interact with its environment to learn where actions earn the maximum rewards.
* Actions in the current time step effect rewards in future time steps
* There is a trade off between the frequency of sampling the environment and frequency of taking actions
* It can be based on discrete state-spaces or continuous state-spaces (and same for action-spaces)

### Value functions

The **worthiness** of a policy is calculated by the aforementioned *value function*. There are various forms of value functions. First, the value of a state <sup>[3]</sup>:

<img src="state_to_action.png" width="65%" height="65%" />

Second, the value of action <sup>[3]</sup>:

<img src="action_to_state.png" width="55%" height="55%" />

Let's review the notation of these value functions in terms of *expected value/reward* (aka expectation notation) <sup>[3][4]</sup>. Note:
* $\tau$ = trajectory

#### <font color="#1DAE00">On-policy ($\pi$) state-value function</font> :

$$V^\pi{(s)} = \displaystyle \mathop{\mathbb{E}}_{\tau\sim \pi}\big[R(\tau)\thinspace|\thinspace s_0 = s\big]$$

* **Definition**: the expected reward of trajectory $\tau$ (sampled from policy $\pi$) starting from state $s$ *over an infinite time horizon*.

#### <font color="#1DAE00">Optimal ($*$) state-value function</font> :

$$V^*{(s)} = \max_{\pi}{\displaystyle \mathop{\mathbb{E}}_{\tau\sim \pi}\big[R(\tau)\thinspace|\thinspace s_0 = s\big]}$$

* **Definition**: the <font color="red">maximum (of all policies)</font> expected reward of trajectory $\tau$ (sampled from policy $\pi$) starting from state $s$ *over an infinite time horizon*.

#### <font color="#5A00AE">On-policy ($\pi$) action-value function</font> :

$$Q^\pi{(s, a)} = \displaystyle \mathop{\mathbb{E}}_{\tau\sim \pi}\big[R(\tau)\thinspace|\thinspace s_0 = s, a_0 = a\big]$$

* **Definition**: the expected reward of trajectory $\tau$ (sampled from policy $\pi$) starting from state $s$ and taking action $a$ *over an infinite time horizon*.

#### <font color="#5A00AE">Optimal ($*$) action-value function</font> :

$$Q^*{(s,a)} = \max_{\pi}{\displaystyle \mathop{\mathbb{E}}_{\tau\sim \pi}\big[R(\tau)\thinspace|\thinspace s_0 = s, a_0 = a\big]}$$

* **Definition**: the <font color="red">maximum (of all policies)</font> expected reward of trajectory $\tau$ (sampled from policy $\pi$) starting from state $s$  and taking action $a$ *over an infinite time horizon*.


#### Bellman equations:

Each of the value functions above has a slightly more explicit form that defines how value is calculated recursively. These are known as **Bellman's equations** <sup>[3][4]</sup>: 

#### <font color="#DE001B">Bellman's on-policy ($\pi$) state-value function</font> :

$$V^\pi{(s)} = \displaystyle \mathop{\mathbb{E}}_{\substack{a\sim\pi\\s^\prime\sim P}}\big[R(s,a) + \gamma V^\pi{(s^\prime)}\big]$$

#### <font color="#DE001B">Bellman's optimal ($*$) state-value function</font> :

$$V^*{(s)} = \displaystyle \max_{a}{\mathop{\mathbb{E}}_{s^\prime\sim{P}}\big[R(s,a) + \gamma V^*{(s^\prime)}\big]}$$

#### <font color="#A1AE00">Bellman's on-policy ($\pi$) action-value function</font> :

$$Q^\pi{(s,a)} = \displaystyle \mathop{\mathbb{E}}_{s^\prime\sim P}\big[R(s,a) + \gamma \mathop{\mathbb{E}}_{a^\prime\sim{\pi}}[Q^\pi{(s^\prime, a^\prime)}]\big]$$

#### <font color="#A1AE00">Bellman's optimal ($*$) action-value function</font> :

$$Q^*{(s,a)} = \displaystyle \mathop{\mathbb{E}}_{s^\prime\sim{P}}\big[R(s,a) + \gamma \max_{a^\prime}Q^*{(s^\prime, a^\prime)}\big]$$


### Expanded notations of Bellman's functions

Going further, all of the value functions above can be expressed more explicitly. These are the equations that you need to know for programming purposes <sup>[3][2]</sup>:

#### <font color="#00DED1">Bellman's on-policy ($\pi$) state-value function</font> :

$$V^\pi{(s)} = \sum_{a^\prime\in\text{A}} \pi(a|s) \sum_{s^\prime\in\text{S}}P{(s^\prime|s,a)} \big[R{(s,a,s^\prime)} + \gamma V^\pi{(s^\prime)}\big]$$

#### <font color="#00DED1">Bellman's optimal ($*$) state-value function</font> :

$$V^*{(s)} = \max_{a^\prime\in\text{A}} \sum_{s^\prime\in\text{S}}P{(s^\prime|s,a)} \big[R{(s,a,s^\prime)} + \gamma V^*{(s^\prime)}\big]$$

#### <font color="#DE008A">Bellman's on-policy ($\pi$) action-value function</font> :

$$Q^\pi{(s,a)} = \sum_{s^\prime\in\text{S}} P{(s^\prime|s,a)} \big[R{(s,a,s^\prime)} + \gamma \sum_{a^\prime\in\text{A}} \pi{(a^\prime|s^\prime)} Q^\pi{(s^\prime,a^\prime)}\big]$$

#### <font color="#DE008A">Bellman's optimal ($*$) action-value function</font> :

$$Q^*{(s,a)} = \sum_{s^\prime\in\text{S}} P{(s^\prime|s,a)}  \big[R{(s,a,s^\prime)} + \gamma \max_{a^\prime} Q^*{(s^\prime,a^\prime)}\big]$$




### Goal of this homework

In this homework we will practice understanding these value functions as they are applied to classical iteration algorithms in RL (specifically finite MDPs). Be sure that you installed OpenAI gym for Python [here](https://gym.openai.com/docs/).

<h2 style="color:#A7BD3F;">Section 2 - Environment</h2>

***

### Intro to OpenAI <i style="color:blue;">FrozenLake</i> environment

For this homework we will be exploring agent training in grid world environment called *FrozenLake*. Read more about it [here](https://gym.openai.com/envs/FrozenLake-v0/). To summarize:
* FrozenLake is a 2D grid world
* There are 2 variants of the environment: 4x4 and 8x8. We will try both.
* There are 4 types of grid cells (S, F, H, G). *S* is the starting point, *G* is the goal, *F* is a frozen surface and *H* is a hole.
* If an agent steps onto a slippery surface, he may slip and not end up in the next desired state for which he took an action (think, transition probabilities!).
* The rewards are sparse: the agent receives a reward of 1 when reaching the goal and 0 otherwise. If the agent falls into a hole, the episode is over.
* <font color="red">Note: we will not be using the slippery version of the environment</font> for the purposes of simpler introduction to core concepts.

<h2 style="color:#A7BD3F;">Section 3 - Dynamic Programming algorithms for MDPs</h2>

***

In environments that have ❗<font color="red">discrete</font> state **and** action spaces, such as FrozenLake or Gym-Minigrid (which we will see in lab 2), two classical algorithms can be used to train the RL agent to solve the environment's goal.

The first is **value iteration** <sup>[3]</sup>:

<img src="value_iteration.png" alt="Value Iteration" width=65% height=65% />

The second is **policy iteration** <sup>[3]</sup>:

<img src="policy_iteration.png" alt="Policy Iteration" width=65% height=65% />

Below, we have implemented a class called `MDP` that combines operations common to **both** of these algorithms. 

#### <font color="red">First</font>, read pages pg. 74 through 83 in Chapter 4 of your class text <sup>[3]</sup>.

#### Next, examine the code below:

In [1]:
import gym
import numpy as np
import collections
import sys
from tqdm import tqdm
from IPython.display import clear_output
import time

"""
DO NOT MODIFY
"""

class MDP:
    def __init__(self, env_name, is_slippery=False):
        self.env = gym.make(env_name, is_slippery=is_slippery)
        self.rewards = collections.defaultdict(float)
        self.transits = collections.defaultdict(collections.Counter)
        self.values = collections.defaultdict(float)
        self.gamma = 0.95
        self.theta = 0.0005
        
    def return_rewards(self):
        return 1.0 in self.rewards.values()

    def return_state_values(self):
        return tuple(self.values.items())

    def _model_transits_rewards(self, num_steps):
        """

        Description: step through the environment to model rewards and transits for all states. Also called "filling a buffer".

        Args:
            * num_steps - num steps to take through env. Should be sufficient to stochastically achieve goal of 1.
        """
        current_state = self.env.reset()

        print("Modeling rewards and transition probabilities ...")
    
        for i in tqdm(range(num_steps)):
            # sample random action
            action = self.env.action_space.sample()

            # take step
            new_state, reward, is_done, _ = self.env.step(action)
            
            # assign rewards for new state
            self.rewards[(current_state, action, new_state)] = reward

            # log transit from state to new state
            self.transits[(current_state, action)][new_state] += 1
            
            if is_done:
                current_state = self.env.reset()
            else:
                current_state = new_state

    def _get_action_value(self, current_state, action):
        """ 

        Description: Get the value of action

        Args:
            * State
            * Action

        Returns:
            * Value of current state

        """
        next_state_counts = self.transits[(current_state, action)]
        total_transits = sum(next_state_counts.values())
        action_value = 0.0
        
        for next_state, n_transits in next_state_counts.items():
            reward = self.rewards[(current_state, action, next_state)]
            transit_prob = (n_transits / total_transits)
            action_value += transit_prob * (reward + self.gamma * self.values[next_state])
        
        return action_value

    def _get_best_action(self, state):
        """ 

        Description: get best action for current state

        Args:
            * state

        Returns:
            * best action for current state

        """
        action_values = {}

        for action in range(self.env.action_space.n):
            action_value = self._get_action_value(state, action)
            action_values[action] = action_value

        best_action_value = max(action_values.values())
        best_action = max(action_values, key=action_values.get)
        
        return best_action

### <font color="blue">Exercise 1: Classical MDP algorithms</font>

### <font color="blue">Problem 1.1 - Value Iteration</font>

#### 🎯 Task:  Implement Value Iteration on the FrozenLake environment.

Guidance/hints:
* Read Chapter 4 from the class text.
* Be sure you understand the algorithm (value iteration in this case).
* Test taking steps through the FrozenLake environment by making your own script and executing it in new code cells
* Explore the MDP class above
* To get it work, you'll need to make sure your algorithm is properly calculating the value of states. Call the `return_state_values` method to check the value of states. The value of the final state (aka the goal state) is always 0.

In [30]:
"""
ADD YOUR CODE BETWEEN THE COMMENTS BELOW
"""
class ValueIteration(MDP):
    def __init__(self, env_name, is_slippery=False):
        super().__init__(env_name, is_slippery=is_slippery)
        
    def _value_iteration(self):
        """ 

        Description: Perform Value Iteration

        """
        #print("Performing value iteration ...")
        self.values = {state: 1. for state in range(self.env.env.nS)}
        self.values[self.env.env.nS] = 0.
        while True:
            delta = 0
            
            ### YOUR SOLUTION BELOW

            for state in range(self.env.env.nS):
                action_values_list = []
                old_value = self.values[state]
                for action in range(self.env.env.nA):
                    action_values_list.append(self._get_action_value(state, action))

                self.values[state] = max(action_values_list)
                delta = max(delta, abs(old_value - self.values[state]))
            ### YOUR SOLUTION ABOVE
            if delta < self.theta:
                break
                
    def _run_episode(self, render=True):
        """

        Description: perform an episode on the environment

        Args
            * Render - render env to screen?

        """

        clear_output()
        episode_reward = 0.0
        current_state = self.env.reset()

        # render initial state
        if render:
            self.env.render()
        
        while True:
            action = self._get_best_action(current_state)
            new_state, step_reward, is_done, _ = self.env.step(action)
            
            self.rewards[(current_state, action, new_state)] = step_reward
            self.transits[(current_state, action)][new_state] += 1
            
            episode_reward += step_reward
            
            if render:
                self.env.render()

            if is_done:
                self.env.reset()
                break
            
            current_state = new_state

        print(f"...Episode completed.")

        return episode_reward
        
    def run_simulation(self, num_steps = 1000, render=True):
        """ Run training simulation """
        try:
            ### YOUR SOLUTION BELOW
            #raise NotImplementedError
            self._model_transits_rewards(num_steps)
            self._value_iteration()
            ### YOUR SOLUTION ABOVE
            episode_reward = self._run_episode(render=render)

            if episode_reward > 0.85:
                print(f"Environment solved.")
            else:
                clear_output()
                print(f"Failed to solve environment.")

        except KeyboardInterrupt:
            print(f"...Cancelling...")

        except NotImplementedError:
            print(f"Your solution is incomplete")

### Run simulation

Execute the cell below. If your code is correct, you will see the environment render the agent taking steps. The final cell will be 'G'. 

* **Note**: Running the simulation below with a large enough `num_steps` hyperparameter is vital to the value iteration algorithm working successfully. This is because the modeling process, which is stochastic (via `_model_transits_rewards`), requires sufficient iterations to reach the goal-state and acquire the reward in the environment, as well as to make enough transits to all possible states to make value iteration numerically stable.

In [3]:
# DO NOT MODIFY

start_time = time.time()

agent1 = ValueIteration("FrozenLake-v0")
agent1.run_simulation(num_steps = 3000)

end_time = time.time()
print(f"Duration of execution: {end_time-start_time:.5f} seconds")


[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
...Episode completed.
Environment solved.
Duration of execution: 1.61627 seconds


### State values?

In a 4x4 grid of FrozenLake, there are 16 states. Now let's look at their values:

In [4]:
state_values = agent1.return_state_values()
state_values

((0, 0.7737809374999999),
 (1, 0.8145062499999999),
 (2, 0.8573749999999999),
 (3, 0.8145062499999999),
 (4, 0.8145062499999999),
 (5, 0.0),
 (6, 0.9025),
 (7, 0.0),
 (8, 0.8573749999999999),
 (9, 0.9025),
 (10, 0.95),
 (11, 0.0),
 (12, 0.0),
 (13, 0.95),
 (14, 1.0),
 (15, 0.0),
 (16, 0.0))

### <font color="orange">Auto-grading</font>
Run this cell to track your answers and to save your answer for problem 1.1. Make sure you defined the necessary variable above to avoid a `NameError` below.

In [5]:
### GRADING DO NOT MODIFY
from grading_utilities import AnswerTracker
hw1_answers = AnswerTracker()
rewards_values = agent1.return_rewards()
hw1_answers.record('problem_1-1', {'state_values': state_values, 'rewards': rewards_values})

### <font color="blue">Problem 1.2 - Policy Iteration</font>

Next, we'll take on policy iteration.

Look at the policy iteration algorithm at the start of section 3. How is it different from value iteration? Let's see:

<img src="vipi_comparison.png" width="90%" height="90%" />

Indeed, these algorithms have a very similar value-of-state estimation cycle. The difference being: the *value iteration* algorithm iterates over **every** action, calculates expected reward of $s^\prime$, and selects the maximum value action; while policy iteration calculates expected reward of $s^\prime$ for the **single** action **specified by the policy**.

#### 🎯 Task:  Implement Policy Iteration on the FrozenLake environment.

In [14]:
class PolicyIteration(MDP):
    def __init__(self, env_name, is_slippery=False):
        super().__init__(env_name, is_slippery=is_slippery)
        self.policy = collections.defaultdict(int)
        
    def return_policy(self):
        return tuple(self.policy.items())
        
    def _policy_iteration(self):
        """ 
        
        Description: Perform policy iteration. Consists of 2 parts: policy evaluation and policy improvement. See Sutton & Barto, RL: An Introduction, page 80.
        
        """
        #-------------------#
        # policy evaluation #
        #-------------------#

        #self._model_transits_rewards(3000)
        
        self.values = {state: 1. for state in range(self.env.observation_space.n)}
        #self.values[self.env.env.nS] = 0.
        self.policy = {state: 1 for state in range(self.env.env.nS)}
        delta = 0

        while True:
            while True:
                delta = 0
                for current_state in range(self.env.env.nS):
                    temp_state_values = self.values[current_state]
                    local_var = self._get_action_value(current_state, self.policy[current_state])

                    self.values[current_state] = local_var

                    delta = max(delta, abs(temp_state_values - self.values[current_state]))

                if delta < self.theta:
                    break
            policy_stable = self.policy_impove()
            if np.sum(policy_stable) == self.env.env.nS:
                break
        return policy_stable
        #--------------------#
        # policy improvement #
        #--------------------#

        
    def policy_impove(self):
        policy_stable = np.zeros((self.env.env.nS), dtype=bool)
        for current_state in range(self.env.env.nS):
            old_policy = self.policy[current_state]
            self.policy[current_state]=self._get_best_action(current_state)
            if old_policy == self.policy[current_state]:
                policy_stable[current_state] = True

        return policy_stable
    
    def _run_episode(self, render=True):
        """

        Description: runs an episode on the environment after policy iteration.

        Args:
            * Render - render env to screen?

        """
        clear_output()
        episode_reward = 0.0
        current_state = self.env.reset()

        if render:
            self.env.render()

        while True:
            action = self.policy[current_state]
            
            #print('action', action)
            new_state, step_reward, is_done, _ = self.env.step(action)
            
            self.rewards[(current_state, action, new_state)] = step_reward
            self.transits[(current_state, action)][new_state] += 1
            
            episode_reward += step_reward
            
            if render:
                self.env.render()

            if is_done:
                self.env.reset()
                break
            
            current_state = new_state
            
        print(f"...Episode completed.")

        return episode_reward
        
    def run_simulation(self, num_steps = 2000, render=True):
        """ Run training simulation """
        try:
            self._model_transits_rewards(num_steps)

            while True:
                policy_stable = self._policy_iteration()

                if policy_stable.all() == True:
                    print('OK! policy stable')
                    break

            episode_reward = self._run_episode(render=render)
            
            if episode_reward > 0.85:
                print(f"Environment solved.")
            else:
                clear_output()
                print(f"Failed to solve environment.")
        
        except KeyboardInterrupt:
            print(f"...Cancelling...")
            
        except NotImplementedError:
            print(f"Your solution is incomplete")

### Run simulation

As discussed above, beware of the num_steps hyperparam.

In [16]:
start_time = time.time()

agent2 = PolicyIteration("FrozenLake-v0")
agent2.run_simulation(num_steps = 3000)

end_time = time.time()
print(f"Duration of execution: {end_time-start_time:.5f} seconds")


[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
...Episode completed.
Environment solved.
Duration of execution: 0.15379 seconds


### <font color="orange">Auto-grading</font>
Run this cell to track your answers and to save your answer for problem 1.2. Make sure you defined the necessary variable above to avoid a `NameError` below.

In [17]:
state_values = agent2.return_state_values()
policy_values = agent2.return_policy()

### GRADING DO NOT MODIFY
hw1_answers.record('problem_1-2', {'state_values': state_values, 'policy_values': policy_values})

In [18]:
agent2.return_state_values()

((0, 0.7737809374999999),
 (1, 0.8145062499999999),
 (2, 0.8573749999999999),
 (3, 0.8145062499999999),
 (4, 0.8145062499999999),
 (5, 0.0),
 (6, 0.9025),
 (7, 0.0),
 (8, 0.8573749999999999),
 (9, 0.9025),
 (10, 0.95),
 (11, 0.0),
 (12, 0.0),
 (13, 0.95),
 (14, 1.0),
 (15, 0.0))

### <font color="blue">Problem 1.3 - FrozenLake8x8</font>

Let's try the 8x8 version of FrozenLake and compare which algorithm is faster.

In [31]:
agent3 = ValueIteration("FrozenLake8x8-v0")
agent3._model_transits_rewards(50000)

 12%|█▏        | 5848/50000 [00:00<00:01, 27927.11it/s]

Modeling rewards and transition probabilities ...


100%|██████████| 50000/50000 [00:01<00:00, 30374.49it/s]


In [32]:
%%timeit
agent3._value_iteration()

9.46 ms ± 572 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [19]:
start_time = time.time()

# value iteration
agent3 = ValueIteration("FrozenLake8x8-v0")
agent3.run_simulation(render = False, num_steps = 50000)

end_time = time.time()
print(f"VI: Duration of execution: {end_time-start_time:.5f} seconds")

...Episode completed.
Environment solved.
VI: Duration of execution: 1.65080 seconds


In [28]:
agent4 = PolicyIteration("FrozenLake8x8-v0")
agent4._model_transits_rewards(50000)

 12%|█▏        | 6211/50000 [00:00<00:01, 27495.80it/s]

Modeling rewards and transition probabilities ...


100%|██████████| 50000/50000 [00:01<00:00, 30853.47it/s]


In [29]:

%%timeit
agent4._policy_iteration()

18.2 ms ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [13]:
start_time = time.time()

# policy iteration
agent4 = PolicyIteration("FrozenLake8x8-v0")
agent4.run_simulation(render = False, num_steps = 50000)

#end_time = time.time()
#print(f"PI: Duration of execution: {end_time-start_time:.5f} seconds")

action 1
action 1
action 1
action 2
action 2
action 2
action 2
action 1
action 2
action 2
action 2
action 1
action 1
action 1
...Episode completed.
Environment solved.


In [98]:
agent4 = PolicyIteration("FrozenLake8x8-v0")

agent4._model_transits_rewards(50000)
agent4._policy_iteration()
#agent4._model_transits_rewards
agent4._run_episode()


[41mS[0mFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
action 1
  (Down)
SFFFFFFF
[41mF[0mFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
action 1
  (Down)
SFFFFFFF
FFFFFFFF
[41mF[0mFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
action 1
  (Down)
SFFFFFFF
FFFFFFFF
FFFHFFFF
[41mF[0mFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
action 2
  (Right)
SFFFFFFF
FFFFFFFF
FFFHFFFF
F[41mF[0mFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
action 2
  (Right)
SFFFFFFF
FFFFFFFF
FFFHFFFF
FF[41mF[0mFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
action 2
  (Right)
SFFFFFFF
FFFFFFFF
FFFHFFFF
FFF[41mF[0mFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
action 2
  (Right)
SFFFFFFF
FFFFFFFF
FFFHFFFF
FFFF[41mF[0mHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
action 1
  (Down)
SFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFH[41mF[0mFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
action 1
  (Down)
SFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHF[41mF[0mFHF
FHFFHFHF
FFFHFFFG
action 2
  (Right)
SFFF

1.0

### Which algorithm is faster?

Which algorithm do you think is faster, on average, policy iteration (PI) or value iteration (VI)? The answer should be clear (if you implemented the algorithms correctly), but I encourage you to think about *why* (there are 2 reasons).

In [99]:
"""
Change None to 'PI' or 'VI' below
"""
### YOUR ANSWER BELOW
problem_13_answer = None
### YOUR ANSWER ABOVE

### <font color="orange">Auto-grading</font>
Run this cell to track your answers and to save your answer for problem 1.3. Make sure you defined the necessary variable above to avoid a `NameError` below.

In [106]:
### GRADING DO NOT MODIFY
hw1_answers.record('problem_1-3', problem_13_answer)

### <font color="orange">Auto-grading: Submit your answers</font>
Enter your first and last name in the cell below and then run it to save your answers for this lab to a JSON file. The file is saved to the same directory as this notebook. After the file is created, upload the JSON file to the assignment page on Canvas.

In [37]:
hw1_answers.print_answers()

{'problem_1-1': {'state_values': ((0, 0.7737809374999999), (1, 0.8145062499999999), (2, 0.8573749999999999), (3, 0.8145062499999999), (4, 0.8145062499999999), (5, 0.0), (6, 0.9025), (7, 0.0), (8, 0.8573749999999999), (9, 0.9025), (10, 0.95), (11, 0.0), (12, 0.0), (13, 0.95), (14, 1.0), (15, 0.0), (16, 0.0)), 'rewards': True}, 'problem_1-2': {'state_values': ((0, 0.7737809374999999), (1, 0.8145062499999999), (2, 0.8573749999999999), (3, 0.8145062499999999), (4, 0.8145062499999999), (5, 0.0), (6, 0.9025), (7, 0.0), (8, 0.8573749999999999), (9, 0.9025), (10, 0.95), (11, 0.0), (12, 0.0), (13, 0.95), (14, 1.0), (15, 0.0)), 'policy_values': ((0, 1), (1, 2), (2, 1), (3, 0), (4, 1), (5, 0), (6, 1), (7, 0), (8, 2), (9, 1), (10, 1), (11, 0), (12, 0), (13, 2), (14, 2), (15, 0))}}


In [108]:
assignment_name = "hw_1"
first_name = "NAME" # Use proper capitalization
last_name = "LASTNAME" # Use proper capitalization

hw1_answers.save_to_json(assignment_name, first_name, last_name)

# Time test with `%%time`

In [33]:
agent3 = ValueIteration('FrozenLake8x8-v0')
agent3._model_transits_rewards(50000)

 11%|█▏        | 5645/50000 [00:00<00:01, 28641.22it/s]

Modeling rewards and transition probabilities ...


100%|██████████| 50000/50000 [00:01<00:00, 32005.86it/s]


In [34]:
%%timeit
agent3._value_iteration()

10 ms ± 568 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [35]:
agent4 = PolicyIteration("FrozenLake8x8-v0")
agent4._model_transits_rewards(50000)

  6%|▌         | 3064/50000 [00:00<00:01, 30637.47it/s]

Modeling rewards and transition probabilities ...


100%|██████████| 50000/50000 [00:01<00:00, 28433.14it/s]


In [36]:
%%timeit
agent4._policy_iteration()

16.3 ms ± 955 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Questions?

Reach out to Eli Bolotin on Piazza (you can find it in Canvas on the left-most menu).

## Sources

***

<sup>[1]</sup> Ng, A. Stanford University, CS229 Notes: Reinforcement Learning and Control.

<sup>[2]</sup> Barnabás Póczos, Carnegie Mellon, Introduction To Machine Learning: Reinforcement Learning (Course).

<sup>[3]</sup> Sutton, R. S., Barto, A. G. (2018 ). Reinforcement Learning: An Introduction. The MIT Press. 

<sup>[4]</sup> OpenAI: Spinning Up. Retrieved from https://spinningup.openai.com/en/latest/spinningup/rl_intro.html