### **Week-8:** MDP and Dynamic Programming
### **Name:** Atyam V V R Manoj
### **Reg No. | Sec:** 200968108

### Use the Frozen Lake environment.

https://www.gymlibrary.dev/environments/toy_text/frozen_lake/
#### **Importing required libraries**

In [1]:
import sys
import gym
import numpy as np

#### Intializing the environment

In [2]:
env = gym.make('FrozenLake-v1',is_slippery=True)

  and should_run_async(code)
  deprecation(
  deprecation(


#### Defining a **Helper function to calculate a state-value**

In [3]:
# Calculate a state-value function
def one_step_lookahead(env, state, V, discount):
    '''
    V: 2-D tensor
        Matrix of size nSxnA, each cell represents 
        a probability of taking actions
    '''
    
    n_actions = env.action_space.n
    action_values = np.zeros(shape=n_actions)
    for action in range(n_actions):
        for prob, next_state, reward, done in env.env.P[state][action]:
            action_values[action] += prob * (reward + discount * V[next_state])
    ''' 
    return:
        A vector of length nA containing the expected value of each action
    '''
    return action_values



#### Defining a function to **implement an optimal policy** for the Frozen Lake environment.

In [4]:
def policy_eval(policy, env, discount=1.0, theta=1e-9, max_iter=1000):
    """    
    policy: 2-D tensor
        Matrix of size nSxnA, each cell represents 
        a probability of taking actions
    """
    n_states = env.observation_space.n
    eval_iters = 1

    # value function
    V = np.zeros(shape=n_states)

    # repeat until value change is below the threshold
    for i in range(int(max_iter)):
        delta = 0
        for state in range(n_states):
            # init a new value of current state
            v = 0
            # Try all possible actions which can be taken from this state
            for action, action_prob in enumerate(policy[state]):
                for state_prob, next_state, reward, done in env.P[state][action]:
                    # calculate the expected value
                    v += action_prob * state_prob * (reward + discount * V[next_state])
            # calculate the absolute change of value function
            delta = max(delta, np.abs(V[state] - v))
            # update value function
            V[state] = v
        eval_iters += 1
        
        if delta < theta:
            print(f'Policy evaluation terminated after {eval_iters} iterations.\n')
            return V

### **1.Create a Policy Iteration function with the following parameters**
* policy: 2D array of a size n(S) x n(A), each cell represents a probability of taking action a in state s.
* environment: Initialized OpenAI gym environment object
* discount_factor: MDP discount factor.
* theta:  A  threshold  of  a  value  function  change.  Once  the  update  to  value function is below this number
* max_iterations: Maximum number of iterations 

In [5]:
def policy_iteration(env, discount=1.0, max_iter=1000):

    n_states = env.observation_space.n
    n_actions = env.action_space.n
    
    # start with random policy = n_states x n_actions / n_actions
    policy = np.ones(shape=[n_states, n_actions]) / n_actions

    # counter of evaluated policies
    evaluated_policies = 1
    
    # repeat until convergence or critical number of iterations reached
    for i in range(int(max_iter)):
        stable_policy = False

        # Evaluate current policy
        V = policy_eval(policy, env, discount)
        
        # go through each state & try to improve actions that were taken
        for state in range(n_states):
            curr_action = np.argmax(policy[state])
            # look one step ahead and evaluate if curr_action is optimal
            # will try every possible action in a curr_state
            action_value = one_step_lookahead(env, state, V, discount)
            # select best aciton 
            best_action = np.argmax(action_value)
            # if action didn't change
            if curr_action != best_action:
                stable_policy = True
            # Greedy policy update
            policy[state] = np.eye(n_actions)[best_action]
        evaluated_policies += 1
        # if the algo converged & policy is not changing anymore
        if stable_policy:
            print(f'Found stable policy after {evaluated_policies:,} evaluations.\n')
            return policy, V

### **2.Create a Value Iteration function with the following parameters**
* environment: Initialized OpenAI gym environment object
* discount_factor: MDP discount factor
* theta:  A  threshold  of  a  value  function  change.  Once  the  update  to  value function is below this numberd.
* max_iterations: Maximum number of iterations 

In [6]:
# defining Value iteration algorithm to solve MDP.

def value_iteration(env, discount=1e-1, theta=1e-9, max_iter=1e4):

    # initalized state-value function with zeros for each env state
    V = np.zeros(env.observation_space.n)
    
    for i in range(int(max_iter)):
        # early stopping condition
        delta = 0

        for state in range(env.observation_space.n):

            # Do a one-step lookahead to calculate state-action values
            action_value = one_step_lookahead(env, state, V, discount)

            # select best action to perform based on the highest state-action values
            best_action_value = np.max(action_value)
          
            # calculate change in value
            delta = max(delta, np.abs(V[state] - best_action_value))
            
            # update the value function for current state
            V[state] = best_action_value
            
        # checking the condition to exit the loop

        if delta < theta:
            print(f'\nValue iteration converged at iteration #{i+1:,}')
            break
    
    policy = np.zeros(shape=[env.observation_space.n, env.action_space.n])
    
    for state in range(env.observation_space.n):
        # one step lookahead to find the best action for this state
        action_value = one_step_lookahead(env, state, V, discount)
        #select the best action based on the highest state-action value
        best_action = np.argmax(action_value)
        # update the policy to perform a better action at a current state
        policy[state, best_action] = 1.0
    
    return policy, V

### Defining a function to implement eacdh episode

In [7]:
def play_episodes(env, episodes, policy, max_action=100, render=False):

    wins = 0
    total_reward, total_action = 0, 0
    
    for episode in range(episodes):
        state = env.reset()
        done, max_a = False, 0 # max_a are the no. of actions taken
        while max_a < max_action:
            action = np.argmax(policy[state])
            next_state, reward, done, _ = env.step(action)
            if render:
                env.render()
            max_a += 1
            total_reward += reward  # increment reward received
            state = next_state  # set current state to next state

            # terminate if we're done and increment `wins`
            if done and reward == 1:
                wins += 1
                break
        
        total_action += max_a

    print(f'Total rewards: {total_reward:,}\tMax action: {max_a:,}')
    
    avg_reward = total_reward / episodes
    avg_action = total_action / episodes
    print('')
    '''
    return: tuple(wins, total_reward, average_reward)
        - wins: The total number of wins the agent has
        - total_reward: The agent's total accumulated reward
        - average_reward: The agent's average reward
    '''
    return wins, total_reward, avg_reward, avg_action

### Implementing the **agent** in the given environment for 1000 episodes

In [8]:
episodes = 1000

def agent(env):

    rewards = []
  
    action_mapping = {
        0: '\u2191',  # up
        1: '\u2192',  # right
        2: '\u2193',  # down
        3: '\u2190'   # left
    }
    
    policies = [
        ('Policy Iteration', policy_iteration),
        ('Value Iteration', value_iteration)
    ]
    
    for iter_name, iter_func in policies:
        policy, V = iter_func(env)
        
        print(f'Final policy using {iter_name}:')
        print(' '.join([action_mapping[action] for action in np.argmax(policy, axis=1)]))
        
        wins, total_reward, avg_reward, avg_action = play_episodes(env, episodes, policy)
        rewards.append(total_reward)
        
        print(f'number of wins = {wins:,}')
        print(f'average reward = {avg_reward:.2f}')
        print(f'average action = {avg_action:.2f}')

    return rewards




In [9]:
rewards = agent(env)

Policy evaluation terminated after 66 iterations.

Found stable policy after 2 evaluations.

Final policy using Policy Iteration:
↑ ← ↑ ← ↑ ↑ ↑ ↑ ← → ↑ ↑ ↑ ↓ → ↑
Total rewards: 742.0	Max action: 19

number of wins = 742
average reward = 0.74
average action = 52.82

Value iteration converged at iteration #8
Final policy using Value Iteration:
→ ← ↓ ← ↑ ↑ ↑ ↑ ← → ↑ ↑ ↑ ↓ → ↑
Total rewards: 444.0	Max action: 18

number of wins = 444
average reward = 0.44
average action = 68.73


In [10]:
env.close()

In [11]:
print(rewards)

[742.0, 444.0]


### **3.Compare  the number of  wins, average  return  after  1000  episodes and  comment  on which method performed better.**
1. Based on the above output, we can say the ***Policy Iteration performs better.***
2. No. of wins (no. of rewards) is always higher while using the Policy Iteration method(742) than the Value Iteration method(444) and so is the average return.
* In **policy iteration**, we start by choosing an random policy ,then we iteratively evaluate and improve the policy until convergence. While in **value iteration**, we start by computing a random state value function and iteratively update the estimate value.
* Both are guaranteed to converge but policy iteration algorithm is faster, requires lesser iterations and cheaper to compute.

 