# Exam Planning-Lab with solutions

## Exercise 1

Consider the following environment:

<img src="images/env1.png" alt="ex1" style="zoom: 40%;" />

The agent starts in cell $(0, 0)$ and has to reach the treasure in $(2, 3)$. In addition to the walls of the previous environments, the floor is covered with lava, however the agent can resist to high temperature and it can use the heat to recharge its batteries, hence it receives a positive reward for being on a lava cell. The environment is deterministic and the agent must avoid the black pits of death (cells $(0,3)$, $(1, 3)$, $(1,1)$). 
Assume that you do not have access to the motion model and to reward and that the problem is undiscounted (i.e., $\gamma$ =1)

In [1]:
import os, sys
module_path = os.path.abspath(os.path.join('tools'))
if module_path not in sys.path:
    sys.path.append(module_path)

import gym, envs
from utils.ai_lab_functions import *
from timeit import default_timer as timer
from tqdm import tqdm as tqdm
import numpy as np

#POSSIBLE SOLUTION Q-LEARNING OR SARSA  
def epsilon_greedy(q, state, epsilon):
    """
    Epsilon-greedy action selection function
    
    Args:
        q: q table
        state: agent's current state
        epsilon: epsilon parameter
    
    Returns:
        action id
    """
    an = q.shape[1]
    probs = np.full(an, epsilon / an)
    probs[q[state].argmax()] += 1 - epsilon
    return np.random.choice(an, p=probs)


env = gym.make('LavaFloor-v0')
env.render()

[['S' 'L' 'L' 'P']
 ['L' 'P' 'L' 'P']
 ['L' 'L' 'L' 'G']]


### 1.1) Given the environment reported above, find a policy with a suitable algorithm of your choice. Print the resulting policy (you can use the provided code for this)¶


In [2]:
def q_learning(environment, episodes, alpha, gamma, expl_func, expl_param):
    """
    Performs the Q-Learning algorithm for a specific environment
    
    Args:
        environment: OpenAI Gym environment
        episodes: number of episodes for training
        alpha: alpha parameter
        gamma: gamma parameter
        expl_func: exploration function (epsilon_greedy, softmax)
        expl_param: exploration parameter (epsilon, T)
    
    Returns:
        (policy, rewards, lengths): final policy, rewards for each episode [array], length of each episode [array]
    """
    
    q = np.zeros((environment.observation_space.n, environment.action_space.n))  # Q(s, a)
    rews = np.zeros(episodes)
    lengths = np.zeros(episodes)
    env = environment
    for e in range(episodes):
        state = env.reset()
        done = False
        leng = 0
        while not done:
           #choose  action
            action = expl_func(q, state, expl_param)
            
            new_state, reward, done, _ = env.step(action)
            rews[e] += reward
            
            max_ = q[new_state,action]
            for a in range(environment.action_space.n):
                if q[new_state,a] > max_:
                    max_ = q[new_state,a]
                
            q[state,action] = q[state,action] + alpha * (reward + gamma *max_  - q[state,action])
            state = new_state
            leng+=1
        
        lengths[e] = leng
            
    policy = q.argmax(axis=1)
    return policy, rews, lengths


In [3]:
episodes = 10
alpha = .3
gamma = 0.99

epsilon = .1

# Q-Learning epsilon greedy
env = gym.make('LavaFloor-v0')
policy, rewards, lengths = q_learning(env, episodes, alpha, gamma, epsilon_greedy, epsilon)
print("Policy:\n {} \n".format(np.vectorize(env.actions.get)(policy.reshape(env.shape))))

Policy:
 [['L' 'L' 'L' 'L']
 ['U' 'L' 'L' 'L']
 ['U' 'L' 'L' 'L']] 



In [29]:
def sarsa(environment, episodes, alpha, gamma, expl_func, expl_param):
    """
    Performs the SARSA algorithm for a specific environment
    
    Args:
        environment: OpenAI gym environment
        episodes: number of episodes for training
        alpha: alpha parameter
        gamma: gamma parameter
        expl_func: exploration function (epsilon_greedy, softmax)
        expl_param: exploration parameter (epsilon, T)
    
    Returns:
        (policy, rewards, lengths): final policy, rewards for each episode [array], length of each episode [array]
    """

    q = np.zeros((environment.observation_space.n, environment.action_space.n))  # Q(s, a)
    rews = np.zeros(episodes)
    lengths = np.zeros(episodes)
    for i in range(episodes):
        state = environment.reset()  # Reset the environment
        action = expl_func(q, state, expl_param)  # Select first action
        el = 0
        
        while True:
            next_state, reward, done, _ = environment.step(action)  # Execute a step
            action_next = expl_func(q, next_state, expl_param)  # Select next action
            q[state, action] += alpha * (reward + gamma * q[next_state, action_next] - q[state, action])  # Temporal difference
            rews[i] += reward
            el += 1
            
            if done:
                break
                
            state = next_state  # Update state
            action = action_next  # Update action
            
        lengths[i] = el
        
    policy = q.argmax(axis=1) # q.argmax(axis=1) automatically extract the policy from the q table
    return policy, rews, lengths

In [22]:
episodes = 500
alpha = .3
gamma = .99
epsilon = .1

# SARSA epsilon greedy
env = gym.make('LavaFloor-v0')
policy, rewards, lengths = sarsa(env, episodes, alpha, gamma, epsilon_greedy, epsilon)
print("Policy:\n {} \n".format(np.vectorize(env.actions.get)(policy.reshape(env.shape))))


Policy:
 [['U' 'L' 'L' 'L']
 ['U' 'L' 'U' 'L']
 ['L' 'L' 'L' 'L']] 



### 1.2) Find a way to force the agent to choose the shortest path towards the goal. You can change these parameters: episodes, alpha, gamma exploration function and exploration parameter. 

##### Hint: you should tune a parameter to consider the immediate reward rather than the long-term one... 

In [5]:
episodes = 300
alpha = .3

# TUNING DISCOUNT FACTOR
gamma = .1
epsilon = .3

# Q-Learning epsilon greedy
env = gym.make('LavaFloor-v0')
p, r, l = q_learning(env, episodes, alpha, gamma, epsilon_greedy, epsilon)
print("Policy:\n {} \n".format(np.vectorize(env.actions.get)(p.reshape(env.shape))))

Policy:
 [['D' 'L' 'L' 'L']
 ['D' 'L' 'D' 'L']
 ['R' 'R' 'R' 'L']] 



In [31]:
episodes = 1000
alpha = .4
gamma = .1
epsilon = .3

# SARSA epsilon greedy
env = gym.make('LavaFloor-v0')
policy, rewards, lengths = sarsa(env, episodes, alpha, gamma, epsilon_greedy, epsilon)
print("Policy:\n {} \n".format(np.vectorize(env.actions.get)(policy.reshape(env.shape))))


Policy:
 [['D' 'L' 'L' 'L']
 ['D' 'L' 'D' 'L']
 ['R' 'R' 'R' 'L']] 



### 1.3) Consider and environment with states {A, B, C, D}, actions {r, l} where states {A, D} are terminal. Consider the following sequence of learning episodes:
* E1: (B, r, C, −1),(C, r, C, −1),(C, r, D, +1)
* E2: (B, r, C, −1),(C, r, D, +1)
* E3: (B, l, A, +5)
* E4: (B, l, B, −1),(B, l, B, −1),(B, l, A, +5)
### Compute v(s) for all non-terminal states by using a sample-based evaluation approach (i.e., computing the values with the function reported below). Assume $\alpha$ = .5 and $\gamma$ = 1.
### $V(s) = (1- \alpha) \cdot V(s) + \alpha \cdot (r + \gamma \cdot V'(s))$ 
### where $s$ is the state under consideration (first element of each tuple), $s'$  (third element of each tuple ) the state reached by starting from $s$ and performing the action $a$ (second element of each tuple). And finally, $r$ is the reward (last element of each tuple).

In [6]:
episodes = {1: [('B', 'r', 'C', -1), ('C', 'r', 'C', -1),('C', 'r', 'D', 1)], 
            2: [('B', 'r', 'C', -1), ('C', 'r', 'D', 1)], 
            3: [('B', 'l', 'A', 5)], 
            4: [('B', 'l', 'B', -1),('B', 'l', 'B', -1),('B', 'l', 'A', 5)]}
v = {'A': 5, 'B': 0, 'C': 0, 'D': 1}
alpha = 0.5
gamma = 1

for episode, values in episodes.items():
    for tuple in values:
        v[tuple[0]] = (1-alpha)*v[tuple[0]] + alpha*(tuple[-1]+ gamma*v[tuple[2]])
      
print(v)

{'A': 5, 'B': 6.90625, 'C': 1.375, 'D': 1}


## Exercise 2

Consider the figure below where **S=$(1,2)$** and **G=$(3,1)$** are the starting and goal positions respectively. Consider the problem of finding a minimum cost path from S to G assuming the agent can move in the four directions (if there is no
obstacle) and that each movement has a unitary cost. The environment is deterministic. Answer the following questions:

<img src="images/ex2.png" alt="ex2" style="zoom: 40%;" />


In [7]:
env = gym.make('SmallMaze-v0')
env.render()

[['C' 'C' 'C' 'C' 'C']
 ['C' 'C' 'S' 'W' 'C']
 ['C' 'W' 'W' 'W' 'C']
 ['C' 'G' 'C' 'C' 'C']]


### 2.1) Verify by using the code developed in the lab that the Manhattan distance (l1_norm) is a  *consistent* heuristic in this environment. In particular, you should implement a function that checks whether for every couple of state $(s,s')$ (where $s'$ is a successor state of $s$), the consistency condition is verified. The function should return true if the heuristic is consistent and false otherwise. Recall that every action has a cost of 1...


#### my solution:

a heuristic is considered consistent if for all (n, n'), h(n) <= h(n') + c(n,a,n'). In this case, having unit cost per displacement, it is easy to see that the manhattan distance (l1_norm) heuristic is consistent since for each node and its successor we have h(n) - h(n') <= 1. 


In [9]:
def check_euristic(environment):
    goalpos = environment.state_to_pos(environment.goalstate)
    for s in range(environment.observation_space.n): #for every state
        #print('s = {}'.format(environment.state_to_pos(s)))
        
        # skip the check if the state is a wall
        if env.grid[s] == "W":
            continue
            
        for action in range(environment.action_space.n): #Look around
            #print('a = {}'.format(action))
            
            # get next state
            s1 = environment.sample(s, action)
            
            #compute heuristic for s
            Hn = Heu.l1_norm(environment.state_to_pos(s), goalpos) # node value (heuristic)
            
            #compute heuristic for s'
            Hn1 = Heu.l1_norm(environment.state_to_pos(s1), goalpos) # node value (heuristic)
            
            # check consistency
            test = ((Hn - Hn1) <= 1)
            
            if not test:
                print("The heuristic is not consistent!")
                print('s = {} \t s1 = {}'.format(environment.state_to_pos(s),environment.state_to_pos(s1)))
                print('h(n)= {} \t h(n1) = {} \t res = {}'.format(Hn,Hn1,test))
                return False
            
          
        print("The heuristic is consistent!")
        return True


env = gym.make('SmallMaze-v0')
check_euristic(env)

The heuristic is consistent!


True

### 2.2) Consider the A* algorithm and assume want to achieve optimality. Based on the consistent heuristics of Section 2.1, state whether it is best to use a graph search or tree search strategy. Motivate your answer and show the results of A* execution and the differences between the two versions in terms of the computed solution, number of nodes generated and maximum number of nodes in memory.


Since we have a consistent heuristic we have to choose a graph search version to achieve optimality. We know only if tree search + admissible heuristic or graph search + consistent heuristic => optimality of A*. In all other situations, we cannot guarantee optimality.

In [10]:
def present_with_higher_cost(queue, node):
    if node.state in queue:
        if queue[node.state].value > node.value: 
            return True
    return False


def astar_tree_search(environment):
    """
    A* Tree search
    
    Args:
        problem: OpenAI Gym environment
        
    Returns:
        (path, time_cost, space_cost): solution as a path and stats.
    """

    goalpos = environment.state_to_pos(environment.goalstate)
    queue = PriorityQueue()
    queue.add(Node(environment.startstate, value=Heu.l1_norm(environment.state_to_pos(environment.startstate), goalpos)))
    time_cost = 1
    space_cost = 1
    
    while True:        
        if queue.is_empty(): 
            return None, time_cost, space_cost
    
        node = queue.remove()
        if node.state == environment.goalstate: 
            return build_path(node), time_cost, space_cost
        
        for action in range(environment.action_space.n): #Look around
            child_state = environment.sample(node.state, action)
            child = Node( 
                          child_state, # node state
                          node, # node parent
                          node.pathcost + 1, # incremental path cost
                          node.pathcost + 1 + Heu.l1_norm(environment.state_to_pos(child_state), goalpos) # node value (heuristic)
                         ) 
            
            time_cost += 1
            queue.add(child)
            
        space_cost = max(space_cost, len(queue))

def astar_graph_search(environment):
    """
    A* Graph Search
    
    Args:
        problem: OpenAI Gym environment
        
    Returns:
        (path, time_cost, space_cost): solution as a path and stats.
    """
    
    goalpos = environment.state_to_pos(environment.goalstate)
    queue = PriorityQueue()
    queue.add(Node(environment.startstate, value=Heu.l1_norm(environment.state_to_pos(environment.startstate), goalpos)))
    time_cost = 1
    space_cost = 1
    explored = set()
    
    while True:
        if queue.is_empty(): return None, time_cost, space_cost
        
        node = queue.remove()
        if node.state == environment.goalstate: 
            return build_path(node), time_cost, space_cost
        explored.add(node.state)
            
        for action in range(environment.action_space.n):  #Look around
            child_state = environment.sample(node.state, action)
            child = Node( 
                      child_state, # node state
                      node, # node parent
                      node.pathcost + 1, # incremental path cost
                      node.pathcost + 1 + Heu.l1_norm(environment.state_to_pos(child_state), goalpos) # node value (heuristic)
                     ) 
            time_cost += 1
            
            if child.state not in queue and child.state not in explored:
                #if child.state == environment.goalstate: return build_path(node), time_cost, space_cost
                queue.add(child)
            elif present_with_higher_cost(queue, child):
                queue.replace(child)
                
        space_cost = max(space_cost, len(queue) + len(explored))


env = gym.make('SmallMaze-v0')
solution_ts, time_ts, memory_ts = astar_tree_search(env)
solution_gs, time_gs, memory_gs = astar_graph_search(env)
print("Solution A*(tree-search): {}".format(solution_2_string(solution_ts, env)))
print("N° of nodes explored: {}".format(time_ts))
print("Max n° of nodes in memory: {}\n".format(memory_ts))
print('='*65)
print("\nSolution A*(graph-search): {}".format(solution_2_string(solution_gs, env)))
print("N° of nodes explored: {}".format(time_gs))
print("Max n° of nodes in memory: {}\n".format(memory_gs))

Solution A*(tree-search): [(1, 1), (1, 0), (2, 0), (3, 0), (3, 1)]
N° of nodes explored: 125
Max n° of nodes in memory: 94


Solution A*(graph-search): [(1, 1), (1, 0), (2, 0), (3, 0), (3, 1)]
N° of nodes explored: 29
Max n° of nodes in memory: 10



### 2.3) Let us consider BFS, show the path of the optimal solution (avoiding the repetition of states) to achieve the goal. In this scenario can we guarantee that the returned solution is the least cost one? If we used Iterative Deepening Search(IDS) with the same methodology used for BFS can we guarantee a least cost solution? Justify your answer and show the results of the different strategies by using the code developed during the lab. 

Yes, given a unitary cost per displacement and BFS+graph search, we always get the optimal solution.


In [11]:
def BFS_GraphSearch(problem):
    """
    Graph Search BFS
    
    Args:
        problem: OpenAI Gym environment
        
    Returns:
        (path, time_cost, space_cost): solution as a path and stats.
    """
    
    node = Node(problem.startstate, None)
    time_cost = 1
    space_cost = 1
    
    if problem.goalstate == node.state: 
        return build_path(node), time_cost, space_cost
    
    frontier = NodeQueue()
    frontier.add(node)
    explored = set()
    
    while not frontier.is_empty():
    
        node = frontier.remove() # Retrieve the first node of the fringe
        explored.add(node.state)
        
        for action in range(problem.action_space.n): # Look around
            
            child = Node(problem.sample(node.state, action), node) # Child node
            time_cost += 1
            
            if child.state not in frontier and child.state not in explored: # Add if not in list and not explored
                if problem.goalstate == child.state: # Goal state check
                    return build_path(child), time_cost, space_cost
                frontier.add(child)
                
        space_cost = max(space_cost, len(frontier) + len(explored)) 
        
    return None, time_cost, space_cost


solution_gs, time_gs, memory_gs = BFS_GraphSearch(env)
print("Solution (BFS graph-search): {}".format(solution_2_string(solution_gs, env)))
print("N° of nodes explored: {}".format(time_gs))
print("Max n° of nodes in memory: {}\n".format(memory_gs))

Solution (BFS graph-search): [(1, 1), (1, 0), (2, 0), (3, 0), (3, 1)]
N° of nodes explored: 39
Max n° of nodes in memory: 11



### Using IDS+graph search in this instance we obtain the least cost solution. However, we can not guarantee this in general. 

In [12]:
#example not optmality IDS+GRAPH SEARCH


def DLS(problem, limit, RDLS_Function):
    """
    DLS
    
    Args:
        problem: OpenAI Gym environment
        limit: depth limit for the exploration, negative number means 'no limit'
        
    Returns:
        (path, time_cost, space_cost): solution as a path and stats.
    """
        
    node = Node(problem.startstate, None)
    return RDLS_Function(node, problem, limit, set())

def Recursive_DLS_GraphSearch(node, problem, limit, explored):
    """
    Recursive DLS
    
    Args:
        node: node to explore
        problem: OpenAI Gym environment
        limit: depth limit for the exploration, negative number means 'no limit'
        explored: completely explored nodes
        
    Returns:
        (path, time_cost, space_cost): solution as a path and stats.
    """
    
    if problem.goalstate == node.state: # Goal state check
        return build_path(node), 1, node.depthcost
    elif limit == 0: # Limit budget check for cutoff
        return "cut_off", 1, node.depthcost
    
    explored.add(node.state)
    space_cost = node.depthcost
    time_cost = 1 
    
    cut_off_occurred = False
    for action in range(problem.action_space.n): # Look around
        child = Node(problem.sample(node.state, action), node, node.pathcost+1) # Child node
        
        if child.state in explored: # Add if not explored
            continue
            
        result = Recursive_DLS_GraphSearch(child, problem, limit-1, explored) # Recursive call
        time_cost += result[1]
        space_cost = max(space_cost, result[2])    
        
        if result[0] == "cut_off":
            cut_off_occurred = True
        elif result[0] != "failure": # Solution found
            return result[0], time_cost, space_cost
        
    if cut_off_occurred: # Solution not found but cutoff occurred
        return "cut_off", time_cost, space_cost
    else: # No solution and no cutoff: failure
        return "failure", time_cost, space_cost


def IDS(env, DLS_Function):
    """
    Iteartive_DLS DLS
    
    Args:
        problem: OpenAI Gym environment
        
    Returns:
        (path, time_cost, space_cost): solution as a path and stats.
    """
    total_time_cost = 0
    total_space_cost = 1
    
    for i in zero_to_infinity():
        solution_dls, time_dls, memory_dls = DLS(env, i, DLS_Function)
        total_time_cost += time_dls
        total_space_cost = max(total_space_cost, memory_dls)
        if isinstance(solution_dls, type(tuple)):
            break
            
    return solution_dls, total_time_cost, total_space_cost, i


solution_gs, time_gs, memory_gs, iterations_gs = IDS(env, Recursive_DLS_GraphSearch)
print("Solution (IDS graph-search): {}".format(solution_2_string(solution_gs, env)))
print("N° of nodes explored: {}".format(time_gs))
print("Max n° of nodes in memory: {}\n".format(memory_gs))

Solution (IDS graph-search): [(1, 1), (1, 0), (2, 0), (3, 0), (3, 1)]
N° of nodes explored: 41
Max n° of nodes in memory: 5



## Exercise 3

Consider the environment displayed in Figure below, where states $(0, 3)$ and $(1, 3)$ are terminal states with reward $+1$ and $−1$ respectively. The agent can move in the four directions. The transition model states that for every state and action the agent has $0.8$ chances of moving in the chosen direction and $0.1$ chances to move in the othogonal directions.The reward model states that for every state, action and successor state the agent pays $−0.04$. Assume that the dicount factor is set to $0.9$. Answer the following questions: 

<img src="images/ex3.1.png" alt="ex3" style="zoom: 40%;" />


In [13]:
env = gym.make('VeryBadLavaFloor-v0')
env.render()

[['S' 'L' 'L' 'G']
 ['L' 'W' 'L' 'P']
 ['L' 'L' 'L' 'L']]


### 3.1) Use one of the methods developed in the lab to compute the value function for the left diagram.

In [14]:
def value_iteration(environment, maxiters=500, discount=0.9, max_error=1e-3):
    """
    Performs the value iteration algorithm for a specific environment
    
    Args:
        environment: OpenAI Gym environment
        maxiters: timeout for the iterations
        discount: gamma value, the discount factor for the Bellman equation
        max_error: the maximum error allowd in the utility of any state
        
    Returns:
        policy: 1-d dimensional array of action identifiers where index `i` corresponds to state id `i`
    """
    
    U_1 = [0 for _ in range(environment.observation_space.n)] # vector of utilities for states S
    delta = 0 # maximum change in the utility o any state in an iteration
    
    while True:
        maxiters -= 1
        U = U_1.copy()
        delta = 0
        for state in range(environment.observation_space.n):
            
            max_array = [0 for _ in range(environment.action_space.n)] # for each possible action
            for action in range(environment.action_space.n):
                for next_state in range(environment.observation_space.n):
                    max_array[action] += env.T[state, action, next_state] * U[next_state]
                    
            if env.grid[state] == "P" or env.grid[state] == "G":
                U_1[state] = environment.RS[state]
            else:           
                U_1[state] = environment.RS[state] + discount * max(max_array) # Bellman Update 
                
            if abs(U_1[state] - U[state]) > delta: 
                delta = abs(U_1[state] - U[state])         
                
        if maxiters <= 0 or delta < (max_error*(1-discount)/discount):
            break  
  
    return values_to_policy(np.asarray(U), environment), np.asarray(U) # automatically convert the value matrix U to a policy




In [15]:
env = gym.make('VeryBadLavaFloor-v0')
policy, values = value_iteration(env)
policy_render = np.vectorize(env.actions.get)(policy.reshape(env.rows, env.cols))
print(values)
print(policy_render)

[ 0.50939438  0.64958568  0.79536209  1.          0.39844322  0.
  0.48644002 -1.          0.29628832  0.253867    0.34475423  0.12987275]
[['R' 'R' 'R' 'L']
 ['U' 'L' 'U' 'L']
 ['U' 'R' 'U' 'L']]


### Results:

<img src="images/ex3.png" alt="ex3" style="zoom: 40%;" />

### 3.2) Consider the right diagram and focus on states $(2, 1)$. State whether the action reported in the right diagram (the blue one) represents the optimal action for that state. Motivate your answer with the code...

In [16]:
actions = {0: "L", 1: "R", 2: "U", 3: "D"}

value_function = [0.50939438, 0.64958568, 0.79536209, 1, 0.39844322, 0, 0.48644002, -1, 0.29628832,  0.253867, 0.34475423, 0.12987275]
id_start_state = 9
gamma = 0.9

values_ex = [0, 0, 0, 0]
for action in range(env.action_space.n):
    for next_state in range(env.observation_space.n):
        values_ex[action] += env.T[id_start_state, action, next_state] * (env.RS[id_start_state]+ gamma*value_function[next_state])
    
print(np.asarray(values_ex))
print(f'The correct action to perform should be: {actions[np.argmax(values_ex)]}')

[0.21902365 0.25391911 0.20047807 0.20047807]
The correct action to perform should be: R


### 3.3) Compute the probability of ending in state (1, 3) if we execute the sequence of actons < Up, Up > from state (2, 2). Motivate your answer reporting the code and the solution printed. The following image shows a diagram to guide the solution process as a hint for you:


<img src="images/example.png" alt="ex3" style="zoom: 30%;" />




In [20]:
id_start_state = 10
state = id_start_state
actions = {0: "L", 1: "R", 2: "U", 3: "D"}

prob = 0
action = 2
probs_fin = 0

for next_state in range(env.observation_space.n):
    if env.T[state, action, next_state] == 0:
        continue

    probs = env.T[state, action, next_state]
    
 
    for second_next_state in range(env.observation_space.n):
        if env.T[next_state, action, second_next_state] == 0:
            continue

        if env.grid[second_next_state] == "P":
    
            probs *= env.T[next_state, action, second_next_state]

            print(f'{env.state_to_pos(state)} --> {env.state_to_pos(next_state)} --> {env.state_to_pos(second_next_state)}')
            print(f'probs: {env.T[state, action, next_state]} --> {env.T[next_state, action, second_next_state]}')

            # your code here
            probs_fin += probs

            print('================')
            break
    
print()
print('Probability: ', round(probs_fin,2))

(2, 2) --> (1, 2) --> (1, 3)
probs: 0.8 --> 0.1
(2, 2) --> (2, 3) --> (1, 3)
probs: 0.1 --> 0.8

Probability:  0.16


### 3.4) Consider the following environment where states $(0, 3)$ and $(1, 3)$ are terminal states with reward $+1$ and $−1$ respectively. Assume the transition model is the same one defined above and that the discounted factor is $0.9$ as above. However, now the agent pays $−0.4$ instead of $−0.04$ for every action, state and successor state.  Compute the new value function and the optimal policy for this new environment. 


<img src="images/env_3.4.png" alt="ex3" style="zoom: 40%;" />

In [17]:
env = gym.make('NiceLavaFloor-v0')
policy, values = value_iteration(env, discount=0.9)
policy_render = np.vectorize(env.actions.get)(policy.reshape(env.rows, env.cols))
print(values)
print(policy_render)

[-0.71058163 -0.20355824  0.32372581  1.         -1.11176358  0.
 -0.28232707 -1.         -1.43861265 -1.20665118 -0.81864385 -1.18620466]
[['R' 'R' 'R' 'L']
 ['U' 'L' 'U' 'L']
 ['U' 'R' 'U' 'L']]


### 3.5) Modify the value iteration parameters so that the policy allows the agent to reach the goal only starting from states $(0,1)$, $(0,2)$ and $(1,2)$, as reported in the image below . Motivate your answer.


#### hint: given the negative reward the agent should care about the immediate reward and not the long-term reward.



<img src="images/result_ex3.5.png" alt="ex3" style="zoom: 40%;" />

In [18]:
# use value iteration to check 
#sol = tuning discout to 0.1
env = gym.make('NiceLavaFloor-v0')
policy, values = value_iteration(env, discount=0.1)
policy_render = np.vectorize(env.actions.get)(policy.reshape(env.rows, env.cols))
print(values)
print(policy_render)

[-0.44  -0.44  -0.328  1.    -0.44   0.    -0.44  -1.    -0.44  -0.44
 -0.44  -0.44 ]
[['L' 'R' 'R' 'L']
 ['L' 'L' 'U' 'L']
 ['L' 'L' 'L' 'D']]
