Utility Theory
In this assignment, we will learn how to maximize game utility and choose actions based on utility values.

Consider a set of actions among which an RL agent can choose. In a deterministic scenario, each action is associated with a reward or utility U
and may transition into a next state.

In a probabilistic case, we are interested in the expected utility EU of an action a that takes us from a given state s to a state s′. 
We compute the expected utility EU(a) of the action a as the weighted average utility of resultant states.

EU(a)=∑s′Pr(s′)U(s′)
The action a∗=argmaxaEU(a) maximizes the expected utility.

We will implement a simple method to get EU(a) for an arbitrary set of utility values. Let us consider the mini-gridworld example. Consider a grid with three states S: A, B, C
Utility of each state being +3, -2, +1 respectively
Actions A: L and R for moving left and right respectively
Agent starting at state B
Agent moving in the chosen direction with a probability of 0.7, in the opposite direction with a probability of 0.2, and not moving at all with a probability of 0.1
Assume there is no wall around the grid. That is, the agent can move out of the grid with zero utility gain.

Do not modify any pre-defined variables. Doing so can affect the autograder results.
Ensures the solution generalizes to different values of Utilities (U)


In [None]:
# required constants

S = [0, 1, 2] # shorthand for A, B, C
A = [-1, 1]   # shorthand for L, R
U = [3, -2, 1]# utility for each state 

In [None]:
#Using the given formula to compute 
#Compute the expected utility given an action and evidence for cell B.

def get_expected_utility(action, state):
    EU = 0.0
    
    # Map state names 'A', 'B', 'C' to their indices
    state_to_index = {'A': 0, 'B': 1, 'C': 2}
    s_idx = state_to_index[state]

    # Move correctly
    successor_s = s_idx + action
    if 0 <= successor_s <= 2:
        EU += 0.7 * U[successor_s]
    else:
        EU += 0.7 * 0  # moving out → utility 0

    # Move wrongly (opposite direction)
    next_s_wrong = s_idx - action
    if 0 <= next_s_wrong <= 2:
        EU += 0.2 * U[next_s_wrong]
    else:
        EU += 0.2 * 0

    # Stay in the same place
    EU += 0.1 * U[s_idx]

    return EU

In [None]:
#Doing a sanity check
assert abs(2.1 - get_expected_utility(-1, 'B')) < 1e-3, "If this fails, try to find the utility of moving left from state B by hand"

In [None]:
#Equipped with the method, find the action that returns the maximum expected utility.
def get_best_action(state):
    """
    return: action -1 or 1, left or right
    """
    action = None
    
    best_action = None
    best_EU = float('-inf')  # start with very small number

    for action in A:
        EU = get_expected_utility(action, state)
        if EU > best_EU:
            best_EU = EU
            best_action = action

    return best_action

In [None]:
#Test again
assert get_best_action('B') == -1, "You can either move left or right. If minigrid was deterministic, what will a rational agent choose?"
assert get_best_action('C') == 1