<a href="https://colab.research.google.com/github/jjoshuakkim/COMP-5600/blob/main/assignment_5_Joshua_Kim.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Softmax Function Derivative**

Attached to pdf

**Loss Function Derivative**

Attached to pdf

Reference: https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/

**MDP**

State Space: The state space can be defined as the space where all postions can be accessed by the agent (25x25 grid meaning the state space has 625 states). Each state can be represented as a tuple (x, y) as coordinates in the grid

Action Space: We can represent the action space as A = {up, down, left, right}

Transition Probabilities: Given a state and action, the transition to the next state depends on the action taken. If the action leads to a blocked cell or goes out of bounds, the agent remains in the same state. Otherwise, the agent moves according to the chosen action. We can represent the transition probabilities as P(s'|s, a), where s in the current state, a is the action taken, and s' is the next state.

Reward Function: The agent receives a reward of -1 for each time step until it reaches the goal state. The reward function can be denoted as R(s, a, s'), where s is the current state, a is the action taken. and s' is the next state.

Discount Factor: y (value between 0 and 1) is used to weigh the importance of future rewards.

Reference: I did use chatgpt to help me debug my policy iteration algorithm and also this reference https://www.youtube.com/watch?v=RlugupBiC6w

**Implementation of Policy Iteration**

In [None]:
import numpy as np

# State space
states = [(x, y) for x in range(1, 26) for y in range(1, 26)]

# Action space
actions = ['up', 'down', 'left', 'right']

# Transition probabilities
def transition_probabilities(state, action):
    # Implement transition probabilities
    x, y = state
    if action == 'up':
        return {(x, min(y+1, 25)): 0.7, (x, max(y-1, 1)): 0.1, (max(x-1, 1), y): 0.1, (min(x+1, 25), y): 0.1}
    elif action == 'down':
        return {(x, max(y-1, 1)): 0.7, (x, min(y+1, 25)): 0.1, (max(x-1, 1), y): 0.1, (min(x+1, 25), y): 0.1}
    elif action == 'left':
        return {(max(x-1, 1), y): 0.7, (x, min(y+1, 25)): 0.1, (x, max(y-1, 1)): 0.1, (min(x+1, 25), y): 0.1}
    elif action == 'right':
        return {(min(x+1, 25), y): 0.7, (x, min(y+1, 25)): 0.1, (x, max(y-1, 1)): 0.1, (max(x-1, 1), y): 0.1}

# Reward function
def reward(state, action, next_state):
    x, y = state
    if (x, y) == (1, 1):  # Goal state
        return 100
    elif (x, y) == (24, 24):  # Obstacle state
        return -100
    else:
        return -1  # Step cost

# Discount factor
gamma = 0.9

# Initialize policy
policy = {state: np.random.choice(actions) for state in states}

# Policy iteration algorithm
def policy_iteration():
    policy_stable = False
    while not policy_stable:
        # Policy evaluation
        V = {state: 0 for state in states}
        while True:
            delta = 0
            for state in states:
                action = policy[state]
                value = 0
                for next_state, prob in transition_probabilities(state, action).items():
                    r = reward(state, action, next_state)
                    value += prob * (r + gamma * V[next_state])
                delta = max(delta, abs(V[state] - value))
                V[state] = value
            if delta < 1e-6:
                break

        # Policy improvement
        policy_stable = True
        for state in states:
            old_action = policy[state]
            max_value = float('-inf')
            for action in actions:
                value = 0
                for next_state, prob in transition_probabilities(state, action).items():
                    r = reward(state, action, next_state)
                    value += prob * (r + gamma * V[next_state])
                if value > max_value:
                    max_value = value
                    best_action = action
            if best_action != old_action:
                policy_stable = False
            policy[state] = best_action

        if policy_stable:
            break

    return V, policy

# Run policy iteration
optimal_value_function, optimal_policy = policy_iteration()

# Visualize
print("Optimal Policy:")
for y in range(1, 26):
    for x in range(1, 26):
        state = (x, y)
        action = optimal_policy[state]
        if state == (1, 1):
            print("G", end="\t")  # Goal state
        elif state == (12, 12):
            print("X", end="\t")  # Obstacle state
        elif action == 'up':
            print("↑", end="\t")
        elif action == 'down':
            print("↓", end="\t")
        elif action == 'left':
            print("←", end="\t")
        elif action == 'right':
            print("→", end="\t")
    print()

# Display the optimal value function
print("\nOptimal Value Function:")
for y in range(1, 26):
    for x in range(1, 26):
        state = (x, y)
        value = optimal_value_function[state]
        print(f"{value:.2f}", end="\t")
    print()

Optimal Policy:
G	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	→	→	↓	
↓	↓	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	↓	↓	
↓	↓	↓	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	↓	↓	
↓	↓	↓	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	↓	↓	
↓	↓	↓	↓	↓	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	↓	↓	
↓	↓	↓	↓	↓	↓	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	↓	↓	↓	
↓	↓	↓	↓	↓	↓	↓	←	←	←	←	←	←	←	←	←	←	←	←	←	←	←	↓	↓	↓	
↓	↓	↓	↓	↓	↓	↓	↓	←	←	←	←	←	←	←	←	←	←	←	←	←	↓	↓	↓	↓	
↓	↓	↓	↓	↓	↓	↓	↓	↓	←	←	←	←	←	←	←	←	←	←	←	←	↓	↓	↓	↓	
↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	←	←	←	←	←	←	←	←	←	←	↓	↓	↓	↓	↓	
↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	←	←	←	←	←	←	←	←	↓	↓	↓	↓	↓	↓	
↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	X	←	←	←	←	←	←	↓	↓	↓	↓	↓	↓	↓	
↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	←	←	←	←	↓	↓	↓	↓	↓	↓	↓	↓	
↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	←	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	
↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	←	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	
↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	←	←	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	
↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	←	←	←	↓	↓	↓	↓	↓	↓	↓	↓	↓	
↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	←	←	←	←	←	↓	↓	↓	↓	↓	↓	↓	↓	
↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	↓	←	←	←	←	←	←	←	↓	↓	↓	↓	↓	↓	↓	
↓	↓	↓	↓	↓	↓	↓	↓

**Report**

When implementing the MDP, I chose a number of data structures to represent different parts of the problem. I represented the state space as a list of tuples, where each tuple represents a distinct state in the grid world. The action space was represented as a simple list of strings. The transition probabilities are represented as a dictionary to show the transition from one state to another given an action. The policy was also represented as a dictionary, where the keys are states and its items are the chosen actions for each state. The value function was represented as a dictionary. The keys are states, and the values are the estimated values of those states. The chosen data structures were super flexible and adaptable to different problem scenarios, so it was really easy to modify and extend as the problem requirements grew.

The problem being addressed was to find the optimal policy for an agent to take in in order to maximize the expected cumulative reward in an MDP environment. In my implementation, the transition probabilities are stochastic (meaning a 70% chance of moving in the intended direction and a 10% chance of moving in each of the other directions. The reward function assigns a reward of 100 for reaching the goal state and a penalty of -100 for entering the obstacle state, and a step cost of -1 for all other transitions. I also used a discount  factor to ensure the that agent prefers to reach the goal state as quickly as it can.

In my policy evaluation, it's supposed to compute the value function for the current policy by solving a system of linear equations. The value function represents how good it is for the agent to be in a particular state when following the current policy. In the policy improvement, the algorithm updates the policy by choosing the action that maximizes the expected future reward for each state, based on the value function computed in the previous step.

The algorithm iterates between these two steps until the policy converges (no further improvements) and the optimal policy has been found.

In [None]:
# Random policy
def random_policy(state):
    return np.random.choice(actions)

def evaluate_random_policy(start_states):
    steps_to_goal = []
    total_states = 25 * 25
    for start_state in start_states:
        state = start_state
        steps = 0
        while state != (1, 1):
            action = random_policy(state)
            next_state, prob = next(iter(transition_probabilities(state, action).items()))
            state = next_state
            steps += 1
            if steps > total_states:
                print(f"Loop encountered for start state {start_state}. Terminating execution.")
                break
        else:
            steps_to_goal.append(steps)
    return np.median(steps_to_goal)

# Choose 3 random start states
start_states = [(np.random.randint(1, 26), np.random.randint(1, 26)) for _ in range(3)]

# Print the randomly chosen start states
print("Randomly chosen start states:")
for state in start_states:
    print(state)

# Evaluate random policy
random_policy_performance = evaluate_random_policy(start_states)
print(f"\nMedian number of steps to reach the goal state with a random policy: {random_policy_performance}")

Randomly chosen start states:
(22, 15)
(1, 19)
(5, 9)
Loop encountered for start state (22, 15). Terminating execution.
Loop encountered for start state (1, 19). Terminating execution.
Loop encountered for start state (5, 9). Terminating execution.

Median number of steps to reach the goal state with a random policy: nan


  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


In [None]:
def evaluate_policy(policy, start_states):
    steps_to_goal = []
    for start_state in start_states:
        state = start_state
        steps = 0
        while state != (1, 1):
            action = policy[state]
            next_state, prob = next(iter(transition_probabilities(state, action).items()))
            state = next_state
            steps += 1
        steps_to_goal.append(steps)
    return np.median(steps_to_goal)

# Evaluate the intermediate policy
intermediate_policy_performance = evaluate_policy(policy, start_states)

# Print the randomly chosen start states
print("Randomly chosen start states:")
for state in start_states:
    print(state)

# Print the result
print(f"\nMedian number of steps to reach the goal state with the intermediate policy: {intermediate_policy_performance}")


Randomly chosen start states:
(12, 1)
(20, 8)
(22, 13)

Median number of steps to reach the goal state with the intermediate policy: 26.0


In [None]:
# Evaluate optimal policy
optimal_policy_performance = evaluate_policy(optimal_policy, start_states)

# Print the result
print(f"\nMedian number of steps to reach the goal state with the optimal policy: {optimal_policy_performance}")



Median number of steps to reach the goal state with the optimal policy: 26.0


**Results**

As you can see in the results, the random policy steps is may hit a loop in the traversal, and the intermediate and optimal policy steps are equal, which is an extremely rare case. I realized it was almost impossible to evaluate the performance of a random policy without setting a clear start state near the goal. Initially, I had the algorithm keep running until it did end up finding the goal state; however, the median always ended up being abnormally high. I ended up having to set a max step count where if you step more than the total number of states, then you know you've hit a loop, then you cut the execution.