# Homework 4: Petting a warg

Wargs do not make good pets. They are vicious creatures, populating Middle Earth, the world described by novels of John Ronald Reuel Tolkien. They tend to show up in the worst moment possible. They eat humans, hobbits, elves and wizards (when they can get them).

![A warg, getting ready for breakfast w:300px](figures/Gundabad_Wargs.jpg)

Your relationship with a warg can be in the following states:
```
SleepingWarg
AngryWarg
FuriousWarg
ApoplecticWarg
Safe
Sorry 
```

![tes](figures/WargStates.jpg)

Your actions are limited to petting a warg or striking it with your sword. The transitions are described in the following picture. The safe and sorry states are terminal, where no further actions can be taken. Landing into them has the reward +10 and -10 respectively. All other actions have a reward of -1. 

The discount factor is $\gamma=0.9$

![](figures/PetAWarg.jpg)


# How to solve this homework
The following problems you can solve either with the help of an LLM or by hand. 

* If you are solving by hand, make sure that you add sufficient comments to make sure that the code is understandable. 
* If you are solving using an LLM, add in form of comments
    * the LLM used (at the first use instance)
    * the prompt used to elicit the code
    * modifications that had to be done to the code 

For example:

```
# --- LLM used: ChatGPT 4.5
# --- LLM prompt
# Write a python class to encapsulate the least common multiple algorithm
# --- End of LLM prompt
```

The programming language should be Python.

## P1: MDP implementation 

Write a class to implement an MDP. Do not include value or policy iteration in the class.

In [8]:
class MDP:

    def __init__(self, states, terminal_states, actions, discount_factor):

        self.states = states
        self.start_state = states[0]
        self.actions = actions
        self.discount_factor = discount_factor
        self.terminal_states = terminal_states
        # --- LLM used: Sonnet 4.5
        # --- LLM prompt: 
        # What is a way I can implement a set of transition functions without having to pass them into the class
        # to ensure I can store the probabiulity and reward in the same structure?
        self.transitions = {s: {a: [] for a in actions} for s in states}
        
    def add_transition(self, state, action, info: tuple[float, str, int]):
        probability, next_state, reward = info
        self.transitions[state][action].append((probability, next_state, reward))
    # --- End of LLM prompt

    def is_terminal(self, state):
        return (state in self.terminal_states)

    def get_transitions(self, state, action):
        return self.transitions[state][action]
    
    def get_actions(self, state):
        if self.is_terminal(state):
            return []
        return self.actions
    
    def get_discount_factor(self):
        return self.discount_factor
    
    def get_states(self):
        return self.states
    
    def get_terminal_states(self):
        return self.terminal_states

## P2: Warg as an MDP
Implement the WargPettingGame as an MDP using the implementation from above. 

In [9]:
def WargPettingGame_MDP():
    states = ['SleepingWarg', 'AngryWarg', 'FuriousWarg', 'ApoplecticWarg']
    terminal_states = ['Safe', 'Sorry']
    actions = ['Strike', 'Pet']
    discount_factor = 0.9

    game = MDP(states, terminal_states, actions, discount_factor)
    # SleepingWarg State
    game.add_transition('SleepingWarg', 'Strike', (1.0, 'AngryWarg', -1))
    game.add_transition('SleepingWarg', 'Pet', (0.95, 'AngryWarg', -1))
    game.add_transition('SleepingWarg', 'Pet', (0.05, 'Safe', 10))
    # AngryWarg State
    game.add_transition('AngryWarg', 'Strike', (1.0, 'FuriousWarg', -1))
    game.add_transition('AngryWarg', 'Pet', (1.0, 'Sorry', -10))
    # FuriousWarg State
    game.add_transition('FuriousWarg', 'Strike', (1.0, 'ApoplecticWarg', -1))
    game.add_transition('FuriousWarg', 'Pet', (1.0, 'Sorry', -10))
    # ApoplecticWarg
    game.add_transition('ApoplecticWarg', 'Strike', (0.8, 'Sorry', -10))
    game.add_transition('ApoplecticWarg', 'Strike', (0.2, 'Safe', 10))
    game.add_transition('ApoplecticWarg', 'Pet', (1.0, 'Sorry', -10))

    return game

## P3: Value iteration

Implement the value iteration as a separate function that uses this MDP implementation. 

In [None]:
def value_iteration(mdp: MDP, iterations=10000, convergence=0.0001):
    # Inilialize V for every state to 0.0
    V = {s: 0.0 for s in mdp.get_states()} 
    for terminal_state in mdp.get_terminal_states():
        V[terminal_state] = 0.0
    for i in range(iterations):
        V_next = V.copy() # V_k+1
        delta = 0 # Difference between iterations for convergence

        # --- LLM used: Sonnet 4.5
        # --- LLM prompt
        # How can I loop through each of the possible transitions for a given action and compute the q value for 
        # value iteration. Please review these slides from my class to understand the bellman equation we used.

        # For each non-terminal state
        for state in mdp.get_states():

            # Compute Q*(s,a) for each action
            action_values = []
            for action in mdp.get_actions(state):
                q = 0

                # Q*(s,a) = Σ T(s,a,s') * [R(s,a,s') + γ * V_k(s')]
                transitions = mdp.get_transitions(state, action)
                for info in transitions:
                    probability, next_state, reward = info
                    q += probability * (reward + mdp.get_discount_factor() * V[next_state])

                action_values.append(q)

            # V_{k+1}(s) = max_a Q*(s,a) (Bellman update)
            if action_values:
                V_next[state] = max(action_values)
                delta = max(delta, abs(V_next[state] - V[state]))

        V = V_next
        # --- End of LLM prompt
        if delta < convergence:
            break

    return V

## P4: Using value iteration
Find the V* values of the WargPettingGame using the implementation above. Print out the V* values for each state in the form 
V(state) == number

In [32]:
mdp = WargPettingGame_MDP()
V_star = value_iteration(mdp)
for state, action in V_star.items():
    print(f"V({state}) = {action}")


V(SleepingWarg) = -6.2298
V(AngryWarg) = -6.760000000000001
V(FuriousWarg) = -6.4
V(ApoplecticWarg) = -6.0
V(Safe) = 0.0
V(Sorry) = 0.0


## P5:  Policy extraction

Find the policy $\pi(s)$ from the V values obtained in the previous step. Remember that you need to do one step of expectimax.
Print out the policy for each state, in a readable way. Eg. 
    pi(ApoplecticWarg) = Pet



In [31]:
def extract_policy(mdp: MDP, V: dict):
    policy = {}
    
    for state in mdp.get_states():
        policy[state] = None
            
        
        best_action = None
        best_value = float('-inf')
        
        # Try each action
        for action in mdp.get_actions(state):
            q_value = 0
            transitions = mdp.get_transitions(state, action)
            
            # Compute expected value for this action
            for prob, next_state, reward in transitions:
                q_value += prob * (reward + mdp.get_discount_factor() * V[next_state])
            
            # Update best action if this is better
            if q_value > best_value:
                best_value = q_value
                best_action = action
        
        policy[state] = best_action
    
    return policy

policy = extract_policy(mdp, V_star)
for state, action in policy.items():
    print(f"pi({state}) = {action}")

pi(SleepingWarg) = Pet
pi(AngryWarg) = Strike
pi(FuriousWarg) = Strike
pi(ApoplecticWarg) = Strike


## P6: Policy iteration
Implement policy iteration with the MDP as defined above as a separate function.
Apply it to the MDP defining the pet the warg game. 
Print out the resulting policy for each state, in a readable way.

In [None]:
def policy_iteration(mdp, policy, convergence=0.0001):
    


{'SleepingWarg': 'Pet', 'AngryWarg': 'Strike', 'FuriousWarg': 'Strike', 'ApoplecticWarg': 'Strike'}


## P7: Trajectory sampling
Implement a function that generates trajectories in the form of (s,a,r,s') tuples from the MDP for a specific policy. The trajectory ends when it reaches a terminal state. 

Generate 100 trajectories for a __random__ policy. 

## P8: Implement Q-learning 

Create an implementation of Q-learning which takes the trajectory database and updates a Q-table.

## P9: Run Q-learning 

Run your implementation of Q-learning on the warg petting game. Print out the Q values in the form 

Q(state, action) = number


## P10: Policy implied by Q-values

Write a function that extracts a policy form q-values. 
Apply it to the Q-table obtained at P9. Print out the resulting policy in a readable way. 