# Homework 5: Petting a deep Q warg

This homework builds on the same game as homework 4. 

![](figures/PetAWarg.jpg)

# How to solve this homework
The following problems you can solve either with the help of an LLM or by hand. 

* If you are solving by hand, make sure that you add sufficient comments to make sure that the code is understandable. 
* If you are solving using an LLM, add in form of comments
    * the LLM used (at the first use instance)
    * the prompt used to elicit the code
    * modifications that had to be done to the code 

For example:

```
# --- LLM used: ChatGPT 4.5
# --- LLM prompt
# Write a python class to encapsulate the least common multiple algorithm
# --- End of LLM prompt
```

The programming language should be Python.

You can reuse code from your submission for homework 4. 

# P1: Model the game as an environment in gymnasium

gymnasium (https://gymnasium.farama.org/index.html) is a fork of the OpenAI gym library. It is a library that allows you to easily build environments 

Model the PetAWarg game as an environment in gymnasium. You don't have to create visual framework: it is enought to implement the render function to print the current state. 

NOTE: If you are using a LLM, you should be able to ask it to convert your previous implementation into the implementation in gym. 


In [6]:
import gymnasium as gym
import numpy as np

class PetAWargEnv(gym.Env):

    # States
    SLEEPING = 0
    ANGRY = 1
    FURIOUS = 2
    APOPLECTIC = 3
    SAFE = 4
    SORRY = 5

    # Actions
    PET = 0
    STRIKE = 1

    def __init__(self):
        super().__init__()

        self.observation_space = gym.spaces.Discrete(6)
        self.action_space = gym.spaces.Discrete(2)

        self.render_mode = None

        self.state_names = {self.SLEEPING: "Sleeping", self.ANGRY: "Angry", self.FURIOUS: "Furious", self.APOPLECTIC: "Apoplectic", self.SAFE: "Safe", self.SORRY: "Sorry"}
        self.action_names = {self.PET: "Pet", self.STRIKE: "Strike"}

        self.state = None
        self.score = None
        self.steps = 0

    def reset(self, seed = None, options = None):
        super().reset(seed=seed)

        self.state = self.SLEEPING
        self.score = 0
        self.steps = 0

        return self.state, self.score
    
    def step(self, action):
        if self.state is None:
            raise RuntimeError("Environment not initialized. Call reset() first.")
        
        self.steps += 1
        reward = 0
        terminated = False
        truncated = False
        
        # State transitions based on current state and action
        if self.state == self.SLEEPING:
            if action == self.PET:
                # Pet with p=0.05 -> Safe, with p=0.95 -> Angry
                if self.np_random.random() < 0.05:
                    self.state = self.SAFE
                    self.score = 10
                else:
                    self.state = self.ANGRY
            elif action == self.STRIKE:
                # Strike with p=1.0 -> Angry
                self.state = self.ANGRY
                
        elif self.state == self.ANGRY:
            if action == self.PET:
                # Pet with p=1.0 -> Sorry
                self.state = self.SORRY
                self.score = -10
            elif action == self.STRIKE:
                # Strike with p=1.0 -> Furious
                self.state = self.FURIOUS
                
        elif self.state == self.FURIOUS:
            if action == self.PET:
                # Pet with p=1.0 -> Sorry
                self.state = self.SORRY
                self.score = -10
            elif action == self.STRIKE:
                # Strike with p=1.0 -> Apoplectic
                self.state = self.APOPLECTIC
                
        elif self.state == self.APOPLECTIC:
            if action == self.PET:
                # Pet with p=1.0 -> Sorry
                self.state = self.SORRY
                self.score = -10
            elif action == self.STRIKE:
                # Strike with p=0.2 -> Safe, with p=0.8 -> Sorry
                if self.np_random.random() < 0.2:
                    self.state = self.SAFE
                    self.score = 10
                else:
                    self.state = self.SORRY
                    self.score = -10
                    
        elif self.state == self.SAFE:
            # Terminal state - should not be here
            terminated = True
            
        elif self.state == self.SORRY:
            # Terminal state - should not be here
            terminated = True
        
        # Set reward based on score
        reward = self.score
        
        # Episode terminates if we reach Safe or Sorry states
        if self.state == self.SAFE or self.state == self.SORRY:
            terminated = True
        
    
        self.render()
        
        info = {
            "score": self.score,
            "steps": self.steps,
            "state_name": self.state_names[self.state]
        }
        
        return self.state, reward, terminated, truncated, info
    
    def render(self):
        state_name = self.state_names[self.state]
        print(f"\nStep: {self.steps}")
        print(f"Current State: {state_name}")
        print(f"Current Score\n: {self.score:+d}")

    


# P2: Pet, strike, pet, strike, pet

Using the environment class implemented above, create an instance of the environment. Print out its state (by calling render()). 

Then, perform the actions: pet, strike, pet, strike, pet. After each action, print out the state.  

In [None]:
# Create environment
env = PetAWargEnv()

# Reset environment
state, info = env.reset(seed=42)

print("\nStarting PetAWarg Game!")
print("Actions: 0=Pet, 1=Strike")

# Run a sample episode with random actions
terminated = False
truncated = False
total_reward = 0

actions = {0,1,0,1,0}

print(f"Initial State: {env.state_names[env.state]}")
for action in actions:
    action_name = env.action_names[action]
    print(f"\nCurrent State: {env.state_names[env.state]}")
    print(f"Action taken: {action_name}")
    state, reward, terminated, truncated, info = env.step(action)
    total_reward += reward
    if terminated:
        print("Episode terminated!")
        print(f"Final State: {info['state_name']}")
        print(f"Total Steps: {info['steps']}")
        print(f"Total Reward: {total_reward:+d}")

env.close()


Starting PetAWarg Game!
Actions: 0=Pet, 1=Strike

Current State: Sleeping
Action taken: Pet

Step: 1
Current State: Angry
Current Score
: +0

Current State: Angry
Action taken: Strike

Step: 2
Current State: Furious
Current Score
: +0


# P3: DQN

Install the stable_baselines3 library. Using the DQN implementation from that library, train an MlpPolicy policy for playing the PetAWarg game. 

https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html

# P4: Print out the policy learned by DQN

Print out the policy learned by DQN in the previous step. You can assume that the policy is deterministic. In this case, the policy can be printed out by iterating over all the states and printing out the action generated by the policy. 