# Exercise 3, SnaQe

## Lab Instructions
Your answers for **tasks a - f** should be written **in this notebook**.
Your answers for **tasks g - i** should be written **in your solution
 pdf file**.

You shouldn't need to write or modify any other files.

**You should execute every block of code to not miss any dependency.**

This exercise was developed by Philipp Dahlinger for the KIT Cognitive Systems Lecture, Juli 2022. The pygame implementation of the snake environment was adapted from this source: [GitHub](https://github.com/jl4r1991/SnakeQlearning)

Exercise 3, a-f:

This jupyter notebook offers a framework of Approximate Q-learning with linear function approximation in the popular Snake game. Read the instructions carefully and complete the unfinished functions. Afterwards, you can run the training procedure and then verify your implementation.

Detailed instructions:

0. You may need to install pygame to run this notebook. Either you install it globally by opening a terminal and type in "pip3 install -U pygame" or you create a virtual environment (venv) and install pygame there. If you use a virtual environment, you may need to create a new kernel for the notebook. Detailled instructions for that can be found here: [StackOverflow](https://stackoverflow.com/questions/33496350/execute-python-script-within-jupyter-notebook-using-a-specific-virtualenv).

1. In script "snake.py", you can find the definition of the "Snake" class.
2. Keywords and rules of the game:
    - **state**: A state is given by the positions of the snake elements on the screen (width: 30, height: 20) and by the food position. At the beginning of the episode, the game is initialized with a snake of length 1 and a random position of the food. Each time the snake head is at the same position as the food, the snake length and the score is increased by 1 and a new food spawns at a random location. <br>
    - **action**: In every state, there are 4 possible actions, namely the direction where the snake head should move 1 space (up, right, down, left).  <br>
    - **terminatation**: When the snake moves inside its tail or outside the screen, the game terminates. <br>
    - **features**: Similar to the Tetris features in the Lecture Slides, we extract handcrafted features of the current state upon which the Q-values can be computed. The in total 6 features are the following:
        - `pos_x`: If the food is right of the snake, this is 1. If the food is left of the snake, this is -1. If it has the same y-coordinate as the snake, this is 0.
        - `pos_y`: If the food is below the snake, this is 1. If the food is above the snake, this is -1. If it has the same y-coordinate as the snake, this is 0.
        - `surrounding`: Contains 4 entries. Each entry represents the space directly above, right, below or left to the snake head. If this space is occupied (either by the snake or a wall), this entry is 1, otherwise it is 0.
    
    <br>


3. **How to represent the Q-values**:
    Given a state s, we will always have 4 Q-values $Q(s, a) \in \mathbb{R}$, one for each action. We represent them by a four dimensional vector and compute them with the formula
    
    $$ Q(s) = \phi(s) \beta \in \mathbb{R}^4.$$
Here, $\phi(s) \in \mathbb{R}^6$ is the feature representation of the state. Our learnable weights $\beta \in \mathbb{R}^{6 \times 4}$ is a matrix, where each row is responsible for the Q-value computation of 1 of the 4 actions.
    


<!-- 4. You can focus on the TD learning, and the action selection part, the
remaining code is provided. If you are interested in the entire Tetris game with TD
learning, please see the pseudo-code below:
    - Initialize the game (Game UI, Environment etc.)<br>
    - While True (game runs forever)<br>
        - Initialize an episode (score, level, initial board, initial piece,
        timer, etc.)<br>
        - While the episode is not terminated<br>
            - If there is no piece falling down:<br>
                - next piece ->> current piece<br>
                - Get a new next piece<br>          
                - If episode is terminal:<br>
                    - print "Game Over"
                    - break<br>
                - Else:<br>
                    - Get action (rotations, and column) to place the
                    current piece (**use 1-step lookahead**)<br> 
                    - Update the Value Function (**use TD Learning**)
            - Decompose the action and get one movement to take (key = up, down, left,
    right)<br>
            - Perform the movement to current piece and compute its new
            coordinates <br>
            - Update the board (check lines completed, merge piece if possible),
    score, level etc.<br>
            - Plot board, piece, text, score, etc.<br><br>
 -->
4. Other function specific instructions can be found above the unfinished
functions.

In [None]:
# DO NOT MODIFY THIS BLOCK

# Include some python packages and snake env

import pygame
import random
from ex3.snake import Snake
import numpy as np


In [None]:
# DO NOT MODIFY THIS BLOCK

# initialization

# fixed seed for deterministic behavior:
np.random.seed(100)
random.seed(100)

# dimensions
feature_size = 6
action_size = 4


# learnable weight initialization
# these are the global weights we update during the learning
betas = np.random.rand(feature_size, action_size)

# number of transitions the agent performs Q learning
len_epoch = 20000 
# hyperparameters
temp = 100.0  
lr = 0.2 
gamma = 0.5 

# action translator. The snake environment expects string keywords, but for the RL agent it is simpler to output
# indices between 0 and 3.
actions = {
    0: "left",
    1: "up",
    2: "right",
    3: "down"
}

### a) Value Function Approximation <br>
Please finish the function **"approximate_q_values"** below. This function is to
 perform **"linear function approximation"** to compute the 4 Q-values $Q(s, a)$ for a state.<br>

Hint:
- You should use the parameters **"betas"**, as well as some useful numpy
functions like **"np.matmul"** or **"np.dot"**.<br> Note that the linear
function approximation case can be implemented by a scalar product.
- For the correct formula see the top instructions, bullet point 3.

In [None]:
# TODO: PLEASE FINISH THE FUNCTION IN THIS BLOCK
def approximate_q_values(phi_s):
    """
    phi_s: Feature vector with shape (feature_size,). Extracted features from a state s.
    
    return: Vector containing the 4 Q-values Q(s,a) for every possible action a.
    """
    ########   Your code begins here   ########
    approximated_q_value = ...
    ########    Your code ends here    ########
    return approximated_q_value

### b) Exploration vs. Exploitation <br>
Please finish the function **"select_action"**  and **"softmax"** below. The function **"select_action"** samples an action (get its index) following the **"Soft-Max Policy"**
exploration strategy (See Slide 33 in the Reinforcement Learning chapter). Recall that this exploration strategy has a **"temperature"** value (in our Notebook called `temp`) that scales the Q-values. A higher temperature results in more similar inputs for the Soft-Max function which will yield more uniform selection probabilities. A temperature closer to 0 increases the sharpness of the probability distribution resulting in almost always selecting the action with the highest Q-value. <br>

- Start by implementing the **"softmax"** function:
$$
f: \mathbb{R}^n \rightarrow \mathbb{R}^n
$$
$$
f(x)_i = \frac {\exp(x_i)}{\sum_j \exp(x_j)}.
$$
Note that the input and the output are both vectors with the same shape.

- Then, implement the **"select_action"** function. It has two arguments: The current feature representation `phi_s` as well as a boolean `sample`. If `sample` is true, return the sampled action based on the Soft-Max policy, otherwise return the deterministic action with the highest Q-value (and hence ignoring the temperature). 


Hints:
- The softmax function can be implemented differently to the given formula above for a better numerical stability. In this exercise however you may use the simpler implementation following the formula.
- You can use **"np.argmax"** function to get the index of the max value in a np.array.
- Use the softmax function to obtain the selection probabilites. Remeber to scale the Q-values with the temperature, which is a global variable `temp`.
- Once you have the selection probabilities, you can use the function np.random.multinomial(...) for sampling. Check its documentation here: [np.random.multinomial](https://numpy.org/doc/stable/reference/random/generated/numpy.random.multinomial.html).
- Make sure that the return value is an integer and not a np.array with length 4!
- You can check your implementation of your softmax function with the test in the cell below.


In [None]:
# TODO: PLEASE FINISH THE FUNCTIONS IN THIS BLOCK

def softmax(x):
    """
    x: 1-dimensional numpy array.
    returns: softmax(x), 1-dimensional numpy.
    """
    ########   Your code begins here   ########
    result = ...
    ########    Your code ends here    ########
    return result

def select_action(phi_s, sample=True):
    """
    phi_s: Feature vector with shape (feature_size,). Extracted features from a state s.
    sample: Boolean flag. If true, sample an action based on the Soft-Max policy, otherwise return the action with
            the highest Q-value.
    
    return: Integer action index a_idx (in [0,1,2,3]) of the selected action.
    """
    # Q-values for the current state
    qs = approximate_q_values(phi_s)
    if sample:
        ########   Your code begins here   ########
        a_idx = ...
        ########    Your code ends here    ########
    else:
        ########   Your code begins here   ########
        a_idx = ...
        ########    Your code ends here    ########
        
    return a_idx
    
    

In [None]:
### Test for softmax:
# example input
x = np.array([1.0, 2.0, 3.0, 4.0])
output = softmax(x)
print(output)
# result: [0.0320586  0.08714432 0.23688282 0.64391426]

### c) Compute Temporal Difference Error <br>
Please finish the function **"compute_delta"** below. This function will
compute the Temporal Difference Error $\delta$ given in the algorithm of slide 50 of the RL chapter. However, we have to also incorporate terminal states which is not shown on the slide. In case the state $s$ is terminal, we compute $\delta$ by the formula
$$
\delta = r(s,a) - Q(s,a).
$$
<br>

Hints:
- Use your implemented approximate_q_values() function.
- You can use **"np.max"** function to get the maximum value of a np.array.
- Use the globally initialized variable gamma for the discount factor

In [None]:
# TODO: PLEASE FINISH THE FUNCTION IN THIS BLOCK
def compute_delta(phi_s, a_idx, r, phi_new_s, is_terminal):
    """
    phi_s: Feature vector of state s with shape (feature_size,).
    a_idx: action index a_idx (in [0,1,2,3]) of the action selected in state s.
    r: reward r(s,a) of the action with index a_idx in state s (float).
    phi_new_s: Feature vector of state new_s with shape (feature_size,). 
               new_s is the state after selection action a_idx in state s.
    is_terminal: boolean which indicates if s is a terminal state.
    
    return: td_error delta of type float
    """
    ########   Your code begins here   ########
    ...
    ########    Your code ends here    ########
    return delta
    
    

### d) Compute the derivative of the Q-value with respect to beta <br>
Please finish the function **"compute_d_qsa_d_beta"** below. This function will
compute the derivative of the Q-value with respect to the global parameter **"betas"**. Therefore, you first need to understand what the derivative $\frac {\text{d}Q(s,a)}{\text{d} \beta}$ exactly is before implementing it.
<br>

Hints:
- In order to derive the exact formulation for the derivative, it helps to write down the formula for the Q-value 
$$
Q(s,a) = \phi(s)\beta \,[\text{a_idx}]
$$
Make sure that you understand that the single Q-Value $Q(s,a) \in \mathbb{R}$ is the entry with index `a_idx` of the vector $\phi(s) \beta \in \mathbb{R}^4$. Therefore, only the column with index `a_idx` of the matrix $\beta \in \mathbb{R}^{6 \times 4}$ has an influence on the Q-value $Q(s,a)$. 
- The shape of the derivative has to be the same as the shape of $\beta$ ($6 \times 4$). 
- You can use `np.zeros(shape)` to initialize a np array with zeros with a given `shape`.

In [None]:
# TODO: PLEASE FINISH THE FUNCTION IN THIS BLOCK
def compute_d_qsa_d_beta(phi_s, a_idx):
    """
    phi_s: Feature vector of state s with shape (feature_size,).
    a_idx: action index a_idx (in [0,1,2,3]) of the action selected in state s.
    
    return: Derivative of the q_value Q(s, a) wrt. betas. It has shape (feature_size, action_size)
    """
    ########   Your code begins here   ########
    d_qsa_d_beta = ...
    ########    Your code ends here    ########
    return d_qsa_d_beta

### e) Gradient Descent <br>
Please finish the function **"update_betas"** below. This function will
update the parameters **"betas"** used in the Q-value function approximation.

Hints:
- You should use your implementation of the td_error $\delta$ and of the derivative of the Q-value function $\frac {\text{d}Q(s,a)}{\text{d} \beta}$.
- Because **"betas"** is a global variable (numpy vector), if you want to
modify its values in a local function, you should add a declaration **"global
betas"** before your modifications.

In [None]:
# TODO: PLEASE FINISH THE FUNCTION IN THIS BLOCK
def update_betas(phi_s, a_idx, r, phi_new_s, is_terminal):
    """
    phi_s: Feature vector of state s with shape (feature_size,).
    a_idx: action index a_idx (in [0,1,2,3]) of the action selected in state s.
    r: reward r(s,a) of the action with index a_idx in state s (float).
    phi_new_s: Feature vector of state new_s with shape (feature_size,). 
               new_s is the state after selection action a_idx in state s.
    is_terminal: boolean which indicates if s is a terminal state.
    
    return: None, but you have to change the value of the global betas
    """
    # we want to update the global variable beta, hence the global "import"
    global betas
    ########   Your code begins here   ########
    betas = ...
    ########   Your code ends here   ########
    

Helper Functions for evaluating and playing the game:

In [None]:
# DO NOT MODIFY THIS BLOCK

# helper function to test the current policy
def test_policy(num_games=5):
    av_score = 0
    g = 0
    snake = Snake(FRAMESPEED=50000)
    while g < num_games:
        s = snake.get_feature_representation()
        a_idx = select_action(s, sample=False)
        a = actions[a_idx]
        is_terminal = snake.step(a)
        if is_terminal:
            av_score += snake.last_score
            g += 1
    pygame.quit()

    av_score /= num_games
    print(f"Average score: {av_score}")

In [None]:
# DO NOT MODIFY THIS BLOCK

# helper function to simulate one game with normal speed
def play_single_game(framespeed=20):
    snake = Snake(FRAMESPEED=framespeed)
    while True:
        s = snake.get_feature_representation()
        a_idx = select_action(s, sample=False)
        is_terminal = snake.step(actions[a_idx], init_new_game_after_terminal=False)
        if is_terminal:
            print(f"Total Score: {snake.last_score}")
            break



### f) Q-Learning <br>
Now, every single module of the Q-learning has been prepared. It is time to
finish the training loop! Insert the missing function calls to stitch together the complete Q-Learning algorithm. 

Hints:
- Step 3-5 is already implemented, so do not modify these parts.
- Every step can be implemented in one line by calling a specific function with the correct function arguments.
- The snake class saves the current state internally and updates this state with the snake.step() function. 

In [None]:
# TODO: PLEASE FINISH THE DESCRIBED STEPS IN THIS BLOCK

snake = Snake(FRAMESPEED=50000)

for i in range(len_epoch):
    
    ########   Your code begins here   ########
    # 1. get the feature representation of the current state by calling snake.get_feature_representation()
    phi_s = snake.get_feature_representation()
    # 2. Select the action using the soft-max policy exploration strategy
    a_idx = select_action(phi_s, sample=True)
    # 3. Get the action string which is needed for the snake environment. (already implemented)
    a = actions[a_idx]
    # 4. Ask for the current reward (already impelmented)
    r = snake.get_reward(a)
    # 5. Perform one step. This returns the boolean is_terminal if the snake died during this step.
    # (already implemented)
    is_terminal = snake.step(a)
    # 6. Get the feature representation of the updated state by calling snake.get_feature_representation()
    phi_new_s = snake.get_feature_representation()
    # 7. Update betas by calling the update_betas(...) method
    update_betas(phi_s, a_idx, r, phi_new_s, is_terminal)
    
    ########   Your code ends here   ########
    
    # To see how well the current policy works we test it every 5000 updates
    if i % 5000 == 0:   
        pygame.quit()
        test_policy()
        snake = Snake(FRAMESPEED=50000)
    
    # update hyperparameters
    temp *= 0.999
    temp = max(temp, 0.1)
    #print(epsilon)
    lr *= 0.9999

test_policy()
pygame.quit()





In [None]:
# TODO: PLEASE FINISH THE DESCRIBED STEPS IN THIS BLOCK

snake = Snake(FRAMESPEED=50000)

for i in range(len_epoch):
    
    ########   Your code begins here   ########
    # 1. get the feature representation of the current state by calling snake.get_feature_representation()
    ...
    # 2. Select the action using the soft-max policy exploration strategy
    a_idx = ...
    # 3. Get the action string which is needed for the snake environment. (already implemented)
    a = actions[a_idx]
    # 4. Ask for the current reward (already impelmented)
    r = snake.get_reward(a)
    # 5. Perform one step. This returns the boolean is_terminal if the snake died during this step.
    # (already implemented)
    is_terminal = snake.step(a)
    # 6. Get the feature representation of the updated state by calling snake.get_feature_representation()
    ...
    # 7. Update betas by calling the update_betas(...) method
    ...
    
    ########   Your code ends here   ########
    
    # To see how well the current policy works we test it every 5000 updates
    if i % 5000 == 0:   
        pygame.quit()
        test_policy()
        snake = Snake(FRAMESPEED=50000)
    
    # update hyperparameters
    temp *= 0.999
    temp = max(temp, 0.1)
    #print(epsilon)
    lr *= 0.9999

test_policy()
pygame.quit()





In [None]:
# Test your learned strategy. You can adjust the speed of the animation with the framespeed argument.
play_single_game(framespeed=30)

In [None]:
# if you want to close the pygame window:
pygame.quit()   