# Exercise 3, AlphaTetris, 44P(oints)

## Lab Instructions
Your answers for **tasks a - f** should be written **in this notebook**.
Your answers for **tasks g - j** should be written **in your solution
 pdf file**.

You shouldn't need to write or modify any other files.

**You should execute every block of code to not miss any dependency.**

This exercise was developed by Ge Li for the KIT Cognitive Systems Lecture,
June 2021.

## Exercise 3, a-f:
This jupyter notebook offers a framework of **Temporal Difference (TD) learning
 with linear function approximation** in a popular Tetris game. Read the
 instructions carefully and complete the unfinished functions. Afterwards,
 you can run the training procedure in Tetris game and then verify your
 implementation.


Detailed instructions:
0. You may need to install pygame to run this notebook, e.g. open a terminal
and type in "pip3 install -U pygame".<br>


1. In script "tetris.py", you can find the definition of the "Tetris" and
some helper functions regarding the environment.<br>


2. Keywords in the game:
    - **board**: The Tetris game board (width: 10, height: 20) in a board where
     the game is played. At the beginning of the episode, the board is initialized as
    a blank board. When the game is going on, the falling piece will be merged
      into the board. When one or more lines in the board are completed, the
      lines will disappear, and the player will gain some score.<br>
    - **piece**: The Tetris game piece, which has 7 different shapes (S, Z, J,
    L, I, O, T). Each piece is formed up with 4 boxes, and may show different
     orientations by rotating. During the game play, each time a random piece
      with random orientation is spawned on the middle top of the board. It
      falls down one height level each time, and the player can apply an
      action to it and put it in a desired place. In the meantime, the player can also
       see the next incoming piece. When the current piece is placed on the bottom
       line of the board, it will be merged into the board. Subsequently, the
       next piece will show on the middle top of the board.<br>
    - **action**: After a new piece is generated and before it is merged to the
    board, a player can select an action to the piece to adjust its
    placement. In contrast to the normal keyboard action (up, down, left,
    right), the action used in this homework directly selects the final configuration
    of the piece, which is described as a tuple **[rot, col]**. The rot is
    the number of rotations to take, and col indicates the lateral movement
    (left or right) to take. Once the action is given, it will be decomposed
    into a series of keyboard "pressing" to execute the operation like a human
    player. <br>
    - **terminate** If the space of the middle top on the board is already
    occupied and therefore the next new piece cannot be spawned, then the game
   terminates.<br>
    - **features** In the slides of Reinforcement Learning in Cognitive
    Systems, 22 features (height of each column, diff between two columns,
    max column height, number of holes, and 1 as constant) were introduced. To
    make this homework simpler, however, a simplified feature vector
    is used, which are **the average height of all columns; the summation of
     absolute height differences of all columns; the number of holes in the board; and
      a constant value 1** respectively. In the function "get_features" you can
     see how the original 22 features are generated from the board, and in the
     function "get_simplified_features", you can see how the simplified features are generated.<br>


3. We assume that we have the deterministic system dynamics function $s' = f(s,
a)$ for the Tetris game to predict the next state from the current state, and
 the action available, we will use a combination of TD-Learning for learning the
  value function and 1-step look ahead prediction for action selection. For
  TD-Learning, we will use the algorithm given in the lecture slides (page 27) to
   the approximate case using linear function approximation. In order to arrive
   at the approximate TD-Learning algorithm, follow exactly the same steps as for
    deriving the approximate Q-Learning algorithm. For action selection we can
    compute the Q-values using the learned V-function and the 1-step lookahead
    prediction, i.e., $Q(s,a) = r(s,a) + \gamma V_{\beta}(f(s,a))$. The Tetris
    class given below provides all functions to retrieve all possible actions for
     the current state as well as to compute the next state given the current
     state and action. Using the computed Q-values, the action should be selected
      using an epsilon-greedy strategy.


4. You can focus on the TD learning, and the action selection part, the
remaining code is provided. If you are interested in the entire Tetris game with TD
learning, please see the pseudo-code below:
    - Initialize the game (Game UI, Environment etc.)<br>
    - While True (game runs forever)<br>
        - Initialize an episode (score, level, initial board, initial piece,
        timer, etc.)<br>
        - While the episode is not terminated<br>
            - If there is no piece falling down:<br>
                - next piece ->> current piece<br>
                - Get a new next piece<br>          
                - If episode is terminal:<br>
                    - print "Game Over"
                    - break<br>
                - Else:<br>
                    - Get action (rotations, and column) to place the
                    current piece (**use 1-step lookahead**)<br> 
                    - Update the Value Function (**use TD Learning**)
            - Decompose the action and get one movement to take (key = up, down, left,
    right)<br>
            - Perform the movement to current piece and compute its new
            coordinates <br>
            - Update the board (check lines completed, merge piece if possible),
    score, level etc.<br>
            - Plot board, piece, text, score, etc.<br><br>

5. Other function specific instructions can be found above the unfinished
functions.

In [None]:
# DO NOT MODIFY THIS BLOCK

# Include some python packages and tetris env

import random
import numpy as np
from ex3.tetris import Tetris

In [None]:
# DO NOT MODIFY THIS BLOCK

# Initialize some Learning parameters as global parameters

alpha = 0.001  # Learning rate
gamma = 0.9  # Discount factor
epsilon = 0.5  # Exploration rate
betas = np.zeros(4, dtype=float)  # Initial parameters vector

### a) Value Function Approximation <br>
Please finish the function **"approximate_value"** below. This function is to
 perform **"linear function approximation"** to compute the state value, i.e.
  V(s), given the features of a state. <br>

Hint:
- You should use the parameters **"betas"**, as well as some useful numpy
functions like **"np.matmul"** or **"np.dot"**.<br> Note that the linear
function approximation case can be implemented by a scalar product.

In [None]:
# TODO: PLEASE FINISH THE FUNCTION IN THIS BLOCK
# TODO: PLEASE FINISH THE FUNCTION IN THIS BLOCK
# TODO: PLEASE FINISH THE FUNCTION IN THIS BLOCK

def approximate_value(features):
    """
    Given state features, calculate the approximated state value
    :param features: features vector
    :return: approximated state value V(s)
    """
    # Assert that the parameters and features have the same dimension
    assert features.shape == betas.shape, "Shape do not match"

    ########   Your code begins here   ########

    ########    Your code ends here    ########

    return approximated_value

### b) Compute State Action Value Function <br>
Please finish the function **"q_s_a"** below. This function is to compute the
state action value function Q(s,a).<br>
By default, $Q(s,a) = r(s,a) + \gamma  E[V(s')]$. However, in the Tetris game,
the expectation can be ignored as the game is assumed to be deterministic. The
function gets the reward for applying action $a$ as well as the next board
configuration $s'$ as input. You should compute $Q(s,a) = r(s,a) + \gamma V
(s')$.

Hint:
- You can call function **"Tetris.get_simplified_features"** to get the
simplified features of a Tetris board.<br>
- You can call your implemented function **""approximate_value"** above to
approximate the value of the features.

In [None]:
# TODO: PLEASE FINISH THE FUNCTION IN THIS BLOCK
# TODO: PLEASE FINISH THE FUNCTION IN THIS BLOCK
# TODO: PLEASE FINISH THE FUNCTION IN THIS BLOCK

def q_s_a(reward, next_board):
    """
    Compute the state action value function Q(s,a)
    :param reward: immediate reward
    :param next_board: next Tetris board after the action has been performed
    :return: None
    """
    ########   Your code begins here   ########

    ########    Your code ends here    ########

    return qsa


### c) Exploration vs. Exploitation <br>
Please finish the function **"sample_action"** below. This function is to
sample an action (get its index) following the **"$\epsilon-greedy$"**
exploration strategy.<br>
Recall the knowledge in Cognitive Systems, your implementation should contain:

- It has a **"epsilon"** probability that the action is randomly sampled from
 all possible actions. This is Exploration.
- Otherwise, take the action with the best action's value, i.e. Q(s,a). This is
Exploitation

Hints:
- You can use **"max"** function to get the max value in a list.
- You can use **"index"** function to get the index of a value in a list.
- You can use **"random.random"** function to generate a random float value
in the range of [0.0, 1.0). Then you can compare this number with the
**"epsilon"** to decide whether to randomly sample one action from the action
list, or just exploit the best action.
- You can use **"random.randint"** function to generate a random integer as the
 index of the sampled action.

In [None]:
# TODO: PLEASE FINISH THE FUNCTION IN THIS BLOCK
# TODO: PLEASE FINISH THE FUNCTION IN THIS BLOCK
# TODO: PLEASE FINISH THE FUNCTION IN THIS BLOCK

def sample_action(action_value_list):
    """
    Epsilon-greedy exploration
    :param action_value_list: list of action values Q(s,a)
    :return: the index of sampled action in the action list
    """

    ########   Your code begins here   ########

    ########    Your code ends here    ########

    return sampled_action_index

### d) Compute Temporal Difference Error <br>
Please finish the function **"compute_td_error"** below. This function will
compute the Temporal Difference Error of the Value function, V(s).
Depending on whether the next board is a terminal board (state), you should
consider computing the td_error in different ways, as it is shown in the
Cognitive Systems lecture.<br>

Hints:
- You can use **"Tetris.get_simplified_features"** function to get the
simplified features of a board.<br>
- You can use your implemented function **"approximate_value"** above to
compute the state value.<br>
- You can call **"Tetris.check_terminal_board"** to check if the next board
(state) is terminated or not.<br>

In [None]:
# TODO: PLEASE FINISH THE FUNCTION IN THIS BLOCK
# TODO: PLEASE FINISH THE FUNCTION IN THIS BLOCK
# TODO: PLEASE FINISH THE FUNCTION IN THIS BLOCK

def compute_td_error(board, reward, next_board, next_piece):
    """
    Compute the TD error of the state value
    :param board: Tetris board
    :param reward: immediate reward
    :param next_board: next Tetris board
    :param next_piece: next Tetris piece
    :return: td error
    """
    ########   Your code begins here   ########

    ########    Your code ends here    ########

    return td_error

### e) Gradient Descent <br>
Please finish the function **"gradient_descent"** below. This function will
update the parameters **"betas"** used in the value function approximation.

Hints:
- You should compute the gradient in the case of linear function approximation
 (see your implementation in function **"approximate_value"**) by hand first,
  then you will know what should you implement in the code.
- Because **"betas"** is a global variable (numpy vector), if you want to
modify its values in a local function, you should add a declaration **"global
betas"** before your modifications.

In [None]:
# TODO: PLEASE FINISH THE FUNCTION IN THIS BLOCK
# TODO: PLEASE FINISH THE FUNCTION IN THIS BLOCK
# TODO: PLEASE FINISH THE FUNCTION IN THIS BLOCK

def gradient_descent(board, td_error):
    """
    Perform the gradient descent
    :param board: Tetris board
    :param td_error: td_error
    :return: None
    """
    ########   Your code begins here   ########

    ########    Your code ends here    ########

In [None]:
# DO NOT MODIFY THIS BLOCK

def update_hyperparameters():
    """
    Update learning and exploration rate
    :return: None
    """
    # Update exploration rate
    global epsilon
    if epsilon > 0.01:
        epsilon *= 0.99
    else:
        epsilon = 0.01

    # Update learning rate
    global alpha
    if alpha > 1e-5:
        alpha *= 0.999
    else:
        alpha = 1e-5

### f) TD Learning <br>
Now, every single module of the TD learning has been prepared. It is time to
finish the function **"td_learning_action_phase"** and
**"td_learning_update_phase"** ! The first function exports an action (a tuple
of [rot, col]) to place the current piece, while the second function update
the weights in value function approximation using the temporal difference error.

Workflow and Hints:
For **"td_learning_action_phase"**:
1. You need to call **"Tetris.get_possible_action"** function to get all
possible actions (given current board, current piece and next piece), together
with their immediate rewards and simulated next boards. <br>
2. With all the possible actions in hand, compute the action values by
calling function **"q_s_a"**. <br>
3. Sample one action (get its index) from all possible actions, by calling
function **"sample_action"**. <br>
4. Use this index to get this action from the list and return it back to the
environment.

For **"td_learning_update_phase"**:
1. From the environment, get the result after applied this action and compute
td_error. <br>
2. Perform gradient descent to update parameters. <br>
3. Call function **"update_hyperparameters"** to update the learning rate and
 exploration rate. <br>

In [None]:
# TODO: PLEASE FINISH THE TWO FUNCTIONS IN THIS BLOCK
# TODO: PLEASE FINISH THE TWO FUNCTIONS IN THIS BLOCK
# TODO: PLEASE FINISH THE TWO FUNCTIONS IN THIS BLOCK

def td_learning_action_phase(board, piece, next_piece):
    """
    Perform td learning, to get an action for the piece
    :param board: current Tetris board
    :param piece: current Tetris piece
    :param next_piece: next Tetris piece
    :return: action (rotation and lateral move)
    """
    ########   Your code begins here   ########

    ########    Your code ends here    ########
    return action

def td_learning_update_phase(board, reward, next_board, next_piece):
    """
    Perform td learning, to update the weights used in value function
    approximation
    :param board: board before applying the action
    :param reward: reward after applying the action
    :param next_board: board after applying the action
    :param next_piece: the piece for next_board
    :return: None
    """

    ########   Your code begins here   ########

    ########    Your code ends here    ########

In [None]:
# DO NOT MODIFY THIS BLOCK

def play_tetris():
    """
    Main function to play Tetris, are you a Master in Tetris ???
    :return: None
    """
    Tetris.show_text_screen('Alpha-Tetris')

    # Game loop for episodes
    while True:
        # You may increase the fps (Frame per second) to speed up your training
        # But some pygame UI may behave weird, if fps is too high.
        Tetris.run_episode(td_learning_action_phase,
                           td_learning_update_phase,
                           fps=30)
        Tetris.show_text_screen('Game Over')

In [None]:
# DO NOT MODIFY THIS BLOCK

# Start playing Tetris
play_tetris()

### Task g) to task j) should be answered in the solution pdf.