# Hochschule Bonn-Rhein-Sieg

# Machine Learning

# Reinforcement Learning Assignment

Before you start working on this assignment, please:
* set up the [OpenAI gym library](https://gym.openai.com), which is a Python library that defines various benchmarking environments in which reinforcement learning agents can be trained. Installation instructions for the library can be found [here](https://gym.openai.com/docs/#installation)
* go through the ["Getting Started With Gym" tutorial](https://gym.openai.com/docs/#getting-started-with-gym) so that you can use the library effectively during the assignment

In [None]:
import numpy as np
import gym

## Q-Learning

Your task in this assignment is, in principle, rather simple: You need to implement the Q-learning algorithm for solving the cart-pole problem that we discussed in the reinforcement learning lecture.

As you might remember from the lecture, the observation space of the cart-pole system consists of:
1. the cart's $x$ position on the track
2. the cart's linear velocity $\dot{x}$ along the track
3. the pole's angle $\theta$ with respect to the vertical axis
4. the pole's angular velocity $\dot{\theta}$

The action space of the system is discrete - the pole can move either to the left or to the right with a constant velocity.

The cart-pole environment provided in OpenAI gym is episodic, such that an episode ends if the pole (i) falls beyond a certain angle (the agent has failed in this case) or (ii) doesn't fall for 200 consecutive steps (the agent has succeeded in this case).

### Q-Function Learning [80 points]

Your main task is to implement the function `train_policy` below, which takes the cart-pole environment provided in OpenAI gym and learns the values of the Q-function $Q(s,a)$ so that the cart-pole problem can be solved.

#### State discretisation

We particularly want to implement a discrete version of Q-learning, namely the Q-function $Q(s,a)$ should be learned for a discretised version of the state space. This means that you need to choose a discretisation of the state space variables so that you can then learn a tabular version of $Q(s,a)$ for all possible combinations of $s$ and $a$.

You don't need to use all four state variables in your Q-function if some don't seem to contribute to the performance of the agent. This is something you will need to experiment with in your implementation.

#### Action selection during training

How you select actions during training is another aspect that you need to choose and experiment with. Remember that Q-learning is an off-policy learning algorithm, so it doesn't always choose the best action according to the current Q-values. A commonly used strategy is to use an $\epsilon$-greedy policy during learning, according to which a greedy action (the one with the highest Q-value) is selected most of the time, but a random action is chosen with a probability $p$, where $p$ starts large at the beginning and decreases according to some schedule during the training process.

In [None]:
def train_policy(env, learning_episodes: int) -> np.ndarray:
    """Trains a policy for the given environment using tabular Q-learning.

    Keyword arguments:
    env -- OpenAI gym environment
    learning_episodes: int -- how many episodes to train the agent for

    Returns:
    Q -- a multidimensional numpy array representing the Q function Q(s,a)

    """
    Q = np.zeros(0)

    ### BEGIN SOLUTION
    ### END SOLUTION

    return Q

Please run the following cell to execute your Q-function implementation. The learned Q-values will be used below for testing the agent. Note that learning may take several minutes. You are allowed to change the number of learning episodes to fit your implementation.

In [None]:
env = gym.make('CartPole-v0')
learning_episodes = 300000

print('Training agent for {0} episodes...'.format(learning_episodes))
Q = train_policy(env, learning_episodes)
print('Training completed')

### Agent Testing [20 points]

Now that you have learned a Q-function for the cart-pole problem, you need to test your implementation. For this, please implement the function `test_policy`, which runs the agent for a given number of test episodes using the learned Q-values and returns the average return over the test episodes.

You can consider the cart-pole problem solved if your average return is $\geq 195$, which means that your agent is able to survive until the end of each episode most of the time.

If you notice that the average return is lower than expected, there might be different issues with the implementation, for instance:
* the number of training episodes might be low
* the action selection during training may be inappropriate (try experimenting with different schedules for your epsilon-greedy policy)
* your state space discretisation may be inappropriate (also, the more fine-grained the discretisation is, the more training episodes you will need)

You may need to experiment with all of these until you successfully train your agent.

Good luck!

In [None]:
def test_policy(env, Q: np.ndarray, test_episodes: int) -> float:
    """Test a policy extracted from a learned Q-function on the given environment.
    The policy is tested for a given number of episodes.

    Keyword arguments:
    env -- OpenAI gym environment
    Q: np.ndarray -- a multidimensional numpy array representing the Q function Q(s,a)
    test_episodes: int -- number of episodes for testing the learned policy

    Returns:
    avg_return: float -- the average return over the test episodes

    """
    avg_return = 0.

    ### BEGIN SOLUTION
    ### END SOLUTION

    return avg_return

In [None]:
test_episodes = 100

print('Testing policy for {0} episodes...'.format(test_episodes))
avg_return = test_policy(env, Q, test_episodes)
print('Average test return =', avg_return)