---
# Markov Decision Process (MDP) and Reinforcement Learning (RL) Exercises
---

This notebook covers the basics of Markov Decision Processes (MDPs) and Reinforcement Learning (RL) with exercises.

We will be using the GridWorld problem, in which an agent must find it way out of a maze.

<img src="images/gridworld.png" width="500px">

We will also be using the Pacman problem, in which the agent (pacman) must collect a number of pellets whilst avoiding ghosts.

<img src="images/pacman.png" width="500px">

---
## Contents
---

This notebook contains the following sections:

- [1. Setup](#1-setup), which covers creating a Python virtual environment.
- [2. Introduction](#2-introduction), which introduces the source code.
- [3. Creating and Running Debug Configurations](#3-creating-and-running-debug-configurations), which introduces creating and running debug configurations.
- [4. Value Iteration](#4-value-iteration), which covers the value iteration algorithm.
  - [4.1 Value Iteration Agent](#41-value-iteration-agent), which covers creating a value iteration agent and testing it on GridWorld.
  - [4.2 Bridge Grid](#42-bridge-grid), which tests the value iteration agent on the BridgeGrid problem.
  - [4.3 Discount Grid](#43-discount-grid), which tests the value iteration agent on the DiscountGrid problem.
- [5. Q-Learning (Reinforcement Learning)](#5-q-learning-reinforcement-learning), which covers the q-learning algorithm.
  - [5.1 Q-Learning Agent](#51-q-learning-agent), which covers creating a q-learning agent and testing it on the GridWorld and Crawler problems.
  - [5.2 Bridge Grid](#52-bridge-grid), which tests the q-learning agent on the BridgeGrid problem.
- [6. Approximative Q-Learning](#6-approximative-q-learning), which covers the approximative q-learning algorithm.
  - [6.1 Approximative Q-Learning Agent](#61-approximative-q-learning-agent), which covers creating an approximative q-learning agent with a weighted linear function and testing it on the Pacman problem.

---
## 1. Setup
---

- Create a Python Virtual Environment
  - Open the built-in terminal with `Ctrl + J`.
  - Run the following commands in the terminal, one by one:

    ```bash
    conda create -y -p ./.conda python=3.6
    conda activate ./.conda
    python -m pip install --upgrade pip
    pip install ipykernel jupyter pylint
    ```
- In the notebook's top right corner, choose:

  `Select Kernel` $\rightarrow$ `Python Environments` $\rightarrow$ `.conda (Python 3.6.13)`.

---
## 2. Introduction
---

This workshop deals with *Markov Decision Processes (MDP)* and *Reinforcement Learning (RL)*, where the `value iteration` and `Q-learning` algorithms will be implemented. Agents that implement these algorithms will be applied to the `Gridworld` problem and the `Pacman` game. In addition, the `Q-learning` agent will be tested on the `Crawler` (a robot controller).

The following files and folders are included under the `src` folder:

- `layouts` is a folder that contains a number of `.lay` files (layouts/maps/levels for the Pacman game).
- `crawler.py` contains code for the *crawler* (robot controller).
- `environment.py` contains an abstract class for reinforcement learning environments.
- `featureExtractors.py` contains classes for extracting (state,action) features.
- `game.py` contains the logic for the Pacman game.
- `ghostAgents.py` contains agents for controlling the ghosts.
- `graphicsCrawlerDisplay.py` contains graphics code for the *crawler*.
- `graphicsDisplay.py` contains graphics code for Pacman.
- `graphicsGridworldDisplay.py` contains graphics code for Gridworld.
- `graphicsUtils.py` contains graphics utilities.
- `gridworld.py` contains the implementation for Gridworld.
- `keyboardAgents.py` contains the keyboard interface for controlling Pacman manually.
- `layout.py` contains code for reading layout files and to store their contents.
- `learningAgents.py` contains the base classes `ValueEstimationAgent` and `QLearningAgent` for learning agents.
- `mdp.py` contains general MDP methods.
- `pacman.py` is the main file that runs the Pacman game.
- `pacmanAgents.py` contains some simple sample agents for Pacman.
- `qlearningAgents.py` contains Q-learning agents for Gridworld, Crawler and Pacman.
- `textDisplay.py` contains ASCII graphics for Pacman.
- `textGridworldDisplay.py` contains ASCII graphics for Gridworld.
- `util.py` contains utility classes, such as the `Counter` class for Q-learning.
- `valueIterationAgents.py` contains a *value iteration* agent for solving known MDPs.

The only files that need to be modified in this workshop are `valueIterationAgents.py` and `qlearningAgents.py`. The files `featureExtractors.py`, `gridworld.py`, `learningAgents.py`, `mdp.py` and `util.py` can also be good to skim through to get a better understanding of how the application works.

**Note!** Solutions are included in the `solution` folder.

---
## 3. Creating and Running Debug Configurations
---

- Create a debug configuration:
  - Open the `Run and Debug` view with `Ctrl + Shift + D`, and click the link `create a launch.json file`.

    <img src="images/run_and_debug_view.png">

  - Choose `Python Debugger` in the command palette.

    <img src="images/python_debugger.png">

  - The choose `Python File`.
    
    <img src="images/python_file.png">

  - Modify and save the `launch.json` file with the following contents:
  
    ```json
    {
        "version": "0.2.0",
        "configurations": [
            {
                "name": "Task 1",
                "type": "debugpy",
                "request": "launch",
                "program": "${workspaceFolder}/src/gridworld.py",
                "args": ["-m"],
                "console": "integratedTerminal"
            }
        ]
    }
    ```
  - Make sure `Task 1` is selected in the dropdown list.

    <img src="images/task1.png">

  - Then click the green arrow to run the `GridWorld` problem in *interactive* mode.
  
    The Gridworld application is now running in normal mode, where you can use the arrow keys to control the agent (a blue dot).
    
    Just like in the Pacman board, every position in the Gridworld board has coordinates (x, y) with the origin (0,0) in the bottom-left corner.
    
    Note that there is a 20% chance that the agent will move in a different direction than the one intended.
    
    Additionally, the two states (1 and -1) at the top right of the layout are special in the sense that when the agent reaches one of these squares, there is only one valid action (in the next time step) which sends the agent to a virtual state outside of the board. This state is the terminal state.
    
    Also note that with the above Python configuration, *discountRate* will have the value 0.9 (which can be changed with the flag `-d`), and *livingReward* will have the value 0 (which can be changed with the flag `-r`).
    
    Close the window to exit the Gridworld application.

    <img src="images/task1_running.png">
  
- Add a second debug configuration:
  - Open the `Run and Debug` view with `Ctrl + Shift + D`.
  - Click the *cogwheel* icon to the right of the dropdown list.
    
    <img src="images/task1.png">

  - Click the `Add Configuration` button.

    <img src="images/add_configuration.png">

  - Choose `Python Debugger` $\rightarrow$ `Python File`.
  - Modify and save the `launch.json` file with the following contents:
  
    ```json
    {
        "version": "0.2.0",
        "configurations": [
            {
                "name": "Task 2",
                "type": "debugpy",
                "request": "launch",
                "program": "${workspaceFolder}/src/gridworld.py",
                "args": ["-g", "MazeGrid"],
                "console": "integratedTerminal"
            },
            {
                "name": "Task 1",
                "type": "debugpy",
                "request": "launch",
                "program": "${workspaceFolder}/src/gridworld.py",
                "args": ["-m"],
                "console": "integratedTerminal"
            }
        ]
    }
    ```
  - Make sure `Task 2` is selected in the dropdown list next to the green arrow.
    
    <img src="images/task2.png">

  - Click the green arrow to run the `GridWorld` problem using the `MazeGrid` *layout* with a *random* agent.
    
    The new configuration ran`gridworld.py` with the following flag: `-g MazeGrid`

    This flag ensures that the layout named `MazeGrid` (the `mazeGrid()` method in the file `gridworld.py`) is loaded.
    
    In addition, a default agent will be used, which moves randomly through the layout until one of the terminal states is reached.
    
    <img src="images/task2_running.png">

- In the tasks below, agents that solve MDPs and learning agents will be created.
  - For each subtask, a new configuration will be created using various flags.
  - The files `gridworld.py` and `pacman.py` can accept several flags, where each flag can be specified either with two dashes followed by the full name of the flag, or with one dash followed by a single-letter abbreviation of that flag. For example, the following two lines are equivalent:

    ```bash
    --grid=MazeGrid
    -g MazeGrid
    ```

Below is a description of all the flags for `gridworld.py`:

In [6]:
!python ../src/gridworld.py -h

Usage: gridworld.py [options]

Options:
  -h, --help            show this help message and exit
  -d DISCOUNT, --discount=DISCOUNT
                        Discount on future (default 0.9)
  -r R, --livingReward=R
                        Reward for living for a time step (default 0.0)
  -n P, --noise=P       How often action results in unintended direction
                        (default 0.2)
  -e E, --epsilon=E     Chance of taking a random action in q-learning
                        (default 0.3)
  -l P, --learningRate=P
                        TD learning rate (default 0.5)
  -i K, --iterations=K  Number of rounds of value iteration (default 10)
  -k K, --episodes=K    Number of epsiodes of the MDP to run (default 1)
  -g G, --grid=G        Grid to use (case sensitive; options are BookGrid,
                        BridgeGrid, CliffGrid, MazeGrid, default BookGrid)
  -w X, --windowSize=X  Request a window width of X pixels *per grid cell*
                        (default 150)
  -a A

Below is a description of all the flags for `pacman.py`:

In [7]:
!python ../src/pacman.py -h

Usage: 
    USAGE:      python pacman.py <options>
    EXAMPLES:   (1) python pacman.py
                    - starts an interactive game
                (2) python pacman.py --layout smallClassic --zoom 2
                OR  python pacman.py -l smallClassic -z 2
                    - starts an interactive game on a smaller board, zoomed in
    

Options:
  -h, --help            show this help message and exit
  -n GAMES, --numGames=GAMES
                        the number of GAMES to play [Default: 1]
  -l LAYOUT_FILE, --layout=LAYOUT_FILE
                        the LAYOUT_FILE from which to load the map layout
                        [Default: mediumClassic]
  -p TYPE, --pacman=TYPE
                        the agent TYPE in the pacmanAgents module to use
                        [Default: KeyboardAgent]
  -t, --textGraphics    Display output as text only
  -q, --quietTextGraphics
                        Generate minimal output and no graphics
  -g TYPE, --ghosts=TYPE
                 

---
## 4. Value Iteration
---

In the file `learningAgents.py`, you will find the class `ValueEstimationAgent`, which inherits from the class `Agent` (in the file `game.py`). `ValueEstimationAgent` is an abstract base class for an agent that can compute state values $V(s)$, q-values $Q(s,a)$, and the policy $\pi(s)$. As you know, $V(s)$ takes a state $s$ (which in Gridworld and Pacman consists of tuples $(x, y)$ that represent positions on a two-dimensional grid) and returns a real number that represents how valuable the state is, i.e. the utility of the agent being in that state. Similarly, $Q(s,a)$ takes a state $s$ and an action $a$ in that state (examples of actions in Gridworld and Pacman are *north*, *south*, *east*, and *west*) and returns a real number that represents how valuable that action is in that state, i.e. the utility of the agent performing that particular action in that particular state. A policy $\pi(s)$ takes a state $s$ and returns the action $a$ with the highest value (the highest utility) in that state. The class contains a constructor `__init__()` and the four methods `getQValue()`, `getValue()`, `getPolicy()`, and `getAction()`.

The constructor accepts the four parameters *alpha*, *epsilon*, *gamma*, and *numTraining*, with default values 1.0, 0.05, 0.8, and 10 respectively. All four parameters are used when an agent learns a Markov Decision Process (MDP), where *alpha* is the *learning rate*, *epsilon* is the *exploration probability*, *gamma* is the *discount rate* (i.e., how *short-sighted* or *long-sighted* the agent is), and *numTraining* is the number of training episodes. These four parameters are simply stored in instance variables with the same names in the constructor (note that the input parameter *gamma* is instead stored in an instance variable named *discount*).

The method `getQValue(s, a)` accepts a state $s$ and an action $a$, where a subclass should return the corresponding q-value $Q(s,a)$.

The method `getValue(s)` accepts a state $s$, where a subclass should return the corresponding state value $V(s)$, i.e. by choosing the largest $Q(s,a)$ value among all valid actions $a$ in the state $s$.

The method `getPolicy(s)` accepts a state $s$, where a subclass should return the corresponding action $a$, i.e. by choosing the action $a$ among all valid actions in the state $s$ that gives the highest $Q(s,a)$ value. Note that if an $\epsilon$-greedy policy is implemented, the action with the highest $Q(s,a)$ value will not always be returned.

The method `getAction(s)` is similar to `getPolicy(s)` but always returns the action $a$ that gives the highest `Q(s,a)` value.

In the file `valueIterationAgents.py`, you will find the class `ValueIterationAgent`, which inherits from `ValueEstimationAgent` (described above) and implements the four abstract methods `getQValue(s,a)`, `getValue(s)`, `getPolicy(s)`, and `getAction(s)`. However, note that `getQValue(s,a)` calls `computeQValueFromValues(s,a)`, whereas `getPolicy(s)` and `getAction(s)` both call the method `computeActionFromValues(s)`. These two methods are not currently implemented. Furthermore, part of the code in the constructor `__init__()` is missing.

---
### 4.1 Value Iteration Agent

Implement the value iteration algorithm in the class `ValueIterationAgent` (in the file `valueIterationAgents.py`) by completing the constructor `__init__()` and the two methods `computeQValueFromValues()` and `computeActionFromValues()`.

The constructor `__init__()` accepts the three parameters *mdp*, *discount*, and *iterations*. The parameter *mdp* represents the MDP problem that the agent should solve, *discount* is the *gamma* parameter for the value iteration algorithm (with a default value of 0.9), and *iterations* specifies the number of iterations for the value iteration algorithm (with a default value of 100).

In the constructor, you should implement the value iteration algorithm itself, which iteratively computes the value for each state (for as many iterations as indicated by the *iterations* parameter) and stores all state values $V(s)$ in the instance variable *self.values*, which is a Python dictionary with elements of the form ${state: value}$ — one entry for each state.

What you compute in each iteration (for each state $s$) corresponds to the update rule for the value iteration algorithm shown below (which computes the $k$-step estimated values for the optimal state values):

$$
V_{k+1}(s) \leftarrow \max_{\alpha} \sum_{s´} T(s,a,s´) [R(s,a,s´) + \gamma V_k(s´)]
$$

The update rule above, is equivalent to:

$$
V_{k+1}(s) \leftarrow \max_{\alpha} Q(s,a)
$$

where:

$$
Q(s,a) = \sum_{s´} T(s,a,s´) [R(s,a,s´) + \gamma V_k(s´)]
$$

To be able to implement the update rule for value iteration, you therefore need to implement the calculation of $Q(s,a)$ shown above in `computeQValueFromValues(s, a)`. This method accepts the two parameters *state* and *action*, just like the method `getQValue(s, a)`. Since `getQValue(s, a)` just makes a direct call to `computeQValueFromValues(s, a)`, you can simply call `getQValue(s, a)` from the constructor after you have implemented `computeQValueFromValues(s, a)`.

Finally, after the agent has computed all state values using the value iteration algorithm, the agent must be able to perform the corresponding actions in each state that solve the MDP problem optimally. This is done by calling `getAction(s)` from the Gridworld application's `__main__` block at the bottom of the file `gridworld.py`. In addition, a call is made to `getPolicy(s)` from the method `displayValues()` in the file `graphicsGridworldDisplay.py` in order to display the Gridworld agent’s policy (the arrows in each square) graphically.

Both `getAction(s)` and `getPolicy(s)` make direct calls to the method `computeActionFromValues(s)`. The method `computeActionFromValues(s)` accepts the single parameter *state*, and must therefore return which action the agent should take in that particular state. In other words, this method must implement the *policy extraction* algorithm as shown below:

$$
\pi^{*}(s) = \argmax_{\alpha} \sum_{s´} T(s,a,s´) [R(s,a,s´) + \gamma V^{*}(s´)]
$$

which is equivalent to:

$$
\pi^{*}(s) = \argmax_{\alpha} Q(s,a)
$$

where $*$ indicates an optimal value (although this notation is not entirely accurate, since we are computing estimated values and not mathematically optimal ones).

From the last equation above, we see that we must also call the method `getQValue(s, a)` from `computeActionFromValues(s)`.

Note that you must implement the *batch version* of the value iteration algorithm in the constructor `__init__()`. This simply means that you need two Python dictionary objects; one to keep track of all state values $V(s)$ for iteration $k$ and another to keep track of all state values $V(s)$ for iteration $k+1$. However, you should use the `Counter` class in the file `utils.py` instead of a Python dictionary. This means that you should use the old (previous) state values when computing the new (current) state values, and then replace the old `utils.Counter` object with the new one at the end of each iteration.

The difference between a *batch update* and an *in-place update* is that the data structures for $V_{k+1}(s)$ and $V_k(s)$ are two different instances in a *batch update*, whereas they are the same instance in an *in-place update*. It can be shown, mathematically, that both versions converge.

Also remember that if a state is a terminal state, i.e. ($state = TERMINAL\_STATE$), then $V(s) = 0$.

Note thst you must also handle the case where a state has no valid actions. In that case, what state value $V(s)$ should you use? That is, what does it mean for future rewards if a state has no possible actions?

<details>
<summary>Answer (click to expand)</summary>

> If a state $s$ has no valid actions $a$, the agent can't collect any more rewards $r$ from that state, thus the return (total discounted future reward) from that state must be 0, i.e. $V(s) = 0$.

</details>

From the *mdp* object that is passed into the constructor `__init__()`, the following information can be retrieved:

- `mdp.getStates()` returns a list of all states.
- `mdp.getPossibleActions(state)` returns a list of valid actions in that *state*.
- `mdp.getTransitionStatesAndProbs(state, action)` returns pairs `(nextState, probability)`, where *probability* is the chance (probability) of ending up in *nextState*.
- `mdp.getReward(state, action, nextState)` returns the *immediate reward* when the action *action* is performed in state *state* and the agent ends up in the state *nextState*.
- `mdp.isTerminal(state)` returns `True` if the state *state* is a terminal state, else `False`.

Test the agent with the Python configuration below (10 rounds, with 100 iterations each, using the *BookGrid* layout). When Gridworld first becomes visible, the state value $V(s)$ is shown in each square along with the policy $\pi(s)$ (see first screenshot below). Then press any key to switch to the next view, which shows the Q-value $Q(s,a)$ for each quadrant in each square (see second screenshot below). Finally, press any key again to allow the agent to follow the policy  (see third screenshot below).

```bash
gridworld.py -g BookGrid -a value -i 100 -k 10
```

<details>
<summary>Debug configuration <b>Task 4.1.1</b> for <b>launch.json</b> (click to expand)</summary>

```bash
{
    "name": "Task 4.1.1",
    "type": "debugpy",
    "request": "launch",
    "program": "${workspaceFolder}/src/gridworld.py",
    "args": [
        "-g", "BookGrid",
        "-a", "value",
        "-i", "100",
        "-k", "10"
    ],
    "console": "integratedTerminal"
}
```
</details>

<details>
<summary>Screenshots (click to expand)</summary>

State values $V(s)$
<img src="images/task_411_1.png">

Q-values $Q(s,a)$
<img src="images/task_411_2.png">

Agent following policy
<img src="images/task_411_3.png">

</details>

With a correctly implemented value iteration agent, your state values (the value displayed in each square), your Q-values (the values displayed in each quadrant), and your policy (the arrows in each square) should look like the screenshots below when running the following Python configuration (that is, after 5 iterations using the BookGrid layout):

```bash
gridworld.py -g BookGrid -a value -i 5
```
<details>
<summary>Debug configuration <b>Task 4.1.2</b> for <b>launch.json</b> (click to expand)</summary>

```bash
{
    "name": "Task 4.1.2",
    "type": "debugpy",
    "request": "launch",
    "program": "${workspaceFolder}/src/gridworld.py",
    "args": [
        "-g", "BookGrid",
        "-a", "value",
        "-i", "5"
    ],
    "console": "integratedTerminal"
}
```
</details>

<details>
<summary>Screenshots (click to expand)</summary>

State values $V(s)$
<img src="images/task_412_1.png">

Q-values $Q(s,a)$
<img src="images/task_412_2.png">

Agent following policy
<img src="images/task_412_3.png">

</details>

---
### 4.2 Bridge Grid

Now that you have a working value iteration agent, we will examine how the two parameters *discount* and *noise* affect the agent’s optimal policy for the *BridgeGrid* problem.

The *BridgeGrid* layout is shown in the image below and contains two positive and ten negative terminal states. The positive terminal state on the far left has a small reward (+1.00), while the positive terminal state on the far right has a large reward (+10.00). Between the two positive terminal states there is a *bridge* that is surrounded on both sides by five negative terminal states each, with very large negative rewards (-100.00). The living cost (living reward) for the *BridgeGrid* problem is 0, i.e. each *immediate reward* is 0 (except for the 12 terminal states). The agent starts in the state immediately to the right of the positive terminal state with reward +1.00, i.e., the square that shows the state value -17.28 in the image below.

<img src="images/task_421_0.png">

With the Python configuration below, the optimal policy will **not** cross the bridge and exit the board via the positive terminal state with reward +10.00.

```bash
gridworld.py -a value -i 100 -g BridgeGrid --discount 0.9 --noise 0.2
```

<details>
<summary>Debug configuration <b>Task 4.2.1</b> for <b>launch.json</b> (click to expand)</summary>

```bash
{
    "name": "Task 4.2.1",
    "type": "debugpy",
    "request": "launch",
    "program": "${workspaceFolder}/src/gridworld.py",
    "args": [        
        "-a", "value",
        "-i", "100",
        "-g", "BridgeGrid",
        "--discount", "0.9",
        "--noise", "0.2"
    ],
    "console": "integratedTerminal"
}
```
</details>

The configuration above uses a *discount* (*gamma* value) of 0.9 and a stochastic (random) action profile (*noise*) where the probability of moving in a different direction than intended is 0.2. With a living cost (living reward) of 0, consider how the two parameters *discount* and *noise* affect the update formula in the value iteration algorithm.

The *discount* is the same as the *gamma* parameter $\gamma$ in the update formula, while the *noise* parameter distributes its probability across the two perpendicular directions relative to the intended direction in the transition function $T(s,a,s´)$ as shown in the figure below (here with $noise = 0.2$).

<img src="images/stochastic_actions.png">

Remember that the value iteration algorithm's update rule:

$$
V_{k+1}(s) \leftarrow \max_{\alpha} \sum_{s´} T(s,a,s´) [R(s,a,s´) + \gamma V_k(s´)]
$$

is equivalent to:

$$
V_{k+1}(s) \leftarrow \max_{\alpha} Q(s,a)
$$

where:

$$
Q(s,a) = \sum_{s´} T(s,a,s´) [R(s,a,s´) + \gamma V_k(s´)]
$$

First, consider how the two parameters will affect the optimal policy (which chooses actions according to the *policy extraction* algorithm shown in the previous task) for the *BridgeGrid* problem. Then, change only **one** of the parameters (*discount* or *noise*) in the configuration above so that the agent’s policy crosses the bridge and exits the board via the terminal state +10.00 on the far right. With a correct policy, your solution should look similar to the screenshots below.

<details>
<summary>Screenshots (click to expand)</summary>

State values $V(s)$
<img src="images/task_421_1.png">

Q-values $Q(s,a)$
<img src="images/task_421_2.png">

Agent following policy
<img src="images/task_421_3.png">

</details>

Fill in your parameter values for *discount* and *noise* in the table below. Also explain why your parameter settings lead to the desired policy for the agent.

<details>
<summary>Sample Solution (click to expand)</summary>

Settings:

| Parameter | Value |
|-----------|-------|
| discount  |  0.9  |
| noise     |  0.0  |

Explaination:

The initial state values are as below:

|         |         |         |         |         |         |         |
|---------|---------|---------|---------|---------|---------|---------|
|         | -100.00 | -100.00 | -100.00 | -100.00 | -100.00 |         |
| +001.00 | +000.00 | +000.00 | +000.00 | +000.00 | +000.00 | +010.00 |
|         | -100.00 | -100.00 | -100.00 | -100.00 | -100.00 |         |
|         |         |         |         |         |         |         |

The update rule is:

$$
V_{k+1}(s) \leftarrow \max_{\alpha} \sum_{s´} T(s,a,s´) [R(s,a,s´) + \gamma V_k(s´)]
$$ 

where:
- $V_k(s´) = 0$ for all terminal states.
- $R(s,a,s´) = 0$ for all non-terminal states.

Why the settings work:

With $noise = 0.2$, every right move has a $20%$ chance to slip into the $±90°$ directions, which on the bridge means an immediate fall into the $−100$ terminals. Over several steps, that risk dominates the expected return, so the optimal policy avoids the bridge and heads to the safe $+1$.

Setting $noise = 0.0$ makes transitions deterministic ($T(s,a,s´) = 1, \forall a \in A(s)$). Then the expected return (total discounted reward) for going straight right is simply the discounted $+10$ after $d$ steps: $10 \times \gamma^d$, and the expected return for going straight left for $1$ step is: $1 \times \gamma^1$. So, for the agent to choose going right, we have: $10 \times \gamma^d > 1 \times \gamma^1$. This beats going left to $+1$ in one step whenever $\gamma^{d-1} > 0.1$, i.e. $(d-1) \times ln(\gamma) > ln(0.1)$ With $\gamma = 0.9$. we have $d < \frac{ln 0.1}{ln 0.9} + 1 \approx 22.85$ (inequality flips direction, since dividing by a negative value $ln(0.9) \approx -0.105$). The bridge is far shorter than 22 steps, so the optimal policy becomes *go right* to $+10$.

</details>

In [None]:
# Settings:

# -------------------
# Parameter | Value |
# -------------------
# discount  |  0.9  |
# noise     |  0.2  |

# Explaination:
# 

---
### 4.3 Discount Grid

Another important parameter, in addition to *discount* and *noise*, that affects the value iteration agent’s policy is the *living cost* (*livingReward*), i.e., the *immediate reward* for each state except the terminal states. Investigate how these three parameters (*discount*, *noise*, and *livingReward*) influence the agent’s optimal policy for the *DiscountGrid* problem.

The *DiscountGrid* layout is shown in the figure below and contains two positive and five negative terminal states. The positive terminal state in the middle has a small reward (+1.00), while the positive terminal state on the far right has a large reward (+10.00). Along the bottom row, there are five negative terminal states with large negative rewards (-10.00). The agent starts in the yellow state at the bottom left (see the figure below). In addition, there are three gray squares that represent walls.

To reach the terminal states, the agent can either choose to take the long, safe path (<span style="color: #99ff99;">green</span> arrow), which avoids the bottom row, or it can take the short, risky path (<span style="color: #ff6666;">red</span> arrow) along the bottom row. The reason this path is risky is that the agent may end up in a terminal state with a large negative reward.

<img src="images/task_431_0.png">

The Python configuration below shows how you can supply parameter values for *discount*, *noise*, and *livingReward* to the value iteration agent:

```bash
gridworld.py -a value -i 100 -g DiscountGrid --discount 0.0 --noise 0.0 --livingReward 0.0
```

<details>
<summary>Debug configuration <b>Task 4.3.1</b> for <b>launch.json</b> (click to expand)</summary>

```bash
{
    "name": "Task 4.3.1",
    "type": "debugpy",
    "request": "launch",
    "program": "${workspaceFolder}/src/gridworld.py",
    "args": [        
        "-a", "value",
        "-i", "100",
        "-g", "DiscountGrid",
        "--discount", "0.0",
        "--noise", "0.0",
        "--livingReward", "0.0"
    ],
    "console": "integratedTerminal"
}
```
</details>

Think about how the three parameters *discount*, *noise*, and *livingReward* affect the update formula in the value iteration algorithm. The *discount* parameter is the same as the *gamma* $\gamma$ parameter in the update formula. The *noise* parameter affects the *transition probability* $T(s,a,s´)$, and the *livingReward* parameter is the same as the *immediate reward*, i.e., $R(s,a,s´)$.

Once again, the value iteration algorithm's update rule is reiterated below:

$$
V_{k+1}(s) \leftarrow \max_{\alpha} \sum_{s´} T(s,a,s´) [R(s,a,s´) + \gamma V_k(s´)]
$$

which is equivalent to:

$$
V_{k+1}(s) \leftarrow \max_{\alpha} Q(s,a)
$$

where:

$$
Q(s,a) = \sum_{s´} T(s,a,s´) [R(s,a,s´) + \gamma V_k(s´)]
$$

After thinking about how the three parameters will affect the optimal policy (which chooses actions according to the *policy extraction* algorithm shown in [4.1 Value Iteration Agent](#41-value-iteration-agent)) for the *DiscountGrid* problem, try to find Python configurations that lead to the following policies:

<ol type="a">
  <li>The agent prefers to go to the terminal state +1.00 via the <span style="color: #ff6666;">red</span> path.</li>
  <li>The agent prefers to go to the terminal state +1.00 via the <span style="color: #99ff99;">green</span> path.</li>
  <li>The agent prefers to go to the terminal state +10.00 via the <span style="color: #ff6666;">red</span> path.</li>
  <li>The agent prefers to go to the terminal state +10.00 via the <span style="color: #99ff99;">green</span> path.</li>
  <li>The agent prefers to avoid all terminal states, i.e. the episode never terminates.</li>
</ol>

To check your policy, you can follow the arrows in the graphical display of the *DiscountGrid* board. For example, the policy for subtask (c) should move four steps east and then one step north in order to reach the terminal state +10.00 via the <span style="color: #ff6666;">red</span> path (see screenshots below).

<details>
<summary>Screenshots (click to expand)</summary>

a) State values $V(s)$
<img src="images/task_431_a_1.png">

a) Q-values $Q(s,a)$
<img src="images/task_431_a_2.png">

b) State values $V(s)$
<img src="images/task_431_b_1.png">

b) Q-values $Q(s,a)$
<img src="images/task_431_b_2.png">

c) State values $V(s)$
<img src="images/task_431_c_1.png">

c) Q-values $Q(s,a)$
<img src="images/task_431_c_2.png">

d) State values $V(s)$
<img src="images/task_431_d_1.png">

d) Q-values $Q(s,a)$
<img src="images/task_431_d_2.png">

e) State values $V(s)$
<img src="images/task_431_e_1.png">

e) Q-values $Q(s,a)$
<img src="images/task_431_e_2.png">

</details>

Fill in your parameter values for *discount*, *noise*, and *livingReward* for subtasks (a)–(e) below. Also explain why your parameter settings lead to the desired policy.

<details>
<summary>Sample Solution (click to expand)</summary>

General rules:

- *discount* ($\gamma$)
  - low $\rightarrow$ *short-sighted* (prefers nearer $+1$ over farther $+10$).
  - high $\rightarrow$ *far-sighted* (will go for $+10$).
- *noise*
  - low $\rightarrow$ actions are reliable, so risky (<span style="color: #ff6666;">red</span>) paths are acceptable.
  - high $\rightarrow$ risky paths are avoided, so stick with low-risk (<span style="color: #99ff99;">green</span>) paths.
- *livingReward*
  - negative $\rightarrow$ hurry to finish (shortest path).
  - zero $\rightarrow$ neutral.
  - positive $\rightarrow$ prefer wandering/long paths (may avoid terminals entirely if big enough).

Subtask a:

The agent prefers to go to the terminal state +1.00 via the <span style="color: #ff6666;">red</span> path.

| Parameter | Value |
|-----------|-------|
| discount  |  0.2  |
| noise     |  0.0  |
| reward    |  0.0  |

Explaination:

> Short-sighted ($\gamma$ low) prefers the closer $+1$, and zero noise makes the risky <span style="color: #ff6666;">red</span> shortcut safe enough to choose.

Subtask b:

  >The agent prefers to go to the terminal state +1.00 via the <span style="color: #99ff99;">green</span> path.

| Parameter | Value |
|-----------|-------|
| discount  |  0.2  |
| noise     |  0.2  |
| reward    | -1.0  |

Explaination:

> Still short-sighted (so $+1$ over $+10$), but higher noise makes the <span style="color: #ff6666;">red</span> path too risky, so it picks the safer <span style="color: #99ff99;">green</span> detour.
Negative living reward encourages finishing quickly, which favors the nearer $+1$ and also biases against meandering on the bottom row.

Subtask c:

The agent prefers to go to the terminal state +10.00 via the <span style="color: #ff6666;">red</span> path.

| Parameter | Value |
|-----------|-------|
| discount  |  1.0  |
| noise     |  0.0  |
| reward    | -1.0  |

Explaination:

> Far-sighted (values the larger $+10$), zero noise makes the risky shortcut (<span style="color: #ff6666;">red</span>) reliable, and negative living reward pushes for the shortest route.

Subtask d:

The agent prefers to go to the terminal state +10.00 via the <span style="color: #99ff99;">green</span> path.

| Parameter | Value |
|-----------|-------|
| discount  |  0.9  |
| noise     |  0.3  |
| reward    |  0.1  |

Explaination:

> Still prefers the larger $+10$ ($\gamma$ high), but the <span style="color: #ff6666;">red</span> route is too risky with higher noise, and a slightly positive living reward removes urgency so it tolerates the longer, safer <span style="color: #99ff99;">green</span> route.

Subtask e:

The agent prefers to avoid all terminal states, i.e. the episode never terminates.

| Parameter | Value |
|-----------|-------|
| discount  |  0.9  |
| noise     |  0.0  |
| reward    |  2.0  |

Explaination:

> With a sufficiently positive living reward, the expected return from staying alive forever exceeds any terminal payoff (e.g., with $\gamma = 0.9$, any $livingReward > 1$ makes endless wandering better than $+10$). Zero noise avoids accidental slips into terminals.

**Note**

- Values aren’t unique.
- Nearby numbers with the same relationships (low/high $\gamma$, low/high noise, negative/zero/positive livingReward) will produce the same behaviors. 

</details>

In [None]:
# Subtask a:

# -------------------
# Parameter | Value |
# -------------------
# discount  |  0.0  |
# noise     |  0.0  |
# reward    |  0.0  |

# Explaination:
# 

# Subtask b:

# -------------------
# Parameter | Value |
# -------------------
# discount  |  0.0  |
# noise     |  0.0  |
# reward    |  0.0  |

# Explaination:
# 

# Subtask c:

# -------------------
# Parameter | Value |
# -------------------
# discount  |  0.0  |
# noise     |  0.0  |
# reward    |  0.0  |

# Explaination:
# 

# Subtask d:

# -------------------
# Parameter | Value |
# -------------------
# discount  |  0.0  |
# noise     |  0.0  |
# reward    |  0.0  |

# Explaination:
# 

# Subtask e:

# -------------------
# Parameter | Value |
# -------------------
# discount  |  0.0  |
# noise     |  0.0  |
# reward    |  0.0  |

# Explaination:
# 

---
## 5. Q-Learning (Reinforcement Learning)
---

The value iteration agent has a complete MDP model, i.e. it knows the transition function $T(s,a,s´)$ and the reward function $R(s,a,s´)$. Therefore, the agent does not need to learn anything about its environment (everything is already known), and it can use the *value iteration* algorithm to compute an optimal policy before it even begins interacting with the environment.

However, in real-world problems, both $T(s,a,s´)$ and $R(s,a,s´)$ are unknown. This means the agent must *learn* how the environment works and how different actions affect it by *interacting* with it. Such agents are called *learning agents*, and the *Q-learning* agent is one type of learning agent that implements the *Q-learning* algorithm.

In the file `learningAgents.py`, you will find the abstract class `ReinforcementAgent`, which inherits from `ValueEstimationAgent`. In addition to the inherited abstract methods `getQValue(s,a)`, `getValue(s)`, `getPolicy(s)`, and `getAction(s)`, the class defines one additional abstract method `update()`.

The first four inherited abstract methods were explained in detail in [4.1 Value Iteration Agent](#41-value-iteration-agent). The method `update(s,a,s´,r)` accepts the four parameters *state*, *action*, *nextState*, and *reward*. A subclass should use these parameters to update its Q-values. The method `update(s,a,s´,r)` will be called automatically by the method `observeTransition(s,a,s´,r)` in the same class whenever a transition is observed, i.e., when the agent performs action $a$ in state $s$, ends up in state $s´$, and receives reward $r$.

Besides these methods, the class also contains `getLegalActions(s)`, `startEpisode()`, `stopEpisode()`, `isInTraining()`, `isInTesting()`, and the constructor `__init__()`. The only method that should be called directly from a subclass is `getLegalActions(s)`, which takes a state and returns a list of valid actions in that state.

In the constructor, you can see that default values are set for *numTraining* (default = 100), *epsilon* (default = 0.5), *alpha* (default = 0.5), and *gamma* (default = 1). The *numTraining* parameter specifies how many episodes the agent will spend learning. The *epsilon* parameter is the probability that the agent chooses a random action ($\epsilon$-greedy behavior). The *alpha* parameter is the *learning rate*. And finally, the *gamma* parameter is the *discount rate* in the Q-learning algorithm.

At the top of the file `qlearningAgents.py`, the class `QLearningAgent` is defined. This class inherits from `ReinforcementAgent`, and therefore must implement the five abstract methods `getQValue(s,a)`, `getValue(s)`, `getPolicy(s)`, `getAction(s)`, and `update(s,a,s´,r)`. However, note that `getValue(s)` is already implemented and simply calls `computeValueFromQValues(s)`. The method `getPolicy(s)` is also already implemented and simply calls `computeActionFromQValues(s)`. This means that the methods you actually need to implement are `getQValue(s,a)`, `computeValueFromQValues(s)`, `computeActionFromQValues(s)`, `getAction(s)`, and `update(s,a,s´,r)`. You may also use the constructor `__init__()` to create additional instance attributes, such as a suitable data structure to store your Q(s,a) values.

---
### 5.1 Q-Learning Agent

Implement the Q-learning algorithm in the class `QLearningAgent` (in the file `qlearningAgents.py`) by completing the methods `update(s,a,s´,r)`, `computeValueFromQValues(s)`, `getQValue(s,a)`, `computeActionFromQValues(s)`, and the constructor `__init__()`.

The constructor `__init__()` accepts a variable number of input parameters, which are simply passed to the superclass `ReinforcementAgent`. The constructor is a suitable place to initialize a data structure for storing $Q(s,a)$ values. Note that no learning is done in the constructor, because the agent must *learn through trial-and-error*, i.e., by interacting with the environment (compare this with the `ValueIterationAgent`, where all computation was done inside the constructor).

The method `getQValue(s,a)` accepts two parameters; a *state* and an *action* in that *state*. The method should return the Q-value $Q(state, action)$. However, if the state has never been encountered before, the value 0 should be returned.

The method `computeValueFromQValues(s)` accepts a single parameter, a *state*. The method should return the largest `Q(state, action)` value among all *legal actions* that can be taken in that state. If there are no legal actions (e.g., the state is a terminal state), the method should return 0. It may be useful to use the superclass method `getLegalActions(state)` here.

The method `computeActionFromQValues(s)` also accepts a single parameter, a *state*. The method should return the action that gives the highest $Q(state, action)$ value among all *legal actions* in that state. If there are no legal actions (e.g., for a terminal state), the method should return `None`. Again, using `getLegalActions(state)` is helpful. Note that the best result is obtained if you break ties randomly when multiple actions share the highest Q-value. You can do this using `random.choice()`. Also remember that actions the agent has never seen before have a Q-value of 0, i.e. if all known actions have negative Q-values, an unseen action may actually be optimal.

Make sure you only access Q-values by calling `getQValue(s, a)` inside both `computeValueFromQValues(s)` and `computeActionFromQValues(s)`. This will help you in [6.1 Approximative Q-Learning Agent](#61-approximative-q-learning-agent).

The method `update(s,a,s´,r)` should contain the actual Q-learning update rule. It accepts the parameters *state*, *action*, *nextState*, and *reward*, and should update the $Q(s,a)$ value. However, we cannot use the update rule for value iteration because we do not know $T(s,a,s´)$ or $R(s,a,s´)$. In other words, we are dealing with an unknown MDP.

$$
Q_{k+1}(s,a) \leftarrow \sum_{s´} T(s,a,s´) [R(s,a,s´) + \gamma \max_{\alpha ´} Q_k(s´,a´)]
$$

However, we can implement the *off-policy* Q-learning algorithm based on sample estimates and an exponential moving average as follows (where $r$ is the reward we receive from the environment when we perform action $a$ in state $s$ and end up in the next state $s´$, i.e., it corresponds to $r = R(s,a,s´)$ if we had a model of the reward function):

$$
sample = r + \gamma \max_{\alpha ´} Q(s´,a´)
$$

$$
Q(s,a) \leftarrow (1 - \alpha) Q(s,a) + (\alpha) [sample]
$$

With $r = R(s,a,s´)$, and substituting for $sample$, this can be expressed as:

$$
Q(s,a) \leftarrow (1 - \alpha) Q(s,a) + (\alpha) [r + \gamma \max_{\alpha ´} Q(s´,a´)]
$$

which is equivalent to:

$$
Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{\alpha ´} Q(s´,a´) - Q(s,a) ]
$$

Note that the parameter gamma $\gamma$ in the update formula above is called *discount* in the superclass `ReinforcementAgent`, while alpha $\alpha$ is in fact named *alpha*. Once you have implemented the methods described above, you can test the learning algorithm with the following Python configuration:

```bash
gridworld.py -a q -k 5 -m
```

<details>
<summary>Debug configuration <b>Task 5.1.1</b> for <b>launch.json</b> (click to expand)</summary>

```bash
{
    "name": "Task 5.1.1",
    "type": "debugpy",
    "request": "launch",
    "program": "${workspaceFolder}/src/gridworld.py",
    "args": [
        "-a", "q",
        "-k", "5",
        "-m"
    ],
    "console": "integratedTerminal"
}
```
</details>

With the configuration above, the agent will learn over five episodes (`-k 5`). In addition, the learning will take place while you *manually* control the agent using the arrow keys on the keyboard (`-m`). You can observe how the agent learns as it leaves a state (the update does not happen in the state the agent enters, but in the state the agent comes from).

If you want to debug your code by manually controlling the agent in *Gridworld*, it may help to turn off *noise* by adding the flag `--noise 0.0` to the configuration above. If you manually move the agent *north* and then *east* to the terminal state in the upper right corner for four episodes, you should see the result shown in the figure below if your Q-learning agent is correctly implemented.

<details>
<summary>Debug configuration <b>Task 5.1.2</b> for <b>launch.json</b> (click to expand)</summary>

```bash
{
    "name": "Task 5.1.2",
    "type": "debugpy",
    "request": "launch",
    "program": "${workspaceFolder}/src/gridworld.py",
    "args": [
        "-a", "q",
        "-k", "5",
        "-m",
        "--noise", "0.0"
    ],
    "console": "integratedTerminal"
}
```
</details>

<img src="images/task_511_2.png">

When your agent’s learning algorithm is working correctly, you can implement the method `getAction()`. The method accepts a single parameter, *state*, and should return an action. You must implement epsilon-greedy ($\epsilon$-greedy) action selection in this method. That is, with probability $\epsilon$, the method should return a random action chosen from all *legal actions*. Otherwise, with probability $1 - \epsilon$, it should return the best action according to the current Q-values. Note that the random action may still happen to be the best action. To choose random actions uniformly, you can use `random.choice()`. To simulate a Bernoulli random variable, you can use `utils.flipCoin(p)`, which returns `True` with probability $p$ and `False` with probability $1 - p$.

You can test your agent with the following Python configuration (100 episodes):

```bash
gridworld.py -a q -k 100
```

<details>
<summary>Debug configuration <b>Task 5.1.3</b> for <b>launch.json</b> (click to expand)</summary>

```bash
{
    "name": "Task 5.1.3",
    "type": "debugpy",
    "request": "launch",
    "program": "${workspaceFolder}/src/gridworld.py",
    "args": [
        "-a", "q",
        "-k", "100"
    ],
    "console": "integratedTerminal"
}
```
</details>

With a correctly implemented agent, your learned Q-values should resemble the Q-values of the value iteration agent on the same problem, especially along frequently visited paths.

<img src="images/task_513.png">

Additionally, if your agent works as expected, you can test your Q-learning agent on the *Crawler* robot using the following Python configuration:

```bash
crawler.py
```

<details>
<summary>Debug configuration <b>Task 5.1.4</b> for <b>launch.json</b> (click to expand)</summary>

```bash
{
    "name": "Task 5.1.4",
    "type": "debugpy",
    "request": "launch",
    "program": "${workspaceFolder}/src/crawler.py",
    "console": "integratedTerminal"
}
```
</details>

<img src="images/crawler.png">

If this does not work, you have most likely not written general code for your Q-learning agent (and instead written code specific to the *Gridworld* problem). Experiment with the different parameters for the robot to see how they affect the robot’s policy and actions.

---
### 5.2 Bridge Grid

Test your Q-learning agent on the *BridgeGrid* problem with the following Python configuration (training over 50 episodes with $noise = 0$, $epsilon = 1$, and $alpha = 0.9$) and observe whether the agent manages to find the optimal policy (the value-iteration agent did manage this for the same problem):

```bash
gridworld.py -a q -k 50 -n 0 -g BridgeGrid -e 1 -l 0.9
```

<details>
<summary>Debug configuration <b>Task 5.2.1</b> for <b>launch.json</b> (click to expand)</summary>

```bash
{
    "name": "Task 5.2.1",
    "type": "debugpy",
    "request": "launch",
    "program": "${workspaceFolder}/src/gridworld.py",
    "args": [
        "-a", "q",
        "-k", "50",
        "-n", "0",
        "-g", "BridgeGrid",
        "-e", "1",
        "-l", "0.9"
    ],
    "console": "integratedTerminal"
}
```
</details>

<details>
<summary>Screenshots (click to expand)</summary>

Q-values $Q(s,a)$
<img src="images/task_521.png">

</details>

Run the same experiment with $epsilon = 0$ and observe whether the agent manages to find the optimal policy:

```bash
gridworld.py -a q -k 50 -n 0 -g BridgeGrid -e 0 -l 0.9
```

<details>
<summary>Debug configuration <b>Task 5.2.2</b> for <b>launch.json</b> (click to expand)</summary>

```bash
{
    "name": "Task 5.2.1",
    "type": "debugpy",
    "request": "launch",
    "program": "${workspaceFolder}/src/gridworld.py",
    "args": [
        "-a", "q",
        "-k", "50",
        "-n", "0",
        "-g", "BridgeGrid",
        "-e", "0",
        "-l", "0.9"
    ],
    "console": "integratedTerminal"
}
```
</details>

<details>
<summary>Screenshots (click to expand)</summary>

Q-values $Q(s,a)$
<img src="images/task_522.png">

</details>

Try varying the $\epsilon$-greedy value *epsilon* (`-e`) and the learning rate *alpha* (`-l`). Does your agent find the optimal policy with any combination of these two parameters? Fill in your parameter values for *epsilon* and *alpha* in the table below. If you do not think any combination of the two parameters can find the optimal policy in only 50 episodes, write `NOT POSSIBLE` as the value for both parameters. Also justify your answer, i.e., explain why the optimal policy is found with a certain parameter combination, or why it cannot be found with any parameter combination.

<details>
<summary>Sample Solution (click to expand)</summary>

Settings:

| Parameter | Value         |
|-----------|---------------|
| epsilon   | NOT POSSIBLE  |
| alpha     | NOT POSSIBLE  |

Explaination:

> With plain $\epsilon$-greedy Q-learning, 50 episodes is usually far too few for *BridgeGrid* because the only way to *discover* the +10 goal is to execute a very specific, fairly long action sequence across a narrow bridge, while any small deviation slams you into −100 and poisons the Q-values around the bridge. Once those big negatives propagate, the agent learns to avoid the bridge and settles for the safe +1.

</details>

In [None]:
# Settings:

# -------------------
# Parameter | Value |
# -------------------
# epsilon   |  0.9  |
# alpha     |  0.2  |

# Explaination:
# 

---
## 6. Approximative Q-Learning
---

In the file `qlearningAgents.py`, the class `PacmanQAgent` is defined. It inherits all attributes and methods from the class `QLearningAgent`, but overrides the base class method `getAction(s)` and defines its own constructor `__init__()`. The class `PacmanQAgent` is already fully implemented, meaning it reuses the Q-learning algorithm from `QLearningAgent`, but with different default values for *epsilon* (0.05), *gamma* (0.8), *alpha* (0.2), and *numTraining* (0) in the constructor's parameter list. These default values are better suited for the *Pacman* problem.

Test the Q-learning agent for Pacman with the following Python configuration.

```bash
pacman.py -p PacmanQAgent -x 2000 -n 2010 -l smallGrid
```

<details>
<summary>Debug configuration <b>Task 6.1.1</b> for <b>launch.json</b> (click to expand)</summary>

```bash
{
    "name": "Task 6.1.1",
    "type": "debugpy",
    "request": "launch",
    "program": "${workspaceFolder}/src/pacman.py",
    "args": [
        "-p", "PacmanQAgent",
        "-x", "2000",
        "-n", "2010",
        "-l", "smallGrid"
    ],
    "cwd": "${workspaceFolder}/src",
    "console": "integratedTerminal"
}
```
</details>

The configuration above runs several Pacman games in two phases. The first phase is a training phase, where the Pacman agent will learn the value of different states and actions. Since it takes quite a long time to learn good Q-values even for a small layout, all games in the training phase are run without graphics. When all the training games have finished, the Pacman agent switches to the testing phase. In the testing phase, the agent’s `self.epsilon` and `self.alpha` parameters are set to 0.0 to disable Q-learning and exploration. This allows the Pacman agent to fully exploit the learned policy. Games run in the testing phase are shown graphically by default.

<img src="images/task_611.png">

In the configuration above, 2010 games are played (`-n 2010`), where the first 2000 games are the training phase (`-x 2000`), and the last 10 games belong to the testing phase (and will be displayed graphically). During training, statistics will be printed every 100 training games. Note, however, that epsilon has a nonzero value during training, meaning the Pacman agent may still play poorly even after finding a good policy (because the agent will still perform random exploratory actions with probability *epsilon*). It typically takes about 1000–1400 games before the statistics for 100 training games show positive reward values.

If your `QLearningAgent` code is correct and written generally, the above configuration will run without errors, and the Pacman agent should win at least 80% of the ten games in the test phase (see the output in the built-in terminal in VSCode). If `QLearningAgent` learns a good policy in `gridworld.py` and `crawler.py`, but not for Pacman with the configuration above, this may be because `getAction(s)` and/or `computeActionFromQValues(s)` do not correctly handle previously unseen actions. Since unseen actions have a Q-value of 0, and if all previously tried actions have negative Q-values, an unseen action may actually be the optimal action.

If you want to experiment with the learning parameters, you can add the `-a` flag in the configuration above, e.g.:

```bash
-a epsilon=0.1,alpha=0.3,gamma=0.7
```

<details>
<summary>Debug configuration <b>Task 6.1.2</b> for <b>launch.json</b> (click to expand)</summary>

```bash
{
    "name": "Task 6.1.2",
    "type": "debugpy",
    "request": "launch",
    "program": "${workspaceFolder}/src/pacman.py",
    "args": [
        "-p", "PacmanQAgent",
        "-x", "2000",
        "-n", "2010",
        "-l", "smallGrid",
        "-a", "epsilon=0.1,alpha=0.3,gamma=0.7"
    ],
    "cwd": "${workspaceFolder}/src",
    "console": "integratedTerminal"
}
```
</details>

These values can be accessed in the code via `self.epsilon`, `self.alpha`, and `self.gamma`.

<img src="images/task_612.png">

Note that an MDP state consists of an entire board configuration. A transition from one state to another (a *ply*) consists of multiple complex actions. A situation in the code where the Pacman agent has moved to a new position but the ghosts have not yet moved is not an MDP state. All actions (Pacman and ghosts) occur simultaneously (in real time), so only after all have moved is the new MDP state defined.

Even if the Pacman agent performs well on the `smallGrid` layout, it will not play well on the slightly larger `mediumGrid` layout.

```bash
pacman.py -p PacmanQAgent -x 2000 -n 2010 -l mediumGrid
```

<details>
<summary>Debug configuration <b>Task 6.1.3</b> for <b>launch.json</b> (click to expand)</summary>

```bash
{
    "name": "Task 6.1.3",
    "type": "debugpy",
    "request": "launch",
    "program": "${workspaceFolder}/src/pacman.py",
    "args": [
        "-p", "PacmanQAgent",
        "-x", "2000",
        "-n", "2010",
        "-l", "mediumGrid"
    ],
    "cwd": "${workspaceFolder}/src",
    "console": "integratedTerminal"
}
```
</details>

The reason the Pacman agent does not win on larger layouts is that every board configuration (Pacman’s position, all ghost positions, all food dots and power capsules, etc.) is treated as a unique state with its own Q-value. The Pacman agent therefore cannot generalize the concept that *running into a ghost is bad* across different board configurations. We can fix this with *approximate Q-learning*.

<img src="images/task_613.png">

---
### 6.1 Approximative Q-Learning Agent

In the file `qlearningAgents.py`, the class `ApproximateQAgent` is defined. You will implement an *approximate Q-learning* agent that learns a *weighted linear function* of *features* to represent all Q-values (where multiple states can share the same features). `ApproximateQAgent` inherits from the class `PacmanQAgent`, but overrides the base class methods `getQValue(s,a)` and `update(s,a,s´,r)`, and defines the methods `getWeights()`, `final(s)`, and its own constructor `__init__()`.

In the constructor `__init__()`, an instance variable `self.weights` of type `util.Counter` (a specialized Python dictionary) is created. This instance variable will contain the *weights* of the approximate Q-learning algorithm. The method `getWeights()` returns this instance variable.

Approximate Q-learning uses a feature function `f(s,a)` that accepts a state $s$ and an action $a$ as inputs, and returns a vector of features. You do not need to write these feature functions yourself, since they are already implemented in the file `featureExtractors.py`. By default, the feature function `IdentityExtractor` is used, which is set in the constructor. The only thing you need to remember in your code is to call the feature function using the following line, which returns a feature vector represented as a `util.Counter` (a specialized Python dictionary) where the elements are pairs of `{featureName : featureValue}`, one for each feature:

```python
self.featExtractor.getFeatures(state, action)
```

The method `getQValue(s,a)` accepts two input parameters; a state *state* and an action *action*, and should return the approximate Q-value $Q(state, action)$. In other words, you should implement the approximate Q-function below in this method:

$$
Q(s,a) = \sum_{i=1}^{n} w_i \times f_i(s,a)
$$

where each weight $w_i$ corresponds to a feature $f_i(s,a)$, for $i = 1 \dots n$. In your code, you should represent the weight vector as a `util.Counter`, where each element consists of a pair `{featureName : weightValue}`.

It is crucial that you only access Q-values by calling `getQValue(s,a)` from the base class `QLearningAgent`, because here you are overriding that method and want your new `getQValue(s,a)` to be used to compute all approximate Q-values.

The method `update(s,a,s´,r)` accepts four parameters; a state *state*, an action *action*, the subsequent state *nextState*, and a reward *reward*, and should implement the weight update in the approximate Q-learning algorithm shown below:

$$
difference = [r + \gamma \max_{a'} Q(s',a')] - Q(s,a)
$$

$$
w_i \leftarrow w_i + \alpha \times difference \times f_i(s,a)
$$

The weight update should therefore be performed for all weights and features, from $1$ to $n$, where the symbols in the formula correspond to the input parameters as follows:
$s = state$, $a = action$, $s´ = nextState$, and $r = reward$. Also, $\alpha = alpha$ and $\gamma = gamma$.

The method `final(s)` is called at the end of each game and can, for example, be used to print your weights if you want to debug them.

You can test your approximate Q-learning agent with the following Python configuration (this configuration uses the `IdentityExtractor` feature function):

```bash
pacman.py -p ApproximateQAgent -x 2000 -n 2010 -l smallGrid
```

<details>
<summary>Debug configuration <b>Task 6.1.4</b> for <b>launch.json</b> (click to expand)</summary>

```bash
{
    "name": "Task 6.1.4",
    "type": "debugpy",
    "request": "launch",
    "program": "${workspaceFolder}/src/pacman.py",
    "args": [
        "-p", "ApproximateQAgent",
        "-x", "2000",
        "-n", "2010",
        "-l", "smallGrid"
    ],
    "cwd": "${workspaceFolder}/src",
    "console": "integratedTerminal"
}
```
</details>

If you have implemented your code correctly, your approximate Q-learning agent should behave the same as your `PacmanQAgent` on the `smallGrid` layout.

<img src="images/task_614.png">

Once your approximate Q-learning agent works correctly, you can test it using the following Python configuration (which uses the better feature function `SimpleExtractor` on the larger layout `mediumGrid`):

```bash
pacman.py -p ApproximateQAgent -a extractor=SimpleExtractor -x 50 -n 60 -l mediumGrid
```

<details>
<summary>Debug configuration <b>Task 6.1.5</b> for <b>launch.json</b> (click to expand)</summary>

```bash
{
    "name": "Task 6.1.5",
    "type": "debugpy",
    "request": "launch",
    "program": "${workspaceFolder}/src/pacman.py",
    "args": [
        "-p", "ApproximateQAgent",
        "-a", "extractor=SimpleExtractor",
        "-x", "50",
        "-n", "60",
        "-l", "mediumGrid"
    ],
    "cwd": "${workspaceFolder}/src",
    "console": "integratedTerminal"
}
```
</details>

Your approximate Q-learning agent should win almost every time, even with only 50 training episodes.

<img src="images/task_615.png">

### A few final words about reinforcement learning agents with neural networks.

You are not building a neural network in this workshop, but a *weighted linear function* of *features* can easily be turned into a *weighted non-linear function* of *features*, i.e., a neural network (for example, using TensorFlow or PyTorch).

In the method `getQValue(s,a)`, a state and action are given as input parameters, and a Q-value is returned. This corresponds exactly to the forward pass in a neural network: $(state, action) \rightarrow Q-value$

<img src="images/nn1.png">

We can also design the network so that it only takes the state $s$ as input and outputs one $Q(s,a)$ value for each action $a$. This is what DeepMind did with their Atari-playing agent, where $s$ was an image, so the first part of the neural network was a CNN: $state \rightarrow [Q(s,a_1), Q(s,a_2), \dots, Q(s,a_n)]$

<img src="images/nn2.png">

In the method `update(s,a,s´,r)`, the current state $s$, the current action $a$, the next state $s´$, and the reward $r$ are given as inputs, and the *weights* are updated. This corresponds directly to the backward pass in a neural network, i.e., backpropagation with gradient descent (but using a batch of several $(s,a,s´,r)$ samples for each backward pass).