Osnabrück University - Machine Learning (Summer Term 2021) - Prof. Dr.-Ing. G. Heidemann, Ulf Krumnack, Axel Schaffland

# Exercise Sheet 11

## Introduction

This week's sheet should be solved and handed in before **2:00pm of Tuesday, July 05, 2021**. If you need help (and Google and other resources were not enough), feel free to contact your groups' designated tutor or whomever of us you run into first. Please upload your results to your group's Stud.IP folder.

## Assignment 1: Probability Theory [4 Points]

Consider three bags filled with three types of candy. The table below indicates for each bag how many candies of each type are in each bag.


| contains        | green candy | blue candy | red candy | **total** |
| --------------- | ----------- | ---------- | --------- | --------- |
| **cyan bag**    |          10 |          4 |         2 |    **16** |
| **magenta bag** |           5 |          7 |         2 |    **14** |
| **yellow bag**  |           2 |          2 |         8 |    **12** |
| **total**       |      **17** |     **13** |    **12** |    **42** |


In the following we denote the bags as $B=\{c,m,y\}$ and the candies as $C=\{r, g, b\}$. So the probability for drawing a blue candy from the cyan bag would be: $P(C=b|B=c)=\frac{4}{16}=0.25$.

### a)

Give a verbal description of the following events and compute their probabilities:
$$
\begin{align*}
P(C=b|B=m) &= ? \\
P(C=g|B=y) &= ? \\
\end{align*}
$$

Drawing a 

### b)

Now assume that you randomly choose a bag and then randomly draw a candy from that bag. Assume that the probability of drawing a random bag is uniformly distributed. What is the probability that the candy is red?

$$P(C=r) = ?$$

YOUR ANSWER HERE

### c)

Someone has choosen a red candy from one of the bags, but does not tell you from which bag it originated.
Compute the probability that it was from the yellow bag:
    
$$P(B=y|C=r) = ?$$

YOUR ANSWER HERE

### d) 

Let's assume we draw with the following probabilities from each bag: $P(B=c)=0.2$, $P(B=m)=0.7$, $P(B=y)=0.1$.
What are the probabilities to draw a green, blue or red candy?

YOUR ANSWER HERE

## Assignment 4: Bayes classifier [4 Points]

Consider the following data set. There are four features, running nose ($N$), coughing ($C$), reddened skin ($R$), and fever ($F$), each of which can take the values true ($+$) or false ($-$).

| Diagnosis ID  | $N$ | $C$ | $R$ | $F$ | Classification     |
|---------------|-----|-----|-----|-----|--------------------|
|     $d_1$     | $+$ | $+$ | $+$ | $-$ | positive (ill)     |
|     $d_2$     | $+$ | $+$ | $-$ | $-$ | positive (ill)     |
|     $d_3$     | $-$ | $-$ | $+$ | $+$ | positive (ill)     |
|     $d_4$     | $+$ | $-$ | $-$ | $-$ | negative (healthy) |
|     $d_5$     | $-$ | $-$ | $-$ | $-$ | negative (healthy) |
|     $d_6$     | $-$ | $+$ | $+$ | $-$ | negative (healthy) |

Solve the following problems either by hand or programmatically. Assume all features are conditionally independent.

### a)

Determine all probabilities required to apply a Bayes classifier for predicting whether a new person is ill or not.

YOUR ANSWER HERE

### b)
Person $p_1$ is coughing and has fever. Person $p_2$ has a running nose and reddened skin. Person $p_3$ is coughing, suffers from reddened skin and has fever. Determine the probability of being ill for all persons $p1, p2, p3$.

YOUR ANSWER HERE

## Assignment 3: Reinforcement Learning Theory [4 Points]


### a) Weak teacher

Reinforcement learning is often described as being different from both supervised and unsupervised learning by providing a "weak teacher". Who is this "teacher" and why is she "weak"?

YOUR ANSWER HERE

### b) Markov decision process

Reinforcement learning is usually restricted to first-order Markov decision processes. What does this mean and what are the practical consequences. How would the formulae change when resorting to second-order Markov decision processes?

YOUR ANSWER HERE

### c) The Q-function

The Q-function can be written as $Q(s,a) = r(s,a) + \gamma \operatorname{argmax}_{a'} Q(s,a')$.
Explain this function in your own words. What does the Q-value represent? What is the problem with that formula and how is this problem resolved in Q-learning? How would you represent this function when implementing Q-learning?

YOUR ANSWER HERE

### d) Goal state

In game playing there often is a goal state (game won/lost), and so is in the maze example from the lecture slides. Discuss the role of this goal state for the Q-Learning algorithm. Describe a learning scenario without a goal state.

YOUR ANSWER HERE

## Assignment 4: Reinforcement Learning [8 Points]

In this assignment you will have a look at the Q-Learning algorithm described in the lecture (ML-10 Slide 18). For this we generate a field with random rewards. A learning agent is then exploring the field and learns the optimal path to navigate through it. The code below is again filled with some ``TODO``s that should be filled by you in order to implement the Q-Learning algorithm. 

Below the code there are some questions! You also find a free-code field for a complete own implementation. You may use your own test mazes.

In [None]:
import numpy as np
import numpy.random as rand

def generate_field(x, y, num_rewards, max_reward):
    """
    Generate a random game field with rewards.
    
    Args:
        x (int):            x dimension of the field
        y (int):            y dimension of the field 
        num_rewards (int):  the number of rewards that should be randomly placed
        max_reward (int):   the maximum reward that can be placed 
        
    Returns:
        ndarray: A field with randomly initialized rewards, the rest of the 
        entries is zero
    """
    
    # Change or comment out to get different random data in each run
    np.random.seed(42)
    
    field = np.zeros((y,x), dtype=np.uint8)
    
    for i in range(num_rewards):
        field[rand.randint(y), rand.randint(x)] = rand.choice(max_reward)
    
    return field

In [None]:
%matplotlib notebook

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patheffects as PathEffects


class QLearning:
    """
    This class contains all the necessary methods to navigate through
    a maze or game with the help of a little bit of Q-Learning.
    """

    def __init__(self, field, actions, gamma):
        """
        Initializes the QLearning Algorithm with the necessary parameters.
        All q values are stored in self.q - this is an array that has
        ACTIONS x map_x x map_y dimensions to store a value for each action
        in each field. The starting position self.pos is randomly initialized.
        
        Args:
            field (ndarray):  the map
            actions (list):   the available actions
            gamma (float):    the gamma in the lecture slides
        
        Returns:
            QLearning: An instance that can be used for Q-Learning on the field
        """
        # q stores the q_values for each action in each space of the field.
        self.field = field
        self.actions = actions
        self.gamma = gamma
        
        # Remember the map extend for further navigation.
        self.map_y = self.field.shape[0]
        self.map_x = self.field.shape[1]
        
        # Create q value matrix.
        self.q = np.zeros((len(self.actions), self.map_y, self.map_x))

        # Start on a random position in the field.
        self.pos = [np.random.randint(self.map_y), np.random.randint(self.map_x)]
        self.fig, self.axes = plt.subplots(3, 3, num='QLearning State')
        for ax in self.axes.flat:
            ax.axis('off')

    def get_coordinates(self, position, action):
        """
        Returns the coordinates that follow a certain action, depending
        on the current position of the learner. If the border is reached
        the agent just stops there.
        
        Args:
            position (pair):  the current position
            action (string):  the action that should be performed (one of: 'up', 'down', ...)
            
        Returns:
            pair of int: the updated coordinates
        """
        # return the right new coordinates depending on the position
        # YOUR CODE HERE


    def update(self):
        """
        Implementation of the update step. Closely follows the Algorithm described on
        ML-10 Sl.18. Note that you have attributes available as specified in the
        __init__ method of this class, in addition to that is the FIELD variable that
        stores the real field the agent is iterating about, as well as ACTIONS which
        stores the available actions.
        """
        # Select a random action that should be performed next.
        # Be careful to handle the case where you hit the wall!
        # YOUR CODE HERE

        # Receive the reward for the new position from the field.
        # YOUR CODE HERE
        
        # Update the q-value for the performed action.
        # YOUR CODE HERE

        # Update the position of the player to the new field.
        # YOUR CODE HERE


    def plot(self):
        """
        Plots the current state.
        """
        fs = 8
        for i, action in enumerate(self.actions):
            ax = self.axes.flat[2*i + 1]
            ax.cla()
            ax.set(title=action)
            ax.set_xticks(np.arange(self.q[i,:,:].shape[1]))
            ax.set_yticks(np.arange(self.q[i,:,:].shape[0]))
            ax.imshow(self.q[i,:,:], interpolation='None')
            for j in range(self.q.shape[1]):
                for k in range(self.q.shape[2]):
                    text = ax.text(k, j, "{:.1f}".format(self.q[i,j,k],1),
                       ha="center", va="center", color="black", fontsize=fs)
                    plt.setp(text, path_effects=[
        PathEffects.withStroke(linewidth=1, foreground="w")])

        self.fig.canvas.draw()

In [None]:
%matplotlib notebook

import matplotlib.pyplot as plt
import matplotlib.patheffects as PathEffects


# Determine the size of the field, change this parameters as you like
m_x = 5
m_y = 4

steps = 500

actions = ['up','left','right','down']  # Those are the availabe actions for the QLearning.
field = generate_field(m_x, m_y, num_rewards=5, max_reward=10) # The field that is used for learning.

# Plotting the generated field
fs = 18
figure, ax = plt.subplots()
#plt.axis('off')
ax.imshow(field, interpolation='none')
ax.set_xticks(np.arange(field.shape[1]))
ax.set_yticks(np.arange(field.shape[0]))
for j in range(field.shape[0]):
    for k in range(field.shape[1]):
        text = plt.text(k, j, field[j,k], ha="center", va="center", color="black", fontsize=fs)
        plt.setp(text, path_effects=[PathEffects.withStroke(linewidth=3, foreground="w")])

figure.suptitle("Field",fontsize=fs)          
figure.canvas.draw()


# Generate a QLearning instance with the right parameters.
# YOUR CODE HERE

# Now we perform steps many learning iterations on the field with
# the generated QLearning instance.
for i in range(steps):     
    player.update()
    player.plot()

Explain in your own words, how the algorithm works. What is depicted on the resulting plots. How can an action policy be derived from these data?

YOUR ANSWER HERE

You are also free to write your complete own implementation of the QLearning algorithm (instead of completing the code above). Use the following cell for your implementation.

In [None]:
# YOUR CODE HERE