Consider an environment with two states $S = \{\ S_1, S_2 \}\$ and two actions $A = \{\ a_1, a_2 \}\$ where the (deterministic) transitions $\delta$ and reward $R$ for each state and action are as follows:


## Question 1:
assuming a discount factor of \gamma = 0.7$ determine the optimal policy $\pi^*: S \to A$

## Answer
$$\pi^* (S_1) = a_2$$


$$\pi^* (S_2) = a_1$$


## Question 2
still assuming $\gamma = 0.7$ determine the value function $V: S \to R$

## Answer
$$V(S_1) = 5.69$$

$$V(S_2) = 10.98$$

## Question 3
still assuming $\gamma = 0.7$ determine the values of the Q-function $Q: S \times A \to R$

## Answer

\begin{aligned}
Q(S_1, a_1) &= 1 + \gamma V (S_1) &= 4.98 \\
Q(S_1, a_2) &= V(S_1) &= 5.69 \\
Q(S_2, a_1) &= V(S_2) &= 10.98 \\
Q(S_2, a_2) &= 3 + \gamma V(S_2) &= 10.69 \\
\end{aligned}


## Question 4
Still assuming $\gamma = 0.7$, trace through the first few steps of the Q-learning algorithm, assuming a learning rate of 1 and with all Q values initially set to zero. Explain why it is necessary to force exploration through probabilistic choice of actions, in order to ensure convergence to the true Q values.

Here are some hints to get you started:

Since the learning rate is 1 (and the environment deterministic) we can use this Q-Learning update rule:

$$Q(S, a) \leftarrow r(S,a) + \gamma \max Q(\delta(S, a), b)$$

Let's assume the agent starts in state . Because the initial Q values are all zero, the first action must be chosen randomly. If action is chosen, the agent will get a reward of +1 and the update will be

$$Q(S_1, a_1) \leftarrow + \gamma \times 0 = 1$$


## Answer:
With a deterministic environment and a learning rate of 1, the Q-learning update rule is

$Q(S,a) \leftarrow r(S,a) + \gamma \max Q(\delta(S,a),b)$

if the agent starts in a state $S_1$ and chooses action $a_1$, it will get a reward of +1 and the update will be:

$$Q(S_1, a_1) \leftarrow 1 + \gamma \times 0 = 1$$

we do $\textbf{not}$ force exploration, the agent will always prefer action $a_1$ in state $S_1$, and will never explore action $a_2$ this means that $Q(S_1, a_2)$ will remain zero forever, instead of converging to the true value of 5.69. If we $\textbf{do}$ force exploration, the next steps may look like this:

\begin{center}
\begin{tabular}{ c c c }
 \text{current state} & \text{chosen state} & \text{new Q value} \\
 $S_1$ & $a_2$ & $-2 + \gamma \times = -2$ \\
 $S_2$ & $a_2$ & $3 + \gamma \times 0 = 3$
\end{tabular}
\end{center}

At this point the table looks like this:

\begin{center}
\begin{tabular}{ c c c }
 Q & $a_1$ & $a_2$ \\
 $S_1$ & 1 & -2 \\
 $S_2$ & 0 & 3
\end{tabular}
\end{center}

Again we need to force exploration in order to the get the agent to choose $a_1$
from $S_2$ and to again choose $a_2$ from $S_1$

\begin{center}
\begin{tabular}{ c c c }
 \text{current state} & \text{chosen action} & \text{new Q value} \\
 $S_2$ & $a_1$ & $7 + \gamma \times 1 = 7.7$ \\
 $S_1$ & $a_2$ & $-2 + \gamma 7.7 = 3.39$
\end{tabular}
\end{center}

\begin{center}
\begin{tabular}{ c c c }
 Q & $a_1$ & $a_2$ \\
 $S_1$ & 1 & 3.39 \\
 $S_2$ & 7.7 & 3
\end{tabular}
\end{center}

Further steps will refine the Q value estimates, and in the limit they will
converge to their true values

## Question 5
Now let's consider how the value function changes as the discount factor  varies between 0 and 1.
There are four deterministic policies for this environment, which can be written as

Calculate the value function  for each of these four policies (keeping as a variable)

## Answer:
<img src="../out/images/5a_5.png"/>

## Question 6
Determine for which range of values of  each of the policies  is optimal.

## Answer:
<img src="../out/images/5a_6.png"/>

<img src="../out/images/5a_6b.png"/>

## Question 1
Describe the elements (sets and functions) that are needed to give a formal description of a reinforcement learning environment. What is the difference between a deterministic environment and a stochastic environment?

<img src="../out/images/revision 7 q1.png"/>

## Question 2
Name three different models of optimality in reinforcement learning, and give a formula for calculating each one.

<img src="../out/images/revision 7 q2.png"/>

## Question 3
What is the definition of:
- the optimal policy
- the value function
- the Q-function?

<img src="../out/images/revision 7 q3.png"/>

## Question 4
Assuming a stochastic environment, discount factor $\gamma$ and learning rate of $\eta$, write the equation for
- Temporal Difference learning TD(0)
- Q-Learning

Remember to define any symbols you use.

<img src="../out/images/revision 7 q4.png"/>