# Reinforcement Learning

## The RL Framework: The Problem

### Episodic Task:

- well defined ending point
- sparse reward (for example: chess): Reward only at the end of all actions/states

Interaction ends at some time step T.

S_0 A_0 R_1 S_1 A_1, ..., R_T, S_T Game Over -> Evaluate the max score reached

### Continuing Task
- Agent lives forever

Example: Stock Trading

### Reward Hypothesis

- All goals can be framed as the maximization of expected cumulative reward

Example of the walking robot:

**Actions:** Forces applied to joints

**States:** The positions and velocities of the joints, statistics about ground, and foot sensor data

![explanation reward for humanoid walking agent](images/rl1.png)

### Cumulative Return

The Agent maximizes not the individual step's return, but the expected return G_t of all future returns.

### Discounted Return

- Reward sooner is valued higher -> Discounting future returns.
- $\gamma$ >= 0 and <= 1 => $\gamma^2$ * $R_2$, $\gamma^3$ * $R_3$, ...
- Important for continuing tasks

### Markov Decision Process (MDP)

![overview of MDPs](images/rl2.png)

#### Excercise and Mathematical notation

![mathematical expression of options](images/rl3.png)

#### Definition

A (finite) MDP is defined by:

- a (finite) set of states
- a (finite) set of actions
- a (finite) set of rewards
- the one-step dynamics of the environment (formula)
- the discount rate $\gamma$

## The RL Framework: The Solution

Policy desribes the Mapping of States to Actions:

- deterministic policy: $\pi$:S -> A
- stochastic policy: $\pi$:S x A -> [0,1]

![value function for a policy](images/rl4.png)

### The Bellman Equation

![Bellmann Equation Pt1](images/rl5.png)
![Bellmann Equation Pt1](images/rl6.png)

#### State-Value Function vs. Action-Value Function

![State-Value Function vs. Action-Value Function comparison](images/rl7.png)

## Dynamic Programming

#### Iterative Policy Evaluation

1. Set the $v_\pi$(s) of each state to 0
2. Iterate through the states and Update the guesses of each state:
V($s_1$) <- $\frac{1}{2}$ x (-1 + V($s_2$)) + $\frac{1}{2}$ x (-1 + V($s_3$)) | Bellmann Equation

**In Summary:**

![Pseudocode to update the value function with the bellmann equation](images/rl8.png)

#### Calculate the action-value function:

![example to calculate the action-value function](images/rl9.png)

#### Policy Improvement

The first step is the action-value function that we calculated

![How to get to the policy improvement](images/rl10.png)

#### Truncated Policy Iteration

- Stop the iteration after a max number of steps instead of theta/no changes in the policy improvement loop.

## Monte Carlo Methods

**Reinforcement Learning vs. Dynamic Programming setting: Agent has no knowledge about the environment dynamics.**

- Monte Carlo methods are ways of solving Reinforcement Learning problems based on averaging sample returns
- Experience is divided into eposides and each eposisode ends
- For each episode the agent's goal is to find the optimal policy to maximize the expected cumulative reward

### The Prediction Problem

**Off-Policy Method**
- When the agent uses another policy to evaluate the environment as to maximize => b != $\pi$

**On-Policy Method**
- The same policy is used to evaluate as well as to maximize -> $\pi$

**Value Function in MC**
- Look at each occurance of a state
- Sum of the rewards following the observed state
- Divide by the number of episodes the state was observed in

Now it depends on first-visit or every-visit MC. Should all occurences of a state be considered, or only the first one?

#### Action Value from MCs

- No reuse of algorithm possible since we do not have a one-step dynamics.

### Control Problem

**Determine the optimal policy $\pi$***

![Incremental Mean computational efficient](images/rl11.png)

#### Epsilon Greedy Policy

![Epsilon Greedy Policy](images/rl12.png)