# Reinforcement learning
Reinforcement learning is a framework in which an agent learns to make decisions by interacting with an environment in order to maximize cumulative reward. The agent observes the current state, selects an action, receives a reward, and transitions to a new state. Over time, it learns a policy that maximizes long-term returns through trial and error.

## Terminology
**Agent**: the entity that makes decisions, performs actions in the environment, receives rewards, and learns from the consequences of its actions

**Environment**: everything external to the agent that it can interact with

**State ($s$)**: all the information needed to describe the current situation of the environment. The agent will take action based on the current state, and an agent action (most of the time) will cause the transition to a new state. A state should be Markovian

**Action ($a$)**: the decision taken by the agent after assessing the current state that can affect the environment. An action can be either discrete or continuous

**Reward ($r$)**: a feedback given to an agent after it performs an action. The reward can be positive or negative based on the action taken and the resulting state, which tells the agent how good or bad its action was. A reward can be immediate or delayed. The goal of the agent is to maximize the cumulative reward. The reward signal is the way of communicating to the agent what we want achieved, not how we want it achieved.

**Discount factor ($\gamma$)**: a number between 0 and 1 that balances the future reward value based on the number of actions required. The more actions required, the smaller the rewards will be. Larger $\gamma$ means the algorithm is patient and will look for long term reward, where smaller $\gamma$ means the algorithm is impatient and will look for short term reward. The discount factor ensures the reward convergence in infinite horizon problems 

**Return ($G_t$)**: the total accumulated reward with discount after a timestep $t$, where
$$G_t= r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + ... + \gamma^{n-1} r_{t+n},$$
where $r_{t+1}$ represents the reward after $t$th action is taken followed by $n$ total actions afterward

**Policy ($\pi(s)$)**: a function that takes in the current state of the agent, $s$, and return an action, $a$ to perform. The agent makes decision based on a policy. The goal of RL is to find an optimal policy $\pi^*$ that maximizes expected return from every state

**Trajectory**: the sequence of states, actions, and rewards the agent experiences

**Episode**: a trajectory that ends in a terminal state

**Exploration**: the agent tries new or less-known actions to discover the environment

**Exploitation**: the agent select the best known action to maximize the reward to its current knowledge

## General RL workflow
For a general RL workflow, the agent observes the current state of the environment and selects an action based on a policy at each time step. The environment then responds by transitioning to a new state and providing a reward that reflects the quality of the action taken. The agent uses this reward, along with the new state, to update its understanding of the environment and improve its policy

This cycle of observing, acting, receiving rewards, and learning continues over many episodes (trial and error), allowing the agent to gradually learn a policy that maximizes long-term cumulative reward. Key components of this process include exploration (trying new actions to gather information) and exploitation (choosing the best-known action to maximize reward). Over time, the agent aims to find a balance between the two and converge toward an optimal behaviour

<img src="https://www.scribbr.com/wp-content/uploads/2023/08/the-general-framework-of-reinforcement-learning.webp" width=500>

# Multi-armed bandit
Multi-armed bandit is the simplest reinforcement learning problem. An agent will choose between $k$ different actions at each timestep, and recieves a reward based on the action chosen, but the reward distributions are unknown and different for each action. The goal of the agent is to maximize the cumulative rewards in the given amount of steps

<img src="https://miro.medium.com/v2/resize:fit:894/1*ZS_craAiKCJzFj9dQ9RaYQ.png" width=500>

## Action-Value

The value of an action, $q(a)$, is the expected reward received when action $a$ is taken:

$$q(a) = \mathbb{E}[R \mid A = a]$$

* $q(a)$: the expected reward of taking action $a$
* In multi-armed bandit problems, there is no concept of state or policy, so the value depends only on the action itself.

### Sample-Average Method
In reinforcement learning, the true action values $q(a)$ are unknown and must be estimated through repeated interactions with the environment.  
One simple estimation approach is the sample-average method, where the estimated value $\hat{q}_t(a)$ at time $t$ is:

$$\hat{q}_t(a) = \frac{\sum_{i=1}^{t-1} \mathbb{1}[A_i = a] \cdot r_i}{\sum_{i=1}^{t-1} \mathbb{1}[A_i = a]} = \frac{\text{Cumulative reward recieved from action $a$ before timestep $t$}}{\text{Total number of time the action $a$ is taken before timestep $t$}}$$

* $A_i$: the action taken at time step $i$
* $r_i$: the reward received at time step $i$
* $\mathbb{1}[A_i = a]$: an **indicator function** that equals 1 if action $a$ was taken at time $i$, and 0 otherwise

As the number of samples increases, $\hat{q}_t(a)$ converges to the true expected value $q(a)$, by the law of large numbers.

#### Incremental Updates

When learning to estimate the expected reward of each action, we need to update our estimate each time a new reward is observed. One issue with the sample-average method is that it requires storing all past rewards for each action to compute $\hat{q}_t(a)$. As the number of trials increases, the required memory grows linearly, which becomes inefficient. To solve this, we can rewrite the sample-average update in a recursive (incremental) form, where

$$\hat{q}_{t+1} = \hat{q}_t + \alpha_t \left(r_t - \hat{q}_t\right)$$

* $\hat{q}_{t+1}$: the updated estimate of the action-value after time step $t$
* $\hat{q}_t$: the previous estimate before seeing the latest reward
* $r_t$: the reward received at time step $t$
* $\alpha_t$: the step size or learning rate, a value between 0 and 1 that determines how much the estimate is updated

This form only requires storing the current estimate and does not require keeping track of all past rewards, making it computationally efficient.

## Non-Stationary Problem
A problem is said to be stationary if the reward distribution for each action remains the same over time. However, in many real world RL problems, the environment is non-stationary, meaning the reward distributions change over time. In such cases, an agent must be able to adapt its action-value estimates to reflect the most recent outcomes more than older ones. The incremental update rule can be written in a recursive form

$$\hat{q}_{t+1} = (1-\alpha)^t\hat{q}_1 + \sum^{t}_{i=1}\alpha_t (1-\alpha)^{t-i} r_i$$

* $\alpha$: a constant step size between 0 and 1
* $r_i$: the reward recieved after taking the action at timestep $i$

This formula represents the exponentially decaying weighted average of past rewards, where recent rewards are given more significance, and older rewards gradually “fade out” due to the $(1 - \alpha)^{t-i}$ decay factor.

## Exploration and Exploitation Tradeoff
One key challenge in reinforcement learning is deciding when to explore and when to exploit, as an agent cannot do both simultaneously. Exploration helps the agent improve its knowledge about the environment, which can lead to greater rewards in the long term. Exploitation, on the other hand, involves leveraging the agent’s current knowledge to maximize immediate rewards. An optimal policy should strike a balance between exploration and exploitation based on the agent’s current knowledge and state, in order to maximize cumulative reward over time.


### Greedy
The greedy policy is simple as the agent always exploits by choosing the action with the highest estimated reward, without any exploration. While this strategy can work in very simple or well-understood environments, it often performs poorly in more complex or uncertain problems, because the agent never tries new actions and thus fails to discover potentially better options or adapt when conditions change.

### $\epsilon$ Greedy
The $\epsilon$-greedy policy is a variation of the greedy policy that introduces a small amount of exploration.  
At each time step, the agent exploits with probability $1 - \epsilon$, and explores by choosing a random action) with probability $\epsilon$. This policy helps the agent avoid getting stuck with suboptimal actions by occasionally trying alternatives.This can be written as

$$
A_t =
\begin{cases}
\arg\max_a \hat{q}_t(a) & \text{with probability } 1 - \varepsilon \\
\text{a random action} & \text{with probability } \varepsilon
\end{cases}
$$

In general, the $\epsilon$ greedy policy will perform better than greedy in the long run as it gains more knowledge about the environment through exploration

### Optimistic Initial Value
Optimistic initial value is another strategy that balances exploration and exploitation by encouraging the agent to explore early. The idea is to initialize the estimated value of all actions to a number higher than the actual maximum reward. This causes the agent to optimistically assume all actions are promising, so it will try each action early on to verify its assumption. In the early timesteps, the agent explores all actions because it believes "every action might be great." Over time, as the agent gathers more data and updates its estimates, the action values converge to their true values, and the agent naturally shifts to exploitation.

However, here are some issues of optimistic initial value:
1. Only encourages early exploration: Once the agent’s estimates stabilize, it behaves greedily and stops exploring, which can lead to suboptimal policies if early exploration missed better actions
2. Poor performance in non-stationary environments: In environments where the reward distributions change over time, the agent may stop exploring too early and fail to adapt
3. Choosing the initial optimistic value is tricky: It must be high enough to encourage exploration, but not too high to delay convergence. Often, the true maximum reward is unknown, making this hard to tune

### Upper-Confidence Bound Action Selection (UCB)

UCB is a method for balancing exploration and exploitation by considering both the current value estimates and the uncertainty in those estimates. At each time step, the agent selects the action using the following rule:

$$
A_t = \arg\max_a \left[ \hat{q}_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right]
$$

* $A_t$: the action selected at time step $t$
* $\hat{q}_t(a)$: the current estimated value of action $a$
* $c$: a user-defined parameter that controls the degree of exploration. Larger $c$ encourages exploration; smaller $c$ favors exploitation
* $t$: the current time step
* $N_t(a)$: the number of times action $a$ has been selected so far

In this formula, $\hat{q}_t(a)$ is the exploitation term, which indicates the agent's current estimate of how good action $a$ is. The term $c \sqrt{\frac{\ln t}{N_t(a)}}$ is the exploration bonus, which is large when action $a$ has been selected only a few times (low $N_t(a)$), encouraging the agent to explore it.

The logarithmic (unbounded) growth in $\ln t$ ensures that all actions will eventually be explored, but actions with lower estimated value or that have already been selected many times will be chosen less frequently over time. This way, UCB systematically prioritizes actions with high upper confidence bounds

Note: UCB is an deterministic algorithm

<img src="https://media.geeksforgeeks.org/wp-content/uploads/20200126023259/Screenshot-2020-01-26-at-2.32.38-AM.png" width=500>

# Markov Decision Process (MDP)
In the bandit problem, the agent selects actions in the same static environment, where each action yields a reward independent of past actions or time. However, in many real-world problems, the environment is dynamic, and the agent must select different actions depending on the current situation. Markov Decision Processes (MDPs) provide a classical formalization for sequential decision making tasks, where each action affects not only the immediate reward but also the next state and all future actions and rewards as well.

All MDPs are Markovian, meaning the future state and reward depend only on the current state and action, not on the full history. Therefore, knowing the current state is sufficient for optimal decision making, and remembering earlier states does not improve predictions about the future.

## The Agent-Environment Interface
In MDPs, the agent and environment interact continuously in either discrete or continuous time. At each timestep $t$, the agent selects an action $A_t$ based on the current state of the environment $S_t$. As the result of its action, the agent will recieve a reward $r_t$ from the envirnment at the next timestep $t+1$, and the environment will transition into a new state $S_{t+1}$. Then, the agent selection its next action $A_{t+1}$ based on the new states. The sequence of states, actions, rewards is called a trajectory

<img src="https://www.researchgate.net/publication/340694475/figure/fig2/AS:938161513455616@1600686543282/The-agent-environment-interaction-in-reinforcement-learning.jpg" width=500>

## Dynamics of MDPs
In finite MDPs, where the sets of states, actions, and rewards are all finite, the dynamics of the environment are defined by the transition probability function:

$$p(s', r \mid s, a)$$

This represents the joint probability of transitioning to state $s'$ and receiving reward $r$, given that the agent is in state $s$ and takes action $a$. In other words, it defines:
1. the likelihood of moving to a specific next state $s'$
2. and receiving a particular reward $r$
based on the current state–action pair $(s, a)$.

This formulation captures the Markov property, where the next state and reward depend only on the current state and action, not on any earlier history.

## Episodic and Continuing Tasks
There are two types of tasks in RL, episodic tasks and continuing tasks.

The tasks are episodic when the agent–environment interaction breaks naturally into subsequences, where each episodes begins from a standard starting state or a sample from a standard distribution of starting states and will eventual reach a terminal state. Each episode begins independently of how the previous one ended.

The tasks are continuing when the agent–environment interaction cannot be broken down into subsequences and will go on without an ending
## Episodic and Continuing Tasks

In reinforcement learning, tasks are generally classified into two types: episodic and continuing.

A task is episodic when the agent–environment interaction is naturally divided into distinct episodes. Each episode begins in a standard starting state (or sampled from a starting state distribution), proceeds through a sequence of interactions, and eventually reaches a terminal state. After an episode ends, the environment resets, and the next episode begins independently of how the previous one ended.

A task is continuing when there is no natural endpoint, where the agent–environment interaction continues indefinitely without reaching a terminal state. This setting is common in real-world systems that operate continuously, such as online recommendation engines or stock trading agents.

<img src="https://av.tib.eu/thumbnail/63100" width=500>

## Goal of Reinforcement Learning

The goal of reinforcement learning at any time step $t$ is to maximize the expected return, denoted $G_t$, which represents the cumulative reward the agent can expect to receive starting from that time step.

In episodic tasks, each episode has a fixed or variable length and terminates at some final time step $T$.  
Since the episode ends, the return is always finite:

$$G_t = r_{t+1} + r_{t+2} + \dots + r_T$$


In continuing tasks**, the agent–environment interaction does not end, so we introduce a discount factor $\gamma \in [0, 1)$ to ensure the return remains finite, where

$$G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}$$

* A larger discount factor ($\gamma \to 1$) makes the agent more far-sighted, valuing long-term rewards.
* A smaller discount factor makes the agent more short-sighted, prioritizing immediate rewards.

Note: $\frac{R_{\text{max}}}{1 - \gamma}$ is an upper bound on return only if each reward is bounded by $R_{\text{max}}$, i.e., $r_t \leq R_{\text{max}}$.


### Recursive Form of Return
The return can also be defined recursively, where

$$G_t = r_{t+1} + \gamma G_{t+1}$$

This recursive relationship is fundamental in deriving value functions and forms the basis of many RL algorithms.

# Policies and Value Functions
## Policies
A policy, denoted by $\pi$, defines the agent’s behaviour by specifying a mapping from states to a probability distribution over actions. Formally, given the current state $s$, the policy defines

$$\pi(a \mid s)$$

This expression represents the probability of selecting action $a$ when the agent is in state $s$.

Policies can be:
* Deterministic: where $\pi(s)$ directly maps to a specific action.
* Stochastic: where $\pi(a \mid s)$ gives a probability distribution over actions.

Note: A policy is typically a function of only the current state $s$, due to the Markov property. If the policy depends on more than the current state (e.g., past states), then the problem setting is no longer a MDP unless those inputs are encoded into the current state.

## State Value Function
The state value function, denoted by $v_\pi(s)$, represents the expected return when the agent starts in state $s$ at time step $t$ and follows a policy $\pi$ thereafter. It is defined as

$$v_\pi(s) = \mathbb{E}_\pi \left[ G_t \mid s_t = s \right] = \mathbb{E}_\pi \left[ \sum_{k=0}^\infty \gamma^k r_{t+k+1} \mid s_t = s \right]$$


## Action Value Function
The action value function, denoted by $Q_\pi(s, a)$, represents the expected return when the agent starts in state $s$, takes action $a$, and then follows policy $\pi$ thereafter. It is defined as

$$Q_\pi(s, a) = \mathbb{E}_\pi \left[ G_t \mid s_t = s, a_t = a \right] = \mathbb{E}_\pi \left[ \sum_{k=0}^\infty \gamma^k r_{t+k+1} \mid s_t = s, a_t = a \right]$$

## Bellman Equation
The Bellman equation provides a recursive formulation of the state value function and action value function. Instead of computing returns as infinite sums of future rewards, the Bellman equations allow us to compute value functions based on expected immediate reward plus the discounted value of the next states.

### Bellman Equation for the State Value Function
The Bellman equation for the state value function expresses the value of a state $s$ under a policy $\pi$ in terms of
1. All the actions available in $s$
2. The probability of choosing each action under policy $\pi$,
3. The environment's transition dynamics/probability $p(s', r \mid s, a)$,
4. The values of successor states $v_\pi(s')$.

It is defined as
$$v_\pi(s) = \mathbb{E}_\pi \left[ G_t \mid s_t = s \right] = \sum_{a} \pi(a \mid s) \sum_{s'} \sum_{r} p(s', r \mid s, a) \left[ r + \gamma v_\pi(s') \right]$$

* $s$, $s'$: the current and next state respectively
* $v_\pi(s)$: expected return starting from state $s$ and following policy $\pi$
* $\pi(a \mid s)$: probability of taking action $a$ in state $s$
* $p(s', r \mid s, a)$: probability of transitioning to state $s'$ and receiving reward $r$ given current state $s$ and action $a$
* $\gamma$: discount factor, $0 \leq \gamma < 1$

At a high level, the Bellman equation follows
1. From the current state $s$, consider all possible actions $a$ (weighted by the policy).
2. For each action, consider all possible resulting next states $s'$ and rewards $r$ (weighted by the environment's dynamics).
3. For each possible $(s', r)$ pair, compute the expected return, which is the immediate reward $r$ plus the discounted future value $v_\pi(s')$.
4. Compute the weighted sum all of these to compute $v_\pi(s)$.

Note: $\pi(a \mid s)$ is the probability of taking action $a$ in state $s$ under policy $\pi$ and $p(s', r \mid s, a)$ is the probability of transitioning to next state $s'$ and receiving reward $r$, given current state $s$ and action $a$. In the equation, we are summing over all possible immediate outcomes, which are the actions the agent might take, the rewards it might receive, and the next states it might reach. This allows the equation to "look ahead" one step into the future and compute the expected value of that step, using $\pi(a \mid s)$ to weigh each action and $p(s', r \mid s, a)$ to weigh each possible outcome of that action. This can be visualized with the Backup Diagram.

<img src="https://goodboychan.github.io/images/backup_diagram_for_v.png" width=300>

The result is a weighted average of the immediate reward plus the discounted value of future states. This recursive structure can be thought of as expanding a search tree, where each branch corresponds to a possible action and outcome.

#### Notational Notes
* We omit the time index $t$ in $v_\pi(s)$ because under the Markov property, the value of a state depends only on the current state, not on the specific time step.
* The summation over $r$ is used when the reward distribution is stochastic; in deterministic cases, this may be dropped.

### Bellman Equation for the Action-Value Function
Similarly, the action-value function can also be expressed in a recursive form using the Bellman equation.  
It defines the expected return starting from state $s$, taking a specific action $a$, and then following policy $\pi$ thereafter:

$$q_\pi(s, a) = \mathbb{E}_\pi \left[ G_t \mid s_t = s, a_t = a \right] = \sum_{s'} \sum_{r} p(s', r \mid s, a) \left[ r + \gamma \sum_{a'} \pi(a' \mid s') q_\pi(s', a') \right]$$

* There is no policy term in the outer expectation because the first action $a$ is already fixed (we are evaluating $q_\pi(s, a)$ for a specific action).
* The recursion comes into play after the transition to the next state $s'$, where the agent resumes following policy $\pi$. For each possible next state $s'$ and reward $r$, we
    1. Compute the immediate reward $r$
    2. Add the discounted expected value of the next state, where the value is a weighted sum over all possible next actions $a'$ under policy $\pi$
    
This gives a complete recursive expression for $q_\pi(s, a)$, based on the one-step lookahead and expected future return.


In summary, the Bellman equation leverages the recursive structure of MDPs to transform an infinite sum of future rewards into a system of linear equations. While this makes value computation more tractable in theory, solving this system exactly becomes computationally infeasible in large-scale problems due to the exponential growth in the number of states and actions.

## Optimal Policy
The goal of reinforcement learning is to find an optimal policy that maximizes the expected cumulative reward. A policy $\pi$ is said to be better than or equal to another policy $\pi'$ if, for every state $s$, the expected return of following $\pi$ is at least as high as that of following $\pi'$. Formally, $\pi \geq \pi'$ if and only if

$$ v_\pi(s) \geq v_{\pi'}(s) \quad \text{for all } s \in \mathcal{S}$$

An optimal policy, denoted by $\pi_*$, is a policy that is better than or equal to all other policies. That is, for any other policy $\pi'$

$$v_{\pi_*}(s) \geq v_{\pi'}(s) \quad \text{for all } s \in \mathcal{S}$$

There always exists at least one optimal policy, and in some cases, multiple optimal policies may exist, all yielding the same optimal state-value function, $v_*(s)$, and optimal action-value function $q_*(s, a)$.

### Optimal Value Functions
The optimal state-value function, denoted by $v_*(s)$, represents the maximum expected return that can be achieved from state $s$ by among all policies, where

$$v_*(s) = \max_{\pi} v_\pi(s) \quad \text{for all } s \in \mathcal{S}$$

Similarly, the optimal action-value function, denoted by $q_*(s, a)$, represents the maximum expected return achievable from state $s$ by taking action $a$ and then following the best possible policy thereafter:

$$q_*(s, a) = \max_{\pi} q_\pi(s, a) \quad \text{for all } s \in \mathcal{S}, a \in \mathcal{A}$$

### Bellman Optimality Equations

We can express the Bellman equations for the optimal value functions without referencing any specific policy. These are known as the Bellman Optimality Equations.

The optimal state-value function $v_*(s)$ satisfies:

$$v_*(s) = \max_{a} \sum_{s'} \sum_{r} p(s', r \mid s, a) \left[ r + \gamma v_*(s') \right]$$

This equation tells us that under an optimal policy, the value of a state equals the maximum expected return achievable by taking the best possible action from that state. In other words, the best action in any state is the one that leads to the highest expect state value.

The optimal action-value function $q_*(s, a)$ satisfies:

$$q_*(s, a) = \sum_{s'} \sum_{r} p(s', r \mid s, a) \left[ r + \gamma \max_{a'} q_*(s', a') \right]$$

This equation expresses the value of taking action $a$ in state $s$ as the expected immediate reward plus the best possible value achievable from the resulting next state $s'$, by taking the best action $a'$ at that point. Since the current state and action $(s, a)$ are already fixed, there is no choice to be made immediately, the choice comes at the next decision point.

It is essential to express these equations without referencing any policy because the goal of reinforcement learning is to discover the optimal policy. We cannot write the equations in terms of a policy that we don't yet know. The Bellman Optimality Equations define the criteria that an optimal policy must satisfy, and solving them (exactly or approximately) allows us to find the best possible decisions at each state.

### Deriving the Optimal Policy from Optimal Value Functions

Once we have the optimal value functions, it is straightforward to derive the optimal policy from them. The optimal policy $\pi_*$ selects the action in each state that maximizes the expected return, based on the optimal state-value function $v_*(s)$, where

$$\pi_*(s) = \arg\max_a \sum_{s'} \sum_{r} p(s', r \mid s, a) \left[ r + \gamma v_*(s') \right]$$

This means that the optimal policy always chooses the action that leads to the highest expected state value in the next state.

Using the optimal action-value function $q_*(s, a)$, the optimal policy can be written more simply as

$$\pi_*(s) = \arg\max_a q_*(s, a)$$

This means the optimal policy selects the action that has the highest action value for the current state.

In both cases, the agent behaves greedily with respect to the optimal value function, where it chooses the action that promises the greatest long-term reward.

However, in most real-world scenarios, obtaining the optimal policy by directly solving the Bellman equations is unrealistic because it requires

1. Substantial knowledge of the environment, including the transition dynamics $p(s', r \mid s, a)$, which we often do not have access to in practice.
2. Solving the Bellman equations involves extremely high computational and memory demands, especially in environments with large or continuous state and action spaces.

As a result, in most practical applications, we use approximation methods to learn a "good enough" policy or value function, rather than attempting to compute the exact optimal values.

# Policy Evaluation & Improvement 

In reinforcement learning, two fundamental tasks are:
* Policy Evaluation: Determining how good a given policy $\pi$ is by computing its state-value function $v_\pi$.
* Policy Improvement: Improving a policy by iteratively producing strictly better policies until reaching an good enough or optimal policy.

Dynamic Programming (DP) methods can be used to solve both tasks if we have access to the environment's dynamics $p(s', r \mid s, a)$.

## Iterative Policy Evaluation
To evaluate the state values under a policy $\pi$, we use the Bellman equation for the value function, where

$$v_\pi(s) = \sum_{a} \pi(a \mid s) \sum_{s'} \sum_{r} p(s', r \mid s, a) \left[ r + \gamma v_\pi(s') \right]$$

Since computing $v_\pi$ exactly is often impractical, we use an iterative approximation, where

$$v_{k+1}(s) = \sum_{a} \pi(a \mid s) \sum_{s'} \sum_{r} p(s', r \mid s, a) \left[ r + \gamma v_k(s') \right]$$

Each iteration updates the estimated value of every state once, using the latest available estimates. With enough iterations, the values are guaranteed to converge to the true value function $v_\pi$. The key idea of policy evaluation is to use observed (or expected) rewards and the estimated state values one step ahead to iteratively update state values until they converge.

### Algorithm: Iterative Policy Evaluation (Given a Policy $\pi$)

1. Initialize two arrays, $v_k$ and $v_{k+1}$, to store the old and updated state values for all states.
2. Randomly initialize the state values (or initialize all to zero).
3. For each state $s$, compute a new value using
   $$v_{k+1}(s) = \sum_{a} \pi(a \mid s) \sum_{s'} \sum_{r} p(s', r \mid s, a) \left[ r + \gamma v_k(s') \right]$$
4. Store all new values in $v_{k+1}$.
5. After updating all states, replace $v_k$ with $v_{k+1}$ (set $v_k = v_{k+1}$).
6. Repeat steps 3–5 until the values converge, meaning the change between $v_k$ and $v_{k+1}$ is smaller than a chosen threshold.

A unique value function $v_\pi$ is guaranteed to exist for any given policy $\pi$ as long as the task is episodic or the discount factor satisfies $\gamma < 1$. Under these conditions, the iterative updates are guaranteed to converge to the true values.

## Policy Improvement
The primary goal of computing the value function of a policy is to help us find a better policy. Given the state-value function $v_\pi$ of any arbitrary policy $\pi$, we can construct a new policy $\pi'$ by acting greedily with respect to $v_\pi$. That is, in each state, $\pi'$ selects the action that yields the highest expected return based on the current value estimates. By doing this, the new policy $\pi'$ is guaranteed to be at least as good as the original policy $\pi$. If the new policy’s value function $v_{\pi'}$ is strictly better than $v_\pi$, we have improved the policy. If $v_{\pi'} = v_\pi$, this means the current policy is already optimal, and no further improvement is possible.

### Policy Improvement Theorem
Formally, given two policies $\pi$ and $\pi'$, the new policy $\pi'$ is guaranteed to be at least as good as $\pi$ if, for every state $s$:

$$q_\pi(s, \pi'(s)) \geq v_\pi(s)$$

This condition means that in every state $s$, the action chosen by $\pi'$, which can be a same or different action chosen by $\pi$, always yields an expected return greater than or equal to the value of following the original policy $\pi$ from that state onward.

As a result,

$$v_{\pi'}(s) \geq v_\pi(s) \quad \text{for all } s$$

### Greedy Policy Improvement
Applying the policy improvement theorem, given an original policy $\pi$ and its state-value function $v_\pi(s)$, we can obtain a better or equally good policy $\pi'$ by selecting actions greedily with respect to $v_\pi(s)$, where

$$\pi'(s) = \arg\max_a q_\pi(s, a)$$

* If $\pi' \neq \pi$, then $\pi'$ is a strictly better policy (strict improvement).
* If $\pi' = \pi$, then the policy is already greedy with respect to its own value function, which indicates that the current policy is optimal and cannot be improved further.

## Policy Iteration
Policy Iteration is an algorithmic process for finding the optimal policy by repeatedly alternating between policy evaluation and policy improvement. The key idea is that by evaluating the current policy and then improving it based on the evaluation, we can iteratively converge to an optimal or sufficiently good policy. In this process, each policy evaluation gives us more accurate value estimates, and each policy improvement ensures the new policy is strictly better or equally good. The process is guaranteed to converge to the optimal policy in finite MDPs.

### Policy Iteration Algorithm

1. Initialize:  
   * Start with an arbitrary policy $\pi$
   * Initialize the state-value function $v$

2. Policy Evaluation: 
   * Compute $v_\pi$ for the current policy using iterative updates until convergence.

3. Policy Improvement:
   * For each state $s$, update the policy greedily with respect to $v_\pi$:  
   $$\pi_{\text{new}}(s) = \arg\max_a \sum_{s'} \sum_r p(s', r \mid s, a) \left[ r + \gamma v_\pi(s') \right]$$

4. Check for Convergence:
   * If the policy hasn’t changed, the optimal policy has been found.
   * Otherwise, set $\pi = \pi_{\text{new}}$ and return to Step 2.

<img src="https://plusreinforcement.com/wp-content/uploads/2018/07/screen-shot-2018-07-04-at-7-37-40-pm.png" width=350>

# Generalized Policy Iteration
Generalized Policy Iteration  refers to the framework in reinforcement learning where policy evaluation and policy improvement are interact repeatedly and approximately to gradually refine a policy toward optimality. The key idea is that evaluation and improvement do not need to fully complete before the next begins. As long as policy evaluation moves the value function closer to accurate estimates under the current policy, and policy improvement makes the policy greedier with respect to those estimates, the process will converge over time. If both the value function and the policy eventually stabilize, meaning no further updates occur, then the policy must be optimal, and the value function must be the optimal value function.


## Value Iteration
Value iteration is a form of generalized policy iteration that streamlines the process of finding the optimal policy. It addresses the inefficiency in traditional policy iteration, where the policy evaluation step requires multiple sweeps over all states until the value function converges. Instead of fully evaluating the current policy before improving it, value iteration performs a single update to each state's value in each iteration and greedy policy improvement. This reduces computational cost while still ensuring convergence to the optimal policy.

The core update rule is:
$$v_{k+1}(s) = \max_a \sum_{s'} \sum_r p(s', r \mid s, a) \left[ r + \gamma v_k(s') \right]$$

Each update improves the value estimate of state $s$ by considering the best possible action to take, which is already a greedy selection. Once the value function has converged (changes are below a small threshold), the optimal policy can be derived by acting greedily with respect to the final value function

$$\pi_*(s) = \arg\max_a \sum_{s'} \sum_r p(s', r \mid s, a) \left[ r + \gamma v_*(s') \right]$$

Value iteration has the advantage of
1. More efficient than policy iteration, especially in large state spaces.
2. Combines evaluation and improvement into a single update step.
3. Guaranteed to converge to the optimal value function and policy under standard conditions (finite MDP, $\gamma < 1$ or episodic tasks).

# DP Efficiency & Limitations
Dynamic programming methods, such as policy iteration and value iteration, are foundational tools in reinforcement learning for solving MDPs. These methods are guaranteed to converge to the optimal policy under the right conditions and are considered efficient in a theoretical sense.

DP methods are polynomial time algorithms, meaning their computational complexity grows at a manageable rate with respect to the number of states and actions. This makes them efficient in contrast to brute force approaches that might try all possible policies.

There are two types of DP methods, synchronous and asynchronous DP.

* Synchronous DP: the value of all states is updated simultaneously and systematically in each iteration based on the values from the previous iteration. This is conceptually simple and easy to parallelize, but can be inefficient if many states are rarely visited or irrelevant in a given policy.

* Asynchronous DP: only a subset of states is updated at a time. This allows for more targeted updates focusing on states that are frequently visited or have recently changed. Asynchronous methods can often reach good solutions more quickly in practice. However, to guarantee convergence to the correct value function, all states must still be updated sufficiently often over time. This means asynchronous DP does not necessarily reduce the total number of updates needed for convergence, but it allows for smarter scheduling and prioritization of updates. In practice, this leads to more efficient use of computational resources, especially when combined with techniques like prioritized sweeping.

## Limitations of DP
However, "efficiency" of DP in a theoretical sense does not always translate to practicality in real world problems because of the following reasons

1. Curse of Dimensionality: DP methods become intractable in high dimensional state spaces, where the number of possible states grows exponentially with the number of variables describing the state.

2. Requirement of the Transition Probability: DP requires complete knowledge of the environment, specifically the transition probability function, $p(s', r \mid s, a)$. In real world problems, this dynamic model is usually unknown, making pure DP methods inapplicable without approximation or learning.

Dynamic Programming is a powerful baseline, but its real world applicability is often limited. That’s why modern reinforcement learning often turns to approximate dynamic programming, Monte Carlo methods, and temporal difference learning, which relax these constraints.

# Sampling Based Learning Methods
In many real-world reinforcement learning scenarios, we do not have access to the full model of the environment, such as the transition probabilities or reward distributions. As a result, it becomes infeasible to apply planning methods like dynamic programming, which require full knowledge of the environment's dynamics.

Sampling-based learning methods (model-free methods) address this by learning directly from sampled experiences, the actual sequences of states, actions, and rewards encountered by the agent while interacting with the environment. Instead of computing expectations over all possible next states as in DP, sampling based methods approximate value functions or policies using observed outcomes through experiences.

Common examples of sampling-based methods include
* Monte Carlo methods: Learn from complete episodes by averaging returns.
* Temporal-Difference (TD) learning: Learn from bootstrapped estimates using partial episodes

These methods are more scalable and practical in complex environments because they do not require storing or computing the full transition dynamics. However, they also come with trade-offs, such as increased variance in learning and the need for sufficient exploration to obtain a good enough estimate. In summary, sampling-based learning allows an agent to learn optimal behaviour from trial-and-error experience, without needing a complete model of the environment

## Monte Carlo Method
Monte Carlo methods estimate value functions by averaging returns obtained from multiple sampled episodes. These methods rely solely on sequences of states, actions, and rewards generated through actual interactions with the environment. Unlike DP, Monte Carlo methods do not require knowledge of the environment’s dynamics.

Similar to DP, Monte Carlo methods follow the principle of Generalized Policy Iteration, where they alternate between policy evaluation (estimating how good the current policy is) and policy improvement (using the value estimates to make the policy better), eventually converging to an optimal or good enough policy.

### Monte Carlo for Policy Evaluation
To evaluate a policy using the Monte Carlo approach
1. Given a policy, run simulations based on it to generate many episodes.
2. Record the returns (cumulative discounted rewards) observed following each state.
3. Estimate the value of a state as the average return observed when visiting that state across many episodes.

Given that episodes terminate, the return $G_t$ at each time step $t$ can be computed backward from the end of the episode, where
$$G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots + \gamma^{T-t-1} r_T$$

With enough samples, the average return for each state converges to the true expected return under the given policy.

### Monte Carlo vs DP
1. No Need for a Model: Monte Carlo methods do not require access to the environment's transition or reward model. They learn directly from sampled episodes, unlike DP which requires complete knowledge of the dynamics.

2. Independent State Updates: In Monte Carlo, the value of each state is computed independently by averaging returns from that specific state. In contrast, DP relies on bootstrapping, where the value of a state is a weighted sum of the values of successor states. This means
   * Monte Carlo computation scales with the number of episodes (and their length), not the size of the state space (the size of MDP).
   * DP computation scales with the number of states and the transition model complexity.

<img src="https://i0.wp.com/roboticseabass.com/wp-content/uploads/2020/08/rl_intro_banner-1.png?fit=1200,448&ssl=1" width=700>

### Monte Carlo for Action-Value Estimation
In DP, state values are sufficient to determine a policy because the model (transition probabilities and rewards) allows us to look one step ahead and select the action that leads to the best expected outcome. However, in Monte Carlo methods, where no model of the environment is available, state values alone are insufficient for decision makin because even though we know the all the state values, we do not know the transition probabilities, and thus, do not know which state the agent will end up in after taking an action. Therefore, we must explicitly estimate action-value functions, $q_\pi(s, a)$, to determine which actions are better under a given policy.

To estimate action values
1. Generate episodes using the current policy.
2. For each state–action pair, we record the return following that pair.
3. The estimated action value $q_\pi(s, a)$ is the average of these observed returns.

This is similar to estimating state values, except we condition on both the state and the action taken.

### Exploring Starts
However, this methods has an critical issue when the agent follows a deterministic policy, and certain state–action pairs may never be visited. As a result, some action values will never be updated, and the agent cannot learn about better actions it never tries. This violates the principle of sufficient exploration, making it impossible to guarantee convergence to the optimal policy.

One common solution in Monte Carlo control is exploring starts, where each episode begins from a randomly chosen state–action pair and follow the policy for the remainder of the episode. This ensures that all state–action pairs have a non-zero probability of being explored. With enough episodes, this guarantees that all $q_\pi(s, a)$ values are visited and accurately estimated.

### Monte Carlo Method for Policy Iteration
Monte Carlo methods can be used to perform policy iteration. This is achieved through sampling episodes and estimating the action values from experience. The algorithm proceeds as follows

1. Initialization
* Initialize an arbitrary policy $\pi$.
* Initialize the action-value function $q_\pi(s, a)$ arbitrarily (e.g., to zero for all $(s, a)$ pairs).

2. Generate an Episode
* Use exploring starts to ensure sufficient coverage of all state-action pairs.
* Generate an episode by starting from a random state-action pair, then follow the current policy $\pi$.
* Record the sequence of $(s_t, a_t, r_{t+1})$ for the entire episode until termination.


3. Monte Carlo Policy Evaluation
* For each state-action pair $(s, a)$ that appears in the episode:
     - Compute the return $G_t$ backward with discount.
     - Update the action-value estimate $q_\pi(s, a)$ as the sample average of all returns observed for that pair:
    $$q_\pi(s, a) \leftarrow \text{average of all returns observed after visiting } (s, a)$$

Note: Only the $(s, a)$ pairs actually visited in the episode are updated; others remain unchanged. In theory, an infinite number of episodes ensures accurate value estimation for all pairs. In practice, however, we use limited episodes, meaning early updates may be noisy, but over many iterations, the estimates improve.

4. Policy Improvement
* For each state $s$ visited in the episode, improve the policy by choosing the action with the highest estimated value to make the policy greedy with respect to the current $q_\pi$, where
  $$\pi_{\text{new}}(s) = \arg\max_a q_\pi(s, a)$$

5. Repeat
* Return to Step 2 and continue until the policy converges, or for a fixed number of iterations.

Note: Using one or few episode per iteration is analogous to stochastic gradient descent, where each update may be noisy, but the overall trend improves the policy over time. The use of exploring starts ensures that all state-action pairs have a non-zero chance of being updated, which is critical for convergence guarantees.

### Monte Carlo for Action-Value Without Exploring Start
Exploring start requires that the agent will be initialized at all possible state-action pairs in order to give enough exploration. This can be problematic in complex, real world problem as some problems, like self-driving car, can have infinite amount of state-action pairs.

To solve this, we can use $\epsilon$-soft policy. $\epsilon$-soft policy is a stochastic policy that give each possible actions at least $\frac{\epsilon}{\text{Total number of actions}}$ probability of being selected, where
$$\pi(a|s) \geq \frac{\epsilon}{A(s)}$$

For example, $\epsilon$ greedy and uniform random policies are both examples of $\epsilon$-soft policy.

The stochasticity in $\epsilon$-soft policy mean that the agent will explore automatically, meaning they will eventually visit all state-action pairs with large enough episodes. This means we can remove the exploring start condition.

However, $\epsilon$-soft policy also means it can only learn the optimal soft policy through policy iteration, and it can never perform as well as the optimal deterministic policy as there's always a small chance of the agent to explore even when it knows the environment completely. Nontheless, this optimal soft policy often performs well enough, and there is no need to find the optimal deterministic policy

To achieve policy iteration for $\epsilon$-soft policies, we first initalize a random $\epsilon$-soft policy. Then, we run simulations to obtain an episode, and we use the observed return at each state to update the action-value estimates. Finally, we make the policy $\epsilon$ greedy with respect to the currently action-value function. This process is essentially the same as before, but instead of using a determinstic policy, all the policies involved are stochastic, and instead of arrving at the optimal dterministic policy, we will arrive at the optimal $\epsilon$-soft policy

### Monte Carlo for Action-Value Without Exploring Starts
In Monte Carlo control with exploring starts, every state-action pair must have a non-zero probability of being the initial pair in an episode. This guarantees that all $(s, a)$ pairs are eventually visited, which is critical for ensuring convergence to the optimal policy. However, exploring starts are impractical in real world problems, especially those with continuous or infinite state-action spaces, like self-driving cars, where it’s impossible to initialize the agent in every possible situation.

### $\boldsymbol{\varepsilon}$-Soft Policies
To ensure sufficient exploration without exploring starts, we use $\varepsilon$-soft policies, which are stochastic policies that assign non-zero probability to all actions in every state, where

$$\pi(a \mid s) \geq \frac{\varepsilon}{|\mathcal{A}(s)|}$$

This guarantees that every action in every state has a chance of being selected with at least probability of $\frac{\varepsilon}{\text{Number of possible actions}}$. Thus, all state-action pairs are eventually explored with enough episodes. Both $\varepsilon$-greedy policy and uniform random policy are examples of $\varepsilon$-soft policies

Using an $\varepsilon$-soft policy ensures exploration and removes the need for exploring starts. However, there’s a trade-off, where the agent can only converge to the optimal $\varepsilon$-soft policy, not the optimal deterministic policy. Even after learning a near-optimal policy, the agent still occasionally explores due to the non-zero $\varepsilon$.

In practice, the performance of the optimal $\varepsilon$-soft policy is often close to that of the optimal deterministic policy, and perform well enough for the need.

### Monte Carlo Control with $\varepsilon$-Soft Policy (Policy Iteration)
The process is similar to standard Monte Carlo policy iteration, with the key difference being the use of stochastic policies:

1. Initialize:
   - Initialize a random $\varepsilon$-soft policy $\pi$.
   - Initialize action-value estimates $q(s, a)$ arbitrarily.

2. Generate Episode:
   - Generate an episode following the current $\varepsilon$-soft policy.
   - Record the sequence of $(s_t, a_t, r_{t+1})$.

3. Policy Evaluation:
   - For each state-action pair in the episode:
     - Compute return $G_t$ and update $q(s, a)$ as the sample average.

4. Policy Improvement:
   - For each state $s$ visited:
     - Make the policy $\varepsilon$-greedy with respect to the current $q(s, a)$:
       $$
       \pi(a \mid s) =
       \begin{cases}
       1 - \varepsilon + \frac{\varepsilon}{|\mathcal{A}(s)|}, & \text{if } a = \arg\max_a q(s, a) \\
       \frac{\varepsilon}{|\mathcal{A}(s)|}, & \text{otherwise}
       \end{cases}
       $$

5. Repeat steps 2–4 until convergence.

Overall, $\varepsilon$-soft policy solves the problem of insufficient exploration with a large number of possible state-action pairs, and is able to achieve a good enough, sub-optimal stochastic policy.

# On-Policy vs Off-Policy Learning
In reinforcement learning, there are two fundamental paradigms for learning from interaction, on-policy and off-policy learning. The key difference lies in whether the policy being learned is the same as the policy used to generate the data.

## On-Policy Learning
In on-policy learning, the agent learns about and improves the same policy that it uses to make decisions. This means the data used for learning is generated by following the current policy, which typically includes a balance of exploitation and exploration. 

**Pros:**
1. Conceptually simpler.
2. Lower variance in value estimates since learning is based on the actual policy being executed.

**Cons:**
1. Exploration is constrained by the policy itself.
2. May learn slower due to limited exploration.

## Off-Policy Learning
In off-policy learning, the agent learns about a different policy than the one used to generate the data. There are two distinct policies involved:

1. Target policy: the policy the agent is trying to learn or evaluate, denoted by $\pi(a \mid s)$
2. Behaviour policy: the policy used to generate data (select actions), denoted by $b(a \mid s)$

This allows the agent to learn an optimal or deterministic target policy while still exploring using a stochastic behaviour policy.

**Pros:**
1. More general and powerful; can learn from data generated by other agents or past experiences.
2. Enables stronger exploration while still converging to an optimal deterministic policy.

**Cons:**
1. Higher variance in updates, especially when using importance sampling.
2. Requires careful alignment between behaviour and target policies.

For off-policy learning to be valid, the behaviour policy must cover all the actions that the target policy might take. That is:
$$\text{If } \pi(a \mid s) > 0, \text{ then } b(a \mid s) > 0$$

This ensures that the agent gets enough data to estimate the value of the target policy correctly. If the behaviour policy never explores certain actions that the target policy might select, it is impossible to evaluate or improve the target policy accurately.

## Importance Sampling
Importance sampling is a fundamental technique used in off-policy reinforcement learning. It allows us to estimate expectations under one probability distribution (the target policy) using samples drawn from a different distribution (the behaviour policy). This is crucial in off-policy learning, where the agent aims to learn or evaluate a policy that is different from the one used to generate data.

In off-policy learning, we want to compute the expected return or value function under the target policy $\pi$, but the agent collects data using the behaviour policy $b$, which is, generally, a more exploratory policy. Since the distrubtion of the target and behaviour policies are different, we need a way to "correct" for the difference in distributions, which is importance sampling.

### The Mathematics
Suppose we want to estimate the expected value of some function $f(x)$ under a target distribution $p(x)$, but we only have samples from a different distribution $q(x)$. Then:

$$
\mathbb{E}_{x \sim p}[f(x)] = \sum_x p(x) f(x) = \sum_x \frac{p(x)}{q(x)} q(x) f(x) = \mathbb{E}_{x \sim q}\left[\frac{p(x)}{q(x)} f(x)\right]
$$

Here, $\frac{p(x)}{q(x)}$ is the importance sampling ratio, which reweights the samples to reflect their importance under the target distribution. The importance sampling ratio reflects the ratio between the two distributions, which allows us to estimate expectations under one distribution using samples from another by accounting for the difference in likelihood between the two.

### Importance Sampling in Reinforcement Learning
In the context of RL, the goal is to estimate the expected return under a target policy $\pi$, but the data (state-action-reward trajectories) was generated by a different behaviour policy $b$. Even if the actions taken differ, both the target policy $\pi$ and the behaviour policy $b$ interact with the same environment. This means the state transition dynamics are the same for both, since they are governed by the MDP and do not depend on the policy itself. As a result, the only difference between trajectories generated by $b$ and those from $\pi$ lies in how actions are selected at each state, which is the difference in the distributions of the policies. By accounting for this difference with importance sampling, we can accurately estimate the expected return of the target policy $\pi$ using samples generated by the behaviour policy $b$.

* Let $\pi(a_t \mid s_t)$ be the probability of taking action $a_t$ in state $s_t$ under the target policy $\pi$.
* Let $b(a_t \mid s_t)$ be the probability of taking the same action under the behaviour policy $b$.

At time step $t$, the importance sampling ratio is
$$\rho_t = \frac{\pi(a_t \mid s_t)}{b(a_t \mid s_t)}$$

This ratio tells us how to reweight a sample from the behaviour policy $b$ to estimate its expected contribution under the target policy $\pi$. In other words, it represents the ratio of the likelihood of taking the same action $a$ in the same state $s$ between the two policies.

* If $\rho_t$ is small, it means this action was much more likely under $b$ than under $\pi$, so this sample is less relevant to what $\pi$ would do. We downweight it.
* If $\rho_t$ is large, it means this action is more consistent with what $\pi$ would have done. We upweight it.

Essentially, $\rho_t$ tells us how much a sample from $b$ should count when estimating what $\pi$ would do. It adjusts for the differences of the distributions between the two policies, so that we can use data from one policy to estimate the behaviour of another.

To estimate the return under the target policy $\pi$ over a full trajectory of length $T$, we compute the cumulative importance sampling ratio from time $t$ onward, where

$$\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(a_k \mid s_k)}{b(a_k \mid s_k)} = \rho_t \cdot \rho_{t+1} \cdots \rho_{T-1}$$

This cumulative ratio adjusts for the differences between the target policy $\pi$ and the behaviour policy $b$ at each timestep along the trajectory. It effectively reweights the entire trajectory so that it reflects how likely the actions would be under the target policy. Then, given a trajectory generated by $b$, we can estimate the expected return under the target policy $\pi$ using:

$$v_\pi(s_t) = \mathbb{E}_b[\rho_{t:T-1} \cdot G_t \mid s_t] \approx \frac{1}{N} \sum_{i=1}^N \rho^{(i)} G^{(i)}$$

* $G_t$: the observed return from time $t$ onward based on the behaviour policy $b$, where $G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots$
* $\rho_{t:T-1}$: the cumulative importance sampling ratio
* $\rho^{(i)} G^{(i)}$: one reweighted return from the $i$-th trajectory.
* $N$: the total number of trajectories sampled from the behaviour policy $b$.

So if we have multiple trajectories, we compute the adjusted return for each using $\rho^{(i)} G^{(i)}$, and then average them to estimate the expected return under the target policy.

## Off-Policy Monte Carlo Method
Off-policy Monte Carlo learning enables us to evaluate and improve a target policy $\pi$ using episodes generated from a different behaviour policy $b$. Both policies can be either deterministic or stochastic. The only essential requirement is the coverage condition as discussed before, where

$$\text{If } \pi(a \mid s) > 0, \text{ then } b(a \mid s) > 0$$

### Algorithm Steps
1. Initialize:
   - Initialize a target policy $\pi$ (can be random).
   - Initialize the action-value function $q(s, a)$ arbitrarily.

2. Generate an Episode:
   - Follow the behaviour policy $b$ to generate an episode.
   - Record the sequence of transitions: $(s_0, a_0, r_1), (s_1, a_1, r_2), \dots, (s_T, \_)$

3. Policy Evaluation (via Importance Sampling):
   - For each timestep $t$ in the episode, process in reverse from $T-1$ to $t$:
     - Compute the return $G_t$ from time $t$ onward: $G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots$
     - Compute the cumulative importance sampling ratio: $\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(a_k \mid s_k)}{b(a_k \mid s_k)}$
     - Estimate the reweighted return: $v_\pi(s_t) \approx \rho_{t:T-1} \cdot G_t$
     - Use this to update the estimate of $q(s_t, a_t)$
     - Average over multiple episodes (if applicable): $v_\pi(s) \approx \frac{1}{N} \sum_{i=1}^N \rho^{(i)} G^{(i)}$

4. Policy Improvement:
   - For every state $s$ visited in the episode:
     - Improve the policy using a greedy or $\varepsilon$-greedy update:
       $$\pi(s) = \arg\max_a q(s, a) \quad \text{or} \quad \varepsilon\text{-greedy}(q(s, a))$$

5. Repeat:  
   - Go back to Step 2 and repeat the process until the policy converges or is "good enough."

# Temporal Difference Learning

## Stochastic environment
Stochastic environment is random in nature so the agent has a probability to fail to perform the action as expected, so the algorithm will perform the policy multiple times and maximize the expected return (average return) based on different reward sequence

$$\text {Expected return} = Average(R_1 + \gamma^2 R_2 + \gamma^3 R_3 + ... + \gamma^n R_n) = E[R_1 + \gamma^2 R_2 + \gamma^3 R_3 + ... + \gamma^n R_n]$$

$$Q(s,a) = R(s) + \gamma E[max_{a'}Q(s', a')]$$

The policy will selected an action that maximize the expected return in state $s$


## Continuous states space
In a continuous states space, the state of the agent, $s$, will be a vector that contains all the information needed

## Deep reinforcement learning

1. Randomly initalize the neural network with a guess on $Q(s,a)$
2. Repeat {
    * Generate a training set
        1. From state $s$, perform a random action $a$ that results in state $s'$. Then, construct a tuple $(s, a, R(s), s')$
        2. Create and store sufficient amount of tuple (replay buffer) for training
        Calculate $Q(y = R(s) + \gamma max_{a'} (s',a')$ by using the guessed function give by the neural network
    * Train the neural netowrk
        1. Create training set with the tuples, where $x=(s,a)$ and $y = R(s) + \gamma max_{a'} Q(s',a')$
        2. Using the training set to train a network $Q_{new}$ such that $Q_{new}(s,a)\approx y$        
    * Set $Q = Q_{new}$
    
   }
 
 With each interation, the neural network can get a better approximation of the $Q$ function
   
The neural network takes in a state, action pair and approximates the return a list $Q(s,a)$ values that contains the return for each action performed, so the algorithm can pick the action that maximize $Q(s,a)$ in the list


## Mini batch
When the training set is large, the algorithm only takes a subset of the training example to compute the cost and a different subset next time. Thus, the algorithm will not take the most optimal step each time, but it will speed up the training process significantly 

## Soft update
When updating $Q = Q_{new}$, the new network may be worse than the old one. To prevent full overwritting, we update the weigths and bias through $w = p \times w + (1-p) \times w_{new}$ and $b = p \times b + (1-p) \times b_{new}$ instead of $w = w_{new}$  and $b = b_{new}$, where $b$ is a value between 0 and 1

# Deep Reinforcement Learning
<img src="https://miro.medium.com/v2/resize:fit:2000/format:webp/0*P8RnG_xRY8sThfyp.png" width=500>