# Reinforcement learning
Reinforcement learning is a framework in which an agent learns to make decisions by interacting with an environment in order to maximize cumulative reward. The agent observes the current state, selects an action, receives a reward, and transitions to a new state. Over time, it learns a policy that maximizes long-term returns through trial and error.

## Terminology
**Agent**: the entity that makes decisions, performs actions in the environment, receives rewards, and learns from the consequences of its actions

**Environment**: everything external to the agent that it can interact with

**State ($s$)**: all the information needed to describe the current situation of the environment. The agent will take action based on the current state, and an agent action (most of the time) will cause the transition to a new state. A state should be Markovian

**Action ($a$)**: the decision taken by the agent after assessing the current state that can affect the environment. An action can be either discrete or continuous

**Reward ($r$)**: a feedback given to an agent after it performs an action. The reward can be positive or negative based on the action taken and the resulting state, which tells the agent how good or bad its action was. A reward can be immediate or delayed. The goal of the agent is to maximize the cumulative reward

**Discount factor ($\gamma$)**: a number between 0 and 1 that balances the future reward value based on the number of actions required. The more actions required, the smaller the rewards will be. Larger $\gamma$ means the algorithm is patient and will look for long term reward, where smaller $\gamma$ means the algorithm is impatient and will look for short term reward. The discount factor ensures the reward convergence in infinite horizon problems 

**Return ($R$)**: the total accumulated reward with discount after a timestep $t$, where
$$Return = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + ... + \gamma^{n-1} r_{t+n},$$
where $r_{t+1}$ represents the reward after $t$th action is taken followed by $n$ total actions afterward

**Policy ($\pi(s)$)**: a function that takes in the current state of the agent, $s$, and return an action, $a$ to perform. The agent makes decision based on a policy. The goal of RL is to find an optimal policy $\pi^*$ that maximizes expected return from every state

**Trajectory**: the sequence of states, actions, and rewards the agent experiences

**Episode**: a trajectory that ends in a terminal state

**Exploration**: the agent tries new or less-known actions to discover the environment

**Exploitation**: the agent select the best known action to maximize the reward to its current knowledge

## General RL workflow
For a general RL workflow, the agent observes the current state of the environment and selects an action based on a policy at each time step. The environment then responds by transitioning to a new state and providing a reward that reflects the quality of the action taken. The agent uses this reward, along with the new state, to update its understanding of the environment and improve its policy

This cycle of observing, acting, receiving rewards, and learning continues over many episodes (trial and error), allowing the agent to gradually learn a policy that maximizes long-term cumulative reward. Key components of this process include exploration (trying new actions to gather information) and exploitation (choosing the best-known action to maximize reward). Over time, the agent aims to find a balance between the two and converge toward an optimal behavior

<img src="https://www.scribbr.com/wp-content/uploads/2023/08/the-general-framework-of-reinforcement-learning.webp" width=500>

# Multi-armed bandit
Multi-armed bandit is the simplest reinforcement learning problem. An agent will choose between $k$ different actions at each timestep, and recieves a reward based on the action chosen, but the reward distributions are unknown and different for each action. The goal of the agent is to maximize the cumulative rewards in the given amount of steps

<img src="https://miro.medium.com/v2/resize:fit:894/1*ZS_craAiKCJzFj9dQ9RaYQ.png" width=500>

## Action-Value

The value of an action, $q(a)$, is the expected reward received when action $a$ is taken:

$$q(a) = \mathbb{E}[R \mid A = a]$$

* $q(a)$: the expected reward of taking action $a$
* In multi-armed bandit problems, there is no concept of state or policy, so the value depends only on the action itself.

### Sample-Average Method
In reinforcement learning, the true action values $q(a)$ are unknown and must be estimated through repeated interactions with the environment.  
One simple estimation approach is the sample-average method, where the estimated value $\hat{q}_t(a)$ at time $t$ is:

$$\hat{q}_t(a) = \frac{\sum_{i=1}^{t-1} \mathbb{1}[A_i = a] \cdot r_i}{\sum_{i=1}^{t-1} \mathbb{1}[A_i = a]} = \frac{\text{Cumulative reward recieved from action $a$ before timestep $t$}}{\text{Total number of time the action $a$ is taken before timestep $t$}}$$

* $A_i$: the action taken at time step $i$
* $r_i$: the reward received at time step $i$
* $\mathbb{1}[A_i = a]$: an **indicator function** that equals 1 if action $a$ was taken at time $i$, and 0 otherwise

As the number of samples increases, $\hat{q}_t(a)$ converges to the true expected value $q(a)$, by the law of large numbers.

#### Incremental Updates
When learning to estimate the expected reward of each action, we need to update the estimation when we observed a new sample. One issue with the sample-average method is that we need to store all the pass actions with its corresponding reward to calculate $\hat{q}_t(a)$. However, as the number of trials grows, the required storage also growth linearly. To resolve this, we can write the sampling average formula in a recursive manner, where

$$\hat{q}_{t+1} = \hat{q}_t + \alpha_t(r_t - \hat{q}_t)$$

$\hat{q}_{t+1}$: the updated estimation of an action

$\hat{q}_t$: the previous estimation of the action

$\alpha_t$: the step size, which is a hyperparameter between 0 and 1 that determines the how much to update

$r_t$: the current reward recieved

#### Incremental Updates

When learning to estimate the expected reward of each action, we need to update our estimate each time a new reward is observed. One issue with the sample-average method is that it requires storing all past rewards for each action to compute $\hat{q}_t(a)$. As the number of trials increases, the required memory grows linearly, which becomes inefficient. To solve this, we can rewrite the sample-average update in a recursive (incremental) form, where

$$\hat{q}_{t+1} = \hat{q}_t + \alpha_t \left(r_t - \hat{q}_t\right)$$

* $\hat{q}_{t+1}$: the updated estimate of the action-value after time step $t$
* $\hat{q}_t$: the previous estimate before seeing the latest reward
* $r_t$: the reward received at time step $t$
* $\alpha_t$: the step size or learning rate, a value between 0 and 1 that determines how much the estimate is updated

This form only requires storing the current estimate and does not require keeping track of all past rewards, making it computationally efficient.

## Non-Stationary Problem
A problem is said to be stationary if the reward distribution for each action remains the same over time. However, in many real world RL problems, the environment is non-stationary, meaning the reward distributions change over time. In such cases, an agent must be able to adapt its action-value estimates to reflect the most recent outcomes more than older ones. The incremental update rule can be written in a recursive form

$$\hat{q}_{t+1} = (1-\alpha)^t\hat{q}_1 + \sum^{t}_{i=1}\alpha_t (1-\alpha)^{t-i} r_i$$

* $\alpha$: a constant step size between 0 and 1
* $r_i$: the reward recieved after taking the action at timestep $i$

This formula represents the exponentially decaying weighted average of past rewards, where recent rewards are given more significance, and older rewards gradually “fade out” due to the $(1 - \alpha)^{t-i}$ decay factor.

## Exploration and Exploitation Tradeoff
One key challenge in reinforcement learning is deciding when to explore and when to exploit, as an agent cannot do both simultaneously. Exploration helps the agent improve its knowledge about the environment, which can lead to greater rewards in the long term. Exploitation, on the other hand, involves leveraging the agent’s current knowledge to maximize immediate rewards. An optimal policy should strike a balance between exploration and exploitation based on the agent’s current knowledge and state, in order to maximize cumulative reward over time.


### Greedy
The greedy policy is simple as the agent always exploits by choosing the action with the highest estimated reward, without any exploration. While this strategy can work in very simple or well-understood environments, it often performs poorly in more complex or uncertain problems, because the agent never tries new actions and thus fails to discover potentially better options or adapt when conditions change.

### $\epsilon$ Greedy
The $\epsilon$-greedy policy is a variation of the greedy policy that introduces a small amount of exploration.  
At each time step, the agent exploits with probability $1 - \epsilon$, and explores by choosing a random action) with probability $\epsilon$. This policy helps the agent avoid getting stuck with suboptimal actions by occasionally trying alternatives.This can be written as

$$
A_t =
\begin{cases}
\arg\max_a \hat{q}_t(a) & \text{with probability } 1 - \varepsilon \\
\text{a random action} & \text{with probability } \varepsilon
\end{cases}
$$

In general, the $\epsilon$ greedy policy will perform better than greedy in the long run as it gains more knowledge about the environment through exploration

### Optimistic Initial Value

### Upper-Confidence Bound Action Selection (UCB)






**Q-function (Action-Value Function)**: a function that estimate the expected return for taking action $a$ in state $s$, and following policy $\pi$ afterward. The $Q$ function is specific to each state–action pair
$$ Q^\pi(s, a) = \mathbb{E}_\pi[R_t \mid s_t = s, a_t = a] $$

* $Q$: the action value, which captures the expected total return (cumulative future rewards), starting from the timestep $t$ with the state $s$, taking a specific action $a$, and then following policy $\pi$ for all future steps.

* $R_t$: the return recieved from one trial

In Multi-armed bandit problem, there's no concept of state and policy, so the action-value function simplfies to
$$ Q(a) = \mathbb{E}[R \mid A = a] $$
where the expected return only depends on the action taken at the current step since each action is independent from the others

The goal of reinforcement learning is to learn a policy that maximizes the expected return over time.




**Value function ($V$)**: a function that estimate the expected return when starting from a state and following a policy $\pi$
$$ V^\pi(s) = \mathbb{E}_\pi[R_t \mid s_t = s] $$
The value function evaluates how good a particular state is under a given policy


## Markov Decision Process (MDP)
The future action decision will only depend on the current state of the agent but not the previous states


## State-action value function ($Q$)
The state-action value fuction ($Q(s, a)$)
1. Takes in the current state and set of all possible actions
2. Take a random action among all possibilities
3. Returns the return as if the agent behave optimally after the random action is taken

* The best possible return from state $s$ is the max possible value of $Q(s, a)$
* The best possible action in state $s$ is the action $a$ that gives $\text{max} Q(s, a)$, which will maximize the return

## Bellman equation
$$Q(s,a) = R(s) + \gamma max_{a'}Q(s', a')$$

$s$: current state

$a$: current action

$s'$: state after action $a$ is performed

$a'$: action that will be taken in state $s'$

$R(s)$: reward of the current state

The Bellman equation shows that the optimal return for a state equals the reward from the current state plus the discount factor times the optimal return from the next state, which is a recursive implementation

## Stochastic environment
Stochastic environment is random in nature so the agent has a probability to fail to perform the action as expected, so the algorithm will perform the policy multiple times and maximize the expected return (average return) based on different reward sequence

$$\text {Expected return} = Average(R_1 + \gamma^2 R_2 + \gamma^3 R_3 + ... + \gamma^n R_n) = E[R_1 + \gamma^2 R_2 + \gamma^3 R_3 + ... + \gamma^n R_n]$$

$$Q(s,a) = R(s) + \gamma E[max_{a'}Q(s', a')]$$

The policy will selected an action that maximize the expected return in state $s$


## Continuous states space
In a continuous states space, the state of the agent, $s$, will be a vector that contains all the information needed

## Deep reinforcement learning

1. Randomly initalize the neural network with a guess on $Q(s,a)$
2. Repeat {
    * Generate a training set
        1. From state $s$, perform a random action $a$ that results in state $s'$. Then, construct a tuple $(s, a, R(s), s')$
        2. Create and store sufficient amount of tuple (replay buffer) for training
        Calculate $Q(y = R(s) + \gamma max_{a'} (s',a')$ by using the guessed function give by the neural network
    * Train the neural netowrk
        1. Create training set with the tuples, where $x=(s,a)$ and $y = R(s) + \gamma max_{a'} Q(s',a')$
        2. Using the training set to train a network $Q_{new}$ such that $Q_{new}(s,a)\approx y$        
    * Set $Q = Q_{new}$
    
   }
 
 With each interation, the neural network can get a better approximation of the $Q$ function
   
The neural network takes in a state, action pair and approximates the return a list $Q(s,a)$ values that contains the return for each action performed, so the algorithm can pick the action that maximize $Q(s,a)$ in the list

## $\epsilon$ greedy policy
Greedy (exploitation) step: with a high probability ($p=1-\epsilon$), pick the action that maximizes $Q(s,a)$ based on the current $Q$ function
Exploration step: with a low probability ($p=\epsilon$), pick a random action. This allows the algorithm to explore  action space even some actions do not maximize the return

Start with a high $\epsilon$ value and decrease it gradually as the approximation for $Q$ becomes better


## Mini batch
When the training set is large, the algorithm only takes a subset of the training example to compute the cost and a different subset next time. Thus, the algorithm will not take the most optimal step each time, but it will speed up the training process significantly 

## Soft update
When updating $Q = Q_{new}$, the new network may be worse than the old one. To prevent full overwritting, we update the weigths and bias through $w = p \times w + (1-p) \times w_{new}$ and $b = p \times b + (1-p) \times b_{new}$ instead of $w = w_{new}$  and $b = b_{new}$, where $b$ is a value between 0 and 1