# Reinforcement learning
Reinforcement learning trains the algorithm to assess the current environment then perform the most optimal action through a rewarding system

## Terminology

Agent: the entity that can interact with the environment and perform actions to get rewards

State ($s$): the current situation returned by the environment, which contains the information needed for the algorithm to assess in order to perform an action

Action ($a$): the action taken by the agent after assessing the current state

Reward ($R$): an immediate feedback given to an agent after it performs an action. The reward can be positive or negative based on the action taken and the result

Discount factor ($\gamma$): a number between 0 and 1 that balances the future reward value based on the number of actions required. The more actions required, the smaller the rewards and the smaller the penalties, so the algorithm will delay the penalty as much as possible. Larger $\gamma$ means the algorithm is impatient and will look for short term reward, where smaller $\gamma$ means the algorithm is patient and will look for long term reward

Policy ($\pi(s)$): a function that takes in the current state of the agent, $s$, and return an action, $a$ to perform

Return: the sum of reward with discount (the first reward does not include a discount factor)
$$Return = R_1 + \gamma^2 R_2 + \gamma^3 R_3 + ... + \gamma^n R_n,$$
where $R_i$ represents the reward after $i$th action is taken with $n$ total actions

<img src="https://www.scribbr.com/wp-content/uploads/2023/08/the-general-framework-of-reinforcement-learning.webp" width=500>

## Markov Decision Process (MDP)
The future action decision will only depend on the current state of the agent but not the previous states


## State-action value function ($Q$)
The state-action value fuction ($Q(s, a)$)
1. Takes in the current state and set of all possible actions
2. Take a random action among all possibilities
3. Returns the return as if the agent behave optimally after the random action is taken

* The best possible return from state $s$ is the max possible value of $Q(s, a)$
* The best possible action in state $s$ is the action $a$ that gives $\text{max} Q(s, a)$, which will maximize the return

## Bellman equation
$$Q(s,a) = R(s) + \gamma max_{a'}Q(s', a')$$

$s$: current state

$a$: current action

$s'$: state after action $a$ is performed

$a'$: action that will be taken in state $s'$

$R(s)$: reward of the current state

The Bellman equation shows that the optimal return for a state equals the reward from the current state plus the discount factor times the optimal return from the next state, which is a recursive implementation

## Stochastic environment
Stochastic environment is random in nature so the agent has a probability to fail to perform the action as expected, so the algorithm will perform the policy multiple times and maximize the expected return (average return) based on different reward sequence

$$\text {Expected return} = Average(R_1 + \gamma^2 R_2 + \gamma^3 R_3 + ... + \gamma^n R_n) = E[R_1 + \gamma^2 R_2 + \gamma^3 R_3 + ... + \gamma^n R_n]$$

$$Q(s,a) = R(s) + \gamma E[max_{a'}Q(s', a')]$$

The policy will selected an action that maximize the expected return in state $s$


## Continuous states space
In a continuous states space, the state of the agent, $s$, will be a vector that contains all the information needed

## Deep reinforcement learning

1. Randomly initalize the neural network with a guess on $Q(s,a)$
2. Repeat {
    * Generate a training set
        1. From state $s$, perform a random action $a$ that results in state $s'$. Then, construct a tuple $(s, a, R(s), s')$
        2. Create and store sufficient amount of tuple (replay buffer) for training
        Calculate $Q(y = R(s) + \gamma max_{a'} (s',a')$ by using the guessed function give by the neural network
    * Train the neural netowrk
        1. Create training set with the tuples, where $x=(s,a)$ and $y = R(s) + \gamma max_{a'} Q(s',a')$
        2. Using the training set to train a network $Q_{new}$ such that $Q_{new}(s,a)\approx y$        
    * Set $Q = Q_{new}$
    
   }
 
 With each interation, the neural network can get a better approximation of the $Q$ function
   
The neural network takes in a state, action pair and approximates the return a list $Q(s,a)$ values that contains the return for each action performed, so the algorithm can pick the action that maximize $Q(s,a)$ in the list

## $\epsilon$ greedy policy
Greedy (exploitation) step: with a high probability ($p=1-\epsilon$), pick the action that maximizes $Q(s,a)$ based on the current $Q$ function
Exploration step: with a low probability ($p=\epsilon$), pick a random action. This allows the algorithm to explore  action space even some actions do not maximize the return

Start with a high $\epsilon$ value and decrease it gradually as the approximation for $Q$ becomes better


## Mini batch
When the training set is large, the algorithm only takes a subset of the training example to compute the cost and a different subset next time. Thus, the algorithm will not take the most optimal step each time, but it will speed up the training process significantly 

## Soft update
When updating $Q = Q_{new}$, the new network may be worse than the old one. To prevent full overwritting, we update the weigths and bias through $w = p \times w + (1-p) \times w_{new}$ and $b = p \times b + (1-p) \times b_{new}$ instead of $w = w_{new}$  and $b = b_{new}$, where $b$ is a value between 0 and 1