# Unit 1: Introduction to Deep RL

Unit 1 Key concepts:<br> 
<i>RL Process $\rightarrow$ MDP $\rightarrow$ State/Observation Space $\rightarrow$ Action space $\rightarrow$ Rewards $\rightarrow$ Tasks $\rightarrow$ Exploration/exploitation tradeoff $\rightarrow$ Policy $\rightarrow$ Two main approches of solving RL problems $\rightarrow$ Policy based (Deterministic vs. Stochastic) $\rightarrow$ Value based $\rightarrow$ Q learning vs. Deep Q-learning </i>

<b>RL Process</b>

A loop of state, action, reward, and next state.

<div align="center"><img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process.jpg" alt="RL process" style="width: 50%;">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/sars.jpg" alt="RL loops output sequence" style="width: 30%;"></div>

<b>Markow's Decision Process</b>

<b>State/Observation space:</b>
- State space
- Observation space

<b>Action space</b>
- Continuous space
- Discrete space

<b>Reward</b>
- Cumulative Reward, $R(\tau)$
- Discounted Expected Cumulative Reward
    - Smaller discount => Larger gamma => Agent cares more about long-term reward (and vice versa)


<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_1.jpg" style="width:400px;" title="Cumulative reward">

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_2.jpg" style="width:150px;" title="Cumulative reward">
   

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_4.jpg" style="width:400px;" title="Discounted exepcted cumulative reward">


<b>Tasks</b>
- Episodic tasks
- Continuous tasks

<b>Exploration-exploitation tradeoff</b>

<img align="right" src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/vbm-1.jpg" style=" width:500px; padding: 10px 20px;" title="Value-based Methods">

<b>Policy $\pi$</b> - Brain of our agent

<b>Two main approaches (Policy) for solving RL problems</b>

- <b>Policy-based methods</b>: Learns a policy function.
    - Two types:
        - Deterministic: $a=\pi (s)$
        - Stochastic: $\pi[a|s]=P[A|s]$
    
- <b>Value-based methods</b>: Learns a value function
    - $v_{\pi}(s)$ = $E_{\pi}(R_{t+1} + R_{t+2} + ... |\, S_{t}=s)$

    - Two types (algos):
        - <b>Q learning (traditional RL)</b>
        - <b>Deep Q learning (Uses NN)</b>

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/deep.jpg" style="width:600px;" title="Q learning vs. Deep Q learning">

---

# Unit 2: Introduction to Q-learning

Unit 2 Key Concepts:<br> 
<i>RL recap $\rightarrow$ Value based methods $\rightarrow$ Greedy policy $\rightarrow$ Link between value and policy $\rightarrow$ Epsilon greedy policy $\rightarrow$ Two main strategies of value based function (state-value function vs. action-value function) $\rightarrow$ Bellman's equation $\rightarrow$ Learning strategies (Monte Carlo vs. Temporal Difference) $\rightarrow$ TD target $\rightarrow$ Q-learning  $\rightarrow$ Q-function  $\rightarrow$ Q-table $\rightarrow$ Q-learning algorithm $\rightarrow$ Off-policy vs. On-policy $\rightarrow$ Q-Learning example </i>

<b>Value based methods</b>

- <b>Greedy Policy</b>: Since policy is not trained/learned in the <b>value-based methods</b>, we define the specific behavior of the policy by hand.

<img align="right" src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" style=" width:400px; padding: 10px 20px;" title="The link between value and policy">

> <i>Link between value and policy: </i>
> $\pi^{*}(s) = arg \, \underset{a}{max} \, Q^{*}(s,a)$, where
> - $\pi^{*}$ is the optimal policy.
> - $Q^{*}$ is the optimal value function.
> - $\underset{a}{max}$ is the pre-defined greedy policy that selects the action that yields the highest expected cumulative value given the state or state action pair.

- <b>Epsilon-Greedy Policy</b> Policy that handles exploration/exploitation tradeoff.

<b>Two main types (strategies) of value based functions:</b>
- <b>state-value function, $V$</b>
    - $V_{\pi}(s) = E_{\pi}[G_{t}|S_{t}=s]$
- <b>action-value function, $Q$</b>
    - $Q_{\pi}(s,a) = E_{\pi}[G_{t}|S_{t}=s, A_{t}=s]$

<img align="right" src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman4.jpg" style=" width:400px; padding: 10px 20px;" title="Bellman equation">

<span style="color:red"><i>Computationally expensive Problem</i>: To claculate each value of a state $V_{\pi}(s)$ or state-action pair $Q_{\pi}(s,a)$, we need to sum all the rewards an agent can get if it starts at that state, and followed the policy forever afterwards.</span>

<b>Bellman's equation:</b> Simplifies our value estimation.

$V_{\pi}(s) = E[R_{t+1} + \gamma * V_{\pi}(S_{t+1})| S_{t}=s)]$ 

<span style="color:blue">To recap, the idea of the Bellman equation is that instead of calculating each value as the sum of the expected return, <b>which is a long process</b>, we calculate the value as the <b>sum of immediate reward + the discounted value of the state that follows.</b></span>

<b>Learning strategies</b>: How agent will update its policy (or value function) from the experience and reward received during training.

- <b>Monte Carlo</b>: learns at end of each episode (IOW, uses an entire episode of experience before learning).
    - $V^{new}(S_{t}) \leftarrow V(S_{t}) + \alpha[G_{t}-V(S_{t})]$


- <b>Temporal Difference (TD)</b>: learns at each step.<br>
  <span style="color:red">As we are updating $V(S_{t})$ at each step, we do not have enitre episode of experience, therefore, we don't have $G_{t}$ (expected return)</span>. <span style="color:green"> So, we estimate the expected value using Bellman's equation, and this is called bootstrapping as the <b>TD target</b> is based on estimate $V(S_{t+1})$ and not a complete $G_{t}$.</span><br>
  This is also called one-step TD or TD(0).
  >In my opinion, TD target is nothing but expected cumulative reward.<br>
    
    - $V^{new}(S_{t}) \leftarrow V(S_{t}) + \alpha[G_{t_{estimate}} - V(S_{t})]$
    - $V^{new}(S_{t}) \leftarrow V(S_{t}) + \alpha[R_{t+1} + \gamma * V(S_{t+1}) - V(S_{t})]$

<div align="center"><img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/monte-carlo-approach.jpg" alt="Monte Carlo" style="width: 45%;">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-3p.jpg" alt="Monte Carlo" style="width: 45%;"></div>

<div align="center"><img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1.jpg" alt="Temporal Difference" style="width: 45%;">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1p.jpg" alt="Temporal Difference" style="width: 45%;"></div>

<b>Q-Learning</b>: An <span style="color:blue"><b><u>off-policy</u> <u>value-based method</u> that uses a <u>TD approach</u> to train its action-value function</b></span>. In other words, an RL algorithm used to train Q-function.

<b>Q-Function</b>: This action-value function takes in the state and action as input,  and provides expected value as output. 

<b>Q-Table</b>: Q-Function is encoded by a Q-table, where each cell corresponds to a state-action pair value. Think of Q-tbale as memory of our Q-function.

> Let’s recap the difference between value and reward:
> - <span style="color:blue">The <i>value of a state, or a state-action pair</i> is the expected cumulative reward our agent gets if it starts at this state (or state-action pair) and then acts accordingly to its policy forever afterwards.</span> In other words, it's the prediction of future rewards.
> - <span style="color:blue">The <i>reward</i> is the <b>immediate feedback I get from the environment</b> after performing an action at a state.</span>

<b>Q-Learning algorithm</b>

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" style="width:700px;" title="Q-learning">

- Step 1: We initialise the Q-table arbitrarily.
    - $Q(s,a) = 0$ for all $s \in S$ and $a \in A(s)$
    - $Q(\text{terminal-state}, .) = 0$
- Step 2: <b>Acting/Inference</b>: Choose an action using a policy (<b>epsilon greedy</b> strategy for exploration vs. exploitation trade-off as the training progresses).
    - $\epsilon$ vs. $1-\epsilon$
- Step 3: Peform action $A_{t}$, get reward $R_{t+1}$ and next state $S_{t+1}$.
- Step 4: <b>Updating/Training</b>: Update action-value function $Q(S_{t}, A_{t})$ after every step/iteration using <b>greedy</b> policy.
    - For state-value function, $V^{new}(S_{t}) \leftarrow V(S_{t}) + \alpha[R_{t+1} + \gamma V(S_{t+1}) - V(S_{t})] $
    - For action-value function, $Q^{new}(S_{t}, A_{t}) \leftarrow Q(S_{t}, A_{t}) + \alpha[R_{t+1} + \gamma \text{max}_{a}Q(S_{t+1}, a) - Q(S_{t}, A_{t})] $

<div align="center"><img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-7.jpg" alt="Q-learning" style="width: 40%;">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-8.jpg" alt="Q-learning" style="width: 50%;"></div>

<b>Off-policy vs. On-policy</b>: 
- Off-policy: A different policy for acting (inferencing) and updating (learning)
- On-policy: Same policy for acting and inferencing

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" style="width:600px;" title="Off-on policy">

Also, the greedy policy will also be the final policy we'll have when the Q-learning agent completes training. 

# Unit 3: Deep Q-learning with Atari Games

Unit 3 Key Concepts:<br> 
<i> From Q-learning to Deep Q-learning $\rightarrow$ Deep Q-Network $\rightarrow$ Preprocessing the input $\rightarrow$ Temporal limitation $\rightarrow$ $\rightarrow$ $\rightarrow$ $\rightarrow$ </i>

<b>State spaces</b>
- LunarLander-v2 - ? different states.
- FrozenLake-v1 - 16 different states (4x4 grid).
- Taxi-v3 - 500 different states (5x5 grid) x (5 passenger locations) x (4 destination locations).
- Atari - 10^9 to 10^11 states.

<img align="right" src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/deep.jpg" style=" width:550px; padding: 10px 20px;" title="Deep Q Learning">

<b>From Q-learning to Deep Q-learning</b>

<span style="color:red"><b>The Q-table becomes ineffective in large state space environments</b></span>, though it worked well for smaller discrete state spaces like LunarLander-v2 and FrozenLake-v1. So, instead of using a Q-table, we need to use <span style="color:green">Deep Q-Learning that uses a Neural Network to approximate, given a state, different Q-values for each possible action based on that state. In other words, <span style="color:green"> Approximate Q-values using a parameterised Q-function $Q_{\theta}(s,a)$</span>.

<img align="right" src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/deep-q-network.jpg" style=" width:500px; padding: 10px 20px;" title="Deep Q Network">

<b>The Deep Q-Network</b>
- Input: Stack of 4 frames
- Output: Vector of Q values for each possible action at that state

<b>Proprocessing the input</b>
- To reduce state space complexity => Faster training
- <span style="color:blue"><b>16x reduction</b> from $\Rightarrow 210*160*3=113,400$ pixels in a frame to $84*84*1=7,056$ pixels.</span>

<b>Temporal Limitation</b>
- Handled by stacking multiple frames together, thereby capturing <b>temporal information</b>

...



# Bous Unit 2: Automatic Hyperparameter Tuning with Optuna