# INTRODUCTION TO Q-LEARNING 

In  **Policy-based** methods: **Directly train the policy** to select what action to take given a state (or a probability distribution over actions at that state).

The approach accepts a state as input and produces the corresponding action for that state (deterministic approach: an approach that generates a single action based on a state, as opposed to a stochastic approach that generates a probability distribution over actions). As a result, **we refrain from manually specifying the behavior** of our approach; **it is the training process that determines it**.


In **value-based** methods, **we learn a value function** that **maps a state to the expected value of being at that state.**
![enter image description here](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/value_1.jpg)
The value of a state is the **expected discounted return** the agent can get if it **starts at that state and then acts according to our policy.**
If our policy isn't trained, we have to spell out how it behaves.   
So, regardless of the method you employ to tackle your problem, a policy will emerge. With value-based methods, there's no training involved for the policy; instead, it's a straightforward pre-defined function (like the Greedy Policy), using values from the value-function to determine actions.
  In value-based training, finding an optimal value function (denoted **Q*** or **V***, we’ll study the difference below) leads to having an optimal policy.
  ![enter image description here](https://live.staticflickr.com/65535/53421013836_47936fb15f_c.jpg)

## The state-value function

The state-value function, is a concept in reinforcement learning that represents the expected cumulative future rewards an agent can obtain from a given state under a certain policy i.e if the agent starts at a given state and follows the policy forever. In simpler terms, it quantifies the goodness of being in a particular state.

![enter image description here](https://live.staticflickr.com/65535/53421445805_3f7ba66e64_z.jpg)

The state-value function tells you the expected total reward the agent can get if it begins in a specific state and continues following its strategy indefinitely.

![enter image description here](https://cdn-media-1.freecodecamp.org/images/1*2_JRk-4O523bcOcSy1u31g.png)

## The action-value function
The action-value function, is a concept in reinforcement learning that measures the expected cumulative future rewards an agent can obtain from being in a certain state taking a specific action, and then following a particular policy. In other words, it evaluates the goodness of taking a specific action in a given state.

![enter image description here](https://live.staticflickr.com/65535/53420125417_461d4c0a6c_z.jpg)

For the action-value function, we calculate **the value of the state-action pair  hence the value of taking that action at that state.**

![enter image description here](https://live.staticflickr.com/65535/53420149332_20560140a8_n.jpg)

## The Bellman Equation
The Bellman equation **simplifies our state value or state-action value calculation.**
Instead of calculating the expected return for each state or each state-action pair, **we can use the Bellman equation.**

Bellman's equation for the state-value function is expressed as follows:

![enter image description here](https://live.staticflickr.com/65535/53421743280_c8bd2216b8_b.jpg)
In simpler terms, the value of a state is the sum of immediate reward and the discounted value of the next state under the current policy.
the idea of the Bellman equation is that instead of calculating each value as the sum of the expected return, **which is a long process**, we calculate the value as **the sum of immediate reward + the discounted value of the state that follows.**

## Monte Carlo and Temporal Difference Learning
Both Monte Carlo (MC) and Temporal Difference (TD) learning are fundamental techniques in the field of **reinforcement learning**. They help agents learn the best course of action to maximize rewards in their environment, even without full knowledge of the environment's dynamics. However, they approach this learning in different ways:

### Monte Carlo: learning at the end of the episode
Monte Carlo waits until the end of the episode, calculates Gt​ (return) and uses it as **a target for updating V(St​).**

-   **Think of it as replaying a completed game:** Imagine navigating a maze. the player or agent waits until you reach the end (the entire episode) before using the final reward and the entire sequence of actions taken to update your understanding of each state in the maze. This gives you a precise value for each state based on the actual outcome of the whole journey and  then **start a new game with this new knowledge**

![enter image description here](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-3.jpg)

### Temporal Difference Learning: learning at each step

Temporal Difference (TD), conversely, relies on a single interaction (one step) specifically, the transition from state St​ to St+1​ to construct a TD target. It then updates the value V(St​) using the immediate reward Rt+1​ and the discounted value of the next state γ⋅V(St+1​)

However, given that we haven't encountered an entire episode, the expected return Gt​ is unavailable. Instead, we approximate Gt​ by combining the immediate reward Rt+1​ with the discounted value of the subsequent state γ⋅V(St+1​).

-   **Think of it as learning while playing:** While still navigating the maze, TD learns **on-the-fly** after each step. It uses the immediate reward and a prediction of the future value (based on the next state) to update its understanding of the current state. This is like asking yourself "would the next move be good based on what I know so far?".
![enter image description here](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-3.jpg)

# Q-Learning
**Q-learning is a reinforcement learning algorithm that excels at discovering the best actions to take in various situations. It achieves this through a unique blend of features:**

-   **Off-policy learning:** It can learn from experiences even when deviating from the optimal policy, making it highly adaptable.
-   **Value-based method:** It focuses on estimating the long-term value of actions, leading to a more strategic approach.
-   **TD (Temporal Difference) approach:** It efficiently updates action-value estimates based on immediate rewards and predicted future values, accelerating learning.

**Q FUNCTION**

![enter image description here](https://live.staticflickr.com/65535/53420551472_013331c6a0_b.jpg)

The term "Q" signifies the "Quality" or value associated with a particular action in a given state.

Our Q-function is internally represented by a Q-table, a structured table where each cell corresponds to the value of a state-action pair. Visualize this Q-table as the memory or cheat sheet of our Q-function.

Lets use an example from hugging face to **visualize this**:
Let’s go through an example of a maze.

![enter image description here](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-1.jpg)

![enter image description here](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-2.jpg)

![enter image description here](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-3.jpg)
  
The training process involves refining a Q-function, essentially an action-value function, manifested internally as a Q-table encompassing all state-action pair values. When provided with a state and action, our Q-function references its Q-table to retrieve the associated value.

Upon completion of training, we attain an optimal Q-function, synonymous with having an optimal Q-table. With an optimal Q-function in place, we concurrently possess an optimal policy, as it guides us to discern the most favorable action for each state.

![enter image description here](https://live.staticflickr.com/65535/53421013836_47936fb15f_c.jpg)


## The Q-Learning algorithm

![enter image description here](https://live.staticflickr.com/65535/53422127408_07570fd7f9_z.jpg)

### Choose an action using the epsilon-greedy strategy
When choosing an action (denoted as "a") in the current state (denoted as "s"), the decision is typically based on the current Q-value estimates. However, a critical question arises at the outset of the training process when all Q-values are initialized to zero: What action should be taken in this scenario?

This is where the exploration/exploitation trade-off, previously discussed, becomes crucial.

To address the uncertainty in the initial Q-values, an epsilon-greedy strategy is employed:

1.  An exploration rate, denoted as "epsilon," is specified and set to 1 at the beginning. This rate represents the proportion of steps taken randomly. In the initial stages, when little is known about the Q-table values, a higher epsilon is necessary to encourage extensive exploration through random action selection.
    
2.  A random number is generated, and if this number is greater than epsilon, the agent chooses "exploitation" by utilizing existing knowledge to select the best action at each step. Conversely, if the random number is less than or equal to epsilon, the agent opts for "exploration" and selects actions randomly.

![enter image description here](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg)
    

The rationale behind this approach is to initiate training with a high epsilon, promoting significant exploration in the early stages of Q-function learning. As the agent gains confidence in estimating Q-values, epsilon is gradually reduced, striking a balance between exploration and exploitation throughout the training process.

![enter image description here](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-5.jpg)

### off policy and on policy