# Q-Learning

https://www.youtube.com/watch?v=wc-FxNENg9U

<image src="https://images.datacamp.com/image/upload/v1666973295/Q_Learning_Header_336c3f8177.png">

Reinforcement learning (RL) is the part of the machine learning ecosystem where the agent learns by interacting with the environment to obtain the optimal strategy for achieving the goals. It is quite different from supervised machine learning algorithms, where we need to ingest and process that data. Reinforcement learning does not require data. Instead, it learns from the environment and reward system to make better decisions.

We can break down reinforcement learning into five simple steps:

1. The agent is at state zero in an environment.
2. It will take an action based on a specific strategy.
3. It will receive a reward or punishment based on that action.
4. By learning from previous moves and optimizing the strategy. 
5. The process will repeat until an optimal strategy is found. 

Q-learning is a model-free, value-based, off-policy algorithm that will find the best series of actions based on the agent's current state. The “Q” stands for quality. Quality represents how valuable the action is in maximizing future rewards.  

The model-based algorithms use transition and reward functions to estimate the optimal policy and create the model. In contrast, model-free algorithms learn the consequences of their actions through the experience without transition and reward function. 

The value-based method trains the value function to learn which state is more valuable and take action. On the other hand, policy-based methods train the policy directly to learn which action to take in a given state.

In the off-policy, the algorithm evaluates and updates a policy that differs from the policy used to take an action. Conversely, the on-policy algorithm evaluates and improves the same policy used to take an action.  

# Q-Learning Terminology 

* States(s): the current position of the agent in the environment. 
* Action(a): a step taken by the agent in a particular state. 
* Rewards: for every action, the agent receives a reward and penalty. 
* Episodes: the end of the stage, where agents can’t take new action. It happens when the agent has achieved the goal or failed. 
* Q(St+1, a): expected optimal Q-value of doing the action in a particular state. 
* Q(St, At): it is the current estimation of Q(St+1, a).
* Q-Table: the agent maintains the Q-table of sets of states and actions.
* Temporal Differences(TD): used to estimate the expected value of Q(St+1, a) by using the current state and action and previous state and action.


## Q-Table

The agent will use a Q-table to take the best possible action based on the expected reward for each state in the environment. In simple words, a **Q-table is a data structure of sets of actions and states**, and we use the Q-learning algorithm to update the values in the table. 



## Q-Function

The Q-function uses the Bellman equation and takes state(s) and action(a) as input. The equation simplifies the state values and state-action value calculation. 

<image src='https://images.datacamp.com/image/upload/v1666973295/Bellman_Equation_63ce32c644.png'>

## Q-learning algortihm

<image src='https://images.datacamp.com/image/upload/v1666973295/Q_Learning_Process_134331efc1.png'>

### Initialize Q-Table
We will first initialize the Q-table. We will build the table with columns based on the number of actions and rows based on the number of states.

In our example, the character can move up, down, left, and right. We have four possible actions and four states(start, Idle, wrong path, and end). You can also consider the wrong path for falling into the hole. We will initialize the Q-Table with values at 0. 

<image src='https://images.datacamp.com/image/upload/v1666973295/Q_Table_1_0780df3d51.png'>

The process is repeated multiple times until the Q-Table is updated and the Q-value function is maximized. 

<image src='https://images.datacamp.com/image/upload/v1666973295/Q_learning_equation_3cd6652b98.png'>

In [4]:
# %%capture   # by using `%%capture` we are suppressing the output of the Jupyter cell. 
#!pip install pyglet==1.5.1
#!apt install python-opengl
#!apt install ffmpeg
#!apt install xvfb
# !pip3 install pyvirtualdisplay

# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()



FileNotFoundError: [Errno 2] No such file or directory: 'Xvfb'