# On-policy vs. off-policy reinforcement learning
As the main objective in reinforcement learning problems is to create a policy (i.e., plan of action) that leads to the best outcome (highest possible reward), algorithms would estimate the return (future rewards) and behave (i.e., pick their actions) based on that estimation. **The difference between an on-policy and off-policy algorithm is simply the way they behave (pick their actions) relative to the way they 'think' they behave when estimating their returns**.

In both on-policy and off-policy algorithms, there exist a policy that guides an agent's behavior, i.e., the algorithm picks an action for the agent to execute. The way these actions are picked is dependent on the algorithm's return estimate for each state-action value pair, $Q$. That is, when in a particular state $S$, the value $Q$ generated by taking a particular action $A$ affects the algorithms' decision on whether to pick $A$. An example method in selecting $A$ is the greedy policy, where an algorithm would find the $A$ that gives the highest $Q$ value and pick that as the action to execute.

After taking the action, the agent would arrive in state $S'$ and receive a reward $R$, which it uses to update its understanding of the return estimates (i.e., state-action values) $Q$. It is at this stage where the on-policy algorithm differs from the off-policy algorithm: to update the $Q$ values, an on-policy algorithm uses its present estimate $Q$ value based on the current state $S'$ and its predicted action $A'$, while an off-policy algorithm uses its present $Q$ based on the current state $S'$ and an action $A''$ that differs from the way it chose $A$.

To put this into context, we use SARSA and Q-learning as examples of an on- and an off-policy respectively. In choosing an action to execute, both SARSA and Q-learning use the $\epsilon$-greedy method to pick $A$. However, in updating the state-action values $Q$, **SARSA uses a $Q$ value based on the action $A'$ picked through the same $\epsilon$-greedy method** while **Q-learning uses a $Q$ value based on the action $A''$ picked through a greedy method (i.e., $A'' = \argmax_a{Q(S', a)}$) that differs from how it picked $A$ (i.e., the $\epsilon$-greedy method)**.

Pseudo code for both algorithms:

SARSA:
![SARSA](img/on-policy-sarsa.png)

Q-learning:
![Q-learning](img/on-policy-q-learning.png)

The following table show the difference between SARSA and Q-learning:

|SARSA|Q-learning|
|---|---|
|On-policy|Off-policy|
|Chooses an action → not necessarily the best one → sees the result|Chooses an action → sees the result|
|Updates its value function with the knowledge of the result|Updates its value function with a different action (greedy)|
|Converges eventually|No convergence guarantees|
|Typically converges slower than Q-learning|Often faster and better than SARSA|