# SARSA Algorithm: A Comprehensive Tutorial

## Introduction

SARSA (State-Action-Reward-State-Action) is an on-policy reinforcement learning algorithm used to learn the optimal policy in Markov Decision Processes (MDPs). The name SARSA comes from the quintuple of elements that are used to update the Q-values: the current state, the current action, the reward, the next state, and the next action.

## Mathematical Background

### Markov Decision Process (MDP)

An MDP is defined by:
- **S**: A set of states.
- **A**: A set of actions.
- **P**: State transition probabilities, where $P(s' \mid s, a)$ represents the probability of transitioning to state $s'$ from state $s$ after taking action $a$.
- **R**: Reward function, where $R(s, a)$ is the expected reward received after taking action $a$ in state $s$.
- **$\gamma$**: Discount factor, $0 \leq \gamma < 1$, which represents the importance of future rewards.

### Q-Function

The Q-function, or action-value function, $Q(s, a)$, represents the expected cumulative reward of taking action $a$ in state $s$ and then following a policy $\pi$:

$$
Q^\pi(s, a) = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \bigg| s_0 = s, a_0 = a, \pi \right]
$$

### SARSA Update Rule

SARSA is an on-policy method, meaning it updates the Q-values using the action taken by the policy itself. The update rule is:

$$
Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma Q(s', a') - Q(s, a) \right]
$$

where:
- $s$ is the current state.
- $a$ is the current action.
- $r$ is the reward received after taking action $a$.
- $s'$ is the next state.
- $a'$ is the next action chosen by the policy.
- $\alpha$ is the learning rate.
- $\gamma$ is the discount factor.

This equation is derived from the Bellman equation for Q-values:

$$
Q(s, a) = \mathbb{E} \left[ R(s, a) + \gamma Q(s', a') \right]
$$

### Algorithm Steps

1. **Initialize Q-values**:
   - Initialize the Q-values arbitrarily for all state-action pairs. Set $Q(s, a)$ to 0 for all state-action pairs $(s, a)$.

2. **Policy Selection**:
   - Choose an action $a$ for the current state $s$ using an epsilon-greedy policy based on Q-values.

3. **Action Execution and Reward Observation**:
   - Execute the action $a$, observe the reward $r$, and the next state $s'$.

4. **Next Action Selection**:
   - Choose the next action $a'$ in the next state $s'$ using the same policy (epsilon-greedy).

5. **Q-value Update**:
   - Update the Q-value using the SARSA update rule.

6. **Transition to the Next State**:
   - Set $s \leftarrow s'$ and $a \leftarrow a'$.

7. **Repeat**:
   - Repeat steps 3 to 6 until the terminal state is reached.

# Advantages and Drawbacks

## Advantages
### On-Policy Learning
- **Stable Learning**: SARSA is an on-policy algorithm, meaning it evaluates and improves the policy that it follows. This can lead to more stable learning in environments where the optimal policy might involve risk-taking.

### Convergence
- **Optimal Q-values**: SARSA converges to the optimal Q-values under the right conditions (sufficient exploration, decaying learning rate, and discount factor).

### Policy-Aware
- **Policy-Based Updates**: Since SARSA updates are based on the actions actually taken by the policy, it can be more suitable for scenarios where following the learned policy is crucial.

## Drawbacks
### Exploration Dependency
- **Performance Dependence**: SARSA’s performance heavily depends on the exploration strategy. An inappropriate exploration policy can lead to suboptimal performance.

### Slower Convergence
- **Convergence Speed**: SARSA might converge more slowly compared to off-policy methods like Q-Learning because it takes into account the action taken by the current policy, which may not always be the optimal action.

### Policy Sensitivity
- **Update Sensitivity**: SARSA’s updates are based on the current policy, making it more sensitive to the policy's nature, which can sometimes result in learning suboptimal policies if the policy isn't properly managed.

## Key Innovations
### On-Policy Nature
- **Policy-Based Q-Value Updates**: The key innovation of SARSA is its on-policy nature, which means it updates its Q-values based on the actions actually taken by the policy. This makes SARSA suitable for learning policies that account for the exploration strategy used during training.

### Balance Between Exploration and Exploitation
- **Epsilon-Greedy Policy**: SARSA inherently balances exploration and exploitation through its epsilon-greedy policy. This helps ensure that the agent explores the environment sufficiently while still exploiting known good actions.

### Suitability for Risk-Averse Strategies
- **Conservative Strategies**: In certain environments where taking risks might lead to high penalties, SARSA’s tendency to follow the current policy can result in safer, more conservative strategies compared to off-policy methods like Q-Learning.



### Pseudocode

```python
Initialize Q(s, a) arbitrarily for all s, a
Repeat for each episode:
    Initialize s
    Choose a from s using policy derived from Q (epsilon-greedy)
    Repeat for each step of the episode:
        Take action a, observe r, s'
        Choose a' from s' using policy derived from Q (epsilon-greedy)
        Q(s, a) = Q(s, a) + α [r + γ Q(s', a') - Q(s, a)]
        s = s'
        a = a'
    until s is terminal


