# **EXPECTED SARSA**

It is the temporal difference learning method used in the model free learning. Expected SARSA, like its counterparts SARSA and Q-learning, is a Temporal Difference or TD learning method used in model-free RL, where we start by initializing a Q-table. Then, repeatedly, the agent chooses an action, receives a reward, and updates the table, until convergence is achieved. However, the key distinction of Expected SARSA over SARSA and Q-learning lies in its update rule.

**Expected SARSA update**\
While SARSA relies on the actual next action taken to update Q-values, and while Q-learning updates Q-values based on the maximum reward attainable from the next state, regardless of the policy being followed, Expected SARSA calculates the expected value of the next state based on all possible actions. This makes Expected SARSA more robust to changes and uncertainties, as it considers the average outcome of all possible next actions according to the current policy.

**Expected value of next sate**\
Expected SARSA's formula reflects this approach by focusing on the expected value of the next state. This is achieved by calculating the sum of the Q-values from all possible actions initiated from this state. Each Q-value is weighted by the probability of its corresponding action being selected under the current policy. In our context, since actions are chosen randomly for now when training, it means they have an equal probability of being selected. Therefore, the expected value simplifies to the mean of the Q-values for all actions in the next state.

### **Expected value of next state**
$$
\text{Q(s,a)} = (1-\alpha)+\alpha [r+\gamma E{Q(s^{'},A)}]
$$
$$
\text{E{Q(s',A)}}=\text{Sum}(Prob(a) * Q(s',a)\text{for a in A})
$$
For random actions with equal probablities:
$$
\text{E{Q(s',A)}}=Mean(Q(s',a)\text{for a in A})
$$

### **Implementation in Frozen Lake**

In [1]:
import gymnasium as gym
import numpy as np

In [4]:
env=gym.make('FrozenLake-v1',is_slippery=False)
num_states=env.observation_space.n
num_actions=env.action_space.n
Q=np.zeros((num_states,num_actions))
gamma=0.99
alpha=0.1
num_episodes=1000

In [6]:
def update_q_table(state,action,next_state,reward):
    expected_q=np.mean(Q[next_state])
    Q[state,action]=(1-alpha)*Q[state,action]+alpha*(reward+gamma*expected_q)

In [None]:
for i in range(num_episodes):
    state,info=env.reset()
    terminated=False
    while not terminated:
        action=env.action_space.sample()
        next_state,reward,terminated,truncated,info=env.step(action)
        update_q_table(state,action,next_state,reward)
        state=next_state

In [9]:
policy={state:np.argmax(Q[state]) for state in range(num_states)}
print(policy)

{0: np.int64(1), 1: np.int64(2), 2: np.int64(1), 3: np.int64(0), 4: np.int64(1), 5: np.int64(0), 6: np.int64(1), 7: np.int64(0), 8: np.int64(2), 9: np.int64(2), 10: np.int64(1), 11: np.int64(0), 12: np.int64(0), 13: np.int64(2), 14: np.int64(2), 15: np.int64(0)}
