## SARSA - State–Action–Reward–State–Action

$$
Q(s_t,a_t) = Q(s_t,a_t) + \alpha (r + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t,a_t))
$$
where<br>
$ \alpha $ - step size (learning rate)<br>
$ \gamma $ - discount factor<br>
$ s_t $ - current state<br>
$ s_{t+1} $ - next state <br>
$ r $ - reward<br>
$ a $ - action

The Q-value depends on the current state of the agent $s_t$, the action the agent chooses $a_t$, the reward $r$ the agent gets for choosing this action, the state $s_{t+1}$ that the agent will now be in after taking that action, and finally the next action $a_{t+1}$ the agent will choose in its new state.

### Comparing Q-learning and SARSA
SARSA
$$
Q(s_t,a_t) = Q(s_t,a_t) + \alpha (r + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t,a_t))
$$
Q-learning
$$
Q(s_t,a_t) = Q(s_t,a_t) + \alpha (r + \gamma \max_{a} Q(s_{t+1}, a) - Q(s_t,a_t))
$$

### Example - 2000 episodes:

In [12]:
import gym
import numpy as np
import time

env = gym.make('FrozenLake-v0')

a = .8 #alpha
y = .95 #gamma
num_episodes = 2000
Q = np.zeros([env.observation_space.n, env.action_space.n])

for i in range(num_episodes):
    visited_states = [0, ]
    choosed_actions = []
    
    current_state = env.reset()
    current_action = np.argmax(Q[current_state,:])
    for j in range(100):
        
        next_state, reward, done, _ = env.step(current_action)
        
        next_action = np.argmax(Q[next_state,:] + np.random.randn(1,env.action_space.n)*(1./(i+1)))
        
        Q[current_state, current_action] += a*(reward + y*Q[next_state, next_action] - Q[current_state, current_action])
        
        visited_states.append(current_state)
        choosed_actions.append(
        {
            0 : 'l',
            1 : 'd',
            2 : 'r',
            3 : 'u'
        }[current_action])
        
        current_state = next_state
        current_action = next_action
        
        if done == True:
            break

choosed_actions.append('-')
print('Last visited states and actions:')
print(np.array([visited_states, choosed_actions]))
print()
print('Last move:')
env.render()
print()
print('Numbers representing states:')
print(np.arange(0,16).reshape(4,4))
print()
print('Q-table:')
print(Q)

Last visited states and actions:
[['0' '0' '0' '4' '4' '8' '8' '8' '9' '8' '9' '13' '13' '14']
 ['l' 'l' 'l' 'l' 'u' 'u' 'u' 'd' 'u' 'd' 'r' 'r' 'u' '-']]

Last move:
  (Up)
SFFF
FHFH
FFFH
HFF[41mG[0m

Numbers representing states:
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]

Q-table:
[[  9.84857068e-02   3.81064777e-04   4.25934352e-04   2.71681067e-04]
 [  1.55885969e-03   1.20787031e-04   8.49505224e-05   9.99091620e-02]
 [  1.80585779e-01   2.89716854e-05   4.10769756e-04   3.77641859e-04]
 [  5.89198625e-06   1.21873822e-04   6.71749217e-05   2.16134593e-06]
 [  6.37015446e-02   5.20212186e-04   1.99713651e-04   1.33293168e-03]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  3.47444063e-06   2.49835288e-09   1.87561714e-04   7.91550829e-07]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  2.62351523e-04   6.60937443e-05   2.58062127e-05   7.90674016e-02]
 [  2.11931543e-04   5.88325406e-01   1.11575138e-03   1.