# SARSA

**State–action–reward–state–action**

Using SARSA algorithm, the agent interacts with the environment and updates the policy based on actions taken. The Q value for a state-action is updated by an error and adjusted by the learning rate alpha.

<img src="../images/sarsa.png">

In [1]:
# importing libraries & packages
import gym
import numpy as np
import time, pickle, os

In [2]:
# seeting up the environment
env = gym.make('FrozenLake-v0')

In [3]:
# defining values
epsilon = 0.9
total_episodes = 50 #1000
max_steps = 5 #100
lr_rate = 0.81
gamma = 0.96
rewards=0

In [4]:
# defining q-table
Q = np.zeros((env.observation_space.n, env.action_space.n))

In [5]:
# creating function "choose_action"    
def choose_action(state):
    action=0
    if np.random.uniform(0, 1) < epsilon:
        action = env.action_space.sample()
    else:
        action = np.argmax(Q[state, :])
    return action


In [6]:
# defining the learn function
def learn(state, state2, reward, action, action2):
    predict = Q[state, action]
    target = reward + gamma * Q[state2, action2]
    Q[state, action] = Q[state, action] + lr_rate * (target - predict)

In [7]:
# applying functions
for episode in range(total_episodes):
    t = 0
    state = env.reset()
    action = choose_action(state)

    while t < max_steps:
        env.render()
        state2, reward, done, info = env.step(action)
        action2 = choose_action(state2)
        learn(state, state2, reward, action, action2)
        state = state2
        action = action2
        t += 1
        rewards+=1
        if done:
            break

        time.sleep(0.1)


[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG

[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Right)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
[41mF[0mHFH
FFFH
HFFG

[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Right)
SF[41mF[0mF
FHFH
FFFH
HFFG
  (Right)
SFFF
FH[41mF[0mH
FFFH
HFFG
  (Right)
SFFF
FHFH
FF[41mF[0mH
HFFG

[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG

[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Down)
SF[41mF[0mF
FHFH
FFFH
HFFG

[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
[41mS[0mF

In [8]:
# printing scores
print ("Score over time: ", rewards/total_episodes)

Score over time:  4.3


In [9]:
# saving the model
with open("frozenLake-sarsa.pkl", 'wb') as f:
    pickle.dump(Q, f)

Learning with [Medium](https://medium.com/swlh/introduction-to-reinforcement-learning-coding-sarsa-part-4-2d64d6e37617).