# Reinforcement Learning
### SARSA

The **SARSA** is a **temporal difference** (TD) **model-free** method used in **Reinforcement Learning** (RL) to obtain the optimal policy for a **Markov Decision Process** (MDP). TD methods update the state-value function $v(s)$ (or action-value function $q(s,a)$) by comparing the current estimate with a more informed estimate (called the TD target). The SARSA is a TD-based method which uses the following iteration for updating action-value function $q(s,a)$:
<br> $\large q(s,a)\leftarrow q(s,a)+\alpha (r+\gamma q(s',a')-q(s,a))$
<br> where $s'$ is the next state, and $a'$ is the action chosen at state $s'$. Also, $r$ is the reward received after taking action $a$.
<br> **Hint** Along with the algorithm SARSA, we use the ϵ-greedy for action-selection. We talked about ϵ-greedy in the previosu post.
<hr>

The example in this Notebook is the same we introduced earlier. Again, we use a **Grid World** environment:
 - **States:** A 3x3 grid (9 states), labeled as (0,0) to (2,2).
 - **Actions:** Up, Down, Left, Right.
 - **Rewards:**
    - Reaching the goal state (2,2) gives a reward of +10.
    - Reaching a "pit" state (1,1) gives a reward of −10.
    - All other transitions give a reward of −1.
- **Terminal States:** (2,2) (goal) and (1,1) (pit).
- **Transition Probabilities:**
    - Moving in the intended direction succeeds with probability 0.8.
    - With probability 0.2, the agent moves in a random direction

<hr>
https://github.com/ostad-ai/Reinforcement-Learning
<br> Explanation: https://www.pinterest.com/HamedShahHosseini/Reinforcement-Learning

In [1]:
# Import the required module
import random

In [2]:
# This class simulates the GridWorld, but the RL algorithm
# does not need to know transition probabilities
# bevasue it is a model-free method
class GridWorld:
    def __init__(self):
        self.states = [(i, j) for i in range(3) for j in range(3)]
        self.actions = ['up', 'down', 'left', 'right']
        self.terminal = {(2, 2): 10, (1, 1): -10}  # Terminal states and rewards
        
    def reset(self):
        self.current_state = (0, 0)  # Start state
        return self.current_state
    
    def step(self, action):
        i, j = self.current_state
        
        if self.current_state in self.terminal:
            return self.current_state, 0, True  # Already terminal
        
        # Movement with 80% success, 20% random orthogonal slip
        if random.random() < 0.8:
            next_state = self._move(action, i, j)
        else:
            # Random orthogonal slip
            if action in ['up', 'down']:
                next_state = self._move(random.choice(['left', 'right']), i, j)
            else:
                next_state = self._move(random.choice(['up', 'down']), i, j)
        
        self.current_state = next_state
        reward = self.terminal.get(next_state, -1)  # -1 for non-terminal
        done = next_state in self.terminal
        
        return next_state, reward, done
    
    def _move(self, action, i, j):
        if action == 'up':
            return max(i-1, 0), j
        elif action == 'down':
            return min(i+1, 2), j
        elif action == 'left':
            return i, max(j-1, 0)
        elif action == 'right':
            return i, min(j+1, 2)

In [3]:
# Epsilon-greedy action-selection
def epsilon_greedy(Q, state, actions, epsilon):
    if random.random() < epsilon:
        return random.choice(actions)
    else:
        # Get all Q-values for this state
        q_values = [Q.get((state, a), 0) for a in actions]
        max_q = max(q_values)
        # In case of multiple max, choose randomly among them
        best_actions = [a for a, q in zip(actions, q_values) if q == max_q]
        return random.choice(best_actions)

In [4]:
# SARSA algorithm (on-policy TD control).
# env: Environment,alpha: Learning rate, gamma: Discount factor,
# epsilon: Exploration rate, episodes: Number of episodes
def sarsa(env, alpha=0.1, gamma=0.9, epsilon=0.1, episodes=1000):
    Q = {(s, a): 0 for s in env.states for a in env.actions}
    
    for _ in range(episodes):
        s = env.reset()
        a = epsilon_greedy(Q, s, env.actions, epsilon)
        done = False
        
        while not done:
            s_next, r, done = env.step(a)
            a_next = epsilon_greedy(Q, s_next, env.actions, epsilon)
            td_target = r + gamma * Q.get((s_next, a_next), 0)
            Q[(s, a)] += alpha * (td_target - Q[(s, a)])
            s, a = s_next, a_next
    
    return Q

In [5]:
# Example: SARSA with GridWorld
print("--- SARSA Optimal Policy:")
env = GridWorld()
Q = sarsa(env)
policy = {}
for s in env.states:
    if s in env.terminal:
        policy[s] = None
    else:
        q_values = [Q.get((s, a), 0) for a in env.actions]
        policy[s] = env.actions[q_values.index(max(q_values))]

for i in range(3):
    for j in range(3):
        print(f"State ({i},{j}): {policy.get((i,j), 'TERMINAL')}", end=" | ")
    print()

--- SARSA Optimal Policy:
State (0,0): right | State (0,1): right | State (0,2): down | 
State (1,0): down | State (1,1): None | State (1,2): down | 
State (2,0): right | State (2,1): right | State (2,2): None | 
