* State-action-reward-state-action (SARSA) is an on-policy algorithm designed to teach a machine learning model a new Markov decision process policy in order to solve reinforcement learning challenges. It’s an algorithm where, in the current state (S), an action (A) is taken and the agent gets a reward (R), and ends up in the next state (S1), and takes action (A1) in S1. Therefore, the tuple (S, A, R, S1, A1) stands for the acronym SARSA. 

* It’s called an on-policy algorithm because it updates the policy based on actions taken.

# What Is SARSA?

> SARSA is an on-policy algorithm used in reinforcement learning to train a Markov decision process model on a new policy. It’s an algorithm where, in the current state (S), an action (A) is taken and the agent gets a reward (R), and ends up in the next state (S1), and takes action (A1) in S1, or in other words, the tuple S, A, R, S1, A1.

# How Does the SARSA Algorithm Work? 

* The SARSA algorithm works by carrying out actions based on rewards received from previous actions. To do this, SARSA stores a table of state (S)-action (A) estimate pairs for each Q-value. This table is known as a Q-table, while the state-action pairs are denoted as Q(S, A). 

* The SARSA process starts by initializing Q(S, A) to arbitrary values. In this step, the initial current state (S) is set, and the initial action (A) is selected by using an epsilon-greedy algorithm policy based on current Q-values. An epsilon-greedy policy balances the use of exploitation and exploration methods in the learning process to select the action with the highest estimated reward. 

* Exploitation involves using already known, estimated values to get more previously earned rewards in the learning process. Exploration involves attempting to find new knowledge on actions, which may result in short-term, sub-optimal actions during learning but may yield long-term benefits to find the best possible action and reward.

* From here, the selected action is taken, and the reward (R) and next state (S1) are observed. Q(S, A) is then updated, and the next action (A1) is selected based on the updated Q-values. Action-value estimates of a state are also updated for each current action-state pair present, which estimates the value of receiving a reward for taking a given action.

* The above steps of R through A1 are repeated until the algorithm’s given episode ends, which describes the sequence of states, actions and rewards taken until the final (terminal) state is reached. State, action and reward experiences in the SARSA process are used to update Q(S, A) values for each iteration.

In [34]:
import gym
import numpy as np
import time
import os
from tqdm import tqdm  



# Define the custom FrozenLake map

In [35]:
custom_map = [
    "SFFF",
    "FHFH",
    "FFFH",
    "HFFG"
]

# Create the environment
env = gym.make('FrozenLake-v1', desc=custom_map, is_slippery=False)

# Q-learning parameters (optimized)
epsilon = 0.9
total_episodes = 75000  # Increased for robust learning
max_steps = 100
lr_rate = 0.5  # Reduced for stable updates
gamma = 0.99  # Higher value to prioritize long-term reward
epsilon_min = 0.01
epsilon_decay = 0.9995  # Very slow decay for extensive exploration

# Initialize Q-table

In [36]:

Q = np.zeros((env.observation_space.n, env.action_space.n))


# Function to choose an action with epsilon decay

In [37]:

def choose_action(state):
    if np.random.uniform(0, 1) < epsilon:
        return env.action_space.sample()  # Random exploration
    return np.argmax(Q[state, :])  # Greedy action

# Function to update Q-table
def learn(state, state2, reward, action, action2):
    predict = Q[state, action]
    target = reward + gamma * Q[state2, action2]
    Q[state, action] += lr_rate * (target - predict)


# Training loop with tqdm progress bar

In [38]:

total_rewards = 0
successful_episodes = 0
with tqdm(total=total_episodes, desc="Training Progress") as pbar:
    for episode in range(total_episodes):
        state = env.reset()
        action = choose_action(state)
        episode_reward = 0

        for t in range(max_steps):
            state2, reward, done, _ = env.step(action)
            action2 = choose_action(state2)
            learn(state, state2, reward, action, action2)

            state = state2
            action = action2
            episode_reward += reward

            if done:
                if reward == 1:
                    successful_episodes += 1
                break

        total_rewards += episode_reward
        epsilon = max(epsilon_min, epsilon * epsilon_decay)  # Decay epsilon

        # Update progress bar
        pbar.update(1)
        if episode % 10000 == 0:
            pbar.set_postfix({"Total Rewards": total_rewards, "Success Rate": f"{successful_episodes / (episode + 1):.2%}"})

print("\nTraining complete! Total rewards:", total_rewards, "Success Rate:", successful_episodes / total_episodes)



Training Progress: 100%|██████████| 75000/75000 [00:18<00:00, 4128.28it/s, Total Rewards=65844.0, Success Rate=94.06%]


Training complete! Total rewards: 70749.0 Success Rate: 0.94332





# ---- Enhanced Visualization Code with Red Indicator ----

In [39]:

def render_grid(state, prev_state=None):
    """Prints the FrozenLake grid with the agent's position and red previous position"""
    os.system('clear' if os.name == 'posix' else 'cls')  # Clears the screen
    grid_size = 4
    grid = [list(row) for row in custom_map]

    # Mark current agent position
    row, col = divmod(state, grid_size)
    grid[row][col] = "A"  # Current position as "A"

    # Mark previous position in red if provided
    if prev_state is not None and prev_state != state:
        prev_row, prev_col = divmod(prev_state, grid_size)
        if 0 <= prev_row < grid_size and 0 <= prev_col < grid_size:  # Ensure within bounds
            grid[prev_row][prev_col] = "\033[31m" + grid[prev_row][prev_col] + "\033[0m"  # Red color

    # Print the grid
    for line in grid:
        print(" ".join(line))
    print()


# Run the agent dynamically with a full path and red indicator

In [40]:

def visualize_agent(Q, env):
    state = env.reset()
    done = False
    step_count = 0
    max_steps = 20  # Limit steps to avoid infinite loops
    prev_state = None  # Initialize previous state as None to start

    print("Starting visualization of agent's learned path...\n")
    while not done and step_count < max_steps:
        render_grid(state, prev_state)  # Render with previous state in red
        
        # Determine movement direction dynamically
        prev_row, prev_col = divmod(prev_state, 4) if prev_state is not None else (-1, -1)
        row, col = divmod(state, 4)

        if row > prev_row:
            move = "DOWN"
        elif row < prev_row:
            move = "UP"
        elif col > prev_col:
            move = "RIGHT"
        else:
            move = "LEFT"

        print(f"Agent moves: {move}\n")
        
        prev_state = state  # Update previous state before moving
        action = np.argmax(Q[state])  # Use the learned policy
        state, reward, done, _ = env.step(action)
        step_count += 1
        time.sleep(0.5)  # Pause for effect

    render_grid(state)  # Show final state (no previous state needed)
    if reward == 1:
        print("🏆 Goal reached! Frisbee retrieved!")
    elif done:
        print("💀 Fell in a hole!")
    else:
        print("❌ Max steps reached without goal.")

# Run visualization
visualize_agent(Q, env)


Starting visualization of agent's learned path...

A F F F
F H F H
F F F H
H F F G

Agent moves: DOWN

[31mS[0m F F F
A H F H
F F F H
H F F G

Agent moves: DOWN

S F F F
[31mF[0m H F H
A F F H
H F F G

Agent moves: DOWN

S F F F
F H F H
[31mF[0m A F H
H F F G

Agent moves: RIGHT

S F F F
F H F H
F [31mF[0m F H
H A F G

Agent moves: DOWN

S F F F
F H F H
F F F H
H [31mF[0m A G

Agent moves: RIGHT

S F F F
F H F H
F F F H
H F F A

🏆 Goal reached! Frisbee retrieved!
