# Deep Q-Learning on Frozen Lake



## Introduction

Reinforcement Learning (RL) is a powerful paradigm for teaching agents to make decisions in an environment in order to maximize cumulative rewards. One of the fundamental environments for experimenting with RL algorithms is the Frozen Lake environment, provided by the OpenAI Gym. In this notebook, we will explore the application of Deep Q-Learning (DQN), a type of Q-learning that utilizes neural networks, to solve the Frozen Lake problem.

## What is Frozen Lake?
Frozen Lake is a grid world environment where an agent needs to navigate from a starting point to a goal, avoiding holes that would lead to failure. The environment is described as follows:

- S: Starting point (safe)
- F: Frozen surface (safe)
- H: Hole (unsafe, leads to failure)
- G: Goal (safe, leads to success)

The agent can move in four directions: left, right, up, and down. The objective is to find the optimal policy that maximizes the chances of reaching the goal while avoiding the holes.

## Training function with different parameters

This function train implements the Deep Q-Learning algorithm to train an agent to navigate the Frozen Lake environment. Here’s how it works:

### Function Parameters
The function takes the following parameters:

- slippery: A boolean that indicates whether the Frozen Lake is slippery or not.
- learning_rate_a: The learning rate used for updating the Q-values.
- discount_factor_g: The discount factor that determines the importance of future rewards.
- epsilon: The initial exploration rate, which controls the probability of choosing a random action.
- epsilon_decay_rate: The rate at which the exploration rate epsilon decreases over time.
- episodes: The number of episodes the agent will be trained for.
- max_steps: The maximum number of steps allowed per episode.
- move_reward: The reward given for making a regular move.
- fall_reward: The penalty given for falling into a hole.
- goal_reward: The reward given for reaching the goal.

### Initialization
The function starts by setting up the environment and initializing various variables:

- It creates the Frozen Lake environment with the specified slipperiness.
- It initializes a Q-table with zeros. The Q-table has dimensions corresponding to the number of states and actions in the environment.
- It sets up a dictionary to store Q-values for each state over time.
- It copies the initial Q-values to a separate variable to track changes.
- It initializes a random number generator.
- It creates arrays to store rewards per episode, state history, action history, and rewards history.
### Training Loop
The main training process occurs in a loop that runs for a specified number of episodes:

- At the beginning of each episode, the environment is reset to the starting state.
- Several variables are initialized to track the progress and results of the current episode, including the current state, whether the episode has ended, the total reward for the episode, and lists to store the states, actions, and rewards encountered during the episode.
### Step Loop
Within each episode, the agent takes actions in a loop that continues until the episode ends or the maximum number of steps is reached:

- The agent decides whether to take a random action or choose the best-known action based on the Q-table. This decision is controlled by the exploration rate epsilon.
- The environment updates based on the chosen action, returning the new state, reward, and whether the episode has ended.
- The state, action, and reward are recorded.
- The reward is adjusted based on the outcome (reaching the goal, falling into a hole, or making a regular move).
- The Q-value for the current state-action pair is updated using the Q-learning formula, which incorporates the learning rate, discount factor, and the maximum Q-value of the next state.
- The state is updated to the new state, and the difference between the current and previous Q-values is calculated.
- The Q-values are stored for each state to track changes over time.
- The step counter is incremented.
### End of Episode
At the end of each episode:

- The state, action, and reward histories are stored.
- The total reward for the episode is recorded.
- The exploration rate epsilon is decayed to encourage more exploitation of known good actions as training progresses. The learning rate is adjusted if - the exploration rate reaches its minimum value.
### Return Values
After all episodes are completed, the function returns the following:

- The final Q-table.
- The dataset of Q-values for each state over time.
- The rewards obtained in each episode.
- The history of states, actions, and rewards encountered during training.
- This function trains an agent to learn an optimal policy for navigating the Frozen Lake environment by balancing exploration and exploitation, updating its Q-values based on the rewards obtained, and gradually improving its strategy over multiple episodes.

In [None]:
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
import pickle
from gymnasium.wrappers import TransformReward
import plotly.graph_objects as go
import plotly.subplots as sp
import numpy as np

In [24]:
def train(slippery, learning_rate_a, discount_factor_g, epsilon, epsilon_decay_rate, episodes, max_steps, move_reward, fall_reward, goal_reward):
    env = gym.make('FrozenLake-v1', desc=None,
                   map_name="8x8", is_slippery=slippery)
    q = np.zeros([env.observation_space.n, env.action_space.n])  # Q-table 64x4
    q_dataset = {s: [] for s in range(env.observation_space.n)}
    prev_Q_values = np.copy(q)
    rng = np.random.default_rng()
    rewards_per_episode = np.zeros(episodes)
    states_history = []
    rewards_history = []
    actions_history = []

    for i in range(episodes):
        state = env.reset()[0]  # reset environment to starting state
        terminated = False
        truncated = False
        episode_reward = 0
        episode_states = []
        episode_actions = []
        episode_rewards = []
        step_count = 0

        while not terminated and not truncated and step_count < max_steps:
            if rng.random() < epsilon:
                action = env.action_space.sample()
            else:
                action = np.argmax(q[state, :])

            new_state, reward, terminated, truncated, info = env.step(action)

            episode_states.append(state)
            episode_actions.append(action)

            # Adjust reward based on outcome
            if terminated and reward == 1:  # Reached the goal
                reward = goal_reward
            elif terminated and reward == 0:  # Fell in a hole
                reward = fall_reward
            else:  # Regular move
                reward += move_reward

            episode_rewards.append(reward)
            episode_reward += reward

            q[state, action] = q[state, action] + learning_rate_a * (
                reward + discount_factor_g *
                np.max(q[new_state, :]) - q[state, action]
            )

            state = new_state
            q_values_diff = q[state] - prev_Q_values[state]
            prev_Q_values = np.copy(q)

            for s in range(env.observation_space.n):
                q_dataset[s].append(np.max(q[s]))

            step_count += 1

        states_history.append(episode_states)
        actions_history.append(episode_actions)
        rewards_history.append(episode_rewards)
        rewards_per_episode[i] = episode_reward

        epsilon = max(epsilon * epsilon_decay_rate,
                      0.01)  # Ensure some exploration
        if epsilon == 0.01:
            learning_rate_a = 0.0001

    return q, q_dataset, rewards_per_episode, states_history, actions_history, rewards_history


## Visualization Function

In [25]:
def visualize_q_learning_results(rewards_per_episode, states_history, actions_history, rewards_history):
    # Create a subplot with 2 rows and 2 columns
    fig = sp.make_subplots(rows=2, cols=2, subplot_titles=(
        "Rewards per Episode",
        "State Visitation Heatmap",
        "Action Distribution",
        "Q-value Evolution"
    ))

    # 1. Rewards per Episode
    fig.add_trace(
        go.Scatter(y=rewards_per_episode, mode='lines', name='Reward'),
        row=1, col=1
    )
    fig.update_xaxes(title_text="Episode", row=1, col=1)
    fig.update_yaxes(title_text="Reward", row=1, col=1)

    # 2. State Visitation Heatmap
    state_visits = np.zeros((8, 8))
    for episode in states_history:
        for state in episode:
            state_visits[state // 8, state % 8] += 1

    fig.add_trace(
        go.Heatmap(z=state_visits, colorscale='Viridis', name='State Visits'),
        row=1, col=2
    )
    fig.update_xaxes(title_text="Column", row=1, col=2)
    fig.update_yaxes(title_text="Row", row=1, col=2)

    # 3. Action Distribution
    action_counts = np.zeros(4)
    for episode in actions_history:
        for action in episode:
            action_counts[action] += 1

    fig.add_trace(
        go.Bar(x=['Left', 'Down', 'Right', 'Up'],
               y=action_counts, name='Action Counts'),
        row=2, col=1
    )
    fig.update_xaxes(title_text="Action", row=2, col=1)
    fig.update_yaxes(title_text="Count", row=2, col=1)

    # 4. Q-value Evolution
    q_values = np.array(list(q_dataset.values()))
    avg_q_values = np.mean(q_values, axis=0)

    fig.add_trace(
        go.Scatter(y=avg_q_values, mode='lines', name='Avg Q-value'),
        row=2, col=2
    )
    fig.update_xaxes(title_text="Training Step", row=2, col=2)
    fig.update_yaxes(title_text="Average Q-value", row=2, col=2)

    # Update layout and show plot
    fig.update_layout(height=800, width=1000,
                      title_text="Q-Learning Analysis <br><sup>"+"</sup>", )
    fig.show()

## Experiment 1

### Parameters
1. slippery=False:
    - This makes the environment deterministic. When set to True, it introduces stochasticity, making learning more challenging but more realistic.
    - False is good for initial learning and debugging, but True better represents real-world scenarios.
2. learning_rate_a=0.01:
    - This determines how much new information overrides old information.
    - 0.01 is a relatively low value, promoting stable but slow learning.
3. discount_factor_g=0.9:
    - This balances immediate and future rewards.
    - 0.9 gives significant weight to future rewards without completely disregarding immediate ones.
4. epsilon=1 and epsilon_decay_rate=0.995:
    - Starts with full exploration (1) and gradually shifts to exploitation.
    - The decay rate of 0.995 is moderate, allowing for a gradual transition.
5. episodes=10000:
    - This is a good number for complex environments like 8x8 Frozen Lake.
    - More episodes allow for more learning opportunities, especially important in stochastic environments.
6. max_steps=100:
    - This prevents episodes from running indefinitely.
    - 100 steps should be sufficient for an 8x8 grid
7. move_reward=-0.01:
    - A small negative reward encourages finding the goal quickly.


8. fall_reward=-1:
    - A significant penalty for falling into a hole.


9. goal_reward=2:
    - This provides a strong positive reinforcement for reaching the goal.


In [29]:
slippery=False
learning_rate_a=0.01
discount_factor_g=0.9
epsilon=1
epsilon_decay_rate=0.995
episodes=10000
max_steps=100
move_reward=-0.01
fall_reward=-1
goal_reward=2

In [30]:
q, q_dataset, rewards_per_episode, states_history, actions_history, rewards_history = train(slippery, learning_rate_a, discount_factor_g, epsilon, epsilon_decay_rate, episodes, max_steps, move_reward, fall_reward, goal_reward)
visualize_q_learning_results(
    rewards_per_episode, states_history, actions_history, rewards_history)

### Thoughts
Based on the Q-Learning analysis results:
1. Rewards per Episode: Shows high variability but generally positive rewards, indicating the agent often reaches the goal.
2. State Visitation Heatmap: The start state (bottom-left) is most visited. There's a clear path towards the top-right, suggesting the agent has learned a route to the goal.
3. Action Distribution: Down and Right actions are preferred, aligning with the goal location .
4. Q-value Evolution: Rapid initial increase followed by steady growth, indicating consistent learning throughout training.

Overall, the agent appears to have learned an effective policy for navigating the Frozen Lake environment, with a clear preference for actions that lead to the goal state.
