# Deep Reinforcement Learning Lab1: FrozenLake: A Slippery Challenge

FrozenLake is a classic reinforcement learning environment provided by the OpenAI Gym library. It
simulates the task of navigating a treacherous, icy terrain. The agent's goal is to reach the goal state
without falling into any holes.

**Key Characteristics:**

- Grid World: The environment is represented as a 4x4 grid, with each cell representing a different state.
- Actions: The agent can take four actions: up, down, left, or right.
- State Transitions: Due to the slippery nature of the ice, the agent's actions may not always have the desired effect.
- Rewards: The agent receives a reward of 1 upon reaching the goal state and a reward of 0 otherwise.
- Terminal States: The goal state and hole states are terminal states.

FrozenLake is a popular choice for introducing reinforcement learning concepts due to its simplicity
and the challenge it presents. It's a great environment to experiment with various reinforcement
learning algorithms, including Q-learning.

For optimal performance and to avoid potential conflicts with system-wide Python
installations, it is recommended to create a virtual environment using Anaconda. This will
provide a dedicated Python environment for your project.



In [1]:
import numpy as np
import gymnasium as gym

Initialize Environment

In [2]:
env = gym.make('FrozenLake-v1')

Define Parameters

In [3]:
epsilon = 0.9 # exploration rate
total_episodes = 100 # total episodes
max_steps = 100 # max steps per episode
alpha = 0.85 # learning rate
gamma = 0.95 # discounting rate

Initializing Q-table

In [4]:
Q = np.zeros((env.observation_space.n, env.action_space.n))

Implementing the `choose_action` function

In [5]:
def choose_action(state):
    if np.random.uniform(0,1) < epsilon:
        return env.action_space.sample() # Explore randomly
    else:
        return np.argmax(Q[state,:]) # Exploit best known action

Implementing the `update` function

In [10]:
def update(state, state2, reward, action):
    predict = Q[state, action]
    target = reward + gamma * np.max(Q[state2, :])
    Q[state, action] = Q[state, action] + alpha * (target - predict)

Implementing the training loop

In [11]:
for episode in range(total_episodes):
    state, _ = env.reset()
    for t in range(max_steps):
        action = choose_action(state)
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        update(state, next_state, reward, action)
        state = next_state
        if done:
            break

Evaluating Performance

In [12]:
total_reward = 0
for _ in range(100):
    state, _ = env.reset() 
    while True:
        action = np.argmax(Q[state, :])
        next_state, reward, terminated, truncated, _ = env.step(action)
        total_reward += reward
        state = next_state
        if terminated or truncated:
            break
print("Average reward over 100 episodes:", total_reward / 100)

Average reward over 100 episodes: 0.03


Q-table:

In [13]:
print(Q)

[[3.48028602e-01 3.61800634e-01 3.49068484e-01 3.62110606e-01]
 [1.93446707e-05 3.79669555e-01 3.81893988e-01 3.51849937e-01]
 [3.93749542e-01 5.47619053e-01 2.95210768e-01 3.95286947e-01]
 [4.38618316e-02 5.59971922e-03 4.52619839e-02 3.94446833e-01]
 [3.68161768e-01 1.22825905e-03 3.71477001e-01 5.70738134e-02]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [6.19438318e-01 6.08457925e-01 2.46303202e-01 6.42167579e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [4.90998115e-03 2.51200753e-01 2.87975109e-01 4.39284490e-01]
 [8.66790651e-02 3.56592812e-01 2.94693882e-01 0.00000000e+00]
 [1.00699899e-02 7.53508266e-01 4.91329774e-01 2.37965310e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 8.53958830e-01 1.15804884e-02]
 [0.00000000e+00 9.89970576e-01 8.50000000e-01 7.08696755e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.000000