# Frozen Lake 4X4 by Ismet

In [2]:
#pip install gym

In [1]:
import numpy as np
import gym
import random

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:
# Initialize Frozen Lake environment with render_mode="human"
env = gym.make("FrozenLake-v1", is_slippery=True, render_mode="human")

# Initialize Q-table with zeros
q_table = np.zeros([env.observation_space.n, env.action_space.n])

# Hyperparameters
alpha = 0.8  # Learning rate
gamma = 0.95  # Discount factor
epsilon = 1.0  # Exploration rate
max_epsilon = 1.0
min_epsilon = 0.01
decay_rate = 0.005

# Training parameters
total_episodes = 10000  # Total number of episodes
max_steps = 100  # Max steps per episode

# List of rewards
rewards = []

# Q-Learning algorithm
for episode in range(total_episodes):
    state, _ = env.reset()  # Reset environment to start state (use tuple unpacking)
    total_reward = 0  # Initialize total reward per episode
    done = False  # Boolean to check if episode is finished
    steps = 0  # Step counter
    
    for step in range(max_steps):
        # Exploration-exploitation tradeoff
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample()  # Explore: random action
        else:
            action = np.argmax(q_table[state, :])  # Exploit: choose best action from Q-table

        # Take the action, observe the outcome (next state, reward, done, truncated)
        new_state, reward, done, truncated, _ = env.step(action)
        
        # Q-learning formula
        q_table[state, action] = q_table[state, action] + alpha * (reward + gamma * np.max(q_table[new_state, :]) - q_table[state, action])

        state = new_state  # Update state
        total_reward += reward  # Accumulate reward

        if done or truncated:
            break

    # Reduce epsilon (to reduce exploration over time)
    epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
    
    rewards.append(total_reward)

# Calculate and display the average reward over all episodes
print(f"Average reward: {sum(rewards)/total_episodes}")

# Display learned Q-table
print("Learned Q-table:")
print(q_table)

# Test the agent after training
state, _ = env.reset()  # Reset environment (use tuple unpacking)
env.render()
done = False
while not done:
    action = np.argmax(q_table[state, :])
    state, reward, done, truncated, _ = env.step(action)
    env.render()

env.close()

## Explanations for the hyperparameters and their potential alternative values

**1. Learning Rate (alpha)**
- Purpose: The learning rate controls how much new information overrides old information. A value closer to 1 prioritizes new information, while a value closer to 0 prioritizes previously learned information.
- Current Value: 0.8
- Alternative Values: 0.1, 0.3, 0.5, 0.9
  * Low Values (0.1 - 0.3): Slower learning, retaining old knowledge longer. The system is more stable, but adapts slower to new states.
  * High Values (0.9): Faster learning with more weight given to new information. This may lead to instability or overfitting, as old knowledge gets overridden too quickly.

**2. Discount Factor (gamma)**
- Purpose: The discount factor determines the importance of future rewards. A gamma close to 0 emphasizes immediate rewards, while a gamma close to 1 emphasizes long-term rewards.
- Current Value: 0.95
- Alternative Values: 0.6, 0.7, 0.9, 0.99
   * Low Values (0.6 - 0.7): Focuses more on short-term rewards, which may lead to faster solutions but less strategic long-term planning.
   * High Values (0.99): Puts more weight on long-term rewards, leading to deeper strategies, but the agent may take longer to converge on a solution.

**3. Exploration Rate (epsilon)**
- Purpose: Epsilon controls the balance between exploration (random actions) and exploitation (choosing the best-known action). High epsilon encourages exploration, while low epsilon favors exploitation.
- Current Value: 1.0
- Alternative Values: 0.1, 0.5, 0.9 (initial value)
   * Low Values (0.1 - 0.5): The agent explores less and relies more on the learned policy. If exploration is too low initially, the agent may miss optimal strategies.
   * High Values (0.9 - 1.0): The agent explores more, which may lead to learning a more complete strategy but could slow down convergence.

**4. Max Epsilon**
- Purpose: Determines the initial maximum value of epsilon, dictating how much exploration is allowed at the start.
Current Value: 1.0
- Alternative Values: 0.8, 0.9, 0.95
   * High Max Epsilon (0.9 - 1.0): The agent acts almost entirely randomly at the beginning, exploring the environment extensively.
   * Low Max Epsilon (0.8): A more controlled exploration process, allowing the agent to gather information earlier.


**5. Min Epsilon**
- Purpose: Determines the minimum value epsilon can decay to, defining when the agent stops exploring randomly.
- Current Value: 0.01
- Alternative Values: 0.05, 0.1, 0.001
   * Higher Min Epsilon (0.05, 0.1): The agent retains some level of exploration throughout training, which can help avoid getting stuck in local minima.
  * Lower Min Epsilon (0.001): The agent relies almost entirely on the learned policy, focusing on exploitation rather than exploration once epsilon reaches this level.


**6. Decay Rate**
- Purpose: Controls how quickly epsilon decreases during training. It determines how fast the agent shifts from exploration to exploitation.
- Current Value: 0.005
- Alternative Values: 0.001, 0.01, 0.02
   * Low Decay Rate (0.001): Epsilon decreases slowly, allowing the agent to explore the environment for a longer period.
   * High Decay Rate (0.01 - 0.02): Epsilon decreases faster, meaning the agent shifts to exploiting the learned policy earlier in the training.


**7. Total Episodes**
- Purpose: Defines the number of episodes the agent will train for. More episodes provide more learning opportunities.
- Current Value: 10,000
- Alternative Values: 1,000, 5,000, 20,000
   * Lower Value (1,000): Shorter training, which is faster but might not allow the agent to fully learn the environment.
   * Higher Value (20,000): Longer training, allowing for more thorough learning, but increasing the computational time.


**8. Max Steps**
- Purpose: Defines the maximum number of steps the agent can take per episode. More steps allow for more exploration within an episode.
- Current Value: 100
- Alternative Values: 50, 200, 500
   * Lower Value (50): Faster episodes, but the agent may not have enough steps to fully explore the environment in a single episode.
   * Higher Value (200, 500): Allows the agent to take more steps and potentially explore more strategies within a single episode, but increases training time.


**Which Values to Choose?**
- alpha and gamma are often kept between 0.8 and 1, as they provide a good balance between fast learning and focusing on future rewards.
- epsilon should start high for exploration and then decrease over time, but the decay rate should be fine-tuned depending on the problem.
- decay_rate and min_epsilon should be low enough to allow exploration in the early stages but encourage exploitation as the agent learns more.

These values can be fine-tuned depending on your specific application, and experimentation with these values can help achieve the best performance for your reinforcement learning model.