<img src="https://www.th-koeln.de/img/logo.svg" style="float:right;" width="200">

# 12th exercise: <font color="#C70039">First Reinforcement Learning Game (*Frozen Lake*) using OpenAI Gym</font>
* Course: AML
* Lecturer: <a href="https://www.gernotheisenberg.de/">Gernot Heisenberg</a>
* Author of notebook: <a href="https://www.gernotheisenberg.de/">Gernot Heisenberg</a>. This notebook is based on the great post and notebook from [Rodolfo Mendes](https://morioh.com/p/18a96b9091d3).
* Editor of notebook: Lena Pickartz (11330741)
* Date:   15.01.2025

<img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*i53DAlKJx_91HgcSiFwyJQ.png" style="float: center;" width="600">

---------------------------------
**GENERAL NOTE 1**:
Please make sure you are reading the entire notebook, since it contains a lot of information on your tasks (e.g. regarding the set of certain paramaters or a specific computational trick), and the written mark downs as well as comments contain a lot of information on how things work together as a whole.

**GENERAL NOTE 2**:
* Please, when commenting source code, just use English language only.
* When describing an observation please use English language, too.
* This applies to all exercises throughout this course.

---------------------------------

### <font color="ce33ff">DESCRIPTION</font>:

#### OpenAI Gym
In this exercise you will be using Python and OpenAI Gym to develop your reinforcement learning algorithm. The Gym library is a collection of environments that can be used freely with the reinforcement learning algorithms.

Gym has a ton of environments ranging from simple text based games to Atari games like Breakout and Space Invaders. The library is intuitive to use and simple to install. Just run **pip install gym** and you are good to go! The link to Gym's installation instructions, requirements, and documentation is included in the description.

Further reading about OpenAI Gym is available under https://www.gymlibrary.dev/.
This notebook is based on this great post and notebook from [Rodolfo Mendes](https://morioh.com/p/18a96b9091d3).

#### Frozen Lake
This description of the game is copied directly from Gym's website.

*Winter is coming. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water and die (Game over). At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend. The surface is described using a grid like the following:*

* SFFF
* FHFH
* FFFH
* HFFG

This grid is your environment! S is your (the agent's) starting point and it's safe. F represents the frozen surface and is also safe. H represents a hole and if your agent steps in a hole in the middle of a frozen lake, the game is over because the agent dies. Finally, G represents the goal, which is the space on the grid where the frisbee is located.

The agent can navigate *left, right, up, down* and the episode ends when the agent reaches the goal or falls in a hole. It receives a reward of **1** if it reaches the goal and **0** otherwise.

Here is the summary:
<img src="https://github.com/len-rtz/AML/blob/main/images/FrozenLake.States.Rewards.png?raw=1" style="float: center;" width="800">

---------------------------------

### <font color="FFC300">TASKS</font>:
The tasks that you need to work on within this notebook are always indicated below as bullet points.
If a task is more challenging and consists of several steps, this is indicated as well.
Make sure you have worked down the task list and commented your doings.
This should be done by using markdown.<br>
<font color=red>Make sure you don't forget to specify your name and your matriculation number in the notebook.</font>

**YOUR TASKS in this exercise are as follows**:
1. import the notebook to Google Colab or use your local machine.
2. make sure you specified you name and your matriculation number in the header below my name and date.
    * set the date too and remove mine.
3. read the entire notebook carefully
    * add comments whereever you feel it necessary for better understanding
    * run the notebook for the first time.
4. install gym into your env!
5. You will train an agent to play the *Frozen Lake* game using Q-learning and you will get a playback of how the agent does after being trained.
6. Again the task: Your agent has to navigate the grid by staying on the frozen surface without falling into any holes until it reaches the frisbee. If it reaches the frisbee, it wins with a reward of plus one. If it falls in a hole, it loses and receives no points for the entire episode.
7. Your tasks are highlighted in the notebook (see below)
---------------------------------

### Imports
import all important libs including gym

In [4]:
import numpy as np
import gym
import random
import time
from   IPython.display import clear_output

In [2]:
print(gym.__version__)

0.25.2


### Creating the Environment
For creating your environment, just call *gym.make()* and pass a string of the name of the environment you want to set up.
All the environments with their corresponding names you can use here are available on Gym's website (see above).
With this *env* object, you are able to query for information about the environment, sample states and actions, retrieve rewards and have your agent navigate the frozen lake. That is all made available to you conveniently with Gym.

In [6]:
env = gym.make("FrozenLake-v1")

  deprecation(
  deprecation(


### Creating the Q-Table
Now, construct your Q-table, and initialize all the Q-values to zero for each state-action pair.
The number of rows in the table is equivalent to the size of the state space in the environment, and the number of columns is equivalent to the size of the action space (see above). You can get this information using *env.observation_space.n* and *env.action_space.n* as shown below in the code. Then, you can use this information to build the Q-table and initialize it with zeros.

In [7]:
action_space_size = env.action_space.n
state_space_size = env.observation_space.n

q_table = np.zeros((state_space_size, action_space_size))

In [5]:
print(q_table)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


### Initializing Q-Learning hyperparameters
Now, we're going to create and initialize all the parameters needed to implement the Q-learning algorithm.

First, with *num_episodes*, you define the total number of episodes you want the agent to play during training. Then, with *max_steps_per_episode*, you define a maximum number of steps that your agent is allowed to take within a single episode. So, if by the 100th step, the agent has not reached the frisbee or fallen through a hole, then the episode will terminate with the agent receiving zero points.

Next, you will set your *learning_rate* and your *discount_rate* as well, which was represented with the symbol (lambda) in the course slides (keyword: discounted return G_t).

Now, the last four parameters are all related to the exploration-exploitation dilemma with respect to the epsilon-greedy policy. You are initializing your *exploration_rate* to **1** and setting the *max_exploration_rate* to **1** and a *min_exploration_rate* to **0.01**. The *max* and *min* are just bounds to how large or small your exploration rate can be. Remember, the exploration rate was represented with the symbol (epsilon) when discussed in the course slides.

Lastly, you will set the *exploration_decay_rate* to **0.01** to determine the rate at which the *exploration_rate* will decay.

**YOUR <font color="FFC300">TASK</font> in this exercise is as follows** (point 7 from the task list above):

All of the above parameters can change!
Your task is to create a *testplan* and tune all parameters by yourself and observe how they influence and change the performance of the algorithm.
Make notes! They will help you during the exam.

In [6]:
num_episodes = 100000
max_steps_per_episode = 100

learning_rate = 0.1
discount_rate = 0.99

exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.01

Create a list to hold all of the rewards you will get from each episode.
By means of this you can observe how your game score changes over time.

In [7]:
rewards_all_episodes = []

In the following code section, the entire Q-learning algorithm is implemented as discussed in detail in the AML course.
When this code is executed, this is exactly where the training will take place.
* The first for-loop contains everything that happens within a single episode.
* The second nested loop contains everything that happens for a single time-step.

Read all the red comments, as they contain lots of important information on the implementation.

In [8]:
# Q-learning algorithm

# loop: for a single episode
for episode in range(num_episodes):
    # initialize 'new episode' parameters
    state = env.reset()
    ''' The done variable just keeps track of whether or not your episode is finished.
    Initialize it to False when first starting the episode and you will see later where
    it will get updated to notify you when the episode is over.'''
    done = False

    ''' Keep track of the rewards within the current episode as well.
    Hence, set rewards_current_episode = 0 since you start
    with no rewards at the beginning of each episode.'''
    rewards_current_episode = 0

    # nested loop: for a single time-step
    for step in range(max_steps_per_episode):
        # Exploration-exploitation trade-off
        '''For each time-step within an episode set your exploration_rate_threshold
        to a random number between 0 and 1. This will be used to determine whether
        your agent will explore or exploit the environment in this time-step.'''
        exploration_rate_threshold = random.uniform(0, 1)
        if exploration_rate_threshold > exploration_rate:
            action = np.argmax(q_table[state,:])
        else:
            action = env.action_space.sample()

        # Take new action
        '''After action is chosen, take that action by calling step() on your env object and
        pass your action to it. The function step() returns a tuple containing the new state,
        the reward for the action you took, whether or not the action ended the episode and
        diagnostic information regarding the environment (helpful for debugging).'''
        new_state, reward, done, _ = env.step(action)

        # Update Q-table for Q(s,a)
        '''Compare this implementation with the equation in the course slides.'''
        q_table[state, action] = q_table[state, action] * (1 - learning_rate) + learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))

        '''Set your current state to the new_state that was returned when taking the last action
        and then update the rewards from your current episode by adding the reward you received
        for your previous action.'''
        # Set new state
        state = new_state
        # Add new reward
        rewards_current_episode += reward
        '''Then, check to see if your last action ended the episode
        (game over by agent stepping in a hole or reaching the goal)!
        If the action did end the episode, then jump out of this loop and start the next episode.
        Otherwise, transition to the next time-step.'''
        if done == True:
            break

    # Exploration rate decay
    '''Once an episode is finished, you need to update your exploration_rate using exponential decay,
    which just means that the exploration rate decays at a rate proportional to its current value.
    You can decay the exploration_rate using the formula above, which makes use of all the exploration
    rate parameters that were defined above in the hyperparameter section.'''
    exploration_rate = min_exploration_rate + (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate*episode)

    # Add current episode reward to total rewards list and move on to the next episode
    rewards_all_episodes.append(rewards_current_episode)


  if not isinstance(terminated, (bool, np.bool8)):


### All episodes training completed
After all episodes are finished you now just calculate the average reward per thousand episodes from your list that contains the rewards for all episodes so that you can print it out and see how the rewards changed over time.

In [9]:
# Calculate and print the average reward per thousand episodes
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes),num_episodes/1000)
count = 1000

print("********Average reward per thousand episodes********\n")
for r in rewards_per_thousand_episodes:
    print(count, ": ", str(sum(r/1000)))
    count += 1000

********Average reward per thousand episodes********

1000 :  0.21200000000000016
2000 :  0.2930000000000002
3000 :  0.4880000000000004
4000 :  0.5710000000000004
5000 :  0.6500000000000005
6000 :  0.6530000000000005
7000 :  0.6550000000000005
8000 :  0.6550000000000005
9000 :  0.6550000000000005
10000 :  0.6850000000000005
11000 :  0.6720000000000005
12000 :  0.6990000000000005
13000 :  0.6900000000000005
14000 :  0.6870000000000005
15000 :  0.6430000000000005
16000 :  0.6720000000000005
17000 :  0.6800000000000005
18000 :  0.6910000000000005
19000 :  0.6720000000000005
20000 :  0.6600000000000005
21000 :  0.6830000000000005
22000 :  0.6950000000000005
23000 :  0.6670000000000005
24000 :  0.6840000000000005
25000 :  0.6880000000000005
26000 :  0.6680000000000005
27000 :  0.6930000000000005
28000 :  0.6730000000000005
29000 :  0.6870000000000005
30000 :  0.7090000000000005
31000 :  0.6950000000000005
32000 :  0.6660000000000005
33000 :  0.6820000000000005
34000 :  0.6820000000000005
35

### Interpretation

From this print, you can see that the average reward per thousand episodes did indeed progress over time. When the algorithm first started training, the first thousand episodes only averaged a reward of almost **0.18**, but by the time it got to its last thousand episodes, the reward drastically improved to almost **0.7**.

Let's take a second to understand how you can interpret these results. Your agent played **10000** episodes. At each time step within an episode, the agent received a reward of **1** if it reached the frisbee, otherwise, it received a reward of **0**. If the agent did indeed reach the frisbee, then the episode finished at that time-step.

Hence, that means for each episode, the total reward received by the agent for the entire episode is either **1** or **0**. So, for the first thousand episodes, you can interpret this score as meaning that **18%** of the time the agent received a reward of **1** and won the episode. And by the last thousand episodes from a total of **10000**, the agent was winning almost **70%** of the time.

By analyzing the grid of the game, you can see it is a lot more likely that the agent would fall in a hole or perhaps reach the max time steps than it is to reach the frisbee, so reaching the frisbee **70%** of the time is not too bad, especially since the agent had no explicit instructions to reach the frisbee. It learned that this is the correct thing to do.

* SFFF
* FHFH
* FFFH
* HFFG

At last, print out your updated Q-table to see how that has transitioned from its initial state of all zeros.

In [10]:
# Print updated Q-table
print("\n\n********Q-table********\n")
print(q_table)



********Q-table********

[[0.458933   0.45535738 0.45874832 0.45512155]
 [0.33658501 0.37895031 0.33533565 0.45337265]
 [0.38466927 0.39752483 0.40639702 0.43479591]
 [0.30656678 0.35801615 0.35519112 0.4137194 ]
 [0.47098395 0.36423554 0.38869777 0.4480022 ]
 [0.         0.         0.         0.        ]
 [0.28232273 0.1491292  0.15149752 0.06587194]
 [0.         0.         0.         0.        ]
 [0.44488038 0.31279964 0.40754072 0.50140978]
 [0.21269504 0.53976557 0.46743161 0.4487372 ]
 [0.52930266 0.40237264 0.3169018  0.30334657]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.44085231 0.56702    0.68667564 0.52244827]
 [0.68397718 0.82876438 0.719174   0.72276508]
 [0.         0.         0.         0.        ]]


# Test Plan to tune parameters

## **Test 1: Learning Rate (alpha)**

- Determine the optimal learning rate that balances exploration and exploitation
Procedure:
- Vary alpha within a reasonable range (0.01, 0.05, 0.1, 0.2, 0.5).
- For each alpha value, train the agent for a fixed number of episodes (100000)
- Record average episode reward and plot learning curves

In [11]:
# Hyperparameters (for Test 1: Varying Learning Rate)
num_episodes = 100000
max_steps_per_episode = 100
discount_rate = 0.99

# Learning rates to test
learning_rates = [0.01, 0.05, 0.1, 0.2, 0.5]

# Exploration parameters
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.01

# Results storage
all_results = []

In [12]:
# Run Q-learning for each learning rate
for i, learning_rate in enumerate(learning_rates):
    q_table = np.zeros((env.observation_space.n, env.action_space.n))
    rewards_all_episodes = []

    for episode in range(num_episodes):
        state = env.reset()
        done = False
        rewards_current_episode = 0

        for step in range(max_steps_per_episode):
            exploration_rate_threshold = random.uniform(0, 1)
            if exploration_rate_threshold > exploration_rate:
                action = np.argmax(q_table[state, :])
            else:
                action = env.action_space.sample()

            new_state, reward, done, _ = env.step(action)

            q_table[state, action] = q_table[state, action] * (1 - learning_rate) + \
                                    learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))

            state = new_state
            rewards_current_episode += reward

            if done:
                break

        exploration_rate = min_exploration_rate + (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate*episode)

        rewards_all_episodes.append(rewards_current_episode)

    all_results.append(rewards_all_episodes)

In [13]:
# Analyze and print results
for i, rewards in enumerate(all_results):
    learning_rate = learning_rates[i]  # Get the corresponding learning rate
    rewards_per_thousand_episodes = np.split(np.array(rewards), num_episodes // 1000)
    count = 1000

    print(f"********Average reward per thousand episodes (Learning Rate: {learning_rate})********\n")
    for r in rewards_per_thousand_episodes:
        print(count, ": ", str(sum(r) / 1000))
        count += 1000

********Average reward per thousand episodes (Learning Rate: 0.01)********

1000 :  0.098
2000 :  0.124
3000 :  0.124
4000 :  0.111
5000 :  0.116
6000 :  0.139
7000 :  0.107
8000 :  0.119
9000 :  0.124
10000 :  0.129
11000 :  0.12
12000 :  0.119
13000 :  0.111
14000 :  0.092
15000 :  0.127
16000 :  0.122
17000 :  0.126
18000 :  0.108
19000 :  0.138
20000 :  0.11
21000 :  0.114
22000 :  0.133
23000 :  0.121
24000 :  0.112
25000 :  0.125
26000 :  0.129
27000 :  0.099
28000 :  0.141
29000 :  0.144
30000 :  0.134
31000 :  0.132
32000 :  0.133
33000 :  0.205
34000 :  0.368
35000 :  0.371
36000 :  0.327
37000 :  0.363
38000 :  0.352
39000 :  0.351
40000 :  0.343
41000 :  0.376
42000 :  0.358
43000 :  0.377
44000 :  0.339
45000 :  0.355
46000 :  0.35
47000 :  0.371
48000 :  0.37
49000 :  0.329
50000 :  0.355
51000 :  0.372
52000 :  0.379
53000 :  0.706
54000 :  0.666
55000 :  0.658
56000 :  0.669
57000 :  0.676
58000 :  0.65
59000 :  0.655
60000 :  0.659
61000 :  0.674
62000 :  0.676
63000 : 

In [14]:
# Print updated Q-table
print("\n\n********Q-table********\n")
print(q_table)



********Q-table********

[[0.69178301 0.43716964 0.49212008 0.56707515]
 [0.23342029 0.33075685 0.18684933 0.6283616 ]
 [0.29106105 0.17007559 0.34481612 0.57233203]
 [0.16033515 0.33401277 0.31788641 0.52135502]
 [0.68713773 0.20893763 0.26729855 0.10069628]
 [0.         0.         0.         0.        ]
 [0.46549346 0.00166774 0.00194891 0.00130611]
 [0.         0.         0.         0.        ]
 [0.24068855 0.22201006 0.03246411 0.64152206]
 [0.18926711 0.75261436 0.14433856 0.20241495]
 [0.82134249 0.02313188 0.51681709 0.34734269]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.14355502 0.29200733 0.73079829 0.4087116 ]
 [0.61445896 0.90239493 0.75223055 0.5341793 ]
 [0.         0.         0.         0.        ]]


### **Observations**

**Learning Rate 0.01:**
Shows a gradual improvement in average reward per thousand episodes.
Reaches a relatively stable and high average reward after around 60,000 episodes. Final Performance is at a 67% success rate
It shows a stable learning process.

**Learning Rates 0.05:**
Complete failure to learn with zero rewards through all episodes. The learning rate might be too high.

**Learning Rate 0.1:**
Has a long phase of stagnation for the learning process with a sudden jump at 24.000 episodes with a reward at 51%. It stabilizes at ~0.65-0.70 reward.

**Learning Rate 0.2:**
Fast initial learning (0.39 to 0.64 in first 2,000 episodes) and stabilizes at ~0.65-0.69 reward throughout.

**Learning Rate 0.5**
Quick Initial Learning (~0.45 in first 1,000 episodes). The learning rate fluctuates between 0.55-0.65 and stablises at 54% success rate.

**Oberservation**: Higher learning rates can cause instability in the learning process.

## **Test 2: Discount Factor (gamma)**

- Determine the impact of future rewards on current decisions
- Vary gamma within a range (e.g., 0.9, 0.95, 0.99, 1.0)
- Train the agent for each gamma value and record performance metrics

In [8]:
# Test parameters
num_episodes = 100000
max_steps_per_episode = 100
learning_rate = 0.1

 # List of discount factors to test
discount_factors = [0.9, 0.95, 0.99, 1.0]

# Exploration parameters
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.01

# Results Storage
results_dict = {}  # Use dictionary instead of list for reliable tracking

In [18]:
for discount_rate in discount_factors:

    # Reset parameters
    exploration_rate = 1.0
    q_table = np.zeros((env.observation_space.n, env.action_space.n))
    rewards_all_episodes = []

    # Training loop
    for episode in range(num_episodes):
        if episode % 10000 == 0:
            print(f"Episode {episode}/{num_episodes}")

        state = env.reset()
        done = False
        rewards_current_episode = 0

        for step in range(max_steps_per_episode):
            # Exploration-exploitation trade-off
            exploration_rate_threshold = random.uniform(0, 1)
            if exploration_rate_threshold > exploration_rate:
                action = np.argmax(q_table[state,:])
            else:
                action = env.action_space.sample()

            new_state, reward, done, _ = env.step(action)

            # Update Q-table
            q_table[state, action] = q_table[state, action] * (1 - learning_rate) + \
                learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))

            state = new_state
            rewards_current_episode += reward

            if done:
                break

        # Update exploration rate
        exploration_rate = 0.01 + (1 - 0.01) * np.exp(-0.01 * episode)

        rewards_all_episodes.append(rewards_current_episode)

    # Store results in dictionary with discount rate as key
    results_dict[discount_rate] = rewards_all_episodes

Episode 0/100000
Episode 10000/100000
Episode 20000/100000
Episode 30000/100000
Episode 40000/100000
Episode 50000/100000
Episode 60000/100000
Episode 70000/100000
Episode 80000/100000
Episode 90000/100000
Episode 0/100000
Episode 10000/100000
Episode 20000/100000
Episode 30000/100000
Episode 40000/100000
Episode 50000/100000
Episode 60000/100000
Episode 70000/100000
Episode 80000/100000
Episode 90000/100000
Episode 0/100000
Episode 10000/100000
Episode 20000/100000
Episode 30000/100000
Episode 40000/100000
Episode 50000/100000
Episode 60000/100000
Episode 70000/100000
Episode 80000/100000
Episode 90000/100000
Episode 0/100000
Episode 10000/100000
Episode 20000/100000
Episode 30000/100000
Episode 40000/100000
Episode 50000/100000
Episode 60000/100000
Episode 70000/100000
Episode 80000/100000
Episode 90000/100000


In [12]:
# Analysis of results
for discount_rate in discount_factors:
    rewards = results_dict[discount_rate]
    rewards_per_thousand_episodes = np.split(np.array(rewards), num_episodes // 1000)
    count = 1000

    print(f"\n********Average reward per thousand episodes (Discount Factor: {discount_rate})********\n")
    for r in rewards_per_thousand_episodes:
        print(f"{count} : {sum(r/1000):.3f}")
        count += 1000


********Average reward per thousand episodes (Discount Factor: 0.9)********

1000 : 0.000
2000 : 0.000
3000 : 0.000
4000 : 0.000
5000 : 0.000
6000 : 0.000
7000 : 0.000
8000 : 0.000
9000 : 0.000
10000 : 0.000
11000 : 0.000
12000 : 0.000
13000 : 0.000
14000 : 0.000
15000 : 0.000
16000 : 0.000
17000 : 0.000
18000 : 0.000
19000 : 0.000
20000 : 0.000
21000 : 0.000
22000 : 0.000
23000 : 0.000
24000 : 0.000
25000 : 0.000
26000 : 0.000
27000 : 0.000
28000 : 0.000
29000 : 0.000
30000 : 0.000
31000 : 0.000
32000 : 0.000
33000 : 0.000
34000 : 0.000
35000 : 0.000
36000 : 0.000
37000 : 0.000
38000 : 0.000
39000 : 0.000
40000 : 0.000
41000 : 0.000
42000 : 0.000
43000 : 0.000
44000 : 0.000
45000 : 0.000
46000 : 0.000
47000 : 0.000
48000 : 0.000
49000 : 0.000
50000 : 0.000
51000 : 0.000
52000 : 0.000
53000 : 0.000
54000 : 0.000
55000 : 0.000
56000 : 0.000
57000 : 0.000
58000 : 0.000
59000 : 0.000
60000 : 0.000
61000 : 0.000
62000 : 0.000
63000 : 0.000
64000 : 0.000
65000 : 0.000
66000 : 0.000
67000 :

# Observations

**γ = 0.90:** Failed to learn (zero rewards throughout), it has not enough emphasis on future rewards, possibly making the agent too short-sighted


**γ = 0.95:** Best initial learning (0.481 in first 1000 episodes)
with a consistent performance around 0.60-0.65 reward, stable learning with moderate consideration of future rewards


**γ = 0.99:** slow initial learning (near zero until episode 12000) and
strong performance once learning begins (around episode 14000), highest stable performance (0.65-0.70 range), most consistent late-game performance

**γ = 1.0:** Failed to learn (zero rewards throughout), complete emphasis on future rewards may have made learning unstable

**Observation:** Mid-range discount factors perfomed best, extreme values fail to learn effectively

## **Test 3: Exploration Parameters**
- Test different combinations of exploration decay rates and min/max exploration rates
- Variables to test:
  - exploration_decay_rate: [0.001, 0.005, 0.01, 0.05, 0.1]
  - min_exploration_rate: [0.001, 0.01, 0.05]
  - max_exploration_rate: [0.8, 0.9, 1.0]
- Find optimal balance between exploration and exploitation phases

In [20]:
from itertools import product

In [21]:
# Fixed hyperparameters (using best from previous tests)
num_episodes = 100000
max_steps_per_episode = 100
learning_rate = 0.1
discount_rate = 0.99  # Best from previous test

# Exploration parameters to test
exploration_decay_rates = [0.001, 0.005, 0.01, 0.05, 0.1]
min_exploration_rates = [0.001, 0.01, 0.05]
max_exploration_rates = [0.8, 0.9, 1.0]

# Generate all combinations
param_combinations = list(product(exploration_decay_rates, min_exploration_rates, max_exploration_rates))
results_dict = {}

In [22]:
for decay_rate, min_rate, max_rate in param_combinations:
    combo_key = f"decay={decay_rate}_min={min_rate}_max={max_rate}"
    print(f"\nStarting training for {combo_key}")

    # Reset parameters
    exploration_rate = max_rate
    q_table = np.zeros((env.observation_space.n, env.action_space.n))
    rewards_all_episodes = []

    # Training loop
    for episode in range(num_episodes):
        if episode % 20000 == 0:
            print(f"Episode {episode}/{num_episodes}")

        state = env.reset()
        done = False
        rewards_current_episode = 0

        for step in range(max_steps_per_episode):
            # Exploration-exploitation trade-off
            exploration_rate_threshold = random.uniform(0, 1)
            if exploration_rate_threshold > exploration_rate:
                action = np.argmax(q_table[state,:])
            else:
                action = env.action_space.sample()

            new_state, reward, done, _ = env.step(action)

            # Update Q-table
            q_table[state, action] = q_table[state, action] * (1 - learning_rate) + \
                learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))

            state = new_state
            rewards_current_episode += reward

            if done:
                break

        # Update exploration rate using current parameters
        exploration_rate = min_rate + (max_rate - min_rate) * np.exp(-decay_rate * episode)

        rewards_all_episodes.append(rewards_current_episode)

    # Calculate key metrics
    final_1k_avg = np.mean(rewards_all_episodes[-1000:])
    best_1k_avg = max([np.mean(rewards_all_episodes[i:i+1000])
                      for i in range(0, len(rewards_all_episodes)-1000, 1000)])

    # Store results with metrics
    results_dict[combo_key] = {
        'rewards': rewards_all_episodes,
        'final_q_table': q_table.copy(),
        'final_1k_avg': final_1k_avg,
        'best_1k_avg': best_1k_avg,
        'params': {
            'decay_rate': decay_rate,
            'min_rate': min_rate,
            'max_rate': max_rate
        }
    }

    print(f"Completed {combo_key}")
    print(f"Final 1k episodes average: {final_1k_avg:.3f}")
    print(f"Best 1k episodes average: {best_1k_avg:.3f}")
    print(f"Final exploration rate: {exploration_rate:.4f}")
    print(f"Max Q-value: {np.max(q_table):.4f}")

# Sort and display results
print("\n********Final Results Summary********")
sorted_results = sorted(results_dict.items(),
                       key=lambda x: x[1]['final_1k_avg'],
                       reverse=True)

print("\nTop 10 Parameter Combinations:")
for i, (combo, results) in enumerate(sorted_results[:10]):
    print(f"\n{i+1}. {combo}")
    print(f"Final 1k avg: {results['final_1k_avg']:.3f}")
    print(f"Best 1k avg: {results['best_1k_avg']:.3f}")
    print(f"Decay Rate: {results['params']['decay_rate']}")
    print(f"Min Rate: {results['params']['min_rate']}")
    print(f"Max Rate: {results['params']['max_rate']}")


Starting training for decay=0.001_min=0.001_max=0.8
Episode 0/100000
Episode 20000/100000
Episode 40000/100000
Episode 60000/100000
Episode 80000/100000
Completed decay=0.001_min=0.001_max=0.8
Final 1k episodes average: 0.730
Best 1k episodes average: 0.777
Final exploration rate: 0.0010
Max Q-value: 0.8803

Starting training for decay=0.001_min=0.001_max=0.9
Episode 0/100000
Episode 20000/100000
Episode 40000/100000
Episode 60000/100000
Episode 80000/100000
Completed decay=0.001_min=0.001_max=0.9
Final 1k episodes average: 0.723
Best 1k episodes average: 0.760
Final exploration rate: 0.0010
Max Q-value: 0.7967

Starting training for decay=0.001_min=0.001_max=1.0
Episode 0/100000
Episode 20000/100000
Episode 40000/100000
Episode 60000/100000
Episode 80000/100000
Completed decay=0.001_min=0.001_max=1.0
Final 1k episodes average: 0.730
Best 1k episodes average: 0.763
Final exploration rate: 0.0010
Max Q-value: 0.8678

Starting training for decay=0.001_min=0.01_max=0.8
Episode 0/100000
E

# Observations

- Slow decay is crucial for learning - allows thorough environment exploration
- Very low minimum exploration (0.001) helps fine-tune policies
- Starting exploration rate matters less than how it's reduced
- Stability correlates with slower decay rates
- Performance gap between best and worst in top 10 is relatively small