In [1]:
import sys
sys.path.append('..\\..')
from academia.tools import visualizations
from academia.curriculum import LearningStats

pygame 2.5.2 (SDL 2.28.3, Python 3.10.11)
Hello from the pygame community. https://www.pygame.org/contribute.html


# Aim of experiment presented below
The goal of this experiment was to investigate how the training time of an agent in an easier difficulty task would affect the evaluation of that agent in a more difficult task. To examine this, the agent was trained for x episodes in the easier task, with the stopping condition determined by the number of episodes. Subsequently, as part of the curriculum, when the agent transitioned to learning in the more difficult task, the stopping condition was set to a constant number of episodes. Once this condition was met, the agent underwent 100 evaluations to average the results. This approach allowed us to observe how, depending on the value of x (the number of episodes spent in the easier task), the agent's evaluation would vary after consistently spending 1500 episodes in the more difficult task.

The tested training times for the agent in the easier task, measured in episodes, were:

- 500
- 750
- 1000
- 1250
- 1500

*The experiment for each unique value of the number of episodes spent in the easier task (500, 750, 1000, 1250, 1500) was repeated 10 times to average the results.*

## Loading data for checking evaluation impact

In [7]:
stats_eval = []
for episodes in [500, 750, 1000, 1250, 1500]:
    stats_episode_num = []
    for i in range(10):
        final_task_run = LearningStats.load(f'outputs/eval/episodes_{episodes}/curriculum_iter_{i+1}/2.stats.json')
        stats_episode_num.append(final_task_run)
    stats_eval.append(stats_episode_num)

## Impact of learning duration in easier task to the evaluation in more difficult task

In [9]:
visualizations.plot_evaluation_impact(task_runs_y=stats_eval,  n_episodes_x=[500, 750, 1000, 1250, 1500], show=True)

## Results Analysis

Moving on to the results, we can observe that up to a certain point, as the training time of the agent in the easier task increases, its evaluation also improves. This is associated with a more comprehensive learning of the policy by the agent and a better understanding of the environment. The peak evaluation for our experiment is achieved with `1000` episodes spent in the easier task, and the final evaluation of the agent under this configuration is approximately 0.81.

However, it is important to note that excessively long training on the easier task may not necessarily bring greater benefits. Moreover, it could potentially harm the final evaluation of the agent in the more challenging task, as illustrated in the above chart. While we observe an improvement in the agent's evaluation with increasing training duration up to the critical value of 1000 episodes, exceeding this threshold results in a gradual regression. As the agent spends more time learning in the easier task, its evaluation in the more difficult task experiences significant drops, reaching around ~0.64 when the training time on the easier task is fixed at 1500 episodes.

What could be the reason for such a noticeable decline?
During prolonged exposure to an environment with an easier difficulty level, the agent may overly adapt and become overfit. Consequently, when transitioning to an environment with more challenging features, such as increased lava patches, it fails to generalize observed states and make appropriate decisions. The agent continues to plan actions as if it were still in the easier environment. Thus, finding the right balance is crucial when determining the number of episodes in easier tasks within the curriculum to maximize the benefits of training with a curriculum strategy, i.e., optimizing agent results while maintaining a reasonable training time.


# Aim of experiment presented below

The aim of the next experiment is to investigate how the total training time (with a specified minimum evaluation in the more difficult task that the agent must achieve for the training to be terminated) depends on the time spent in the easier task. To achieve this, the experiment was conducted by examining various values of episodes that the agent must spend in the easier task, namely:
- 500
- 750
- 1000
- 1250
- 1500

In the more challenging task, the minimum evaluation for the agent was set to 0.8. This implies that the agent must spend enough time in the more difficult task to achieve an average evaluation of 0.8 or higher. This average is calculated by having the agent perform 25 runs every 100 training episodes, following the previously learned policy. Based on the evaluations obtained during these runs, the average is calculated.

*The experiment for each unique value of the number of episodes spent in the easier task (500, 750, 1000, 1250, 1500) was repeated 10 times to average the results.*


## Loading data for checking time impact

In [10]:
task_runs_x = []
task_runs_y = []
for episodes in [500, 750, 1000, 1250, 1500]:
    first_task_stats = []
    second_task_stats = []
    for i in range(10):
        stats_per_curriculum={}
        first_task_stats.append(LearningStats.load(f'outputs/time/episodes_{episodes}/curriculum_iter_{i+1}/1.stats.json'))
        second_task_stats.append(LearningStats.load(f'outputs/time/episodes_{episodes}/curriculum_iter_{i+1}/2.stats.json'))
    task_runs_x.append(first_task_stats)
    task_runs_y.append(second_task_stats)

## Impact of learning duration in easier task to the total time spent in both tasks (easier and more difficult)

#### time domain for x: `episodes` vs time domain for y: `steps`

In [13]:
visualizations.plot_time_impact(task_runs_x=task_runs_x, task_runs_y=task_runs_y, time_domain_x="episodes", time_domain_y="steps", show=True)

The above chart illustrates the relationship between the time spent in the easier task (X-axis), expressed in the number of episodes, and the total time spent in both tasks (Y-axis), expressed in the total steps taken by the agent. An interesting pattern emerges as the number of episodes spent in the easier task increases, with the total training time reaching its maximum for 1000 episodes.

However, exceeding this value reveals a global minimum observed at 1250 episodes, and surpassing this threshold shows a slight increase again.

Interpreting this chart, it can be concluded that, when measuring time in the total number of steps taken by the agent, it is advisable to choose a reasonably long training time for the agent on easier difficulty levels. Setting a too small value leads to prolonged training, as the agent cannot learn a preliminary baseline policy. Consequently, when transitioning to more challenging levels, it performs even worse than on the easier levels, spending more time exploring the new environment and trying to learn how to solve it optimally.

On the other hand, opting for excessively long training in the easier task exposes the agent to overfitting, where its policy becomes too tailored to the current environment. This also elongates the learning process on subsequent tasks, as the agent first needs to unlearn certain behaviors strictly adapted to a specific environment before adapting to new challenges.


#### time domain for x: `episodes` vs time domain for y: `episodes`

In [17]:
visualizations.plot_time_impact(task_runs_x=task_runs_x, task_runs_y=task_runs_y, time_domain_x="episodes", time_domain_y="as_x", show=True)

Moving on to the next visualization, the Y-axis once again represents the total training time of the agent, but this time expressed in the cumulative number of episodes spent in training. Expressing time in different units reveals different results, which is associated with a different interpretation method for various analyses. One noticeable aspect, similar to the previous visualization, is the minimum at 1250 episodes. However, this time, it is a local minimum rather than a global one as seen before.

The chart exhibits an upward trend up to 1000 episodes on the X-axis, followed by a significant decrease in time, reaching a minimum at 1250 episodes. However, this time, the value at that point, i.e., the total training time of 1900 episodes, is not a global minimum on this chart. The global minimum is observed for the smallest tested number of episodes in the easier task, which is 500, with a total training time of 1870 episodes. The trend observed after the 1250 value is again very similar to the previous visualization.

What could be the reason for the difference at 1250 episodes on both charts? There could be various reasons; however, the most likely is that measuring time in episodes tells us how many iterations of attempting to navigate the environment by the agent have been invoked. It represents the number of iterations where, at the beginning of each iteration, we reset the environment, place the agent in it, and then based on randomly chosen actions and those according to its policy (epsilon-greedy strategy), we persist in it until we reach the end of that iteration. The iteration can end either by successfully completing the environment or by exceeding a certain number of steps or taking another action leading to an immediate termination, resulting in failure. Thus, counting time in episodes does not provide information about how long a given episode lasted; the agent may have performed an illegal action or spent a longer time in the environment, leading to success or failure accordingly.

Defining time in the number of steps taken by the agent, on the other hand, provides information on how many decisions it made, how many actions it took until the end of the learning process. A higher value at 1250 episodes on the chart above does not necessarily mean that the total training time expressed in seconds was not the shortest at this point. It only indicates that the agent completed the fewest episodes during the training process. However, it is a fact that even if it performed a greater cumulative number of episodes during training, compared to, for example, the value of 500 episodes, it executed the fewest total steps or elementary operations compared to the other values tested in this experiment.
