## Reinforcement Learning Project 1: The Mountain Car Challenge 

*[Coding along with [Reinforcement Learning for Algorithmic Trading with Python](https://www.udemy.com/course/reinforcement-learning-for-algorithmic-trading-with-python/), an Udemy online course by Alexander Hagmann]*

**From https://github.com/openai:** [Gym](https://github.com/openai/gym) is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between learning algorithms and environments, as well as a standard set of environments compliant with that API. Since its release, Gym's API has become the field standard for doing this.

**[Gymnasium](https://gymnasium.farama.org/content/basic_usage/)* is a project that provides an API for all single agent reinforcement learning environments, and includes implementations of common environments: cartpole, pendulum, mountain-car, mujoco, atari, and more.

The API contains four key functions: make, reset, step and render, that this basic usage will introduce you to. At the core of Gymnasium is Env, a high-level python class representing a markov decision process (MDP) from reinforcement learning theory (this is not a perfect reconstruction, and is missing several components of MDPs). Within gymnasium, environments (MDPs) are implemented as Env classes, along with Wrappers, which provide helpful utilities and can change the results passed to the user ([Gymnasium Documentation: Basic Usage](https://gymnasium.farama.org/content/basic_usage/)).

**The Gymnasium GitHub repository can be found at [https://github.com/Farama-Foundation/Gymnasium](https://github.com/Farama-Foundation/Gymnasium).**

### One random Episode - with human rendering

> <span style="color:blue">__Prompt:__ I have installed Anaconda and all required libraries to work with OpenAI´s gymnasium Reinforcement Learning environments. Also I have installed Microsoft Visual Studio to visualize environments such as Mountain Car. Please provide a Python code example that creates and runs a random episode in the 'MountainCar-v0' environment using the gymnasium library, with human rendering enabled and printing the total reward and number of steps at the end. Please note that gym is now gymnasium.</span>

Great! Here’s a Python code example that creates and runs a random episode in the **'MountainCar-v0'** environment using the **gymnasium** library, with **human rendering** enabled and printing the **total reward** and **number of steps** at the end:

```python
import gymnasium as gym

# Create the environment
env = gym.make("MountainCar-v0", render_mode="human")

# Initialize the environment
observation, info = env.reset()

total_reward = 0
num_steps = 0
done = False

# Run a random episode
while not done:
    # Take a random action
    action = env.action_space.sample()
    
    # Perform the action in the environment
    observation, reward, done, truncated, info = env.step(action)
    
    # Accumulate the reward and count the steps
    total_reward += reward
    num_steps += 1

# Print the results
print(f"Total Reward: {total_reward}")
print(f"Total Steps: {num_steps}")

# Close the environment
env.close()
```

### Explanation of the Code:
1. **`gymnasium.make("MountainCar-v0", render_mode="human")`**: 
   - Initializes the **MountainCar-v0** environment with rendering enabled so you can visually observe the car’s movement.

2. **`env.reset()`**:
   - Resets the environment to its initial state and returns the initial observation and environment info.

3. **`env.action_space.sample()`**:
   - Randomly samples an action from the action space (which, for **MountainCar**, can be one of three discrete actions: push left, no push, or push right).

4. **`env.step(action)`**:
   - Takes the sampled action and returns the next observation, reward, whether the episode is done, and any additional info. The loop continues until `done` is `True` (when the car reaches the flag or max steps are taken).

5. **`total_reward` and `num_steps`**:
   - These accumulate the reward over the episode and count the number of steps, respectively.

6. **`env.close()`**:
   - Closes the environment to stop the rendering window after the episode finishes.

### Libraries and Setup:
- **gymnasium**: The new version of the Gym library for RL environments.
- **render_mode='human'**: Ensures the environment window is shown for human interaction.
  
This code will run a random agent in the **MountainCar** environment, render the environment visually, and display the total reward and number of steps after the episode ends.

Let me know if you encounter any issues or need further customizations!

In [1]:
"""
import gymnasium as gym

# Create the environment
# initializing the MountainCar-v0 environment with human rendering to visualize the agent's actions
# the environment is rendered in real-time using render_mode="human"
env = gym.make("MountainCar-v0", render_mode="human")

# Initialize the environment
observation, info = env.reset()

total_reward = 0
num_steps = 0
done = False

# Run a random episode
while not done:
    # Take a random action, like driving right or left or doing nothing
    action = env.action_space.sample() # selects a random action at each step
    
    # Perform the action in the environment
    # env.step(action) executes the action and returns the new state, reward, and whether the episode has finished
    # observing the result for the next step/sate (???)
    observation, reward, done, truncated, info = env.step(action)
    
    # Accumulate the reward and count the steps
    total_reward += reward
    num_steps += 1

# Print the results
print(f"Total Reward: {total_reward}")
print(f"Total Steps: {num_steps}")

# Close the environment
env.close()
"""

'\nimport gymnasium as gym\n\n# Create the environment\n# initializing the MountainCar-v0 environment with human rendering to visualize the agent\'s actions\n# the environment is rendered in real-time using render_mode="human"\nenv = gym.make("MountainCar-v0", render_mode="human")\n\n# Initialize the environment\nobservation, info = env.reset()\n\ntotal_reward = 0\nnum_steps = 0\ndone = False\n\n# Run a random episode\nwhile not done:\n    # Take a random action, like driving right or left or doing nothing\n    action = env.action_space.sample() # selects a random action at each step\n    \n    # Perform the action in the environment\n    # env.step(action) executes the action and returns the new state, reward, and whether the episode has finished\n    # observing the result for the next step/sate (???)\n    observation, reward, done, truncated, info = env.step(action)\n    \n    # Accumulate the reward and count the steps\n    total_reward += reward\n    num_steps += 1\n\n# Print the 

> <span style="color:blue">Could you give me please a more verbose explanations of the steps Episode Initialization and Episode Execution that describes the state of the variables in each step?</span>

Certainly! I'll break down the steps of **Episode Initialization** and **Episode Execution** in more detail, focusing on the state of the variables at each stage of the process.

### **1. Episode Initialization**

The first part of the code involves initializing the environment and setting up the necessary variables.

```python
# Create the MountainCar environment
env = gym.make("MountainCar-v0", render_mode="human")

# Initialize the environment
observation, info = env.reset()

total_reward = 0
num_steps = 0
done = False
```

#### Step-by-Step Explanation:

- **`env = gym.make("MountainCar-v0", render_mode="human")`**:
   - This line **creates and initializes** the 'MountainCar-v0' environment.
   - **`env`**: This variable now holds the Mountain Car environment object, which has properties and methods to interact with the simulation.
   - **`render_mode="human"`**: This parameter specifies that the environment should render to a window so you can visually see the simulation as it runs.

- **`observation, info = env.reset()`**:
   - **Resetting the environment**: The `reset()` method sets the environment to its **initial state**, ready to start a new episode.
   - **`observation`**: This variable holds the initial state (or observation) of the environment, which is a **numpy array** representing the position and velocity of the car. For MountainCar, it’s typically a 2D array like `[-0.5, 0.0]`, where `-0.5` is the initial position and `0.0` is the initial velocity.
   - **`info`**: This variable holds **additional information** about the environment's state or configuration (which can vary based on the environment). It's often empty or not necessary for simple tasks.

- **`total_reward = 0`**:
   - **`total_reward`** is a variable used to track the **total accumulated reward** for the current episode.
   - In the MountainCar environment, each step usually gives a negative reward, and your goal is to minimize the number of steps to reach the flag (i.e., the top of the hill).
   - Initial value is set to **0** because no steps have been taken yet.

- **`num_steps = 0`**:
   - **`num_steps`** tracks the **number of steps** taken in the episode. It’s initialized to **0** at the start.
   
- **`done = False`**:
   - **`done`** is a boolean flag that indicates if the **episode is finished**.
   - It starts as **False** because we haven't yet reached the goal or the max number of steps.

---

### **2. Episode Execution**

This part of the code runs the episode, taking actions and interacting with the environment until the car reaches the goal or a maximum number of steps is reached.

```python
while not done:
    # Take a random action
    action = env.action_space.sample()
    
    # Perform the action in the environment
    observation, reward, done, truncated, info = env.step(action)
    
    # Accumulate the reward and count the steps
    total_reward += reward
    num_steps += 1
```

#### Step-by-Step Explanation:

- **`while not done:`**:
   - This is a **loop** that runs as long as `done` is **False**, meaning the episode hasn't yet finished.
   - Every time we take a step in the environment, the loop checks if `done` has become `True` (i.e., the car reaches the goal or the episode exceeds a step limit).

- **`action = env.action_space.sample()`**:
   - This line randomly selects an action from the **action space** of the Mountain Car environment.
   - **`action`** is an integer that can be **0** (push left), **1** (no push), or **2** (push right). This corresponds to the agent’s attempt to push the car left, do nothing, or push it right.

- **`observation, reward, done, truncated, info = env.step(action)`**:
   - **`env.step(action)`**: This method **performs the action** in the environment, advancing the simulation by one time step. It returns multiple pieces of information:
     - **`observation`**: The new state of the environment after the action. In the MountainCar environment, this is an array representing the new position and velocity of the car.
     - **`reward`**: The **reward** received from taking that action. In MountainCar, it’s typically a small negative value (e.g., -1) to encourage the agent to finish the episode quickly.
     - **`done`**: A boolean flag indicating whether the **episode is finished**. This will become **True** if the car reaches the flag (goal) or the episode exceeds the step limit.
     - **`truncated`**: A boolean flag indicating whether the episode was **truncated** (i.e., ended prematurely, often because a max time or step limit was reached).
     - **`info`**: Additional information, often empty or not relevant for basic tasks.
   
- **`total_reward += reward`**:
   - This line **updates the total reward** by adding the reward received from the current step. Since each step in MountainCar gives a negative reward, this value will become increasingly negative over time.
   
- **`num_steps += 1`**:
   - This increments the **step counter**, tracking how many steps have been taken in the episode.

### Final State (End of Episode):
Once the `done` flag becomes `True`, meaning the episode has ended (the car reached the goal or the max number of steps was taken), the loop exits.

```python
# Print the results
print(f"Total Reward: {total_reward}")
print(f"Total Steps: {num_steps}")

# Close the environment
env.close()
```

- **Total Reward** and **Total Steps** are printed:
   - **`total_reward`**: This will show the **cumulative reward** over the entire episode (typically negative for MountainCar unless you reach the goal quickly).
   - **`num_steps`**: This shows the **number of actions** the agent took to complete the episode.

- **`env.close()`**:
   - This closes the environment and stops rendering. It’s important to call this to properly clean up resources, especially if you're using human rendering.

---

### Summary of Variable States Throughout the Episode:
- **`env`**: Holds the environment object, allowing interaction with the MountainCar-v0 environment.
- **`observation`**: Initially set by `reset()` to a state like `[-0.5, 0.0]` and updated at each step to reflect the new position and velocity of the car.
- **`total_reward`**: Starts at 0 and decreases as the episode progresses (MountainCar provides negative rewards for each step).
- **`num_steps`**: Starts at 0 and increments with each step in the environment.
- **`done`**: Initially `False`, becomes `True` when the car reaches the goal or the max steps are exceeded, signaling the end of the episode.
- **`action`**: A random action (either 0, 1, or 2) sampled from the action space at each step.

> <span style="color:blue">__Prompt:__ It doesn't stop automatically (unless it reaches the goal). Please adjust and add a maximum number of steps per episode.</span>

To ensure that the episode terminates after a certain number of steps (even if the goal is not reached), we can add a condition to break the loop after a **maximum number of steps**. Here's the modified code:

```python
import gymnasium as gym

# Create the MountainCar environment
env = gym.make("MountainCar-v0", render_mode="human")

max_steps_per_episode = 200  # Define the maximum number of steps per episode

# Initialize the environment
observation, info = env.reset()

total_reward = 0
num_steps = 0
done = False

# Run a random episode with a maximum number of steps
while not done and num_steps < max_steps_per_episode:
    # Take a random action
    action = env.action_space.sample()
    
    # Perform the action in the environment
    observation, reward, done, truncated, info = env.step(action)
    
    # Accumulate the reward and count the steps
    total_reward += reward
    num_steps += 1

# Print the results
print(f"Total Reward: {total_reward}")
print(f"Total Steps: {num_steps}")

# Close the environment
env.close()
```

### Changes Made:
1. **`max_steps_per_episode = 200`**:
   - A new variable `max_steps_per_episode` defines the **maximum number of steps** an episode can run before it is forcefully stopped.
   - You can adjust this number based on your requirements.

2. **`while not done and num_steps < max_steps_per_episode:`**:
   - The loop now also checks if `num_steps` has reached `max_steps_per_episode`. If the maximum number of steps is reached, the episode will terminate, even if the goal hasn’t been reached.

Now, the episode will automatically stop either when:
1. The agent reaches the goal (i.e., `done == True`).
2. The agent exceeds the maximum number of steps (200 steps in this case).

Let me know if you need further adjustments!

In [2]:
import gymnasium as gym

# Create the MountainCar environment
env = gym.make("MountainCar-v0", render_mode="human")

max_steps_per_episode = 200  # Define the maximum number of steps per episode

# Initialize the environment
observation, info = env.reset()

total_reward = 0
num_steps = 0
done = False

# Run a random episode with a maximum number of steps
while not done and num_steps < max_steps_per_episode:
    # Take a random action
    action = env.action_space.sample()
    
    # Perform the action in the environment
    observation, reward, done, truncated, info = env.step(action)
    
    # Accumulate the reward and count the steps
    total_reward += reward
    num_steps += 1

# Print the results
print(f"Total Reward: {total_reward}")
print(f"Total Steps: {num_steps}")

# Close the environment
env.close()

Total Reward: -200.0
Total Steps: 200


> <span style="color:blue">__Prompt:__ Please use the max_episode_steps parameter in gym.make() and truncated.</span>

Sure! To specify the maximum number of steps directly in the environment initialization, we can use the `max_episode_steps` parameter in the `gym.make()` function. Additionally, we can handle the episode termination using the **`truncated`** flag, which indicates whether the episode ended because it reached the maximum number of steps.

Here’s the adjusted code:

```python
import gymnasium as gym

# Create the MountainCar environment with max episode steps
env = gym.make("MountainCar-v0", render_mode="human", max_episode_steps=200)

# Initialize the environment
observation, info = env.reset()

total_reward = 0
num_steps = 0
done = False
truncated = False

# Run a random episode
while not done and not truncated:
    # Take a random action
    action = env.action_space.sample()
    
    # Perform the action in the environment
    observation, reward, done, truncated, info = env.step(action)
    
    # Accumulate the reward and count the steps
    total_reward += reward
    num_steps += 1

# Print the results
print(f"Total Reward: {total_reward}")
print(f"Total Steps: {num_steps}")

# Check if the episode was truncated
if truncated:
    print("Episode was truncated due to reaching the maximum number of steps.")

# Close the environment
env.close()
```

### Changes Explained:

1. **`max_episode_steps=200`**:
   - This parameter is passed to `gym.make()` to specify that each episode should end after a maximum of **200 steps**.
   - You can adjust this number if you want more or fewer steps per episode.

2. **`observation, reward, done, truncated, info = env.step(action)`**:
   - The `step()` function returns the `truncated` flag, which is set to `True` if the episode ends because it has reached the maximum number of steps.
   - The loop will now terminate if either `done` is `True` (the goal is reached) or `truncated` is `True` (the maximum step count is reached).

3. **`if truncated:`**:
   - After the episode ends, this check will print a message if the episode was truncated because it reached the maximum number of steps.

### Explanation of Variables:
- **`done`**: Indicates if the episode has finished because the goal was reached.
- **`truncated`**: Indicates if the episode was **cut off** because it reached the **maximum number of steps**.

This method ensures that the episode automatically stops when either the goal is reached or the maximum number of steps is hit, and it leverages the `max_episode_steps` parameter of the environment itself.

In [3]:
import gymnasium as gym

# Create the MountainCar environment with max episode steps
env = gym.make("MountainCar-v0", render_mode="human", max_episode_steps=200)

# Initialize the environment
observation, info = env.reset()

total_reward = 0
num_steps = 0 # num_steps = env._elapsed_steps
done = False
truncated = False

# Run a random episode
while not done and not truncated:
    # Take a random action
    action = env.action_space.sample()
    
    # Perform the action in the environment
    observation, reward, done, truncated, info = env.step(action)
    
    # Accumulate the reward and count the steps
    total_reward += reward
    num_steps += 1 # could be removed because env._elapsed_steps returns the same information

# Print the results
print(f"Total Reward: {total_reward}")
print(f"Total Steps: {num_steps}") # num_steps = env._elapsed_steps

# Check if the episode was truncated
if truncated:
    print("Episode was truncated due to reaching the maximum number of steps.")

# Close the environment
env.close()

Total Reward: -200.0
Total Steps: 200
Episode was truncated due to reaching the maximum number of steps.


#### Breaking down the code line by line (at least more or less)

In [1]:
# importing gymnasium
import gymnasium as gym

In [2]:
# https://gymnasium.farama.org/environments/classic_control/mountain_car/
# creating the environment for the MountainCar challenge
# selecting the 'human' render mode
# setting the maximum episodes to 200
env = gym.make("MountainCar-v0", render_mode="human", max_episode_steps=200)

In [3]:
env

<TimeLimit<OrderEnforcing<PassiveEnvChecker<MountainCarEnv<MountainCar-v0>>>>>

In [4]:
# setting parameters with initial values
total_reward = 0
num_steps = 0
done = False
truncated = False

In [5]:
# resetting the state of the environment = resetting the starting point of the car
# Initialize the environment
observation, info = env.reset() # observation = the state
# starting position has a random aspect
observation

array([-0.46586874,  0.        ], dtype=float32)

__Observation Space:__
The observation is a ndarray with shape (2,) where the elements correspond to the following:
<table align="left" class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head"><p>Num</p></th>
<th class="head"><p>Observation</p></th>
<th class="head"><p>Min</p></th>
<th class="head"><p>Max</p></th>
<th class="head"><p>Unit</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>0</p></td>
<td><p>position of the car along the x-axis</p></td>
<td><p>-1.2</p></td>
<td><p>0.6</p></td>
<td><p>position (m)</p></td>
</tr>
<tr class="row-odd"><td><p>1</p></td>
<td><p>velocity of the car</p></td>
<td><p>-0.07</p></td>
<td><p>0.07</p></td>
<td><p>velocity (v)</p></td>
</tr>
</tbody>
</table>

In [6]:
# creating a random action with the sample method: 0, 1 or 2
action = env.action_space.sample()
action

np.int64(1)

__Action Space:__ There are 3 discrete deterministic actions:

0: Accelerate to the left

1: Don’t accelerate

2: Accelerate to the right

In [7]:
# passing the action to the step method to evaluate the next state
# depending on the current state
# truncated checks if we've reached the max episode steps
observation, reward, done, truncated, info = env.step(action)
observation, reward, done, truncated, info

(array([-4.6629956e-01, -4.3081396e-04], dtype=float32),
 -1.0,
 False,
 False,
 {})

In [8]:
# update total rewards and the number of steps
total_reward += reward
num_steps += 1
total_reward, num_steps

(-1.0, 1)

In [9]:
# update the state
state = observation
state

array([-4.6629956e-01, -4.3081396e-04], dtype=float32)

In [10]:
# close the environment
env.close()

In [11]:
# access the internal step count
num_steps = env._elapsed_steps
num_steps

1

### Multiple random Episodes - with human rendering

> <span style="color:blue">__Prompt:__ Now, I want to run multiple random episodes with human rendering. Please adjust the code accordingly with as few changes as possible.</span>

To run multiple random episodes with human rendering, you can simply add a loop around the episode logic. Here’s the adjusted code to run **multiple episodes**:

```python
import gymnasium as gym

# Create the environment
env = gym.make("MountainCar-v0", render_mode="human")

num_episodes = 5  # Set the number of episodes you want to run

for episode in range(num_episodes):
    # Reset the environment at the beginning of each episode
    observation, info = env.reset()
    
    total_reward = 0
    num_steps = 0
    done = False

    print(f"\nStarting Episode {episode + 1}")

    # Run the episode until done
    while not done:
        # Take a random action
        action = env.action_space.sample()
        
        # Perform the action in the environment
        observation, reward, done, truncated, info = env.step(action)
        
        # Accumulate the reward and count the steps
        total_reward += reward
        num_steps += 1

    # Print the results for the current episode
    print(f"Episode {episode + 1} finished")
    print(f"Total Reward: {total_reward}")
    print(f"Total Steps: {num_steps}")

# Close the environment after all episodes
env.close()
```

### Changes:
1. **Added a loop** around the episode logic to run multiple episodes (`for episode in range(num_episodes):`).
2. Set `num_episodes` to control how many episodes you want to run (in this case, 5).
3. A print statement was added before each episode to indicate which episode is starting.

This code will now run 5 random episodes with human rendering enabled, showing the total reward and number of steps for each episode.

Let me know if you need further adjustments!

### Performance Evaluation / Success Measurement

> <span style="color:blue">__Prompt:__ Now, I want to run multiple random episodes with human rendering. Please adjust the code accordingly with as few changes as possible.</span>

### Performance Evaluation / Success Measurement

> <span style="color:blue">__Prompt:__ Now, let´s focus on Performance Measurement and Evaluation. I want to measure the performance / success of the random agent and print the performance at the end of the random run. Please make an appopriate proposal with various metrics. </span>

### RGB Rendering

> <span style="color:blue">__Prompt:__ Human rendering is really slow. Please remove human rendering to run many episodes in no time. Print every 100th episode only.</span>

### Training an RL Agent with Q-Tables

> <span style="color:blue">__Prompt:__ I have the following code to run multiple random episodes of OpenAI´s MountainCar challenge: ---insert code---</span>
>
> <span style="color:blue">Now, I want to create and train a Reinforcement Learning Agent that can reach the target with as few steps as possible and with a high success rate. Please build on the existing code and train the Agent. </span>

> <span style="color:blue">__Prompt:__ XXX</span>

> <span style="color:blue">__Prompt:__ XXX</span>

> <span style="color:blue">__Prompt:__ XXX</span>