# ‚õ∞Ô∏è Cliff Walking
---

In this challenge, you will learn the fundamental aspects of using Gymnasium and Stable Baselines3 to train reinforcement learning models. This challenge is designed to give you a comprehensive understanding of setting up environments, training models, visualizing their performance, and finally using a trained model to interact with an environment.

---
It's a good practice to place all your import statements at the top of your notebook to make it easier to rerun your notebook.

Throughout the challenges, you will need to add import statements. Add them all to the cell below.

In [1]:
import gymnasium as gym
from stable_baselines3 import DQN
from stable_baselines3.common.vec_env import DummyVecEnv

---

### üéØ Challenge Objectives

#### üèãÔ∏è‚Äç‚ôÄÔ∏è Gain Familiarity with Gymnasium:
We will start by exploring the [CliffWalking environment](https://gymnasium.farama.org/environments/toy_text/cliff_walking/) from Gymnasium. Understand the fundamental methods used to manage and visualize environments. Learn to load and reset environments, take actions, and visualize results.

#### ü§ñ Learn to Use Stable Baselines3:
Use the Stable Baselines3 library to set up and train a reinforcement learning model. You will learn to configure a model, adjust training parameters, and start the training process using a popular algorithm like Deep Q-Network (DQN).

#### üìà Visualize Training Performance:
Apply visualization tools such as logging and TensorBoard to monitor and plot your model's training performance. Analyze metrics such as rewards per episode and learning curves to evaluate the effectiveness of your training strategies.

#### üíæ Load and Deploy a Trained Model:
After training, learn to save a trained model and load it later. Use this model to run a simulation where the agent interacts with the CliffWalking environment and apply the learned policies to navigate the environment effectively.

#### üîç Evaluate and Reflect:
Evaluate the trained model's performance in real-time interactions within the environment. Reflect on the learning process and the agent's behavior, understanding how different configurations and training durations can affect the results.

---
### Section 1: Exploring the CliffWalking Environment

In this first task, you will set up the [CliffWalking environment](https://gymnasium.farama.org/environments/toy_text/cliff_walking/) from Gymnasium.

#### üìù Steps to Follow:

#### 0. ‚öôÔ∏è Import the Package:
Import gymnasium in the cell at the top of your notebook. Make sure we can use it as `gym` in our code.

#### 1. üóÇÔ∏è Load the Environment:
- Use the `.make()` method to load the `CliffWalking` environment.  
- Set `render_mode` appropriately to visualize the environment.

Note: The `gymnasium` documentation tells you to use `CliffWalking-v1`, but this is not yet available in the `gymnasium` version 1.0.0 we are using for compatibility reasons. Instead, use `CliffWalking-v0`.

In [2]:
env = gym.make('CliffWalking-v0', render_mode='human')

#### 2. üîÑ Initialize the Environment:
Call the `.reset()` method to bring the environment to its initial state ‚Äî this is required before interacting with it. The `.reset()` method returns a state: the initial state. Store this in a variable and print it.

> You may see a `UserWarning: pkg_resources is deprecated ...` warning. You can ignore this.

In [3]:
initial_state = env.reset()
initial_state

  from pkg_resources import resource_stream, resource_exists


(36, {'prob': 1})

The output `(36, {'prob': 1})` provides two pieces of information about the environment's state:

**üî¢ State Index (36)**: This number represents the agent's specific position within the environment grid at the start. For example, in the 'CliffWalking-v0' environment, the index `36` corresponds to the initial state where the agent starts the episode.

**üé≤ Probability Information ({'prob': 1})**: This dictionary shows additional details about the state, particularly regarding transition probability. The value `1` for the `'prob'` key indicates that the transition to this state occurred with probability 1, meaning it is certain. This makes sense because the agent must start in this state.

#### 3. üëÄ Visualize the Environment:
Use the `.render()` method to display the environment and understand its layout before taking any action.

Depending on your setup, you may not need this step as the new state may already be rendered after the previous step.

In [4]:
env.render()

#### 4. üßπ Close the Environment:
Properly close the environment by calling `.close()` to free up resources when you're done.

<details>
  <summary markdown='span'>‚ö†Ô∏è <strong>Note about closing on macOS</strong></summary>
  
  `.close()` will not close the environment window. This is fine, **you can leave it open**. The window will be reused in the next steps.

  At the end of the challenge and **only at the end**, you will need to force quit it yourself to close the environment window:
  1. Find the window and click the red button to close it. (This will fail.)
  1. Click the Apple symbol in the top left corner of your screen.
  1. Select `Force Quit`.
  1. Find the `python (Not Responding)` process in the process list. Select it.
  1. Click the `Force Quit` button.

  Since this will terminate your kernel, **only do this at the end of the challenge**.

  This happens because we're running inside Jupyter Notebook. If you move your code to a `.py` file, you won't see this behavior.

</details>

In [5]:
env.close()

---
### Section 2: Interacting with the Environment

Now let's load the environment again, take a random step, and show the result. This will introduce the fundamental methods for interacting with environments ‚Äî helping you understand how agents move and receive feedback.

#### 1. üóÇÔ∏è Load the Environment
- Load the environment.
- Reset it to the initial state.
- Print the initial state.

In [6]:
env = gym.make('CliffWalking-v0', render_mode='human')
initial_state = env.reset()
print("Initial state:", initial_state)

Initial state: (36, {'prob': 1})


#### 2. üé≤ Sample an Action:
- Use `.action_space.sample()` to select a random action from the environment's action space ‚Äî simulates exploration.
- Store this in a variable.
- Examine the type and value of the action.
- Run the cell several times. What values do you see? What do they represent?

In [7]:
action = env.action_space.sample()
print("Type of action:", type(action))
print("Value of action:", action)

Type of action: <class 'numpy.int64'>
Value of action: 1


#### 2. ü¶∂ Execute the Action:
- Apply the action using `.step()`.  
- This returns:  
  - New state  
  - Reward  
  - A `done` flag (whether the episode has ended or not)  
  - Additional information (if any)
- Use "tuple unpacking" to store each of these returned values in a variable. Print them. Do you understand the new state, reward, and done return values?
- Run this **reset > action > step** sequence several times and check the different results.

<details>
  <summary markdown='span'>
  üí° Tuple unpacking?
  </summary>

  If a function returns a tuple, you can immediately store the different elements of the tuple in different variables.

  Example:

  Suppose you have this function:

  ```python
  def surface_and_circumference(a, b):
    """Returns the area and perimeter of a rectangle 
    with length `a` and width `b`."""
    return a*b, 2*a + 2*b
  ```

  Instead of:

  ```python
  result = surface_and_circumference(4, 2)
  surface = result[0]
  circumference = result[1]
  ```

  You can immediately do:

  ```python
  surface, circumference = surface_and_circumference(4, 2)
  ```

  If you're only going to use surface in the rest of your code, it's common practice to use `_` for the other return values. This signals to other programmers that you're discarding the remaining values.

  Example:
  
  ```python
  surface, _ = surface_and_circumference(4, 2)
  # Code that only needs surface comes after this
  ```

</details>

In [8]:
new_state, reward, done, info, _ = env.step(action)
print("New state:", new_state)
print("Reward:", reward)
print("Done:", done)

New state: 36
Reward: -100
Done: False


**üìç New State**: This number represents the agent's state after performing the specified action. In the `CliffWalking` environment, the state corresponds to a specific position in the environment grid where the agent has moved.

**üí∏ Reward**: The reward value shows the immediate feedback given by the environment as a result of the agent's action. In many grid-based environments, a negative reward like this typically represents a penalty and indicates that the action taken may not be optimal or is being penalized to encourage other strategies.

**üö¶ Done**: The boolean value indicates whether the episode has ended. In this case, `False` means the episode is still ongoing and the agent has not reached a terminal state (such as the goal or a cliff) that would end the episode.

#### 3. üëÅÔ∏è Visualize the Result:
- Call `.render()` again to visualize the environment after the action ‚Äî and see how the state has changed.
- If you don't see the render window, it's probably hidden behind your other windows or displayed on another desktop.

In [9]:
env.render()

#### 4. üßπ Close the Environment:
Properly close the environment by calling `.close()` to free up resources when you're done.

In [10]:
env.close()

---
### Section 3: Directed Interaction with the Environment

This time, instead of selecting a random action, you will take an intentional step ‚Äî we will move **UP**. This reinforces the idea of purposeful decision-making in reinforcement learning.

#### 1. üöÄ Initialize and Print the Initial State:
- Load the environment and call `.reset()` to get the initial state.  
- Use `print()` to display the initial state before taking any action.
- You should get this output `(36, {'prob': 1})`

In [11]:
# Load the environment again, and initialize it

env = gym.make('CliffWalking-v0', render_mode='human')

# Reset the environment to the initial state

initial_state = env.reset()

# Print the initial state

print("Initial state:", initial_state)


Initial state: (36, {'prob': 1})


#### 2. ‚¨ÜÔ∏è Determine and Execute the 'UP' Action

- Check `.action_space` and the documentation to find the index for the **'UP'** action.  
- Use `.step(action_index)` to execute this action.  
- Print the result to see:
  - New state  
  - Reward received  
  - Whether the episode has ended (`done` flag)

In [12]:
# Take a step to move 'UP'

action = 0  # Action 0 corresponds to 'UP'

# Take the action

new_state, reward, done, info, _ = env.step(action)

# Print the new state, reward, and done status

print("New state:", new_state)
print("Reward:", reward)
print("Done:", done)


New state: 24
Reward: -1
Done: False


#### 3. üñºÔ∏è Visualize and Close the Environment

- Use `.render()` to visualize the environment after taking the action ‚Äî see how the state has changed.
- Then call `.close()` to properly close the environment and free up system resources.

In [13]:
# Render the environment after taking a step

env.render()

# Close the environment

env.close()


---
### Section 4: Training an RL Algorithm

First, you will prepare the `CliffWalking-v0` environment for training with Stable Baselines3 models. This involves proper setup and wrapping to ensure compatibility with RL algorithms.

#### üß± 1. Load the Environment:
- Use `gym.make()` to create the `CliffWalking-v0` environment.  
- Set `render_mode='human'` ‚Üí Provides visual feedback during interaction.

In [14]:
env = gym.make("CliffWalking-v0", render_mode="human")

#### üßµ 2. Wrap the Environment

Wrapping your environment with [`DummyVecEnv`](https://stable-baselines3.readthedocs.io/en/v1.0/guide/vec_envs.html#dummyvecenv) is a small but critical step ‚Äî it ensures your RL setup works seamlessly with Stable Baselines3.

Go ahead and wrap the environment using [`DummyVecEnv`](https://stable-baselines3.readthedocs.io/en/v1.0/guide/vec_envs.html#dummyvecenv) from Stable Baselines3.  üëá

In [15]:
# Wrap the environment to make it compatible with SB3
env = DummyVecEnv([lambda: env])  # Lambda function to ensure proper environment handling

#### üßµ More About `DummyVecEnv`

`DummyVecEnv` is a wrapper from Stable Baselines3 that vectorizes Gym environments to make them compatible with RL models.

#### ‚öôÔ∏è What Does It Do?

- **üì¶ Standardizes the API**  
  Ensures the environment matches the expected format for Stable Baselines3 algorithms.

- **üìä Enables Batch Processing**  
  Even with a single environment, interactions are processed as a batch ‚Äî necessary for scaling later with tools like `SubprocVecEnv`.

- **üîó Ensures Compatibility**  
  Wraps `reset()` and `step()` to work correctly within training loops.

#### 3. Initialize a DQN Model from SB3

Now that your environment is ready, you need to initialize the model. In this case, you will configure a Deep Q-Network (DQN) using Stable Baselines3. DQN uses a neural network to predict Q-values ‚Äî predicting the expected reward for each action in a given state.

#### üìù Steps to Follow:
- Import `DQN` from `stable_baselines3`. Add this to the cell at the top of the notebook.
- Initialize an instance of the DQN model with the following parameters:
    - Use `'MlpPolicy'` ‚Äî a basic neural network that maps observations to actions.  
    - Set `env` to your environment and `verbose=1` for detailed training logs.
    - Add the `tensorboard_log` parameter to track training metrics. We will use this later to track training with TensorBoard.

In [16]:
model = DQN("MlpPolicy", env, verbose=1, tensorboard_log="./dqn_cliff_walking_tensorboard/")

Using cpu device


### 4. üèãÔ∏è‚Äç‚ôÇÔ∏è Train and Save the DQN Model

Time to train your DQN model and save it for future use ü•≥

#### üìù Steps to Follow:

#### ‚è±Ô∏è Set Training Step Count
- Define how long you will train the model (in terms of number of interactions with the environment).  
- For now, use: `total_timesteps = 1000`

#### üß† Train the Model
- Call `.learn()` on your DQN model to start training.  
- The model will improve its policy over time based on feedback.

#### üíæ Save the Trained Model
- Use `.save()` to save the model to disk.  
- Print the file path to verify it was saved correctly.

üé• You can follow the steps in the rendered environment during training.

In [17]:
# Set training timesteps

total_timesteps = 1000

# Train the model (set `progress_bar=True` to show progress)

model.learn(total_timesteps=total_timesteps, progress_bar=True)

# Define model save path

model_path = "./dqn_cliff_walking"

# Save the model

model.save(model_path)
print("Model trained and saved to:", model_path)


Logging to ./dqn_cliff_walking_tensorboard/DQN_2


Output()

Model trained and saved to: ./dqn_cliff_walking


Congratulations, you've trained your first RL model!

---

### Section 5: Efficient Training Without Visualization

In this section, you will go through the entire process of preparing the environment, loading the model, and training, but without visual rendering. Disabling visualization during the training phase can significantly speed up the training process as it reduces computational overhead.

#### üìù Steps to Follow:

#### üß± 1. Load and Prepare the Environment
- Use `render_mode=None` to disable visual output.

#### ‚öôÔ∏è 2. Configure and Initialize the DQN Model
- Use `'MlpPolicy'` and `DummyVecEnv` as before.  
- Keep `verbose=1` if you want log output.
- We will add a TensorBoard logging location `./dqn_tensorboard` so we can track training with TensorBoard.

#### ‚è±Ô∏è 3. Set Training Parameters and Train
- Increase the step count for better learning (e.g., `total_timesteps = 100_000`).  
- Call `.learn()` to start training.

#### üíæ 4. Save the Trained Model
- Use `.save("dqn_cliffwalking_fast")` to store your model.

By following these steps, you will efficiently train and save a DQN model without the additional computational overhead of visualizing the environment. This approach is particularly useful when training complex models or using limited computational resources.

In [30]:
# Load environment

env = gym.make("CliffWalking-v0", render_mode="human")

# Wrap the environment to be compatible with SB3

env = DummyVecEnv([lambda: env])

# Load the model, match it with the environment, set a policy, add a logging location

model = DQN("MlpPolicy", env, verbose=1, tensorboard_log="dqn_cliff_walking_tensorboard/")

# Set training timesteps

timesteps = 100_000

# Train the model

model.learn(total_timesteps=timesteps)

# Define model save path

model_path = "dqn_cliff_walking_model.zip"

# Save the model

model.save(model_path)


Using cpu device
Logging to dqn_cliff_walking_tensorboard/DQN_3


KeyboardInterrupt: 

#### üìä Understanding Training Log Metrics

When training a reinforcement learning model, various metrics help monitor learning progress and performance.

#### üé≤ Rollout Metrics

- **exploration_rate** ‚Üí Probability of taking a random action.  
  - High = more exploration  
  - Low = more exploitation  

#### ‚è±Ô∏è Time-Related Metrics

- **episodes** ‚Üí Number of completed episodes.  
- **fps** ‚Üí Frames per second (how fast training is running).  
- **time_elapsed** ‚Üí Total time elapsed since training started (in seconds).  
- **total_timesteps** ‚Üí Total number of steps taken in the environment.

#### üß† Training Metrics

- **learning_rate** ‚Üí Size of updates made to model weights.  
  - Low = slower but more stable learning  
- **loss** ‚Üí Model's prediction error.  
  - Decreasing loss = learning is working  
- **n_updates** ‚Üí Number of times the model has updated its weights.

#### üîç How to Interpret

- **‚¨áÔ∏è exploration_rate** ‚Üí Agent is transitioning from exploration to learned behavior.  
- **‚ö° High fps** ‚Üí Training is running efficiently.  
- **üìâ Decreasing loss** ‚Üí Model is improving and making fewer errors.

#### üñ•Ô∏è Start TensorBoard

TensorBoard allows you to track how training is going. The interactive dashboard shows learning curves, reward trends, and more. 

To use TensorBoard:

1. Open a terminal window.
1. Make sure you're in the challenge folder!
1. Run this command (you may need to replace it with the path you set during training):
   ```bash
   tensorboard --logdir=./dqn_cliff_tensorboard/
   ```
1. Follow the link shown in the terminal (probably `localhost:6006`).

You can start TensorBoard while your model is still training: its purpose is to track how training is going. At the start of training, you may get a `No dashboards are active for the current data set.` warning. **Be patient: you won't see anything until the first episode completes.**

It's also possible to open TensorBoard inside your notebook:

```bash
%load_ext tensorboard
%tensorboard --logdir=./dqn_cliff_tensorboard/
```

### üì¶ Section 6: Use a Trained Model

Instead of training again from scratch, you can load the model you just trained to quickly observe how your agent behaves.  
Alternatively, you can load a more powerful pre-trained model if one is available.

#### üì• 1. Load the Pre-trained Model

- Use `.load("path_to_your_pre_trained_model")` to load the model from disk.  
- Replace the path with the actual location of the model you saved.

In [35]:
model = DQN.load(model_path)

#### üß± 2. Prepare the Environment

Load and reset the environment to start a new episode before running the trained model.

In [40]:
env = gym.make("CliffWalking-v0", render_mode="human")
env = DummyVecEnv([lambda: env])
obs = env.reset()

#### 3. üß† Predict the Best Next Action

Use the model's `.predict()` method to select the next action based on the current observation.

In [41]:
action, _states = model.predict(obs, deterministic=True)
print("Action to take:", action)

Action to take: [2]


#### 4. üö∂‚Äç‚ôÇÔ∏è Take the Step

- Use `.step(action)` to apply the predicted action.  
- This returns:
  - New state  
  - Reward  
  - A `done` flag (if the episode has ended)  
  - Additional information (if any)

If your agent makes strange movements or gets stuck, don't panic: it probably hasn't been trained long enough. Go to the end of the challenge before trying to train longer üòâ

In [42]:
obs, reward, done, info = env.step(action)
print(obs, reward, done, info)

[36] [-1.] [False] [{'prob': 1.0, 'TimeLimit.truncated': False}]


#### 5. üñºÔ∏è Visualize the Environment

Use `.render()` to visualize the current state of the environment after the action has been executed.

In [43]:
env.render()

#### 6.  üßπ Close the Environment

Use `.close()` to properly close the environment and free up system resources.

In [44]:
env.close()

---

### Section 7: Implement the Full Interaction Loop

In this final exercise, you will create a complete loop where your pre-trained model interacts step-by-step with the environment until the episode ends.

- Use a `while` loop to repeat actions.
- Call `.predict()` to select actions and `.step()` to apply them.
- Make sure you feed the new state to the next predict iteration.
- Use the `done` flag from `.step()` to know when to exit the loop.
- Don't forget to call `.render()` after each step to visualize the agent in action.

Again, if your agent selects strange movements or gets stuck in an infinite loop, don't panic: it probably hasn't been trained long enough. If your agent gets stuck in an infinite loop, stop the cell execution.

üëâ Go to the end of the challenge before trying to train longer.

In [45]:
# Load the pre-trained model

model = DQN.load(model_path)

# Initialize the environment

env = gym.make("CliffWalking-v0", render_mode="human")

# Wrap the environment in DummyVecEnv to make it compatible with SB3

env = DummyVecEnv([lambda: env])

# Start a new episode

obs = env.reset()
done = False

# Run the interaction loop

while not done:
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    print(obs, reward, done, info)
    env.render()

# Close the environment

env.close()


[36] [-1.] [False] [{'prob': 1.0, 'TimeLimit.truncated': False}]
[36] [-1.] [False] [{'prob': 1.0, 'TimeLimit.truncated': False}]
[36] [-1.] [False] [{'prob': 1.0, 'TimeLimit.truncated': False}]
[36] [-1.] [False] [{'prob': 1.0, 'TimeLimit.truncated': False}]
[36] [-1.] [False] [{'prob': 1.0, 'TimeLimit.truncated': False}]
[36] [-1.] [False] [{'prob': 1.0, 'TimeLimit.truncated': False}]
[36] [-1.] [False] [{'prob': 1.0, 'TimeLimit.truncated': False}]
[36] [-1.] [False] [{'prob': 1.0, 'TimeLimit.truncated': False}]
[36] [-1.] [False] [{'prob': 1.0, 'TimeLimit.truncated': False}]
[36] [-1.] [False] [{'prob': 1.0, 'TimeLimit.truncated': False}]
[36] [-1.] [False] [{'prob': 1.0, 'TimeLimit.truncated': False}]
[36] [-1.] [False] [{'prob': 1.0, 'TimeLimit.truncated': False}]
[36] [-1.] [False] [{'prob': 1.0, 'TimeLimit.truncated': False}]
[36] [-1.] [False] [{'prob': 1.0, 'TimeLimit.truncated': False}]
[36] [-1.] [False] [{'prob': 1.0, 'TimeLimit.truncated': False}]
[36] [-1.] [False] [{'pro

KeyboardInterrupt: 

### ü§î Not Satisfied with Your Agent's Behavior?

Don't worry ‚Äî you have options:

- üèãÔ∏è‚Äç‚ôÇÔ∏è **Train longer** ‚Üí Try increasing the number of episodes to improve performance.
- üì¶ **Or load our pre-trained model** ‚Üí  
  We trained one for **500,000 episodes** ‚Äî uncomment and run the cell below to download it.

In [46]:
!curl https://d37p7d5kaxknzw.cloudfront.net/projects/best_dqn_cliffwalking.zip > best_dqn_cliffwalking.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  139k  100  139k    0     0   188k      0 --:--:-- --:--:-- --:--:--  188k


### ‚ùÑÔ∏è Final Note: Understanding `is_slippery=True`

You may have wondered what the `is_slippery` parameter does while reading the documentation.

Setting `is_slippery=True` in `CliffWalking` adds randomness to the agent's actions ‚Äî introducing a chance that the agent will slip and move in an unintended direction. Training with `is_slippery=True` creates agents that can handle uncertainty ‚Äî a critical skill for real-world applications.

#### üéØ Why add stochasticity?

- **üåç Realism** ‚Üí Simulates the uncertainty found in real-world environments.  
- **üõ°Ô∏è Robustness** ‚Üí Helps agents learn more reliable, adaptable strategies.  
- **üî• Challenge** ‚Üí Makes the task more difficult and more interesting to solve.

#### ü§î So why didn't we set it to `True` here?

The DQN algorithm is designed for simple use cases. It struggles with the high stochasticity that `slippery=True` adds to the `CliffWalking` environment. Think about it: in this simple environment, not following the intended direction immediately means going in a completely different direction: 90 degrees, or even 180 degrees! It would be very difficult for a simple algorithm to learn.

### Congratulations on completing the project üèÅ