# **Assignment 5: Perception**

### **Due Date**: 04/29/2025 at 11:59 PM

### **Late Due Date**: 05/02/2025 at 11:59 PM

# **Introduction**

Welcome to Assignment 5 of CS 4756/5756. In this assignment, you will train policies under various state estimators. In Assignment 4, we tried to estimate the transition function through a world model. This assignment is very similar, but this time we will try estimating the state instead.

This assignment is built up by the following components:

- **[PROVIDED] Setup**: Dependency installing and initializations.
- **[PROVIDED] Helper Functions**: Provided functions for visualization and evaluation.
- **Part 1**: Train an expert PPO policy with StableBaselines3.
- **Part 2**: Train a state estimator on RGB image data.
- **Part 3**: Train a learner PPO policy using the state estimator.
- **[GRAD] Part 4**: Combine the state estimator and policy into an end-to-end policy.

You will use the **FetchReach-v4** environment for this assignment. Refer to the Gymnasium-Robotics website for more details about this [environment](https://robotics.farama.org/envs/fetch/reach/)

Please read through the following paragraphs carefully.

**Getting Started**: You should complete this assignment on [Google Colab](https://colab.research.google.com).

**Evaluation**: Your code will be tested for correctness and, for certain assignments, speed. For this particular assignment, performance results will not be harshly graded (although we provide approximate expected reward numbers, you are not expected to replicate them exactly). Please remember that all assignments should be completed individually.

**Academic Integrity**: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us.

**Getting Help**: The [Resources](https://www.cs.cornell.edu/courses/cs4756/2025sp/#resources) section on the course website is your friend! If you ever feel stuck in these projects, please feel free to avail yourself to office hours and Edstem! If you are unable to make any of the office hours listed, please let TAs know and we will be happy to assist. If you need a refresher for PyTorch, please see this [60 minute blitz](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)! For Numpy, please see the quickstart [here](https://numpy.org/doc/stable/user/quickstart.html) and full API [here](https://numpy.org/doc/stable/reference/).

# **[PROVIDED] Setup**

Please run the cells below to install the necessary packages.

In [None]:
import sys
USING_COLAB = 'google.colab' in sys.modules

if USING_COLAB:
    !apt-get -qq update
    !apt-get -qq install -y libosmesa6-dev libgl1-mesa-glx libglfw3 libgl1-mesa-dev libglew-dev patchelf
    !apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
else:
    !pip install torch torchvision torchaudio
    !pip install numpy
    !pip install tqdm
    !pip install opencv-python

!pip install matplotlib
!pip install -U mediapy
!pip install -U renderlab
!pip install -U "imageio<3.0"
!pip install stable_baselines3

!git clone https://github.com/Farama-Foundation/Gymnasium-Robotics.git
!pip install -e Gymnasium-Robotics
sys.path.append('/content/Gymnasium-Robotics')

In [None]:
import os
# Mujoco GLEW Setup
try:
    if _mujoco_run_once:  pass
except NameError:
    _mujoco_run_once = False

if not _mujoco_run_once:
    try:
        os.environ['LD_PRELOAD']=os.environ['LD_PRELOAD'] + ':/usr/lib/x86_64-linux-gnu/libGLEW.so'
    except KeyError:
        os.environ['LD_PRELOAD']='/usr/lib/x86_64-linux-gnu/libGLEW.so'

    # Presetup so we don't see output on first env initialization
    _mujoco_run_once = True
    if USING_COLAB:
        NVIDIA_ICD_CONFIG_PATH = '/usr/share/glvnd/egl_vendor.d/10_nvidia.json'
        if not os.path.exists(NVIDIA_ICD_CONFIG_PATH):
            with open(NVIDIA_ICD_CONFIG_PATH, 'w') as f:
                f.write("""{
                    "file_format_version" : "1.0.0",
                    "ICD" : {
                        "library_path" : "libEGL_nvidia.so.0"
                    }
                }""")

    # Set environment variable to support EGL (off-screen) rendering
    %env MUJOCO_GL=egl

Please run the cells below to import necessary packages and set the initial seeding.

In [None]:
from gymnasium_robotics.envs.fetch.reach import MujocoFetchReachEnv, MujocoPyFetchReachEnv
from torch.utils.data import DataLoader
import gymnasium.wrappers as wrappers
import matplotlib.pyplot as plt
import torch.distributions as D
from tqdm import tqdm, trange
import torch.optim as optim
import gymnasium_robotics
import gymnasium as gym
import torch.nn as nn
import numpy as np
import random
import torch

In [None]:
seed = 695

# Setting the seed to ensure reproducability
def reseed(seed, env=None):
    torch.manual_seed(seed)
    random.seed(seed)
    np.random.seed(seed)

    if env is not None:
        env.unwrapped._np_random = gym.utils.seeding.np_random(seed)[0]

reseed(seed)

Please run the cells below to simplify the `FetchReach-v4` environment for easier training.

In [None]:
# In this block we define wrappers necessary to simplify the environment MDP
def wrap_reach_fixed_goal(env, fixed_goal_noise : float = 0.0):
    g = np.array([1.486, 0.73, 0.681], dtype=np.float32)
    def sample_goal():
      noise = np.random.uniform(-fixed_goal_noise, fixed_goal_noise, size=3)
      return g + noise
    env.unwrapped._sample_goal = sample_goal
    return env

class FetchRewardWrapper(gym.Wrapper):
    def reset(self, *args, **kwargs):
        obs, info = self.env.reset(*args, **kwargs)
        self.prev_dist = np.linalg.norm(obs['achieved_goal'] - obs['desired_goal'])
        return obs, info

    def step(self, action):
        obs, reward, terminated, truncated, info = self.env.step(action) # Terminated is never set to true
        current_dist = np.linalg.norm(obs['achieved_goal'] - obs['desired_goal'])
        reward = (self.prev_dist - current_dist) * 10
        self.prev_dist = current_dist
        return obs, reward, info['is_success'], truncated, info

In [None]:
# Let's initialize the environment first
reseed(seed)

def make_fetch_env(width : int = 120, height : int = 120, fixed_goal_noise : float = 0.03):
    env = MujocoFetchReachEnv(width=width, height=height, render_mode="rgb_array")
    env = wrappers.TimeLimit(env, 50)
    env = wrap_reach_fixed_goal(env, fixed_goal_noise)
    env = FetchRewardWrapper(env)
    env = wrappers.FilterObservation(env, ["desired_goal", "observation"])
    env = wrappers.FlattenObservation(env)
    return env

# **[PROVIDED] Helper Functions**

### **Policy Evaluation Functions**

The `evaluate_policy` function takes a policy actor, an environment whose output observations can be applied to the actor, and evaluates the policy by doing the following:

- Rollout actor for a default of 100 trajectories, and record the total reward.
- Return the average trajectory rewards over these episodes.

**Note:** Since the actor we will be defining in this assignment exclusively uses a StableBaselines3 PPO policy, then the environment provided must be an instance of `VecEnv`, more information introduced in Part 1.

The `success_rate` function is similar to the `evaluate_policy` function except that it takes a regular gymnasium environment instead of a vectorized environment. It also records the success as a percentage instead of the total reward.

In [None]:
def evaluate_policy(actor, environment, num_episodes=100, progress=True):
    """
        Returns the mean trajectory reward of rolling out `actor` on `environment.

        Parameters
        - actor: PPOActor instance, defined in Part 1.
        - environment: classstable_baselines3.common.vec_env.VecEnv instance.
        - num_episodes: Total number of trajectories to collect and average over.
    """

    total_rew = 0
    iterate = (trange(num_episodes) if progress else range(num_episodes))

    for _ in iterate:
        obs = environment.reset()
        done = False

        while not done:
            action = actor.select_action(obs)
            next_obs, reward, done, info = environment.step(action)
            total_rew += reward
            obs = next_obs

    return (total_rew / num_episodes).item()


def success_rate(actor, environment, num_episodes=100, progress=True):
    """
        Returns the percentage of successful trajectories of `actor` on `environment`.

        Parameters
        - actor: PPOActor instance, defined in Part 1.
        - environment: Gymnasium environment.
        - num_episodes: Total number of trajectories to collect and average over.
    """

    total_success = 0
    iterate = (trange(num_episodes) if progress else range(num_episodes))

    for _ in iterate:
        obs, info = environment.reset()
        done = False

        while not done:
            action = actor.select_action(obs)
            next_obs, reward, done, truncated, info = environment.step(action)
            obs = next_obs

            if done: total_success += 1
            if truncated: break

    return (total_success / num_episodes)

### **Notes About Fetch Reach Environment**

The environment uses a Fetch Robot, which is a 7-DoF Mobile Manipulator.

The task is a _goal-reaching task_: The observation space contains `observation` which includes the state of the robot in the environment, and `desired_goal` which specifies the xyz coordinate that the robot's gripper aims to reach.

See https://robotics.farama.org/envs/fetch/reach/ for more details.

If the goal is reached, `info['is_success']` will be set to 1, and this is an indication that we should terminate the rollout.

The reward is normally -1 per timestep spent in the environment without completing the task, with 50 steps being the limit (so -50 is the worst episode return).

> Note: For this assignment, we've modified the environment so that it only has a partially random fixed goal to reach, has better reward shaping, and renders smaller frames. This is to make training easier and quicker later on.

**Run the cell below to create the environment:**

In [None]:
real_env = make_fetch_env()

# **Part 1: Train Expert Using StableBaselines3**

### **1.1: [PROVIDED] Introduction To Stable Baselines 3**

StableBaselines3 is popular off-the-shelf set of reliable implementations of reinforcement learning algorithms in PyTorch. In this assignment, we will be using its PPO (Proximal Policy Gradient) implementation as our policy.

Each algorithm implementation is a subclass of the `stable_baselines3.common.base_class.BaseAlgorithm` class, which provides us with the following functions:

- `learn(total_timesteps, callback=None, log_interval=100, tb_log_name='run', reset_num_timesteps=True, progress_bar=False)`
  - This is the training loop of any of the RL algorithm implementations. Training is done by calling this function with an appropriate amount of `total_timesteps`.
- `predict(observation)`
  - Returns a tuple `(predicted_action, next_hidden_state)` based on input `observation`. If we are not using an RNN, the next hidden state can be neglected.
- `save(path)`
  - Saves the current policy parameters into a `.zip` file with given `path`. Note that the `path` does not have the `.zip` postfix.
- `load(path, env=None)`
  - Loads a saved a `.zip` checkpoint into this RL implementation model.

### **1.2: [PROVIDED] Hyperparameters**

The implementation has a set of hyperparameters that can be tuned towards better performance. For the sake of simplicity, we will provide the hyperparameters for the StableBaselines3 PPO implementation. The main ones we specify include the following:

- `n_steps`: the number of steps to run with the environment for each update to the policy network.
- `net_arch`: The network architecture of the policy network and the critic network:
  - `pi`: a list that specifies the hidden dimensions of the policy network. The input and output dimension are determined by the environment associated with this policy.
  - `vf`: a list that specifies the hidden dimensions of the critic network.
  - `activation_fn`: Nonlinearity to be applied between each of the MLP layers.

For a more comprehensive list and description of each of these hyperparameters, visit the official [documentation page](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#parameters) for more information.


In [None]:
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import BaseCallback
from stable_baselines3.common.vec_env.base_vec_env import VecEnv

hyperparameters = {
    "n_steps": 512,
    "policy_kwargs": {
        "net_arch": {
            "pi": [128],
            "vf": [128],
            "activation_fn": "tanh",
        }
    },
}

### **1.3: Vectorized Environmnent**

For any StableBaselines3 algorithm implementation, the gymnasium environment used need to be converted into a [vectorized environment](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#) of `VecEnv` type.

A vectorized environment stacks multiple independent environments into one, stepping multiple `n` environments each time. If we set the the `n_envs` parameter to 3, then 3 environments will be stepped each time the VecEnv is stepped.

**For the rest of this assignment, all vectorized environments with `n_env=n` will be described as n-vectorized.**

With a vectorized environment that steps multiple environments at the same time, the model learning process can be made more efficient through parallelization trajectory collection across these independent environments. This `n_envs` parameter can be tailored to the specific machines.  

**Important Differences:**
- The vectorized environments now require input action to be a shape of `n_envs * act_dim`. The output observation from `step` and `reset` will also have the shape of `n_envs * obs_dim`.
- The VecEnv `reset()` function returns only the observation, while the gymnasium.Env `reset()` function returns a tuple `(observation, info_dict)`.
- The `vec_env.step(action)` function returns a 4-tuple of `(obs, reward, terminated, info)`, while the `gym_env.step(action)` returns a 5-tuple of `(obs, reward, terminated, truncated, info)`. The `terminated` value from VecEnv would equivalent to the gymnasium environment's `terminated or truncated`.


A VecEnv instance can be created using the `make_vec_env` function, which takes the id of the wanted gymnasium environment, as well as the number of environments needed. This function has the following key parameters
- `env_id`: the id of the gymnasium environment, or instantiated gym environment, or a callable that returns an env.
- `n_envs`: The number of environments to have in parallel.
- `seed`: The initial seed for the random number generator.
- `env_kwargs`: An optional parameter to pass into the environment constructor.

More detailed function documentation can be found in this [page](https://stable-baselines3.readthedocs.io/en/master/common/env_util.html#stable_baselines3.common.env_util.make_vec_env).

**Instructions** For this part, please create two vectorized version of `FetchReach-v4` with 3 and 1 environments stacked. Note that because of our wrappers, you need to pass a callable, we have one called `make_fetch_env` defined above. During the initialization of the vectorized environments, leave the parameters of `make_fetch_env` their default values (you can directly pass in the function as the environment constructor for `make_vec_env`).

In [None]:
from stable_baselines3.common.env_util import make_vec_env

# TODO: Instantiate
real_vec_env_3 = None
real_vec_env_1 = None
# END TODO

### **1.4: Actor Definition**

**Instruction**: You will need to implement the following PPOActor class, which serves as a wrapper to provide PPO model predictions.
- `__init__`: Takes a path to the checkpoint and the corresponding environment, and load an instance of this PPO checkpoint. However if a PPO model is given, then the internally representing model uses that directly instead. This is for use in the Callback function, and since we provide that implementation for you, you will only need to implement the model loading portion of the constructor.
- `select_action`: Takes an observation and produce the corresponding action prediction from the checkpoint PPO model using **deterministic** mode. While implementing, take note of the output of the `predict` function.

**Note:** To try and limit the noise introduced in the environment and make it easier to achieve expected ranges, please use the parameter `deterministic=True` in your `section_action` function.

In [None]:
class PPOActor():
    def __init__(self, ckpt: str=None, environment: VecEnv=None, model : PPO =None):
        '''
          Requires environment to be a 1-vectorized environment

          The `ckpt` is a .zip file path that leads to the checkpoint you want
          to use for this particular actor.

          If the `model` variable is provided, then this constructor will store
          that as the internal representing model instead of loading one from the
          checkpoint path
        '''
        assert ckpt is not None or model is not None

        if model is not None:
            self.model = model
            return

        # TODO: Load checkpoint into a `PPO` class
        self.model = None
        # END TODO

    def select_action(self, obs):
        '''Gives the action prediction of this particular actor'''

        # TODO: compute action of the PPO policy using deterministic mode

        # END TODO

### **1.5: [PROVIDED] Callbacks**

To visualize the training process, since it could take a significant amount of time, StableBaselines3 provides a mean for us to visualize the training progress through a `BaseCallback` class instance, which can be optionally passed in as a parameter of the `learn` function. This Callback function is customizable by defining a subclass of `BaseCallback`.

For this part, we provide you with a customized callback that evaluates the model under training every 256 steps on an evaluating environment, which will be the 1-vectorized environment you have instantiated in the previous portion. Based on this evaluation result, this callback will save a checkpoint of the model if it is, so far, the best performing model. At the end of training, a plot of all evaluation results with respect to number of steps will be generated.

You are free to modify this callback class to help you visualize training in any way most convenient for you, but is **NOT REQUIRED**.

In [None]:
class PPOCallback(BaseCallback):
    def __init__(self, verbose=0, save_path='default', eval_env=None, graph=True):
        super(PPOCallback, self).__init__(verbose)
        self.rewards = []
        self.graph = graph

        self.save_freq = 256
        self.min_reward = -np.inf
        self.actor = None
        self.eval_env = eval_env

        self.save_path = save_path
        self.eval_steps = []
        self.eval_rewards = []

    def _init_callback(self) -> None:
        pass

    def _on_training_start(self) -> None:
        """
        This method is called before the first rollout starts.
        """
        self.actor = PPOActor(model=self.model)

    def _on_rollout_start(self) -> None:
        """
        A rollout is the collection of environment interaction
        using the current policy.
        This event is triggered before collecting new samples.
        """
        pass

    def _on_rollout_end(self) -> None:
        """
        This event is triggered before updating the policy.
        """
        episode_info = self.model.ep_info_buffer
        rewards = [ep_info['r'] for ep_info in episode_info]
        mean_rewards = np.mean(rewards)
        self.rewards.append(mean_rewards)

    def _on_step(self) -> bool:
        """
        This method will be called by the model after each call to `env.step()`.

        For child callback (of an `EventCallback`), this will be called
        when the event is triggered.

        :return: If the callback returns False, training is aborted early.
        """
        if self.eval_env is None:
            return True

        if self.num_timesteps % self.save_freq == 0 and self.num_timesteps != 0:
            mean_reward = evaluate_policy(self.actor, environment=self.eval_env, num_episodes=20)
            print(f'evaluating {self.num_timesteps=}, {mean_reward=}=======')

            self.eval_steps.append(self.num_timesteps)
            self.eval_rewards.append(mean_reward)
            if mean_reward > self.min_reward:
                self.min_reward = mean_reward
                self.model.save(self.save_path)
                print(f'model saved on eval reward: {self.min_reward}')

        return True

    def _on_training_end(self) -> None:
        """
        This event is triggered before exiting the `learn()` method.
        """
        print(f'model saved on eval reward: {self.min_reward}')

        if self.graph:
            plt.plot(self.eval_steps, self.eval_rewards, c='red')
            plt.xlabel('Episodes')
            plt.ylabel('Rewards')
            plt.title('Rewards over Episodes')

            plt.show()
            plt.close()

### **1.6 PPOActor Initialization And Training**

The `stable_baselines3.ppo.PPO` class inherits from the `BaseAlgorithm` class described at the beginning of this section, and is specifically implemented for the PPO algorithm. To initialize a class, the following parameters are especially important:
- `policy: str`: The policy type we use to train the policy, common ones include MlpPolicy and CnnPolicy. In our case, we will be using the MlpPolicy.
- `env: VecEnv`: The environment that the policy rollouts on for training, must be vectorized or it will be vectorized by the PPO implementation
- `n_steps`: number of steps to optimize the policy for
- `device`: The device to put the model on (For this assignment, if you're not able to reach the performance bounds, try setting this parameter to cpu)
- Other hyperparameters specified in the `hyperparameters` dictionary we provided, can be directly applied using the `**` operator.

**Instructions**
- Initialize a PPO MLP policy as expert, using the 3-env VecEnv initialized in the previous part and pass in the given hyperparameters.
- Train the expert with an instance of the `PPOCallback` defined before. No need to save the resulting model into checkpoint since that is done for you in the Callback class
  - (HINT): Look at the beginning of Part 1 for useful functions for training.


**Estimated Training Time**:
- 2 - 4 minutes on Google Colab CPU

In [None]:
reseed(seed)
ckpt_path = 'expert'
total_steps = 20000
expert_callback = PPOCallback(save_path=ckpt_path, eval_env=real_vec_env_1)

# TODO: Instantiate and train
expert = None
# END TODO

### **1.7: Evaluate Expert**

**Instructions** Initialize an expert PPOActor instance from the checkpoint and evaluate the expert policy for 100 trajectories using the `evaluate_policy` and `success_rate` function on the real environment.

**Expected Reward**: Around 1.5 - 1.7 on `real_vec_env_1`

**Expected Success**: Around 0.95 on `real_env`

In [None]:
expert = PPOActor(ckpt_path, real_vec_env_1)

# TODO: Evaluate

# END TODO

# **Part 2: Collect Data And Train State Estimator**

### **2.1: [PROVIDED] Overview**
Unlike in simulation, we might not obtain the exact state in real world scenarios, and instead have to estimate it. We emulate that property in this assignment here.

Assuming we do not have the underlying logic to the `FetchReach-v4`, given that we have an expert policy in solving this particular problem, we take the following state estimation approach to learn an RL policy that can be applied to the real scenario.

In real life, we might not have such a trained expert, and human operating the robot remotely could be one source of expert data.

1. Rollout a series of expert trajectories in the true environment (analogous to collecting a set of human demonstrations on the robot).
    - Note that we need to record both perception data and the true state.
2. Define and train a state estimator with the perception data as input.
3. Define a new environment that applies the trained state estimator
4. Learn an RL policy under the learned environment
5. Evaluate this policy using the real environment

You will need to implement the following functions and classes
- `data_collect`: a helper function that rolls out a policy on an environment, and returning a tuple of lists representing the transitions
- `StateEstimator` : a `torch.nn` module defining the architecture for perception.
- `train_state_estimator` and `eval_state_estimator`: Training and evaluation loop of the state estimator

Follow the instructions below to implement each of these components

### **2.2: Collect Data**

**Instructions**

The `data_collect` function should rollout a policy actor on the environment for a total of `num_steps`, with a maximum trajectory length of `traj_max_length`, then returning 2 lists: `renders` and `observations`, where `renders ` is the camera view at every step, and `observations` is the state at every step.

**Hints:**
- The `.render()` function for MuJoCo environments will be useful.
- Also remember to apply the given transform to all obtained renders.
- Make sure to end a trajectory if `done` or `truncated` is true.

In [None]:
import torchvision.transforms as T

def data_collect(num_steps: int, traj_max_length: int, data_env: gym.Env, actor: PPOActor = None):
    renders, observations = [], []
    transform = T.Compose([T.ToPILImage(), T.ToTensor()])

    # TODO: Step and collect data

    # END TODO

    return renders, observations

**Instructions**
Run data collection function on the real environment with the expert policy trained in part 1.

**Note**: The `data_collect` function requires the environment provided to be a regular gymnasium environment instead of a vectorized environment. Please make sure to not confuse it with `real_vec_env_1` defined in part 1.3.

**Note**: Here is a list of currently created environments:
- `real_env`
- `real_vec_env_1`
- `real_vec_env_3`

Refer to function documentation for selecting which one to use when doing function calls.

**Estimated Collection Time**:
- 2 - 4 minutes on Google Colab CPU

In [None]:
total_steps = 10000
traj_max_length = 100
reseed(seed, env=real_env)

# TODO: Collect data
renders, observations = None, None
# END TODO

### **2.3: [PROVIDED] Visualize And Create Dataset**

**Note** The below visualization is showing multiple coordinates of the observation at the same time, so it looks a bit weird. You should see 3 distinct curves that oscillate up and down.

In [None]:
def visualize_collected_data(renders, observations):
    '''
        Takes the first 300 data points and generates a plot of the observations and next_obs.
    '''
    print(f'Dataset Size: {len(observations)}')
    print(f'Observation Size: {observations[0].shape}')
    print(f'Render Size: {renders[0].shape}')
    plt.close()
    plt.plot(np.arange(300), [obs[3:6] for obs in observations[:300]], c='blue')
    plt.show()

visualize_collected_data(renders, observations)

In [None]:
from torch.utils.data import Dataset

class RenderDataset(Dataset):
    def __init__(self, renders, observations):
        self.renders = renders
        self.observations = [obs.astype(np.float32) for obs in observations]

    def __len__(self):
        return len(self.observations)

    def __getitem__(self, idx):
        return {
            'render': self.renders[idx],
            'observation': self.observations[idx]
        }

split = len(observations) // 5
val_data = RenderDataset(renders[:split], observations[:split])
train_data = RenderDataset(renders[split:], observations[split:])

train_dataloader = DataLoader(train_data, batch_size=128)
val_dataloader = DataLoader(val_data, batch_size=128)

### **2.4: Define State Estimator**

The `StateEstimator` class should define a neural network that takes a render and outputs the corresponding state in the state space. The network should have the following architecture:

- Layer 1: a conv block made from a `nn.Sequential` module.
    - There should be three convolutional layers that go from `3 -> 8 -> 16 -> 32` channels.
    - Each convolutional layer should have a `kernel_size=3`, `stride=2`, and `padding=1`.
    - There should be a `nn.ReLU` after every convolutional layer.
    - The final layer in the block should be a `nn.AdaptiveAvgPool2d` that reduces both the height and width of the image to 1.
- Layer 2: a fully-connected layer with `32` input nodes and `16` output nodes, followed by another ReLU.
- Output layer: a fully-connected layer with `16` input nodes and `13` output nodes.

The `forward` function should take only one input: `renders` of size `b x 3 x 120 x 120`.

In [None]:
class StateEstimator(nn.Module):
    def __init__(self):
        super(StateEstimator, self).__init__()

        # TODO: Define architecture
        self.conv = None
        self.fc1 = None
        self.fc2 = None
        # END TODO

    def forward(self, renders):
        # TODO: Calculate the estimated state
        return None
        # END TODO

### **2.5: Training And Validation Function For State Estimator**

The `train_state_estimator` function should train the provided model for one epoch, using the optimizer and criterion provided on the given train_dataloader. This function should iterate through each batch of the `train_dataloader` once, update the state estimator based on the loss calculated by criterion, then step the optimizer.

In [None]:
def train_state_estimator(model, optimizer, criterion, train_dataloader):
    '''
        This function should train the torch model `model` using the
        optim `optimizer` and `criterion` as loss function, on one pass
        of the `train_dataloader`

        This is should train the model for on epoch, as in one pass through
        the training data.

        Returns: the mean criterion loss across each batch of the dataset.
    '''
    total_loss, cnt = 0, 0
    model.train()

    # TODO: Update the model for one epoch

    # END TODO

    return total_loss / cnt

The `eval_state_estimator` function is similar to the `train_state_estimator` function with iteration through the batches of the `eval_dataloader` and computes the loss using the given criterion. Note that no update to model should be made and gradients should not be calculated during the forward pass.

In [None]:
def eval_state_estimator(model, criterion, eval_dataloader):
    '''
        This function should evaluate the torch model `model` using
        `criterion` as loss function, on one pass of the `eval_dataloader`

        This is should evaluate the model on the validation dataset.

        Take note that during evaluation, the model should not be updated
        in any way and gradients should not be calculated.

        Returns: the mean criterion loss across each batch of the dataset.

    '''
    total_loss, cnt = 0, 0
    model.eval()

    # TODO: Evaluate the model across the whole dataset

    # END TODO

    return total_loss / cnt

### **2.6: Train The State Estimator**

Train an instance of `StateEstimator` for `20` epochs with the dataloader built in previous section, using Adam optimizer and MSE loss, with an `lr` of `0.0001`. Provide a plot of training and evaluation losses with respect to training epochs, and also print out the final evaluation loss.

**Estimated Training Time**:
- 2 - 4 minutes on Google Colab CPU

In [None]:
num_epochs = 20
reseed(seed)
lr = 0.0001

state_estimator = StateEstimator()
optimizer = torch.optim.Adam(state_estimator.parameters(), lr=lr)
criterion = nn.MSELoss()

# TODO: Train and evaluate state estimator
train_losses, eval_losses = [], []
# END TODO

### **2.7: [PROVIDED] Build Gym Environment With State Estimator**

The following `StateEstimatorEnv` class is largely defined for you to train your next PPO policy. In this environment, the reward is calculated the same as `real_env`.

To initialize a `StateEstimatorEnv` environment, a `state_estimator` (an instance of StateEstimator in this case) should be passed in as argument, which will be used in the `step()` and `reset()` functions.

This environment is registered with an id of **StateEstimatorFetch**, which can be initialized using `gym.make` or directly initializing it.

**Run the following cell to define and register this environment**

In [None]:
class StateEstimatorEnv(gym.Env):
    def __init__(self, estimator: StateEstimator, render_mode: str='rgb_array'):
        super(StateEstimatorEnv, self).__init__()
        self.metadata = { 'render_modes': ['human', 'rgb_array'], 'render_fps': 30 }
        self.render_mode = 'rgb_array'
        self.estimator = estimator
        self.corr_env = real_env

        self.transform = T.Compose([T.ToPILImage(), T.ToTensor()])
        self.observation_space = self.corr_env.observation_space
        self.action_space = self.corr_env.action_space
        self.obs_min = torch.tensor(self.corr_env.observation_space.low)
        self.obs_max = torch.tensor(self.corr_env.observation_space.high)

    def seed(self, seed=None):
        pass

    def reset(self, seed=None, options=None):
        obs, info = self.corr_env.reset()
        rgb_array = self.corr_env.render()

        transformed = self.transform(rgb_array).unsqueeze(0)
        estimate = self.estimator(transformed)
        clipped = torch.clamp(estimate, self.obs_min, self.obs_max)
        return clipped.detach().numpy(), {}

    def step(self, action):
        obs, reward, terminated, truncated, info = self.corr_env.step(action)
        rgb_array = self.corr_env.render()

        transformed = self.transform(rgb_array).unsqueeze(0)
        estimate = self.estimator(transformed)
        clipped = torch.clamp(estimate, self.obs_min, self.obs_max)
        return clipped.detach().numpy(), reward, bool(terminated), truncated, {}

gym.register(id='StateEstimatorFetch', entry_point=StateEstimatorEnv)

# **Part 3: Train Policy Using State Estimator**

Now that we have learned a model that estimates the state of the real environment, it's time to train an policy on this model. In this part, you will learn and evaluate a PPO policy on the learned state estimator, similar to what happened in Part 1, using functions defined in the Helper Function section, Part 1, and Part 2.

**Follow instructions to complete each component**

### **3.1: Train New PPO On State Estimator**

**Instruction 1** Initialize a 3-vectorized and a 1-vectorized `StateEstimatorFetch` environment.

**Instruction 2** For this part, we will train a separate PPO policy using the learned model environment. This model should be trained with the state estimator learned in the previous part for 10000 steps, under a 3-vectorized environment, using the same hyperparameters provided in Part 1.

**Note**: Here is a list of created environments after running this following cell:
- `real_env`
- `real_vec_env_1`
- `real_vec_env_3`
- `state_vec_env_1`
- `state_vec_env_3`

Refer to function documentation for selecting which one to use when doing function calls.

**Estimated Training Time**:
- 6 - 10 minutes on Google Colab CPU

In [None]:
learner_ckpt_path = 'learner'
total_steps = 10000
reseed(seed)

# TODO 1: Create vectorized state estimator environments (HINT: use env_kwargs)
env_kwargs = None
state_vec_env_1 = None
state_vec_env_3 = None
# END TODO

learner_callback = PPOCallback(save_path=learner_ckpt_path, eval_env=state_vec_env_1)

# TODO 2: Initiate training
learner = None
# END TODO

### **3.2: Evaluate Learned Policy**

Evaluate the learner policy on both the learned state estimator environment and the real environment. Also print out the success rate on the real environment. Note to save time only use 20 trajectores for `state_vec_env_1` and 100 trajectories for the rest.

**Expected Rewards**:
- 1.5 - 1.7 on `state_vec_env_1`
- 1.5 - 1.6 on `real_vec_env_1`

**Expected Success**: Around 0.95 on `real_env`

In [None]:
learner_actor = PPOActor(ckpt=f'{learner_ckpt_path}.zip', environment=state_vec_env_1)

# TODO: Evaluate on both environments

# END TODO

# **[GRAD] Part 4: End-To-End Image-Based Policy**

### **4.1: [PROVIDED] Overview**
In the previous section we trained the `StateEstimator` and the learner policy separately. First we made a good `StateEstimator` to go from renders to states and only then did we start training a policy to go from states to actions. One common method in practice is to just make a single End-To-End Policy that goes directly from renders to actions.

Follow the instructions below to implement each of the components of this end-to-end policy.

### **4.2: [PROVIDED] Build Custom Gym Environment**

The following `EndToEndEnv` class is largely defined for you to train your next PPO policy. In this environment, the reward is calculated the same as `real_env`.

This environment is registered with an id of **FetchReachEndToEnd**, which can be initialized using `gym.make` or directly initializing it.

**Run the following cell to define and register this environment**

In [None]:
class EndToEndEnv(gym.Env):
    def __init__(self, render_mode: str='rgb_array'):
        super(EndToEndEnv, self).__init__()
        self.metadata = { 'render_modes': ['human', 'rgb_array'], 'render_fps': 30 }
        self.render_mode = 'rgb_array'
        self.corr_env = real_env

        self.transform = T.Compose([T.ToPILImage(), T.ToTensor()])
        self.observation_space = gym.spaces.Box(low=0, high=1, shape=(3, 120, 120), dtype=np.float32)
        self.action_space = self.corr_env.action_space
        self.obs_min = torch.from_numpy(self.observation_space.low).float()
        self.obs_max = torch.from_numpy(self.observation_space.high).float()

    def seed(self, seed=None):
        self.corr_env.seed(seed)

    def reset(self, seed=None, options=None):
        obs, info = self.corr_env.reset()
        rgb_array = self.corr_env.render()
        return self.transform(rgb_array), {}

    def step(self, action):
        obs, reward, terminated, truncated, info = self.corr_env.step(action)
        rgb_array = self.corr_env.render()
        return self.transform(rgb_array), reward, terminated, truncated, info

    def render(self):
        return self.corr_env.render()

gym.register(id='FetchReachEndToEnd', entry_point=EndToEndEnv)

### **4.3: Define Feature Extractor**

In order to configure our end-to-end policy to work with StableBaselines' training, we need to define a `FeatureExtractor` to pass in. This feature extractor will be the exact same as the `StateEstimator` we defined above, except that the final fully connected linear layer will go from `16` to `features_dim`.

In [None]:
from stable_baselines3.common.torch_layers import BaseFeaturesExtractor

class CustomCNN(BaseFeaturesExtractor):
    """
    A custom CNN feature extractor for image-based observations.
    """
    def __init__(self, observation_space: gym.spaces.Box, features_dim: int = 128):
        # Initialize the base features extractor with the expected output feature dimension
        super(CustomCNN, self).__init__(observation_space, features_dim)

        # TODO: Define architecture
        self.conv = None
        self.fc1 = None
        self.fc2 = None
        # END TODO

    def forward(self, renders):
        # TODO: Calculate the estimated state
        return None
        # END TODO

### **4.3: Train End-To-End Policy**

**Instruction 1** Initialize a 3-vectorized and a 1-vectorized `FetchReachEndToEnd` environment. Note that you don't need `env_kwargs` anymore.

**Instruction 2** Train a separate PPO policy using `EndToEndEnv`. Similar to before, this model should be trained for 10000 steps, under a 3-vectorized environment, using the same hyperparameters provided in Part 1. Note that because our feature extractor is a CNN, we should make our policy an `CnnPolicy` instead of a `MlpPolicy`.

**Note**: Here is a list of created environments after running this following cell:
- `real_env`
- `real_vec_env_1`
- `real_vec_env_3`
- `state_vec_env_1`
- `state_vec_env_3`
- `end_vec_env_1`
- `end_vec_env_3`

Refer to function documentation for selecting which one to use when doing function calls.

**Estimated Training Time**:
- 6 - 10 minutes on Google Colab CPU

In [None]:
endlearner_ckpt_path = 'endlearner'
total_steps = 10000
reseed(seed)

# TODO 1: Create vectorized end-to-end environments
end_vec_env_1 = None
end_vec_env_3 = None
# END TODO

end_callback = PPOCallback(save_path=endlearner_ckpt_path, eval_env=end_vec_env_1)
policy_kwargs = dict(features_extractor_class=CustomCNN, features_extractor_kwargs=dict(features_dim=128))

# TODO 2: Initiate training
endlearner = None
# END TODO

### **4.4: Evaluate PPO On The Custom Environment**

Note to save time only use 20 trajectores for `end_vec_env_1` and 100 trajectories for the rest.

**Expected Rewards:** About 1.6 to 1.8 on `end_vec_env_1`

**Expected Success:** Around 0.95 on `end_env`

In [None]:
endlearner_actor = PPOActor(ckpt=f'{endlearner_ckpt_path}.zip', environment=end_vec_env_1)
end_env = gym.make('FetchReachEndToEnd')

# TODO: Evaluate on custom environment

# END TODO