<a href="https://colab.research.google.com/github/kushal2000/Annotator-Analysis-HateXplain/blob/main/cs4756_robot_learning/tree/main/assignments/HW2/CS_4756_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Introduction**

Welcome to your second coding Assignment of CS4756/5756. In this short notebook, you will implement Model Predictive Control for the Pendulum environment using the Cross-Entropy Method for Trajectory Optimization.

**Evaluation:**
Your code will be tested for correctness, and for certain assignments, speed. Please remember that all assignments should be completed individually.

**Academic Integrity:** We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else’s code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don’t try. We trust you all to submit your own work only; please don’t let us down. If you do, we will pursue the strongest consequences available to us.

**Getting Help:** The [#resources](https://www.cs.cornell.edu/courses/cs4756/2023sp/#resources) section on the course website is your friend (especially for this first assignment)! If you ever feel stuck in these projects, please feel free to avail yourself to office hours and Edstem! If you are unable to make any of the office hours listed, please let TAs know and we will be happy to assist. Since this is the first iteration of this course, please do not hesitate to reach out to TAs if you find any errors in the assignments. 

## Installations and Imports



In [None]:
!apt-get install x11-utils > /dev/null 2>&1 
!pip install pyglet > /dev/null 2>&1 
!apt-get install -y xvfb python-opengl > /dev/null 2>&1
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!pip install gym[classic_control] > /dev/null 2>&1

In [None]:
import gym
import numpy as np
import matplotlib.pyplot as plt
from IPython import display as ipythondisplay
import random

## Visualizing the environment


In [None]:
env = gym.make('Pendulum-v1', g=9.81)
env.reset()
prev_screen = env.render(mode='rgb_array')
plt.imshow(prev_screen)

for i in range(50):
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
    screen = env.render(mode='rgb_array')

    plt.imshow(screen)
    ipythondisplay.clear_output(wait=True)
    ipythondisplay.display(plt.gcf())

    if done:
        break

ipythondisplay.clear_output(wait=True)
env.close()

In [None]:
class CEM_Agent():
    def __init__(self, 
                 action_dim = None, 
                 planning_horizon = None, 
                 num_rollouts=None, 
                 proportion_elite = None, 
                 sigma_init = None, 
                 noise = None, 
                 num_iter = None,
                 learned_transition_model = None,
                 action_high = None,
                 action_low = None):
        self.action_dim = action_dim
        self.planning_horizon = planning_horizon
        self.num_rollouts = num_rollouts
        self.proportion_elite = proportion_elite
        self.sigma_init = sigma_init
        self.noise = noise
        self.num_iter = num_iter
        self.learned_transition_model = learned_transition_model
        self.rollout_env = gym.make('Pendulum-v1', g=9.81)
        self.action_high = action_high
        self.action_low = action_low

        self.action_means = None
        self.action_sigmas = None
    
    def get_full_state(self, obs):
        x, y, thetadot = obs
        theta = np.arctan2(y, x)
        return np.array([theta, thetadot])

    def custom_step(self, action, obs):
        # Function returns next state, reward for transition and whether or not the episode has terminated
        if self.learned_transition_model is not None:
            """TODO for Extra Credit: Use a learned transition model to step the environment during rollouts
            Outputs:
            ----------
            obs: next state 
            reward: reward for the next state
            done: whether the episode has terminated
            """
            pass
        else:
            self.rollout_env.reset()
            self.rollout_env.state = self.rollout_env.unwrapped.state = self.get_full_state(obs)
            obs, reward, done, info = self.rollout_env.step(action)
            return obs, reward, done 

    def reset_distribution(self):
        """TODO: Resets the action distribution (sets self.action_means and self.action_sigmas)"""
        pass

    def update_distribution(self, current_state, iter_num=None):
        """TODO: Update the action distribution by sampling rollouts from action distribution and selecting best rollouts

        Inputs
        ----------
        current_state : The current state of the environment
        iter_num (optional): The number of times the distrubution has been updated (can be useful for adding noise to variance updates)
        """
        pass

    def rollout_plans(self, current_state):
        """TODO: Samples and rolls out on Pedulum environment

        Inputs
        ----------
        current_state : The current state of the environment

        Outputs
        ----------
        rollouts: Dimension (num_rollouts, planning_horizon, action_dim) - plans sampled from action distributions
        rollout_costs: A list of costs for each rollout
        """
        pass     

    def rollout_cost(self, rollout_plan, current_state):
        """TODO: Rolls out an action plan on the Pendulum environment from the current state

        Inputs
        ----------
        rollout_plan: Dimension (planning_horizon, action_dim) - a single plan sampled from the action distribution
        current_state : The current state of the environment

        Outputs
        ----------
        rollout_cost: Cost of the rollout
        """
        pass
    
    def mpc_policy(self, current_state):
        """TODO: Executes the CEM updates and returns the action to be executed on the current timestep

        Inputs
        ----------
        current_state: The current state of the environment

        Outputs
        ----------
        action: The first sampled action in the planning horizon of the action distribution
        """
        pass

In [None]:
### Initialize the CEM agent with appropriate hyperparameters
cem_agent = CEM_Agent(action_dim = None, 
                 planning_horizon = None, 
                 num_rollouts=None, 
                 proportion_elite = None, 
                 sigma_init = None, 
                 noise = None, 
                 num_iter = None,
                 learned_transition_model = None,
                 action_high = None,
                 action_low = None)

In [None]:
def execute_policy(cem_agent, render = False):
    env = gym.make('Pendulum-v1', g=9.81)
    current_state = env.reset()
    total_reward = 0
    import time
    start_time = time.time()
    if render:
        prev_screen = env.render(mode='rgb_array')
        plt.imshow(prev_screen)

    for i in range(50):
        action = cem_agent.mpc_policy(current_state)
        current_state, reward, done, info = env.step(action)
        total_reward += reward

        if render:
            screen = env.render(mode='rgb_array')
            plt.imshow(screen)
            ipythondisplay.clear_output(wait=True)
            ipythondisplay.display(plt.gcf())

        if done:
            break

    return total_reward, time.time()-start_time