In [None]:
%matplotlib inline

# Q-Learning on the CartPole Environment

This week we introduce a new environment: CartPole-v1 task from the [OpenAI Gym](https://gym.openai.com/).

![cartpole](https://github.com/pytorch/tutorials/blob/main/_static/img/cartpole.gif?raw=true)

In this environment, the task is to balance the pole that is attached to the cart, by moving the cart to either side. The reward gets incremented for each step (for up to 200 steps) where the pole is not exceeding a set angle and the cart is not touching the sides of the line.

The environment provides four parameters that represent the state of the environment:
Position and velocity of the cart, and the angle and angular velocity of the pole (see [the documentation](https://gymnasium.farama.org/environments/classic_control/cart_pole/#observation-space)).

In this notebook, we will implement Q-learning for this environment. As you will notice, the state space of Cartpole uses continues values, hence we will need to discretize, in order to apply the tabular version of Q-Learning.

This practice is based on a post by Jose Nieves Flores Maynez ([link](https://medium.com/@flomay/using-q-learning-to-solve-the-cartpole-balancing-problem-c0a7f47d3f9d)) and code from Isaac Patole([link](https://github.com/init-22/CartPole-v0-using-Q-learning-SARSA-and-DNN/blob/master/Qlearning_for_cartpole.py)).


### Packages


First, let's import needed packages. You may need to install 'gym', 'seaborn' and other packages. You can use pip or conda for this purpose.

**Note: For compatibility reason, if you are using your own laptop ensure that you are use the gym version 0.25.2**

In [None]:
import gym

import math
import random
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from typing import NamedTuple
import seaborn as sns
# additional imports for saving and loading a trained policy
from gym.wrappers.record_video import RecordVideo
from gym.wrappers import TransformObservation

print(f'gym.__version__ = {gym.__version__}')

## Implementating the Cartpole agent

The CartPole environment gives us the state of the agent in terms of the position of the cart, its velocity, the angle of the pole and the velocity at the tip of the pole. However, all of these are continuous variables. 

To be able to solve this problem, we need to discretize these states. The solution is to group several values of each of the variables into the same “bucket” and treat them as similar states. 

The tabular Q-learning encodes the **Action-value** function (Q(s,a)) as a table storing the expected
return for each state,action pair: `Q_table[state,action])`. 

The code below allows to have a variable learning rate and $\epsilon$, defined by the decay parameter. Typically, we encourage higher exploration and learning rates at the beginning of the interaction and, as the agent learns better policies, these values decrease.


We will first implement a class to set the learning parameters

In [None]:
class Params(NamedTuple):
    seed: float  # Define a seed so that we get reproducible results
    buckets_pos: int # Number of buckets for discretizing Cart Position
    buckets_vel: int # Number of buckets for discretizing Cart Velocity
    buckets_ang: int #  Number of buckets for discretizing Pole Angle
    buckets_angV: int #  Number of buckets for discretizing Pole Angular Velocity
    num_episodes: int 
    min_lr: float # Min Lerning rate
    min_epsilon: float # minimal epsilon
    gamma: float # discount factor
    decay_epsilon: float # decay for exploration parameter epsilon
    decay_lr: float # decay for learning rate

params = Params(
    seed = 123,
    buckets_pos = 3,
    buckets_vel = 3,
    buckets_ang = 6,
    buckets_angV = 6,
    num_episodes=500,
    min_lr=0.1,
    min_epsilon=0.1,
    gamma=1.0,
    decay_epsilon=25,
    decay_lr=25)


# Set the seed for the random number generator
rng = np.random.default_rng(params.seed)

Now, we define the `CartPoleQAgent` class comprising the Q-learning implementation

In [None]:
class CartPoleQAgent():
    def __init__(self, env, buckets=(3, 3, 6, 6),
                 num_episodes=500, min_lr=0.1,
                 min_epsilon=0.1, gamma=1.0, decay_epsilon=25, decay_lr=25):
        self.buckets = buckets
        self.num_episodes = num_episodes
        self.min_lr = min_lr
        self.min_epsilon = min_epsilon
        self.gamma = gamma
        self.decay_epsilon = decay_epsilon
        self.decay_lr = decay_lr

        
        # Initialize the action-value function
        self.Q_table = np.zeros(self.buckets + (env.action_space.n,))

        # Define bound values for the state variables [position, velocity, angle, angular velocity]
        self.upper_bounds = [env.observation_space.high[0], 0.5, env.observation_space.high[2], math.radians(50) / 1.]
        self.lower_bounds = [env.observation_space.low[0], -0.5, env.observation_space.low[2], -math.radians(50) / 1.]

        # Create array to store the number of steps per episode
        self.steps = np.zeros(self.num_episodes)


    def discretize_state(self, obs):
        """
        Takes an observation of the environment and discretizes it.
        By doing this, very similar observations can be treated
        as the same and it reduces the state space so that the
        Q-table can be smaller and more easily filled.

        Input:
        obs (tuple): Tuple containing 4 floats describing the current
                     state of the environment.

        Output:
        discretized (tuple): Tuple containing 4 non-negative integers smaller
                             than n where n is the number in the same position
                             in the buckets list.
        """
        discretized = list()
        """
        TODO: Complete the code to implement the discretization process. It should take the 4 values inthe agent state (obs)
                and discretize them into non-negative integers corresponding to the respective bucket 
        """
        return tuple(discretized)

    
    def choose_action(self, env, state):
        """
        Implementation of e-greedy algorithm. Returns an action (0 or 1).

        Input:
        state (tuple): Tuple containing 4 non-negative integers within
                       the range of the buckets.

        Output:
        (int) Returns either 0 or 1
        """
        """
        TODO: Implement the e-greedy algorithm. i.e.:
            with probability epsilon:
                select an action randomly
            else
                select the action with the highest q-value
        """


    def update_q(self, state, action, reward, new_state):
        """
        Updates Q-table (self.Q_table)
        """
        """
        TODO: Implement the update of the Q-function
            Q(s,a):= Q(s,a) + learning_rate * delta
                delta =  [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
        """

    

    def get_epsilon(self, t):
        """Gets value for epsilon. It declines as we advance in episodes."""
        # Ensures that there's almost at least a min_epsilon chance of randomly exploring
        return max(self.min_epsilon, min(1., 1. - math.log10((t + 1) / self.decay_epsilon)))

    
    def get_learning_rate(self, t):
        """Gets value for learning rate. It declines as we advance in episodes."""
        # Learning rate also declines as we add more episodes
        return max(self.min_lr, min(1., 1. - math.log10((t + 1) / self.decay_lr)))

    def train(self, env):
        """
        Trains agent making it go through the environment and choose actions
        through an e-greedy policy and updating values for its Q-table. The
        agent is trained by default for 500 episodes with a declining
        learning rate and epsilon values that with the default values,
        reach the minimum after 198 episodes.
        """
        # Looping for each episode
        for e in range(self.num_episodes):
            # Initializes the state
            current_state = self.discretize_state(list(env.reset()))

            # Get new values of learning rate and epsilo (according to decay function)
            self.learning_rate = self.get_learning_rate(e)
            self.epsilon = self.get_epsilon(e)
            done = False
            
            # Looping for each step
            while not done:
                self.steps[e] += 1
                # Choose A from S
                action = self.choose_action(env, current_state)
                # Take action
                obs, reward, terminated, truncated,_ = env.step(action)
                done = truncated or terminated 
                new_state = self.discretize_state(obs)
                # print ("current_state" + str(current_state) + "new_state" + str(new_state) +"\r", flush = False)
               
                # Update Q(S,A) and new state
                self.update_q(current_state, action, reward, new_state)
                current_state = new_state

                # We break out of the loop when done is False which is
                # a terminal state.

        print('Finished training!')


    def run(self, env, strname, num_episodes = 10):
        """Runs an episode and save videos of the cartpole environment in directory "video_eval'."""
        self.epsilon = self.min_epsilon
        t = 0
        
        envVid = RecordVideo(env, './video_eval/', name_prefix=strname, episode_trigger = lambda episode_number: True, new_step_api= True)
        envVid.reset()
        # Looping for several episodes
        for e in range(num_episodes):
            current_state = self.discretize_state(envVid.reset())
            done = False
            while not done:
                    envVid.render()
                    t = t+1
                    action = self.choose_action(envVid, current_state)
                    obs, reward, terminated, truncated, _ = envVid.step(action)
                    done = truncated or terminated 
                    new_state = self.discretize_state(obs)
                    current_state = new_state
                
        envVid.env.close()


    
    def plot_learning(self):
        """
        Plots the number of steps at each episode and prints the
        amount of times that an episode was successfully completed.
        """
        sns.lineplot(data=self.steps)
        plt.xlabel("Episode")
        plt.ylabel("Steps")
        plt.show()
        t = 0
        for i in range(self.num_episodes):
            if self.steps[i] >= 200:
                t+=1
        print(t, "episode(s) were successfully completed.")


## Create the agent and save videos of the untrained agent

In [None]:
# Create the CartPole environment. We are creating the environment with 'render_mode="rgb_array"' to save videos 
# of the policy behavior before training
env = gym.make('CartPole-v1', render_mode="rgb_array", new_step_api=True)
env.reset(seed=params.seed)

# Create an Agent
agent = CartPoleQAgent(env,
                       (params.buckets_pos, params.buckets_vel, params.buckets_ang, params.buckets_angV),
                       params.num_episodes,
                       params.min_lr,
                       params.min_epsilon,
                       params.gamma,
                       params.decay_epsilon,
                       params.decay_lr)


# Save videos of the behavior of the untrained agent
agent.run(env,"before")
env.close()

## Train the agent and plot the number of steps per episode

In [None]:
# We have closed the environment and created a new one without rendering to speed up the training process
env = gym.make('CartPole-v1', new_step_api=True)
env.reset(seed=params.seed)

agent.train(env)
agent.plot_learning()
env.close()

## Save videos with the behavior of the trained agent

In [None]:
# As above, we create the CartPole Environment with rendering mode activated
env = gym.make('CartPole-v1', render_mode="rgb_array", new_step_api=True)
env.reset(seed=params.seed)

agent.run(env,"after")

In [None]:
env.close()

## Tasks

1. **Where is the Q-Learning algorithm?**:
    Find out where the training is happening.
    Implement the functions to discretize the state space, to update the Q-value, and to choose the actions. For the last two functions, can you use the same expressions you used for the Gridworld environment?
    
2. **Extend the plotting**:
    Implement a function to plot both the learning rate and the $\epsilon$ parameter over the course of the training.

3. **Discretization**:
    Change the number of buckets used for discretization. What is the effect on learning?


4. **Initial conditions**:
    Adapt the code to run the training process 10 times with different seends of the random number generator. Does the learning process and outcome change across runs?


5. **Update the code**:
    Update the code to use a fixed learning rate to 0.1 and retrain. How does the learning changes compared with the initial implementation?
    Now, do the same with the $\epsilon$ parameter.
