## CSCI 470 Activities and Case Studies

1. For all activities, you are allowed to collaborate with a partner. 
1. For case studies, you should work individually and are **not** allowed to collaborate.

By filling out this notebook and submitting it, you acknowledge that you are aware of the above policies and are agreeing to comply with them.

Some considerations with regard to how these notebooks will be graded:

1. You can add more notebook cells or edit existing notebook cells other than "# YOUR CODE HERE" to test out or debug your code. We actually highly recommend you do so to gain a better understanding of what is happening. However, during grading, **these changes are ignored**. 
2. You must ensure that all your code for the particular task is available in the cells that say "# YOUR CODE HERE"
3. Every cell that says "# YOUR CODE HERE" is followed by a "raise NotImplementedError". You need to remove that line. During grading, if an error occurs then you will not receive points for your work in that section.
4. If your code passes the "assert" statements, then no output will result. If your code fails the "assert" statements, you will get an "AssertionError". Getting an assertion error means you will not receive points for that particular task.
5. If you edit the "assert" statements to make your code pass, they will still fail when they are graded since the "assert" statements will revert to the original. Make sure you don't edit the assert statements.
6. We may sometimes have "hidden" tests for grading. This means that passing the visible "assert" statements is not sufficient. The "assert" statements are there as a guide but you need to make sure you understand what you're required to do and ensure that you are doing it correctly. Passing the visible tests is necessary but not sufficient to get the grade for that cell.
7. When you are asked to define a function, make sure you **don't** use any variables outside of the parameters passed to the function. You can think of the parameters being passed to the function as a hint. Make sure you're using all of those variables.
8. Finally, **make sure you run "Kernel > Restart and Run All"** and pass all the asserts before submitting. If you don't restart the kernel, there may be some code that you ran and deleted that is still being used and that was why your asserts were passing.

# Reinforcement Learning

In [1]:
from time import sleep
from IPython.display import clear_output
import random

import gym
import numpy as np
np.random.seed(0)

We will be using [OpenAI's gym](https://gym.openai.com/docs/) for rendering environments and we will specifically use the [Taxi-v3](https://gym.openai.com/envs/Taxi-v3/) environment for this exercise.

In [2]:
# Load the Taxi-v3 environment
env = gym.make("Taxi-v3").env

# Standardize expected results
env.seed(0)
env.reset()

print(f"Current State: {env.s}")
env.render()

Current State: 26
+---------+
|R:[43m [0m| : :[34;1mG[0m|
| : | : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+



The above section just rendered an example view of the environment. For the Taxi-v2 environment,

1. the block is the taxi and it is yellow if empty and green if it contains a passenger
1. Pipe symbols `|` represent barriers preventing the taxi from moving in that direction
1. R, G, Y, B are all the possible pickup or dropoff locations for a passenger
1. Blue font represents the current passenger's pickup location
1. Purple font represents the current passenger's dropoff location

The reward scheme for this environment is as follows, "your job is to pick up the passenger at one location and drop them off in another. You receive +20 points for a successful dropoff, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions." A nicer visualization of the environment, along a description of the state space, is found in State Space section of this [blog post](https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/).

In [3]:
print(f"The action space is discrete with {env.action_space.n} possibilities.")
print(f"The observation (state) space is discrete with {env.observation_space.n} possibilities.")

The action space is discrete with 6 possibilities.
The observation (state) space is discrete with 500 possibilities.


The following actions are possible in the environment:

1. Move south
1. Move north
1. Move east
1. Move west
1. Pick up passenger
1. Drop off passenger

In [4]:
def initialize_q_table(env):
    """Initialize a Q table for an environment with all 0s
    
    Args:
        env (gym.envs): The environment
    
    Returns:
        np.array: The Q table of shape (observation space size, action space size)
    """
    return np.zeros((env.observation_space.n, env.action_space.n))

In [5]:
assert initialize_q_table(env).shape == (500, 6)
xenv = gym.make("FrozenLake-v0").env
assert initialize_q_table(xenv).shape ==(16,4)

In [6]:
def select_action(q_row, method, epsilon=0.5):
    """Select the appropriate action given a Q table row for the state and a chosen method
    
    Args:
        q_row (np.array): The row from the Q table to utilize
        method (str): The method to use, either "random" or "epsilon"
        epsilon (float, optional): Defaults to 0.5. The epsilon value to use for epislon-greedy action selection
    
    Raises:
        NameError: If method specified is not supported
    
    Returns:
        int: The index of the action to apply
    """
    if method not in ["random", "epsilon"]:
        raise NameError("Undefined method.")
    p = random.random()
    if method=="random" or p < epsilon:
      return random.randint(0, len(q_row)-1)
    else:
      return np.argmax(q_row)

In [7]:
assert select_action(np.array([1,2,3,4]), "epsilon", epsilon=0) == 3
assert select_action(np.array([1,2,3,4]), "epsilon", epsilon=1) in range(4)
assert select_action(np.array([1,2,3,4]), "random") in range(4)

The `env.step(action)` method takes a parameter that is the action the agent decides to apply and returns 4 values:
1. The new state
1. The received reward
1. Whether you have completed the task
1. Miscellaneous information

In [8]:
action = 0
vals = env.step(action)
print(f"An example returned from a step with action 0")
print(vals)
print(f"This returns the new state {vals[0]}, the reward received ({vals[1]}) based on performing the action {action}, whether or not the task has been completed, {vals[2]}, and some additional miscellaneous info.")

An example returned from a step with action 0
(126, -1, False, {'prob': 1.0})
This returns the new state 126, the reward received (-1) based on performing the action 0, whether or not the task has been completed, False, and some additional miscellaneous info.


In [9]:
def calculate_new_q_val(q_table, state, action, reward, next_state, alpha, gamma):
    """Calculate the updated Q table value for a particular state and action given the necessary parameters
    
    Args:
        q_table (np.array): The Q table
        state (int): The current state of the simulation's index in the Q table
        action (int): The current action's index in the Q table
        reward (float): The returned reward value from the environment
        next_state (int): The next state of the simulation's index in the Q table (Based on the environment)
        alpha (float): The learning rate
        gamma (float): The discount rate
    
    Returns:
        float: The updated action-value expectation for the state and action
    """
    max_action = np.max(q_table[next_state, :])
    return (1-alpha)*q_table[state,action] + alpha*(reward + gamma*max_action)

In [10]:
test_q = np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]])
assert -0.05 < calculate_new_q_val(test_q, 0, 1, 10, 1, 0.1, 0.2) - 2.88 < 0.05
assert -0.05 < calculate_new_q_val(test_q, 0, 1, 1, 1, 0.1, 0.1) - 1.94 < 0.05
assert -0.05 < calculate_new_q_val(test_q, 0, 1, -11, 2, 0.1, 0.1) - 0.74 < 0.05

In [11]:
epsilon1_params = {
    "method": "epsilon",
    "epsilon": 0.1,
    "alpha": 0.1,
    "gamma": 0.5
}

In [12]:
epsilon2_params = {
    "method": "epsilon",
    "epsilon": 0.3,
    "alpha": 0.1,
    "gamma": 0.5
}

In [13]:
def train_sim(env, params, n=100):
    """Run a training simulation in an environment and return its Q table
    
    Args:
        env (gym.envs): The environment to train in
        params (dict): The parameters needed to train the simulation: method (for action selection), epsilon, alpha, gamma
        n (int, optional): Defaults to 100. The number of simulations to run for training
    
    Returns:
        np.array: The trained Q table from the simulation
    """
    my_q = initialize_q_table(env)
    
    for i in range(n):
        current_state = env.reset()
        done = False
        
        while not done:
            # Get the next action based on current state
            # Step through the environment with the selected action
            # Update the qtable
            
            action = select_action(my_q[current_state,:], params["method"], params["epsilon"])
            step_vals = env.step(action)
            next_state = step_vals[0]
            reward = step_vals[1]
            done = step_vals[2]
            my_q[current_state,action] = calculate_new_q_val(my_q, current_state, action, reward, next_state, params["alpha"], params["gamma"])

            # Prep for next iteration
            current_state = next_state 

        if (i+1) % 100 == 0:
            print(f"Simulation #{i+1:,} complete.")
        
    return my_q

In [14]:
%%time
n = 10000
epsilon1_q = train_sim(env, epsilon1_params, n)
epsilon2_q = train_sim(env, epsilon2_params, n)

Simulation #100 complete.
Simulation #200 complete.
Simulation #300 complete.
Simulation #400 complete.
Simulation #500 complete.
Simulation #600 complete.
Simulation #700 complete.
Simulation #800 complete.
Simulation #900 complete.
Simulation #1,000 complete.
Simulation #1,100 complete.
Simulation #1,200 complete.
Simulation #1,300 complete.
Simulation #1,400 complete.
Simulation #1,500 complete.
Simulation #1,600 complete.
Simulation #1,700 complete.
Simulation #1,800 complete.
Simulation #1,900 complete.
Simulation #2,000 complete.
Simulation #2,100 complete.
Simulation #2,200 complete.
Simulation #2,300 complete.
Simulation #2,400 complete.
Simulation #2,500 complete.
Simulation #2,600 complete.
Simulation #2,700 complete.
Simulation #2,800 complete.
Simulation #2,900 complete.
Simulation #3,000 complete.
Simulation #3,100 complete.
Simulation #3,200 complete.
Simulation #3,300 complete.
Simulation #3,400 complete.
Simulation #3,500 complete.
Simulation #3,600 complete.
Simulation

In [15]:
def test_sim(env, q_table, n=100, render=True):
    """Test an environment using a pre-trained Q table
    
    Args:
        env (gym.envs): The environment to test
        q_table (np.array): The pretrained Q table
        n (int, optional): Defaults to 100. The number of test iterations to run
        render (bool, optional): Defaults to False. Whether to display a rendering of the environment
    
    Returns:
        np.array: Array of length n with each value being the cumulative reward achieved in the simulation
    """
    rewards = []
    
    for i in range(n):
        current_state = env.reset()

        tot_reward = 0
        done = False
        step = 0

        while not done:
            
            # Determine the best action
            # Step through the environment
            
            action = select_action(q_table[current_state,:], "epsilon", epsilon=0)
            step_vals = env.step(action)
            next_state = step_vals[0]
            reward = step_vals[1]
            done = step_vals[2]

            current_state = next_state

            tot_reward += reward
            step +=1
            if render:
                clear_output(wait=True)
                print(f"Simulation: {i + 1}")
                env.render()
                print(f"Step: {step}")
                print(f"Current State: {current_state}")
                print(f"Action: {action}")
                print(f"Reward: {reward}")
                print(f"Total rewards: {tot_reward}")
                sleep(.2)
            if step == 50:
                print("Agent got stuck. Quitting...")
                sleep(.5)
                break
        
        rewards.append(tot_reward)
    
    return np.array(rewards)

In [16]:
# Add render=True to see the simulation running
epsilon1_rewards = test_sim(env, epsilon1_q, 10)

Simulation: 10
+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[35m[34;1m[43mB[0m[0m[0m: |
+---------+
  (Dropoff)
Step: 14
Current State: 475
Action: 5
Reward: 20
Total rewards: 7


In [17]:
epsilon2_rewards = test_sim(env, epsilon2_q, 10)

Simulation: 10
+---------+
|R: | : :[35m[34;1m[43mG[0m[0m[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)
Step: 17
Current State: 85
Action: 5
Reward: 20
Total rewards: 4


In [18]:
print(f"The first epsilon greedy training method was able to get a median reward of {np.median(epsilon1_rewards)}.")
print(f"The second epsilon greedy training method was able to get a median reward of {np.median(epsilon2_rewards)}.")

The first epsilon greedy training method was able to get a median reward of 7.5.
The second epsilon greedy training method was able to get a median reward of 7.5.


In [19]:
# Your models may sometimes not pass the below asserts but you should be able to get it at least work sometimes
# To avoid any issues with grading, we've commented them out.
# To make the most out of this activity, please uncomment them and get them to at least occasionally pass
assert np.median(epsilon1_rewards) > 5
assert np.median(epsilon2_rewards) > 5

## Feedback

In [21]:
def feedback():
    """Provide feedback on the contents of this exercise
    
    Returns:
        string
    """
    return "All good!"  

In [22]:
print(feedback())

All good!
