# Module 11: Reinforcement Learning

### I had to start by installing a library called gym
### Code for the first example is taken/updated from here: https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/

### There is a lot of potential complexity involved in reinforcement learning because we are trying to train a machine to "intelligently" act given a situation.
### For the examples here, we are working with some commonly used scenarios for learning how to set up a reinforcement learning system

### For the first example, the environment we'll be working in is a simulated parking lot where we have an agent picking up and dropping off passengers.  Each of the positions R,G,Y,B can be the pick up or drop off location for a passenger.  The blue letter is the pickup point and the red letter is the drop off point.  The taxi is yellow when empty and green when it has a passenger

In [None]:
import gym

env = gym.make("Taxi-v3").env
env.render()

### We're will be resetting the environment each time we want to run an episode.  An episode is an opportunity for our agent to complete the requested action.

In [None]:
env.reset() # reset environment to a new, random state
env.render()

print("Action Space {}".format(env.action_space))
print("State Space {}".format(env.observation_space))

### 5 width x 5 height x 5 passenger locations (1 in the taxi) x 4 destinations = 500 possible states
### Action space includes S,N,E,W, Pickup, Dropoff - These are the things the agent can decide to do in each frame.

### We can look at a specific state if we want

In [None]:
state = env.encode(3, 1, 2, 0) # (taxi row, taxi column, passenger index, destination index)
print("State:", state)

env.s = state
env.render()

### Reward scores are already established for each of the available actions on each state.  Here's the scoring for the specific state above

In [None]:
env.P[328]

### Each direction here has a -1 reward.  Trying to do a pickup or dropoff here would be -10 reward because it's an invalid action.  Note that the state doesn't change if the agent attempts to go west because they hit a wall.  Each time the agent attempts to do this, they get a -1 reward

### Here's a state where you can see the positive reward for completing a dropoff.

In [None]:
state = env.encode(0, 0, 4, 0) # (taxi row, taxi column, passenger index, destination index)
print("State:", state)

env.s = state
env.render()
env.P[state]

### We can have our agent randomly attempt to do a pickup and dropoff.  He doesn't do very well, usually.

In [None]:
env.s = 328  # set environment to illustration's state

epochs = 0
penalties, reward = 0, 0

frames = [] # for animation

done = False

while not done:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)

    if reward == -10:
        penalties += 1
    
    # Put each rendered frame into dict for animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )

    epochs += 1
    
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))

### A visual of what is happening is pretty nice here

In [None]:
from IPython.display import clear_output
from time import sleep

def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])#.getvalue())
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.1)
        
print_frames(frames)

### In order to make our agent learn what to do to maximize reward, we are now setting up a q-table which will score/store all of the actions taken by the agent.

### The q-table is really the learning tracking for our agent.  Each time it takes an action, the q-table is updated and therefore the agent better "understands" the correct/optimal actions to take

In [None]:
import numpy as np
q_table = np.zeros([env.observation_space.n, env.action_space.n])
print(q_table.shape) # a row for each table state and a column for each action
print(q_table[:,:10])

### With our q-table created, we'll now let our agent randomly explore the environment/actions.  Each action will update the q-table.  This takes a lot of iterations to get a good idea of the value of each action

### Before that, we need to understand a few variables that we can set prior to training our agent.
* Alpha - Learning rate - An alpha of zero means q-values are never updated.  An alpha near one means learning occurs more quickly but ignores prior knowledge 
* Gamma - Discount factor - This is tells the agent how much to prioritize immediate rewards to future rewards.  A gamma near zero means the agent will only look at what is immediately the most rewarding.  At near 1, the agent will focus much more on long term rewards
* Epsilon - Exploration (near zero) vs. Exploitation (near one) - This is the rate that the agent randomly explores vs. exploits the knowledge gained from previous experience.  Some reinforcement models will decay this over episodes as the model gains more knowledge

In [None]:
%%time
"""Training the agent"""

import random
from IPython.display import clear_output

# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

# For plotting metrics
all_epochs = []
all_penalties = []

for i in range(1, 100001):
    state = env.reset()

    epochs, penalties, reward, = 0, 0, 0
    done = False
    
    while not done:
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values

        next_state, reward, done, info = env.step(action) 
        
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        
        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value

        if reward == -10:
            penalties += 1

        state = next_state
        epochs += 1
        
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print("Training finished.\n")

### Here's the q-table now for the states we looked at earlier

In [None]:
print(q_table[328])
print(q_table[16])

### With our agent well trained, let's have it run through some episodes to see how well it does

In [None]:
"""Evaluate agent's performance after Q-learning"""

total_epochs, total_penalties = 0, 0
episodes = 100
frames = []
for _ in range(episodes):
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    
    done = False
    
    while not done:
        action = np.argmax(q_table[state])
        state, reward, done, info = env.step(action)

        if reward == -10:
            penalties += 1

        epochs += 1
        # Put each rendered frame into dict for animation
        frames.append({
            'frame': env.render(mode='ansi'),
            'state': state,
            'action': action,
            'reward': reward
            }
        )
    total_penalties += penalties
    total_epochs += epochs

print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")

### Let's watch our agent go

In [None]:
print_frames(frames)

### Frozen Lake

### This example is in a similar thread as the previous.  Our agent is attempting to cross a slippery frozen lake safely to get his frisbee.  Some parts of the lake have holes in the ice that the agent will fall into and lose...a lot!
### The only way to be rewarded in this case is to cross the lake without dieing.  This makes training more challenging.

In [None]:
import numpy as np
import gym
import random

In [None]:
env = gym.make("FrozenLake8x8-v0")
env.render()
# (S: starting point, safe)
# (F: frozen surface, safe)
# (H: hole, fall to your doom)
# (G: goal, where the frisbee is located)

In [None]:
env.P[62]

### The action space here is only 4 options N, S, E, W

In [None]:
action_size = env.action_space.n
state_size = env.observation_space.n
print(action_size, state_size)

In [None]:
qtable = np.zeros((state_size, action_size))
print(qtable)

In [None]:
total_episodes = 20000       # Total episodes
learning_rate = 0.75          # Learning rate
max_steps = 450               # Max steps per episode
gamma = 0.90                 # Discounting rate

# Exploration parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.10            # Minimum exploration probability 
decay_rate = 0.00005          # Exponential decay rate for exploration prob

### We need to train our agent with the environment

In [None]:
# List of rewards
rewards = []
iteration = 0

# 2 For life or until learning is stopped
for episode in range(total_episodes):
    # Reset the environment
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
    print(qtable)
    
    for step in range(max_steps):
        # 3. Choose an action a in the current world state (s)
        ## First we randomize a number
        exp_exp_tradeoff = random.uniform(0, 1)
        
        ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(qtable[state,:])
            #print(exp_exp_tradeoff, "action", action)

        # Else doing a random choice --> exploration
        else:
            action = env.action_space.sample()
            #print("action random")
            print(action, end=" ")
            
        
        # Take the action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, done, info = env.step(action)

        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
        # qtable[new_state,:] : all the actions we can take from new state
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])
        
        total_rewards += reward
        
        # Our new state is state
        state = new_state
        
        # If done (if we're dead) : finish episode
        if done == True: 
            break
        
    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) 
    rewards.append(total_rewards)
    clear_output(wait=True)
    print("training iteration - ", iteration)
    iteration += 1
    

print ("Score over time: " +  str(sum(rewards)/total_episodes))
print(qtable)

In [None]:
env.reset()
win_count = 0
loss_count = 0
frames = []
for episode in range(500):
    state = env.reset()
    step = 0
    done = False
    print("****************************************************")
    print("EPISODE ", episode)

    for step in range(max_steps):
        
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(qtable[state,:])
        
        new_state, reward, done, info = env.step(action)
        frames.append({
            'frame': env.render(mode='ansi'),
            'state': state,
            'action': action,
            'reward': reward,
            'win count': win_count,
            'loss count': loss_count
            }
        )
        if done:
            # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)
            env.render()
            if new_state == 63:
                print("We reached our Goal 🏆")
                win_count += 1
            else:
                print("We fell into a hole ☠️")
                loss_count += 1
            
            # We print the number of steps it took.
            print("Number of steps", step)
            
            break
        state = new_state
print("win count -",win_count,"loss count",loss_count)
env.close()

In [None]:
from IPython.display import clear_output
from time import sleep

def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])#.getvalue())
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        print(f"Win Count: {frame['win count']}")
        print(f"Loss Count: {frame['loss count']}")
        sleep(.1)
        
print_frames(frames)