# Maze Robot Code
Testing Q-Learning Code with Gym.

## Imports
|Package|Usage|Link|
|-|-|-|
|gym|Software for testing AI without real Hardware.|https://gym.openai.com|
|numpy|The fundamental package for scientific computing with Python|https://numpy.org|
|random|Generates random numbers| |
|clear_output|Cleares console output| |
|sleep|Stops code for specific given time|

In [1]:
import gym
import numpy as np
import random
from IPython.display import clear_output
from time import sleep

## Initializing Variables
|Name|Usage|
|---|---|
|environment|Envireonment for testing the AI provided by GYM|
|Hyperparam|Provides the parameters used by the Q-Learning formula|
|Res|Holds the responses from the environment|
|epoch_array|Array holding all epochs|
|penalty_array|Array holding all penaltys|
|q_table|See "Q-Table Subheading|

### Q-Table
`np.zeros` = Table; `[Amount of Observations, Amount of possible options (Drop, Pick, L, R, U, D)]` = Columns and Rows of table

In [2]:
environment = gym.make("Taxi-v3").env

class Hyperparam:
    def __init__(self, alpha, gamma, epsilon):
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon

hyperparams = Hyperparam(0.1, 0.6, 0.1)

class Res:
    def __init__(self, action, state, reward, done, info):
        self.action = action
        self.state = state
        self.reward = reward
        self.done = done
        self.info = info

epoch_array = []
penatlty_array = []

q_table = np.zeros([environment.observation_space.n, environment.action_space.n])

## Training Algorythm

In [3]:
%%time

for i in range(1, 100001):
    epochs, penalties, reward = 0, 0, 0
    result = Res(0, 0, 0, False, "")

    result.state = environment.reset()

    while not result.done:
        if random.uniform(0, 1) < hyperparams.epsilon:
            result.action = environment.action_space.sample()
        else:
            result.action = np.argmax(q_table[result.state])

        next_state, result.reward, result.done, result.info = environment.step(result.action)

        old_value = q_table[result.state, result.action]
        next_max = np.max(q_table[next_state])

        # Q-Learning Formula
        new_value = (1 - hyperparams.alpha) * old_value + hyperparams.alpha * (result.reward + hyperparams.gamma * next_max)
        q_table[result.state, result.action] = new_value

        if reward == -10:
            penalties += 1
        
        result.state = next_state
        epochs += 1
    
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print("Trained.")
sleep(5)

Episode: 100000
Trained.
CPU times: user 25.9 s, sys: 3.08 s, total: 29 s
Wall time: 31.6 s


## Performing

In [4]:
class Performer:
    def __init__(self, state, next_state, reward, info, done, period):
        self.state = state
        self.next_state = next_state
        self.reward = reward
        self.info = info
        self.done = done
        self.period = period

performer = Performer(0, 0, 0, "", False, 0)

performer.state = environment.reset()

clear_output(wait=True)
environment.render()

while not performer.done:
    performer.period += 1

    performer.action = np.argmax(q_table[performer.state])
    performer.next_state, performer.reward, performer.done, performer.info = environment.step(performer.action)

    performer.state = performer.next_state

    clear_output(wait=True)
    print(f"{performer.next_state} {performer.reward} {performer.done} {performer.info}")
    print(f"Period: {performer.period}")
    environment.render()
    sleep(.5)

print("Done.")
    

    

0 20 True {'prob': 1.0}
Period: 13
+---------+
|[35m[34;1m[43mR[0m[0m[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)
Done.
