# T81-558: Applications of Deep Neural Networks
**Module 12: Deep Learning and Security**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module Video Material

Main video lecture:

* Part 12.1: Introduction to the OpenAI Gym** [[Video]](https://www.youtube.com/playlist?list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_12_01_ai_gym.ipynb)
* **Part 12.2: Introduction to Q-Learning** [[Video]](https://www.youtube.com/playlist?list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_12_02_qlearningreinforcement.ipynb)
* Part 12.3: Keras Q-Learning in the OpenAI Gym [[Video]](https://www.youtube.com/playlist?list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_12_03_keras_reinforce.ipynb)
* Part 12.4: Atari Games with Keras Neural Networks [[Video]](https://www.youtube.com/playlist?list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_12_04_atari.ipynb)
* 12.5: How Alpha Zero used Reinforcement Learning to Master Chess [[Video]](https://www.youtube.com/playlist?list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_12_05_alpha_zero.ipynb)


# Part 12.2: Introduction to Q-Learning

### Single Action Cart

Mountain car actions:

* 0 - Apply left force
* 1 - Apply no force
* 2 - Apply right force

State values:

* state[0] - Position 
* state[1] - Velocity

The following shows a cart that simply applies full-force to climb the hill.  The cart is simply not strong enough.  It will need to use momentum from the hill behind it.

In [1]:
import gym

env = gym.make("MountainCar-v0")
env.reset()
done = False

i = 0
while not done:
    i += 1
    state, reward, done, _ = env.step(2)
    env.render()
    print(f"Step {i}: State={state}, Reward={reward}")
    
env.close()

Step 1: State=[-4.29400171e-01  3.05074005e-04], Reward=-1.0
Step 2: State=[-0.42879222  0.00060795], Reward=-1.0
Step 3: State=[-0.42788577  0.00090645], Reward=-1.0
Step 4: State=[-0.42668735  0.00119843], Reward=-1.0
Step 5: State=[-0.42520556  0.00148179], Reward=-1.0
Step 6: State=[-0.42345105  0.00175451], Reward=-1.0
Step 7: State=[-0.42143641  0.00201465], Reward=-1.0
Step 8: State=[-0.41917604  0.00226037], Reward=-1.0
Step 9: State=[-0.41668609  0.00248995], Reward=-1.0
Step 10: State=[-0.41398431  0.00270178], Reward=-1.0
Step 11: State=[-0.41108991  0.00289441], Reward=-1.0
Step 12: State=[-0.40802338  0.00306652], Reward=-1.0
Step 13: State=[-0.40480642  0.00321697], Reward=-1.0
Step 14: State=[-0.40146165  0.00334477], Reward=-1.0
Step 15: State=[-0.39801255  0.0034491 ], Reward=-1.0
Step 16: State=[-0.39448322  0.00352933], Reward=-1.0
Step 17: State=[-0.39089823  0.00358499], Reward=-1.0
Step 18: State=[-0.38728241  0.00361582], Reward=-1.0
Step 19: State=[-0.3836607   

Step 155: State=[-0.42760263  0.0009829 ], Reward=-1.0
Step 156: State=[-0.42632979  0.00127284], Reward=-1.0
Step 157: State=[-0.42477616  0.00155363], Reward=-1.0
Step 158: State=[-0.42295289  0.00182327], Reward=-1.0
Step 159: State=[-0.42087304  0.00207985], Reward=-1.0
Step 160: State=[-0.4185515   0.00232154], Reward=-1.0
Step 161: State=[-0.41600484  0.00254666], Reward=-1.0
Step 162: State=[-0.41325119  0.00275365], Reward=-1.0
Step 163: State=[-0.41031012  0.00294107], Reward=-1.0
Step 164: State=[-0.40720245  0.00310767], Reward=-1.0
Step 165: State=[-0.40395012  0.00325233], Reward=-1.0
Step 166: State=[-0.400576    0.00337411], Reward=-1.0
Step 167: State=[-0.39710376  0.00347225], Reward=-1.0
Step 168: State=[-0.39355762  0.00354614], Reward=-1.0
Step 169: State=[-0.38996223  0.00359538], Reward=-1.0
Step 170: State=[-0.38634249  0.00361974], Reward=-1.0
Step 171: State=[-0.38272332  0.00361917], Reward=-1.0
Step 172: State=[-0.37912955  0.00359377], Reward=-1.0
Step 173: 

### Programmed Car

This is a car that I hand-programmed.  It uses a simple rule, but solves the problem. The programmed car constantly applies force to one direction or another.  It does not reset.  Whatever direction the car is currently rolling, it applies force in that direction.  Therefore, the car begins to climb a hill, is overpowered, and rolls backward.  However, once it begins to roll backwards force is immediately applied in this new direction.

In [1]:
import gym

env = gym.make("MountainCar-v0")
state = env.reset()
done = False

i = 0
while not done:
    i += 1
    
    if state[1]>0:
        action = 2
    else:
        action = 0
    
    state, reward, done, _ = env.step(action)
    env.render()
    print(f"Step {i}: State={state}, Reward={reward}")
    
env.close()

Step 1: State=[-5.11319340e-01 -9.27702414e-05], Reward=-1.0
Step 2: State=[-5.11504185e-01 -1.84845182e-04], Reward=-1.0
Step 3: State=[-5.11779720e-01 -2.75534709e-04], Reward=-1.0
Step 4: State=[-5.12143879e-01 -3.64159056e-04], Reward=-1.0
Step 5: State=[-5.12593933e-01 -4.50053875e-04], Reward=-1.0
Step 6: State=[-0.51312651 -0.00053258], Reward=-1.0
Step 7: State=[-0.51373761 -0.0006111 ], Reward=-1.0
Step 8: State=[-0.51442266 -0.00068505], Reward=-1.0
Step 9: State=[-0.51517653 -0.00075386], Reward=-1.0
Step 10: State=[-0.51599355 -0.00081702], Reward=-1.0
Step 11: State=[-0.51686761 -0.00087406], Reward=-1.0
Step 12: State=[-0.51779215 -0.00092454], Reward=-1.0
Step 13: State=[-0.51876024 -0.00096809], Reward=-1.0
Step 14: State=[-0.51976461 -0.00100437], Reward=-1.0
Step 15: State=[-0.52079774 -0.00103313], Reward=-1.0
Step 16: State=[-0.52185188 -0.00105414], Reward=-1.0
Step 17: State=[-0.52291912 -0.00106724], Reward=-1.0
Step 18: State=[-0.52399145 -0.00107234], Reward=-1

Step 154: State=[0.17032272 0.01169586], Reward=-1.0
Step 155: State=[0.1808379  0.01051518], Reward=-1.0
Step 156: State=[0.19021205 0.00937415], Reward=-1.0
Step 157: State=[0.1984823  0.00827025], Reward=-1.0
Step 158: State=[0.20568281 0.00720051], Reward=-1.0
Step 159: State=[0.21184435 0.00616153], Reward=-1.0
Step 160: State=[0.21699399 0.00514965], Reward=-1.0
Step 161: State=[0.22115492 0.00416092], Reward=-1.0
Step 162: State=[0.22434618 0.00319127], Reward=-1.0
Step 163: State=[0.22658262 0.00223644], Reward=-1.0
Step 164: State=[0.22787473 0.00129211], Reward=-1.0
Step 165: State=[0.22822862 0.00035389], Reward=-1.0
Step 166: State=[ 0.22764596 -0.00058266], Reward=-1.0
Step 167: State=[ 0.225124   -0.00252196], Reward=-1.0
Step 168: State=[ 0.22065085 -0.00447315], Reward=-1.0
Step 169: State=[ 0.21420572 -0.00644513], Reward=-1.0
Step 170: State=[ 0.20575926 -0.00844646], Reward=-1.0
Step 171: State=[ 0.19527416 -0.0104851 ], Reward=-1.0
Step 172: State=[ 0.18270592 -0.01

### Reinforcement Learning

![Reinforcement Learning](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/reinforcement.png "Reinforcement Learning")


### Q-Learning Car

We will now use Q-Learning to produce a car that learns to drive itself.  Look out Tesla! 

Q-Learning works by building a table that provides a lookup table to determine which of several actions should be taken. As we move through a number of training episodes this table is refined.

$ Q^{new}(s_{t},a_{t}) \leftarrow (1-\alpha) \cdot \underbrace{Q(s_{t},a_{t})}_{\text{old value}} + \underbrace{\alpha}_{\text{learning rate}} \cdot  \overbrace{\bigg( \underbrace{r_{t}}_{\text{reward}} + \underbrace{\gamma}_{\text{discount factor}} \cdot \underbrace{\max_{a}Q(s_{t+1}, a)}_{\text{estimate of optimal future value}} \bigg) }^{\text{learned value}} $

In [1]:
import gym
import numpy as np

def calc_discrete_state(state):
    discrete_state = (state - env.observation_space.low)/buckets
    return tuple(discrete_state.astype(np.int))  

def run_game(q_table, render, should_update):
    done = False
    discrete_state = calc_discrete_state(env.reset())
    success = False
    
    while not done:
        # Exploit or explore
        if np.random.random() > epsilon:
            # Exploit - use q-table to take current best action (and probably refine)
            action = np.argmax(q_table[discrete_state])
        else:
            # Explore - t
            action = np.random.randint(0, env.action_space.n)
            
        # Run simulation step
        new_state, reward, done, _ = env.step(action)
        
        # 
        new_state_disc = calc_discrete_state(new_state)

        # 
        if new_state[0] >= env.goal_position:
            success = True
          
        # Update q-table
        if should_update:
            max_future_q = np.max(q_table[new_state_disc])
            current_q = q_table[discrete_state + (action,)]
            new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
            q_table[discrete_state + (action,)] = new_q

        discrete_state = new_state_disc
        
        if render:
            env.render()
            
    return success


In [2]:
LEARNING_RATE = 0.1
DISCOUNT = 0.95
EPISODES = 10000
SHOW_EVERY = 1000

DISCRETE_GRID_SIZE = [10, 10]
START_EPSILON_DECAYING = 1
END_EPSILON_DECAYING = EPISODES//2

In [3]:
env = gym.make("MountainCar-v0")

epsilon = 1  
epsilon_change = epsilon/(END_EPSILON_DECAYING - START_EPSILON_DECAYING)
buckets = (env.observation_space.high - env.observation_space.low)/DISCRETE_GRID_SIZE
q_table = np.random.uniform(low=-3, high=0, size=(DISCRETE_GRID_SIZE + [env.action_space.n]))
success = False


In [4]:
episode = 0
success_count = 0

while episode<EPISODES:
    episode+=1
    done = False

    if episode % SHOW_EVERY == 0:
        print(f"Current episode: {episode}, success: {success_count} ({float(success_count)/SHOW_EVERY})")
        success = run_game(q_table, True, False)
        success_count = 0
    else:
        success = run_game(q_table, False, True)
        
    if success:
        success_count += 1

    # Move epsilon towards its ending value, if it still needs to move
    if END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING:
        epsilon -= epsilon_change

print(success)

Current episode: 1000, success: 0 (0.0)
Current episode: 2000, success: 0 (0.0)
Current episode: 3000, success: 1 (0.001)
Current episode: 4000, success: 40 (0.04)
Current episode: 5000, success: 351 (0.351)
Current episode: 6000, success: 454 (0.454)
Current episode: 7000, success: 705 (0.705)
Current episode: 8000, success: 935 (0.935)
Current episode: 9000, success: 880 (0.88)
Current episode: 10000, success: 760 (0.76)
True


In [7]:
run_game(q_table, True, False)

True

In [8]:
import pandas as pd

df = pd.DataFrame(q_table.argmax(axis=2))

In [9]:
df.columns = [f'v-{x}' for x in range(DISCRETE_GRID_SIZE[0])]
df.index = [f'p-{x}' for x in range(DISCRETE_GRID_SIZE[1])]
df

Unnamed: 0,v-0,v-1,v-2,v-3,v-4,v-5,v-6,v-7,v-8,v-9
p-0,1,2,1,0,2,2,2,0,0,0
p-1,0,2,2,1,2,2,1,2,1,1
p-2,0,2,1,1,2,2,2,2,1,2
p-3,0,0,0,0,0,2,2,2,2,0
p-4,0,0,0,0,0,0,0,1,2,1
p-5,2,0,0,0,0,2,1,1,2,2
p-6,1,0,0,0,0,2,2,2,1,2
p-7,2,1,0,0,2,1,1,2,0,0
p-8,1,1,2,0,0,2,0,2,2,2
p-9,1,2,0,2,0,0,0,0,2,1


In [10]:
DISCRETE_OS_SIZE[0]

10

In [11]:
np.argmax(q_table[(2,0)])

0