<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#2019-05-17_week12_fundamental-learning_openAIgym-deepQ-exploration" data-toc-modified-id="2019-05-17_week12_fundamental-learning_openAIgym-deepQ-exploration-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>2019-05-17_week12_fundamental-learning_openAIgym-deepQ-exploration</a></span><ul class="toc-item"><li><span><a href="#model-definition" data-toc-modified-id="model-definition-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>model definition</a></span></li><li><span><a href="#lr-exploration" data-toc-modified-id="lr-exploration-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>lr exploration</a></span><ul class="toc-item"><li><span><a href="#memory-wrt-learning-rate-hyperparameter-matrix" data-toc-modified-id="memory-wrt-learning-rate-hyperparameter-matrix-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>memory wrt learning rate hyperparameter matrix</a></span></li></ul></li><li><span><a href="#Eps-greedy-exploration" data-toc-modified-id="Eps-greedy-exploration-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Eps greedy exploration</a></span><ul class="toc-item"><li><span><a href="#Form-of-the-decaying-epsilon-greedy-algorithm" data-toc-modified-id="Form-of-the-decaying-epsilon-greedy-algorithm-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Form of the decaying epsilon greedy algorithm</a></span></li></ul></li><li><span><a href="#Review-of-first-attempt-at-Reinforcement-learning" data-toc-modified-id="Review-of-first-attempt-at-Reinforcement-learning-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Review of first attempt at Reinforcement learning</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Done-well" data-toc-modified-id="Done-well-1.4.0.1"><span class="toc-item-num">1.4.0.1&nbsp;&nbsp;</span>Done well</a></span></li><li><span><a href="#Points-of-difficulty" data-toc-modified-id="Points-of-difficulty-1.4.0.2"><span class="toc-item-num">1.4.0.2&nbsp;&nbsp;</span>Points of difficulty</a></span></li><li><span><a href="#Thinking-points-for-next-project" data-toc-modified-id="Thinking-points-for-next-project-1.4.0.3"><span class="toc-item-num">1.4.0.3&nbsp;&nbsp;</span>Thinking points for next project</a></span></li></ul></li></ul></li><li><span><a href="#Rendering-for-presentation" data-toc-modified-id="Rendering-for-presentation-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Rendering for presentation</a></span></li></ul></li></ul></div>

# 2019-05-17_week12_fundamental-learning_openAIgym-deepQ-exploration

In [None]:
from collections import namedtuple
import gym
import attr
import random
import math
from itertools import count

import torch
from torch import optim
import torch.nn as nn
import torch.nn.functional as F

import matplotlib.pyplot as plt

import pdb
from time import time

## model definition

In [None]:
Experience = namedtuple("Experience", ("state", "action", "reward", "next_state"))


@attr.s
class ReplayMemory(object):
    capacity = attr.ib()
    memory = []
    position = 0

    def push(self, transition):
        """adds experiences to the memory buffer"""
        self.memory.append(transition)
        if len(self.memory) > self.capacity:
            del self.memory[0]

    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)


class CPNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 200)
        # self.fc2 = nn.Linear(200, 200)
        self.fc3 = nn.Linear(200, 2)

    def forward(self, x):
        xb = x.view(-1, 4)
        xb = F.relu(self.fc1(xb))
        # xb = F.relu(self.fc2(xb))
        xb = self.fc3(xb)
        return xb.view(-1, xb.size(1))
    



class CPSolver(object):
    """ 
    This is the base class that manages the whole RL pipeline.
    This was written for my own understanding, but it runs and 
    hopefully provides a bit of insight.

    Args:
        episodes (int): Number of runs of the game (up to end) to perform
        memory (int): Number of memories in the memory buffer
        gamma (float): Parameter in the loss function determining the 
            fractional impact of actions. between 0 and 1
        lr (float): The learning rate of the gradient descent when training the network
        batch_size (int): Number of memories passed in for training simulataneously
        eps_start (float): Starting fraction of true random decisions
        eps_end (float): Final fraction of true random decisions
        eps_decay (int): Exponential decay constant for the proportion of
            random decisions in episodes. See self.eps_greedy for 
            implementation of decaying epsilon greedy strategy
        optimizer (torch optimizer): SGD = stochastic gradient descent. could use others
        loss_fn (torch loss): MSE = Mean square error loss
        render (bool): Whether you want to see the game run
        render_step (int): render the game every x steps
        output = False: For saving video files. requires ffmpeg
    
    """
    def __init__(
        self,
        episodes=300,
        memory=10000,
        gamma=0.8,
        lr=0.01,
        batch_size=32,
        eps_start=0.9,
        eps_end=0.01,
        eps_decay=100,
        optimizer=optim.SGD,
        loss_fn=nn.MSELoss,
        render = True,
        render_step = 100,
        output = False
    ):

        self.eps_start = eps_start
        self.eps_end = eps_end
        self.eps_decay = eps_decay
        self.episodes = episodes
        self.gamma = gamma
        self.lr = lr
        self.batch_size = batch_size
        self.memory_size = memory
        
        self.memory = ReplayMemory(memory)
        self.env = gym.make("CartPole-v1")
        self.model = CPNet()
        self.optimizer = optimizer(self.model.parameters(), lr=lr)
        self.loss_fn=loss_fn()
        
        self.render = render
        self.render_step = render_step
        if output:
            # enables video file output
            self.env = gym.wrappers.Monitor(self.env, f'./RL_vids/{str(time())}/',video_callable=self.render_check)
        
        
        
    def render_check(self, step):
        if step==0:
            return True
        else:
            return (step+1)%self.render_step==0
    
    def eps_threshold(self, steps_done):
        return self.eps_end + (self.eps_start - self.eps_end) * math.exp(
            -1.0 * steps_done / self.eps_decay
        )

    def select_action(self, state, steps_done):
        """Selects the best action using the model"""
        with torch.no_grad():
            # model predicts highest predicted reward
            prediction = self.model(state)
            # selects the action with the highest predicted probability
            action = prediction.data.max(1)[1].view(1, 1)
        return action
            
        
    def eps_greedy(self, state, steps_done):
        if random.random() > self.eps_threshold(steps_done):
            return self.select_action(state, steps_done)
        else:
            return torch.tensor([[random.choice([0, 1])]])
        

    def optimize_model(self):
        
        transitions = self.memory.sample(self.batch_size)
        # print(transitions)

        batch_state, batch_action, batch_reward, batch_next_state = zip(*transitions)

        batch_state = torch.cat(batch_state)
        batch_action = torch.cat(batch_action)
        batch_reward = torch.cat(batch_reward)
        batch_next_state = torch.cat(batch_next_state)
        # print(f'batch_reward:{batch_reward}')

        self.optimizer.zero_grad()
        # The network returns probabilities
        # The probs corresponding to the actions taken are selected
        current_q_values = self.model(batch_state).gather(1, batch_action).view(-1)
        # best probabilities possible from the next state
        max_next_q_values = self.model(batch_next_state).detach().max(1)[0]
        expected_q_values = batch_reward + (self.gamma * max_next_q_values)
        # print(batch_state.shape, batch_action.shape, batch_reward.shape, batch_next_state.shape)
        # print(current_q_values.shape, expected_q_values.shape)
        # print(current_q_values, expected_q_values)
        
        # loss is measured from error between current and newly expected Q values
        loss = self.loss_fn(current_q_values, expected_q_values)
        # backpropagation of loss to NN

        loss.backward()
        # print(f'ep:{episode:03d}-step:{i:03d}-loss:{loss:0.4f}')
        # print(current_q_values.mean().item(), expected_q_values.mean().item())
        # print(optimizer.param_groups[0]['params'][0].grad.mean().item(),
        #       optimizer.param_groups[0]['params'][0].grad.var().item())
        self.optimizer.step()

    def learn(self):
        steps_done = 0
        ep_count = []
        step_count = []
        
        for episode in range(self.episodes):
            state = self.env.reset()
            for i in count():
                if self.render and (episode+1) % self.render_step ==0:
                    if i==0:
                        print(f'lr:{self.lr}_mem:{self.memory_size}_Episode:{episode+1}')
                    self.env.render()
                action = self.eps_greedy(torch.FloatTensor([state]), steps_done)
                steps_done += 1
                next_state, reward, done, info = self.env.step(action[0, 0].item())
                if done:
                    reward = -1
                self.memory.push(
                    (
                        torch.FloatTensor(state),
                        action,
                        torch.FloatTensor([reward]),
                        torch.FloatTensor(next_state),
                    )
                )
                # Only train if there are enough memories to produce a full batch
                if len(self.memory) >= self.batch_size:
                    self.optimize_model()
                    
                state = next_state
                if done:
                    ep_count.append(episode)
                    step_count.append(i)
                    break
        return ep_count, step_count

    def close(self):
        self.env.close()


In [None]:
fig, ax = plt.subplots()
cartpole_solver = CPSolver()
ep_count, steps = cartpole_solver.learn()
cartpole_solver.close()
ax.plot(steps, label=f"lr={lr} mem={mem}")
plt.legend()
plt.title(f"SGD_MSEloss_lr-{lr_space}_mem-{mem_space}_{eps}eps")
plt.show()

## lr exploration

In [None]:
fig, ax = plt.subplots()
lr_space = [ 0.001, 0.01, 0.1]
mem_space = [1000]
eps = 100

for lr in lr_space:
    for mem in mem_space:
        cartpole_solver = CPSolver(
            episodes=eps,
            memory=mem,
            gamma=0.8,
            lr=lr,
            batch_size=32,
            eps_start=0.9,
            eps_end=0.05,
            eps_decay=200,
            output=True,
            render=True,
            render_step=eps,
            optimizer=optim.SGD,
            loss_fn = nn.MSELoss
        )
        ep_count, steps = cartpole_solver.learn()
        cartpole_solver.close()
        ax.plot(steps, label=f"lr={lr} mem={mem}")
plt.legend()
plt.title(f"SGD_MSEloss_lr-{lr_space}_mem-{mem_space}_{eps}eps")
plt.show()

In [None]:
params=list(model.parameters())
for i in params:
    print(i.shape)
params[3]

In [None]:
nmodel=CPNet()

In [None]:
print(nmodel.fc3.bias)
nmodel.fc3.bias=params[3]
print(nmodel.fc3.bias)

In [None]:
torch.save(model.state_dict(), 'test_model.torch')

new_model = CPNet()
new_model.load_state_dict(torch.load('test_model.torch'))

Use the save and load functionality to use a model later for inferencee.

0.01 seems the sweet spot between learning rate and stability.

It would seem that instability could be due to the sensitivity to what is coming in on each batch. if the memory size is low, then over time the early examples of performance will be replaced, and they will stop cropping up in batches. Thus the learning of the network will no longer emphasize them, and it could forget.

try with longer memory time.

In [None]:
fig, ax = plt.subplots()
lr_space = [ 0.001, 0.01, 0.1]
mem_space = [10000]
eps = 100
for lr in lr_space:
    for mem in mem_space:
        cartpole_solver = CPSolver(
            episodes=eps,
            memory=mem,
            gamma=0.8,
            lr=lr,
            batch_size=32,
            eps_start=0.9,
            eps_end=0.05,
            eps_decay=200,
            output=False,
            render=True,
            render_step=eps,
            optimizer=optim.SGD,
            loss_fn = nn.MSELoss
        )
        ep_count, steps = cartpole_solver.learn()
        cartpole_solver.close()
        ax.plot(steps, label=f"lr={lr} mem={mem}")
plt.legend()
plt.title(f"SGD_MSEloss_lr-{lr_space}_mem-{mem_space}_{eps}eps")
plt.show()

Higher memory shows less drastic reduction in performance, although still high variability over a short episode scale.

can try more episodes.

In [None]:
fig, ax = plt.subplots()
lr_space = [ 0.001, 0.01, 0.1]
mem_space = [10000]
eps = 300
for lr in lr_space:
    for mem in mem_space:
        cartpole_solver = CPSolver(
            episodes=eps,
            memory=mem,
            gamma=0.8,
            lr=lr,
            batch_size=32,
            eps_start=0.9,
            eps_end=0.05,
            eps_decay=200,
            output=False,
            render=True,
            render_step=100,
            optimizer=optim.SGD,
            loss_fn = nn.MSELoss
        )
        ep_count, steps = cartpole_solver.learn()
        cartpole_solver.close()
        ax.plot(steps, label=f"lr={lr} mem={mem}")
plt.legend()
plt.title(f"SGD_MSEloss_lr-{lr_space}_mem-{mem_space}_{eps}eps")
plt.show()

### memory wrt learning rate hyperparameter matrix

In [None]:
fig, ax = plt.subplots()
lr_space = [ 0.01]
mem_space = [1000,10000, 100000]
eps = 300
for lr in lr_space:
    for mem in mem_space:
        cartpole_solver = CPSolver(
            episodes=eps,
            memory=mem,
            gamma=0.8,
            lr=lr,
            batch_size=32,
            eps_start=0.9,
            eps_end=0.05,
            eps_decay=200,
            output=False,
            render=True,
            render_step=eps,
            optimizer=optim.SGD,
            loss_fn = nn.MSELoss
        )
        ep_count, steps = cartpole_solver.learn()
        cartpole_solver.close()
        ax.plot(steps, label=f"lr={lr} mem={mem}")
plt.legend()
plt.title(f"SGD_MSEloss_lr-{lr_space}_mem-{mem_space}_{eps}eps")
plt.show()

In [None]:
fig, ax = plt.subplots()
lr_space = [ 0.1]
mem_space = [1000,10000, 100000]
eps = 500
for lr in lr_space:
    for mem in mem_space:
        cartpole_solver = CPSolver(
            episodes=eps,
            memory=mem,
            gamma=0.8,
            lr=lr,
            batch_size=32,
            eps_start=0.9,
            eps_end=0.05,
            eps_decay=200,
            output=False,
            render=True,
            render_step=eps,
            optimizer=optim.SGD,
            loss_fn = nn.MSELoss
        )
        ep_count, steps = cartpole_solver.learn()
        cartpole_solver.close()
        ax.plot(steps, label=f"lr={lr} mem={mem}")
plt.legend()
plt.title(f"SGD_MSEloss_lr-{lr_space}_mem-{mem_space}_{eps}eps")
plt.show()

In [None]:
fig, ax = plt.subplots()
lr_space = [ 0.01]
mem_space = [1000,10000, 100000]
eps = 500
for lr in lr_space:
    for mem in mem_space:
        cartpole_solver = CPSolver(
            episodes=eps,
            memory=mem,
            gamma=0.8,
            lr=lr,
            batch_size=32,
            eps_start=0.9,
            eps_end=0.05,
            eps_decay=200,
            output=False,
            render=True,
            render_step=eps,
            optimizer=optim.SGD,
            loss_fn = nn.MSELoss
        )
        ep_count, steps = cartpole_solver.learn()
        cartpole_solver.close()
        ax.plot(steps, label=f"lr={lr} mem={mem}")
plt.legend()
plt.title(f"SGD_MSEloss_lr-{lr_space}_mem-{mem_space}_{eps}eps")
plt.show()

In [None]:
fig, ax = plt.subplots()
lr_space = [ 0.001]
mem_space = [1000,10000, 100000]
eps = 500
for lr in lr_space:
    for mem in mem_space:
        cartpole_solver = CPSolver(
            episodes=eps,
            memory=mem,
            gamma=0.8,
            lr=lr,
            batch_size=32,
            eps_start=0.9,
            eps_end=0.05,
            eps_decay=200,
            output=False,
            render=True,
            render_step=eps,
            optimizer=optim.SGD,
            loss_fn = nn.MSELoss
        )
        ep_count, steps = cartpole_solver.learn()
        cartpole_solver.close()
        ax.plot(steps, label=f"lr={lr} mem={mem}")
plt.legend()
plt.title(f"SGD_MSEloss_lr-{lr_space}_mem-{mem_space}_{eps}eps")
plt.show()

In [None]:
fig, ax = plt.subplots()
lr_space = [ 0.01]
mem_space = [1000,10000, 100000]
eps = 250
for lr in lr_space:
    for mem in mem_space:
        cartpole_solver = CPSolver(
            episodes=eps,
            memory=mem,
            gamma=0.8,
            lr=lr,
            batch_size=32,
            eps_start=0.9,
            eps_end=0.05,
            eps_decay=200,
            output=False,
            render=True,
            render_step=eps,
            optimizer=optim.SGD,
            loss_fn = nn.MSELoss
        )
        ep_count, steps = cartpole_solver.learn()
        cartpole_solver.close()
        ax.plot(steps, label=f"lr={lr} mem={mem}")
plt.legend()
plt.title(f"SGD_MSEloss_lr-{lr_space}_mem-{mem_space}_{eps}eps")
plt.show()

## Eps greedy exploration

In [None]:
fig, ax = plt.subplots()
lr_space = [ 0.01]
mem_space = [10000]
eps = 500
for lr in lr_space:
    for mem in mem_space:
        cartpole_solver = CPSolver(
            episodes=eps,
            memory=mem,
            gamma=0.8,
            lr=lr,
            batch_size=32,

            eps_start=0.9,
            eps_end=0.05,
            eps_decay=200,
            output=False,
            render=True,
            render_step=100,
            optimizer=optim.SGD,
            loss_fn = nn.MSELoss
        )
        ep_count, steps = cartpole_solver.learn()
        cartpole_solver.close()
        ax.plot(steps, label=f"lr={lr} mem={mem}")
plt.legend()
plt.title(f"SGD_MSEloss_lr-{lr_space}_mem-{mem_space}_{eps}eps")
plt.show()

In [None]:
fig, ax = plt.subplots()
lr_space = [ 0.01]
mem_space = [10000]
eps = 500
for lr in lr_space:
    for mem in mem_space:
        cartpole_solver = CPSolver(
            episodes=eps,
            memory=mem,
            gamma=0.8,
            lr=lr,
            batch_size=32,

            eps_start=0.9,
            eps_end=0.05,
            eps_decay=100,
            output=False,
            render=True,
            render_step=100,
            optimizer=optim.SGD,
            loss_fn = nn.MSELoss
        )
        ep_count, steps = cartpole_solver.learn()
        cartpole_solver.close()
        ax.plot(steps, label=f"lr={lr} mem={mem}")
plt.legend()
plt.title(f"SGD_MSEloss_lr-{lr_space}_mem-{mem_space}_{eps}eps")
plt.show()

In [None]:
fig, ax = plt.subplots()
lr_space = [ 0.01]
mem_space = [10000]
eps = 500
for lr in lr_space:
    for mem in mem_space:
        cartpole_solver = CPSolver(
            episodes=eps,
            memory=mem,
            gamma=0.8,
            lr=lr,
            batch_size=32,

            eps_start=0.5,
            eps_end=0.05,
            eps_decay=100,
            output=False,
            render=True,
            render_step=100,
            optimizer=optim.SGD,
            loss_fn = nn.MSELoss
        )
        ep_count, steps = cartpole_solver.learn()
        cartpole_solver.close()
        ax.plot(steps, label=f"lr={lr} mem={mem}")
plt.legend()
plt.title(f"SGD_MSEloss_lr-{lr_space}_mem-{mem_space}_{eps}eps")
plt.show()

In [None]:
fig, ax = plt.subplots()
lr_space = [ 0.01]
mem_space = [10000]
eps = 500
for lr in lr_space:
    for mem in mem_space:
        cartpole_solver = CPSolver(
            episodes=eps,
            memory=mem,
            gamma=0.8,
            lr=lr,
            batch_size=32,

            eps_start=0.1,
            eps_end=0.01,
            eps_decay=100,
            output=False,
            render=True,
            render_step=100,
            optimizer=optim.SGD,
            loss_fn = nn.MSELoss
        )
        ep_count, steps = cartpole_solver.learn()
        cartpole_solver.close()
        ax.plot(steps, label=f"lr={lr} mem={mem}")
plt.legend()
plt.title(f"SGD_MSEloss_lr-{lr_space}_mem-{mem_space}_{eps}eps")
plt.show()

### Form of the decaying epsilon greedy algorithm

In [None]:
def decaying_eps_greedy(steps_done, start, end, decay):
        return end + (start - end) * np.exp( -1.0 * steps_done / decay)
start = 0.9
end = 0.05
decay_const = 100
x = np.arange(0,501)
y = decaying_eps_greedy(x, start, end, decay_const)

fig, ax = plt.subplots()
ax.plot(x,y,c='k',label="decaying $\epsilon$-greedy")
ax.axvline(decay_const,c='y',ls='--',label= f'1xdecay_const={decay_const}')
ax.axvline(decay_const*3,c='r',ls='--',label= f'3xdecay_const={decay_const*3}')
ax.axhline(end,c='g',ls='--',label=f'endpoint={end}')


ax.legend()


## Review of first attempt at Reinforcement learning

#### Done well
- training is successful. We show increase in performance when using a simple fully connected network, and deepQ learning
- training output is plotted
- implemented options for actually viewing the performance during training at a variety of points
- code is extensible to different Networks provided they satisfy the input/output requirements

#### Points of difficulty
- need better understanding of how the maxQ step of the optimization works. perhaps a short writeup?
- model could be generalized as an excercise:
    - to take alternative inputs eg. the image of the system rather than the system state as input.
    - to be applied to any of the other openAI gym environments with arbitrary action space

#### Thinking points for next project
- work through the fiddly tensor arithmatic that goes in to taking the raw input data of whatever type in to batches for the model.
- build the model architechture myself instead of following tutorials. This is a good excercise, and can always check in with max for a sanity check.
- build in some way for saving a model, and then using it for repeat inference to compare models, 



## Rendering for presentation

In [None]:
fig, ax = plt.subplots()
lr_space = [0.01]
mem_space = [10000]
eps = 500
for lr in lr_space:
    for mem in mem_space:
        cartpole_solver = CPSolver(
            episodes=eps,
            memory=mem,
            gamma=0.8,
            lr=lr,
            batch_size=32,
            eps_start=0.9,
            eps_end=0.05,
            eps_decay=200,
            output=True,
            render=False,
            render_step=100,
            optimizer=optim.SGD,
            loss_fn = nn.MSELoss
        )
        ep_count, steps = cartpole_solver.learn()
        cartpole_solver.close()
        ax.plot(steps, label=f"lr={lr} mem={mem}")
plt.legend()
plt.title(f"SGD_MSEloss_lr-{lr_space}_mem-{mem_space}_{eps}eps")
plt.show()