## Deep Q Learning 
### http://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
Deep Q Learning agent learning on the CartPole-v0 problem, where you train a model to balance a pole on a cart that can move left to right

The model will examine the state of the system (direction the cart is moving, the angle of the pole wrt the cart, etc...) and will determine which direction to move the cart in order to balance the pole.

The typical inputs to the model would be the environmental conditions, but we can do this based on the images of the system only (render the scence, examine the scene, update what the cart should do).

In [2]:
# imports
import gym
import math
import random

from collections import namedtuple
from itertools import count
from copy import deepcopy
from PIL import Image

import matplotlib
import matplotlib.pyplot as plt

import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.autograd import Variable

import torchvision.transforms as T

In [3]:
# set things up

# OpenAI gym
env = gym.make('CartPole-v0').unwrapped

# matlplotlib for jupyter
is_ipython = 'inline' in matplotlib.get_backend()
if is_ipython:
    from IPython import display
plt.ion()

# alias some pytorch stuff if we're using a gpu
use_cuda = torch.cuda.is_available()
FloatTensor = torch.cuda.FloatTensor if use_cuda else torch.FloatTensor
LongTensor = torch.cuda.LongTensor if use_cuda else torch.LongTensor
ByteTensor = torch.cuda.ByteTensor if use_cuda else torch.ByteTensor
Tensor = FloatTensor

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


In [9]:
# using experience replay memory to store the transitions 
# in data that the model observes. 
#
# we reuse this data randomly at later iterations to decorrelated 
# the batch of transitions, which makes training more stable
# in other words, if we sampled sequentially in time, the model 
# may learn sequential relationships, which severely hurts the 
# model's ability to generalize
#
# we'll hold this information in two structures:
#     Transition: 
#         holds informations regarding a transition in state
#     ReplayMemory:
#         a cyclic buffer of fixed size that holds the most 
#         recent transitions 

Transition = namedtuple('Transition', ('state','action','next_state','reward'))

class ReplayMemory(object):
    def __init__(self, capacity):
        self.capacity = capacity
        self.memory = []
        self.position = 0
    
    def push(self, *args):
        """Saves a transition"""
        if len(self.memory) < self.capacity:
            self.memory.append(None)
        self.memory[self.position] = Transition(*args)
        self.position = (self.position+1) % self.capacity
        
    def sample(self, batch_size):
        """Randomly samples a batch from memory"""
        return random.sample(self.memory, batch_size)
    
    def __len__(self):
        return len(self.memory)