<a href="https://colab.research.google.com/github/mgite03/bu-ai4all-2019/blob/main/rl/Copy_of_Deep_RL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep RL

So far we've seen how to train an agent by essentially having it memorize the value of each state and action in a table. But what if there is not a finite number of states? I.e. what if the state is continuous? For example, instead of the agent being on one square in a grid, it could be at certain coordinates in the grid. The state could be anywhere inside a continuous 2-dimensional box. We can't create a table with an infinite number of values, so we need a different way to represent the value of each state.

We can do this by training a neural network to represent $Q$.

## New environment: CartPole

For this part of the project you'll be working with a different environment, the CartPole environment. Read about what this environment is [here](https://gym.openai.com/envs/CartPole-v1/).

In [None]:
import gym

In [None]:
env = gym.make('CartPole-v1')

In [None]:
obs = env.reset()

On Google Colab, we can't render the CartPole environment. But that's ok, because when you finish training your model, you can download the model and render the environment states on your local machine.

In [None]:
env.render() # notice that this won't work

NoSuchDisplayException: ignored

You can, however, still print out the states, actions and rewards. Refer to the first Intro to RL notebook to review how to gain information about an OpenAI gym environment.

In [None]:
# Your code here. Figure out what the state space and action space are, and play around with the environment.

obs = env.reset()
print(env.observation_space)

print(env.action_space)

obs, reward, done, info = env.step(0)
print(obs)
obs, reward, done, info = env.step(1)
print(obs)
obs, reward, done, info = env.step(0)
print(obs)
obs, reward, done, info = env.step(1)
print(obs)
obs, reward, done, info = env.step(0)
print(obs)
obs, reward, done, info = env.step(1)
print(obs)
obs, reward, done, info = env.step(0)
print(obs)
obs, reward, done, info = env.step(1)
print(obs)
obs, reward, done, info = env.step(0)
print(obs)
obs, reward, done, info = env.step(1)
print(obs)
obs, reward, done, info = env.step(0)
print(obs)
obs, reward, done, info = env.step(1)
print(obs)
print(done)
print(reward)


Box(4,)
Discrete(2)
[-0.00747024 -0.22507357  0.01645787  0.29915644]
[-0.01197171 -0.03019004  0.022441    0.01170912]
[-0.01257551 -0.22562651  0.02267518  0.31138712]
[-0.01708804 -0.03083483  0.02890293  0.02594064]
[-0.01770473 -0.22635909  0.02942174  0.32760082]
[-0.02223192 -0.03166809  0.03597376  0.04433946]
[-0.02286528 -0.22728693  0.03686055  0.34815187]
[-0.02741102 -0.03270812  0.04382358  0.06731645]
[-0.02806518 -0.22843008  0.04516991  0.37349753]
[-0.03263378 -0.03397791  0.05263986  0.09539223]
[-0.03331334 -0.22981327  0.05454771  0.40420734]
[-0.0379096  -0.03550563  0.06263185  0.12920847]
False
1.0


(Just run this next cell to import everything you need.)

In [None]:
# Imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import gym
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import random
import tqdm
from time import time
import pickle

# Building a model

Build a fully connected neural network using pytorch, that takes the current state as an input and outputs the expected return for moving in each direction.

What are the input and output dimensions of the neural network?

> input: 1X4

> output: 1x2? 2x1?

In [None]:
class Q(nn.Module):
  def __init__(self, input_size, hidden_size, output_size):
    super(Q, self).__init__()
    # weights 
    self.layer1 = nn.Linear(input_size, hidden_size)
    self.layer2 = nn.Linear(hidden_size, hidden_size)
    self.layer3 = nn.Linear(hidden_size, hidden_size)
    self.layer4 = nn.Linear(hidden_size, hidden_size)
    self.layer5 = nn.Linear(hidden_size, hidden_size)
    self.layer6 = nn.Linear(hidden_size, output_size)
  def forward(self, inputs):
    output = self.layer1(inputs)
    output = self.layer2(output)
    output = self.layer3(output)
    output = self.layer4(output)
    output = self.layer5(output)
    output = self.layer6(output)
    return output
  
thingy = Q(4,100,2)



# Gain Experience

Write a function that uses the current `q` (but doesn't change it yet) to guide the agent in the environment. This function should record a certain number of transitions (`num_transitions`) and return them in a tensor.

Epsilon is the exploration probability.

In [None]:
def get_new_experience(thingy, env, num_transitions, epsilon):
  # returns experience array: (state(4), action(1), new state(4), reward(1))

  # initialize transitions with nan
  transitions = torch.full((num_transitions, 10), np.nan)
  s = env.reset()
  
  done = False

  for i in tqdm.tqdm(range(num_transitions)):
  
    if done == True:
      s = env.reset()
      done = False
    
    somethin = random.random()
    if somethin < epsilon:
      a = random.randint(0,1)
    else:
      return_list = thingy
      if return_list[0] > return_list[1]:
        a = 0 
      else:
        a = 1
    
    old_s=torch.from_numpy(s)
    s, reward, done, info = env.step(a)
    
    transitions[i,:4] = old_s
    transitions[i,4:5] = a
    transitions[i,5:9] = s
    transitions[i,9] = reward
    
  return transitions




(Just run this next cell, it's for later.)

In [None]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)

cuda:0


# Define the training function

Write some code that trains the model. You can write it in a function so that it is easy to call multiple times for different hyperparameters.

The training should go like this:
1. Gain experience
2. Use experience to refine $Q$
3. Repeat

When training, the loss should follow the Bellman equation. You should use the loss between $Q(s,a)$ and $r+\gamma Q(s',a')$ to train $Q$ (which is instantiated earlier in the notebook as lower-case `q`).


Now, when training you don't want to completely forget old experience. You will want to keep some of the experience from previous "gain experience" steps in your experience buffer.
"`num_transitions`" is the amount of new experience you will generate in each training iteration, and "`buffer_size`" is the total amount of experience you remember at any time. Each time we do step 1, we replace *some* of the old experience in the buffer with new experience you get from calling `get_new_experience()`. The buffer is essentially a giant tensor.

Use slicing to extract the correct parts of the transitions from the buffer to train the model.

In [None]:
def train(thingy, gamma, criterion, optimizer, num_epochs, num_transitions, buffer_size):
    # start an array to keep track of loss
    keep_track_of_loss = []
    running_loss = 0
    
    # set up an array to keep track of how well q performs at each epoch
    episode_lengths = np.zeros(num_epochs)
    
    # initialize exploration probability
    epsilon = 0.9
    
    # define the buffer as "experience"
    experience = get_new_experience(thingy, env, buffer_size, epsilon)#.to(device) # .to(device) moves the experience to the gpu (if you have one)

    for epch in tqdm.tqdm(range(num_epochs)):
        # replace some of the old experience with new experience
        experience = torch.cat((experience[num_transitions:], 
                                get_new_experience(thingy, env, num_transitions, epsilon)), 
                               dim=0)





        state = experience[:,:4]
        a = experience[:,4:5]
        snew = experience[:,5:9]
        reward = experience[:,9]
        
        
        
        

        # ------------------
        # | YOUR CODE HERE |
        # ------------------
        
        
        # forward function
        q_predicted = thingy(state)
        the_actions_taken = a.view(-1,1).long() # either 1 or 0
        indicies = torch.arange(buffer_size, dtype=torch.float32).unsqueeze(dim=0).long() # the transition number
        q_predicted = q_predicted[(indicies, the_actions_taken)].view(-1,1)
        
        #Q(s',a')
        q_next_optimal = torch.max(thingy(snew), dim=1).values.unsqueeze(dim=1)
        
        # rewards + gamma * q(state)
        output_d = reward + gamma * q_next_optimal
        
        
        # zero the gradients in the weights
        optimizer.zero_grad()
        # calculate the losses
        loss = criterion(q_predicted, output_d)
        # calculate the gradients
        loss.backward()
        # update the weights
        optimizer.step()
        #Log the log so we can plot it later
        losses.append(loss.item())


        
        
        # in every epoch, check episode length -- take an average of 3
        step_counter= 0
        for num in range(3):
            s = env.reset()
            s = torch.Tensor(s).to(device)
            done= False
            while not done:
                a = torch.argmax(q(s)).cpu()
                s_new, r, done, num = env.step(a.numpy())
                s = torch.Tensor(s_new).to(device)
                step_counter += 1
        episode_lengths[epch] = step_counter/3.

        # every 10th epoch, record the average loss over the last 10 epochs
        running_loss += loss/buffer_size
        if epch%10==9:
            keep_track_of_loss.append(running_loss/10.)
            running_loss=0

        # update epsilon
        epsilon *= 0.999

    # return the episode lengths so we can see how the model progressed, and also the losses so we can plot them
    return episode_lengths, keep_track_of_loss

# Train

Create our environment:

In [None]:
env = gym.make('CartPole-v1')

Here's some starter code. However, you should run lots of tests, and it will probably be helpful for you to run this in a loop and record all the different results for examination.

In [None]:
# instantiate the model q
thingy2 = Q(4, 100, 2)#.to(device) # .to(device) sends the model to the GPU. The GPU makes training faster

# define our criterion and optimizer. You can use the Pytorch notebook as reference, 
# and also look at the pytorch documentation if you want to try some different loss functions
# and optimizers
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(thingy2.parameters(), lr=0.5)

# define hyperparameters. Try a few different combinations and record which ones do the best.
gamma = 0.9
num_epochs = 10000 
num_transitions = 10000
buffer_size = 20000

# call the training function
episode_lengths, all_losses = train(thingy, gamma, criterion, optimizer, num_epochs, num_transitions, buffer_size)

  0%|          | 0/20000 [00:00<?, ?it/s]


TypeError: ignored

# Examine the results

Define some useful functions to look at our episode lengths and loss curve:

In [None]:
def plot_episode_lengths(episode_lengths):
  plt.scatter(range(len(episode_lengths)), episode_lengths, marker=".")
  plt.show()
  
def plot_loss_curve(all_losses):
  plt.plot([10*e for e in range(len(all_losses))], all_losses)
  plt.show()

Plot the performace of the models that you've trained. See if you have trained a good one.

In [None]:
# Your code here