# Reinforcement Learning
### For the second part of this project we will be implementing a simple Q-Learning algorithm on an RL environment called Cart Pole. The idea of Q-Learning is to try to estimate the expected future reward or Q-value of taking a certain action. Then at any given step we take the action with the most expected future reward.

### In reinforcement learning, we refer to algorithms that attempt to solve environments as "agents", so in this part of the project we will be making a Deep Q Network Agent that will solve the Cart Pole environment.

In [None]:
!pip3 install gym tqdm

# Part 1: Setup the Environment

In [None]:
import gym
env = gym.make('CartPole-v0')

# Part 2: Create The DQN Agent

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from collections import deque
import random
from keras.optimizers import Adam

import numpy as np


class DQNAgent:
    
    def __init__(self, env, replay_size=1000, epsilon=1.0, epsilon_min=0.01, epsilon_decay=0.995, gamma=0.99):
        self.state_size = env.observation_space.shape[0]
        self.num_actions = env.action_space.n
        self.model = self.build_model()
        self.replay_buffer = deque(maxlen=replay_size)
        self.epsilon = epsilon
        self.epsilon_min = epsilon_min
        self.epsilon_decay = epsilon_decay
        self.gamma = gamma

        
    def build_model(self):
        model = Sequential()
        # TODO: add 2 dense layers each with 32 neurons, the input dim to the first
        # layer should be the state size, also add relu activations, for both these layers
        # Then add another Dense layer with num_actions neurons.
        # Then use model.compile to compile the model with mse loss and an Adam optimizer
        # with learning rate 0.001.
        
        return model
        
    def action(self, state):
        # Whenever a random number between 0 and 1 is less than epsilon we want to return
        # a random action. This means that with probability epsilon we return a random action.
        if np.random.random() <= self.epsilon:
            #TODO: return random action here
        # Now we want to use our model to get the q values
        # HINT: we want to do prediction
        q_values = ???
        return np.argmax(q_values[0])
    
    def add_to_replay_buffer(self, state, action, reward, next_state, done):
        self.replay_buffer.append((state, action, reward, next_state, done))

    def train_batch_from_replay(self, batch_size):
        # if we don't have enough samples in our replay buffer just return
        if len(self.replay_buffer) < batch_size:
            return False
        # TODO: randomly sample batch_size samples from the replay buffer
        # hint: use random.sample
        minibatch = random.sample(self.replay_buffer, batch_size)
        for state, action, reward, next_state, done in minibatch:
            target = reward
            if not done:
                next_Qs = self.model.predict(next_state)[0]
                # TODO: we want to add to our target GAMMA * max Q(next_state)
                target += ???
            # our target should only take into account the current action
            # so we set all the Q values except the current action, to the 
            # current output of our model so that they get ignored in the loss function.
            target_Qs = self.model.predict(state)
            target_Qs[0][action] = target
            self.model.fit(state, target_Qs, epochs=1, verbose=0)
        
        # Now we want to slowly decay how many random actions we take
        # to do this we can multiply epsilon by our epsilon decay parameter
        # each iteration
        if self.epsilon > self.epsilon_min:
            #TODO: YOUR CODE HERE

# Part 3: Train the Model

In [None]:
agent = DQNAgent(env)

In [None]:
from tqdm import tqdm

done = False
batch_size = 32
num_episodes = 800

for episode in tqdm(range(num_episodes)):
    state = env.reset()
    state = np.reshape(state, [1, agent.state_size])
    
    for t in range(200):
        action = agent.action(state)
        next_state, reward, done, _ = env.step(action)
        reward = reward if not done else 100
        next_state = np.reshape(next_state, [1, agent.state_size])
        # TODO: add this sample to the replay buffer
        
        state = next_state
        
        # TODO: train on a batch from the replay buffer
        

# Part 4: Test the Model

In [None]:
#TODO: set the agent's epsilon so that we don't take any random actions.
for _ in range(10):
    state = env.reset()
    state = np.reshape(state, [1, agent.state_size])
    total_reward = 0
    for t in range(200):
        action = agent.action(state)
        next_state, reward, done, _ = env.step(action)
        total_reward += reward
        state = np.reshape(next_state, [1, agent.state_size])
        # TODO: if you want to see the rendered version of your agent running
        # uncomment this line
        #env.render()
    print(total_reward)

# Part 5: Writeup

#### Now for the writeup portion write a paragraph of your understanding of how Deep Q Learning works.