# Manual DQN - Cartpole

This notebook implements the **Deep Q Network (DQN)** for the [CartPole](https://gym.openai.com/envs/CartPole-v1/) game manually.
See `../ReinforcementLearning_Guide.md` for theory and intuition.

DQN build on the concepts introduced in Q Learning; see `../02_QLearning` for examples.

According to the OpenAI environment page of CartPole: "A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force to the left (0) or right (1) of the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright (i.e., does not fall). The episode ends when the pole is more than 12 degrees from vertical, or the cart moves more than 2.4 units from the center."

Note that I needed to change the description above, since it is outdated on the web page.

Look at the Github page of the environment, on the docstring of the environment.

Note the following:
- The center position is 0 and the range of possibles postions is `[-2.4,2.4]`, continuous
- Pole angle can vary in `[-12,12] deg`
- Velocity (linear for cart, angular for pole) can be any
- An episode is done if
    1. The pole tresspasses the limits above
    2. 200 steps/actions taken
    3. A minimum return is achieved over 100 steps/actions

We need to discretize the domains using bins??

Overview of sections:
1. Environment Setup
2. Neural Networks: Q Network & Target Network
3. Replay Buffer

## 1. Environment Setup

In [1]:
import numpy as np
import random
from collections import deque
import gym

  for external in metadata.entry_points().get(self.group, []):


In [100]:
from tensorflow.keras.models import Sequential,clone_model
from tensorflow.keras.layers import Dense,Activation,Flatten
from tensorflow.keras.optimizers import Adam

In [101]:
env_name = 'CartPole-v1'
env = gym.make(env_name)

In [102]:
env.reset()

for step in range(1000):
    env.render(mode='human')
    random_action = env.action_space.sample()
    env.step(random_action)
env.close()

  "You are calling 'step()' even though this "


## 2. Neural Networks: Q Network & Target Network

In [103]:
# Get the number of observations (alternatively: Github)
num_observations = env.observation_space.shape[0]

In [104]:
num_observations

4

In [105]:
# Get the number of actions (alternatively: Github)
num_actions = env.action_space.n

In [106]:
num_actions

2

In [107]:
# In general, our ANN is defined with these input/output sizes
# input_shape = num_observations = 4 (state) --> (leyers) --> output neurons = num_actions

In [132]:
# Q Network
model = Sequential()
#model.add(Flatten(input_shape=[1,4]))
# A Dense layer of 16 (4x) units/neurons which receives as input num_observations values (4)
model.add(Dense(16,input_shape=(1,num_observations)))
model.add(Activation('relu')) # we can add it in the Dense layer
# Expand: 2x
model.add(Dense(32,activation='relu'))
# Final output layer
model.add(Dense(num_actions, activation='linear')) # no change, ie. f(x) = x ??

In [133]:
model.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_15 (Dense)             (None, 1, 16)             80        
_________________________________________________________________
activation_5 (Activation)    (None, 1, 16)             0         
_________________________________________________________________
dense_16 (Dense)             (None, 1, 32)             544       
_________________________________________________________________
dense_17 (Dense)             (None, 1, 2)              66        
Total params: 690
Trainable params: 690
Non-trainable params: 0
_________________________________________________________________


In [134]:
# Target Network
target_model = clone_model(model)

## 3. Hyperparameters and Functions (Epsilon-Greedy)

In [135]:
EPOCHS = 1000
EPSILON = 1.0
# Another way of reducing epsilon (exploration)
# is to multiply to it a value close but under 1 every step
espsilon_reduce = 0.995
LEARNING_RATE = 0.001 # Watch out: not the ALPHA from Q Learning, but the LR of the NN!
GAMMA = 0.95

In [136]:
def epsilon_greedy_action_selection(model, epsilon, observation):
    if np.random.random() > epsilon: # Exploit
        # Predict
        prediction = model.predict(observation)
        # Select action with highest Q value
        action = np.argmax(prediction)
    else: # Explore
        action = np.random.randint(0, env.action_space.n)
    return action

## 3. Replay Buffer

### Deques

Let's analyze deques first.

In [137]:
# We create a deque of size 5
deque_1 = deque(maxlen=5)

In [138]:
# The deque is empty
deque_1

deque([])

In [139]:
# We add/append 5 elements (maximum) to the deque: [0, 1, 2, 3, 4]
for i in range(5):
    deque_1.append(i)

In [140]:
deque_1

deque([0, 1, 2, 3, 4])

In [141]:
# We add/append another 6th element
# First/Oldest is removed from head, Last/Newest is added to tail
deque_1.append(5)

In [142]:
deque_1

deque([1, 2, 3, 4, 5])

### Tuples Management

In the following, a one-liner for re-combining elements from tuples is shown, used later in the code.

In [143]:
test_tuple = [(1,2,3),(4,5,6),(7,8,9)]

In [144]:
# * iterates through the list elements
# zip(): the elements of passed collections are merged
# into tuples that contain elements from the different lists
# list(): zip is a generator, we need to convert it to a list
zipped_list = list(zip(*test_tuple))

In [145]:
# unpack
a, b, c = zipped_list
print(a, b, c)

(1, 4, 7) (2, 5, 8) (3, 6, 9)


### Replay Buffer

**IMPORTANT NOTE:** As it is written in the `replay()` function, I think there is a misunderstanding either here in the code or on the notes/guide. See the notes.

In [146]:
# How often do we update the target model?
# Number of epochs
# Take into account that each replay trains for one epoch only
update_target_model = 10

In [147]:
# Replay buffer itself
replay_buffer = deque(maxlen=20000)

In [148]:
def replay(replay_buffer, batch_size, model, target_model):
    # If the buffer is at least the size of the training subset or batch, 
    # no training done
    if len(replay_buffer) < batch_size: 
        return
    # If buffer already filled until batch size, training done!
    # Get experience samples: exp = (state, action, reward, new_state, done)
    samples = random.sample(replay_buffer, batch_size)  
    # Recombine and put each class-element of experience tuples together
    zipped_samples = list(zip(*samples))
    # unpack
    states, actions, rewards, new_states, dones = zipped_samples
    # WARNING: Either in my notes or here, something is wrong?
    # The target model is fed with S_t or S_t+1?
    # The Q network is fed with S_t+1 or S_t?
    # The output targets of target_model is not being used...
    # ... its elements are copied but then replaced!?
    targets = target_model.predict(np.array(states))
    q_values = model.predict(np.array(new_states))
    target_batch = []
    for i in range(batch_size):  
        q_value = max(q_values[i][0])
        #q_value = max(q_values[i])
        target = targets[i].copy()  
        if dones[i]:
            target[0][actions[i]] = rewards[i]
            #target[actions[i]] = rewards[i]
        else:
            target[0][actions[i]] = rewards[i] + q_value * GAMMA
            #target[actions[i]] = rewards[i] + q_value * GAMMA
        target_batch.append(target)
    # Train for 1 epoch only
    model.fit(np.array(states), np.array(target_batch), epochs=1, verbose=0)  

In [149]:
# Model update
def update_model_handler(epoch, update_target_model, model, target_model):
    if epoch > 0 and epoch % update_target_model == 0:
        # Get weights from Q network and copy them to the target network
        target_model.set_weights(model.get_weights())

## 4. Training

The training part is very similar to the one in Q learning.

**IMPORTANT NOTE**: 

I have not achieved to train with the following one on my Apple M1.
The error is related to the input_shape, which is not the expected one.

    ValueError: Error when checking input: expected dense_input to have 3 dimensions, but got array with shape (1, 4)

It seems that using TF 2.6.7 could solve it, but I cannot use that version on a M1 chip -- or at least, it is not as straighforward as it should be.
For more information, see:

[https://www.udemy.com/course/practical-ai-with-python-and-reinforcement-learning/learn/lecture/27376754#questions/16362492](https://www.udemy.com/course/practical-ai-with-python-and-reinforcement-learning/learn/lecture/27376754#questions/16362492)

See also:

[https://stackoverflow.com/questions/67000544/valueerror-error-when-checking-input-expected-dense-input-to-have-2-dimensions](https://stackoverflow.com/questions/67000544/valueerror-error-when-checking-input-expected-dense-input-to-have-2-dimensions)

In [150]:
# Compile model, since it was not done yet...
# Compile = configure model for training
# Note that the 
# Watch out: not the ALPHA from Q Learning, but the LR of the NN!
model.compile(loss='mse', optimizer=Adam(lr=LEARNING_RATE))

In [None]:
# Track the points: epoch with the max number of points recorded
best_so_far = 0
for epoch in range(EPOCHS):
    observation = env.reset()
    # Our observations are [a,b,c,d] size (4,)
    # Keras expects shape (1,4)
    observation = observation.reshape([1, 4])
    done = False  
    points = 0
    while not done:
        # Compute action
        action = epsilon_greedy_action_selection(model, EPSILON, observation)
        # Execute action
        next_observation, reward, done, info = env.step(action)  
        # Next state/observation
        next_observation = next_observation.reshape([1, 4])
        # Replay buffer: add experience!
        replay_buffer.append((observation, action, reward, next_observation, done))
        # Update loop variables
        observation = next_observation
        points+=1
        # Train the model!
        # batch_size = 32
        replay(replay_buffer, 32, model, target_model)
    # After one epoch, reduce EPSILON: exploration -> exploitation
    EPSILON *= espsilon_reduce
    # Refresh Q network weights every N=update_target_model epochs
    update_model_handler(epoch, update_target_model, model, target_model)
    if points > best_so_far:
        best_so_far = points
    if epoch % 25 == 0:
        print(f"{epoch}: Points reached: {points} - epsilon: {EPSILON} - Best: {best_so_far}")