## Deep Q-Learning

### First, what is reinforcement learning?
Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. [Wikipedia](https://en.wikipedia.org/wiki/Reinforcement_learning)

![reinforcement_learning](https://images.ecosia.org/lisjdPOy_rATnKdUqSj2IFDjD_I=/0x390/smart/https%3A%2F%2Fi.stack.imgur.com%2FeoeSq.png)

### deep reinforcement learning = reinforcement learning + deep learning
In 2013 Google DeepMind published the paper [Playing Atari with Deep Reinforcement Learning](https://arxiv.org/abs/1312.5602), where they demonstrated how a computer learned to play Atari 2600 video games by observing just the screen pixels and receiving a reward when the game score increased. This was done using a new algorithm called **Deep Q Network (DQN)**.

Deep learning plays its part because instead of using the classic **Q function** *Q(s,a)*, which returns the expected value if from the **current state** *s* the agent performs the **action** *a*, in Deep Q-learning there is a **neural network** used to approximate this value.

### The example in this notebook
![cartpole](https://images.ecosia.org/xOJ8LOC_xNbxq2fyI-bXyD1Fnq8=/0x390/smart/https%3A%2F%2Frubenfiszel.github.io%2Fposts%2Frl4j%2Fcartpole.gif)

This is one of the most known example of reinforcement learning. The goal is balancing a pole on top of a moving cart.
Other great examples can be found on the [**OpenAI Gym** website](https://gym.openai.com/envs/#classic_control).

### Credits for the code to [keon.io](https://keon.io/deep-q-learning/)

I followed his tutorial to create my first agent trained with reinforcement learning. I slightly modified the code and added some comments. More examples coming soon...

In [1]:
import random
import gym
import numpy as np
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

Using TensorFlow backend.
  from ._conv import register_converters as _register_converters


In [2]:
# Number of games
EPISODES = 1000

class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95   # discount rate
        self.epsilon = 1.0  # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995 # while the agent is improving we decrease the exploration rate
        self.learning_rate = 0.001
        self.model = self._build_model()

    '''This is the neural network used to approximate the reward value.'''
    def _build_model(self):
        model = Sequential()
        model.add(Dense(24, input_dim=self.state_size, activation='relu'))
        model.add(Dense(24, activation='relu'))
        model.add(Dense(self.action_size, activation='linear'))
        model.compile(loss='mse',
                      optimizer=Adam(lr=self.learning_rate))
        return model

    '''Using a neural network to train the agent has its drawbacks. The main problem is that it tends to forget the
    previous experiences. That s why we need a way to store these previous experiences such that we can reuse them
    later using the replay function.'''
    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    '''Initially the agent doesn t know what to do, so it acts randomly (if the extracted value is <= of the exploration rate).
    After a while, it will start predicting the reward value based on the current state and then returns the action which
    maximize this value.'''
    def act(self, state):
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        act_values = self.model.predict(state)
        return np.argmax(act_values[0])
    
    '''This function simply extracts memories of previous experiences from the memory of the agent and splits them
    in the so called mini-batches, used then for the training phase.'''
    def replay(self, batch_size):
        minibatch = random.sample(self.memory, batch_size)
        for state, action, reward, next_state, done in minibatch:
            target = reward
            if not done:
                target = (reward + self.gamma * np.amax(self.model.predict(next_state)[0]))
            target_f = self.model.predict(state)
            target_f[0][action] = target
            self.model.fit(state, target_f, epochs=1, verbose=0)
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    def load(self, name):
        self.model.load_weights(name)

    def save(self, name):
        self.model.save_weights(name)

In [None]:
env = gym.make('CartPole-v1')
env._max_episodes = 1000
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
agent = DQNAgent(state_size, action_size)
#agent.load("./save/cartpole-dqn.h5")
done = False
batch_size = 32

for e in range(EPISODES):
    state = env.reset()
    state = np.reshape(state, [1, state_size])
    for time in range(500):
        env.render()
        action = agent.act(state)
        next_state, reward, done, _ = env.step(action)
        reward = reward if not done else -10
        next_state = np.reshape(next_state, [1, state_size])
        agent.remember(state, action, reward, next_state, done)
        state = next_state
        if done:
            print("episode: {}/{}, score: {}, e: {:.2}"
                  .format(e, EPISODES, time, agent.epsilon))
            break
        if len(agent.memory) > batch_size:
            agent.replay(batch_size)
    #if e % 10 == 0:
    #    agent.save("./save/cartpole-dqn.h5")

episode: 0/1000, score: 11, e: 1.0
episode: 1/1000, score: 18, e: 1.0
episode: 2/1000, score: 16, e: 0.93
episode: 3/1000, score: 18, e: 0.85
episode: 4/1000, score: 10, e: 0.81
episode: 5/1000, score: 9, e: 0.77
episode: 6/1000, score: 14, e: 0.72
episode: 7/1000, score: 67, e: 0.51
episode: 8/1000, score: 160, e: 0.23
episode: 9/1000, score: 98, e: 0.14
episode: 10/1000, score: 85, e: 0.092
episode: 11/1000, score: 44, e: 0.074
episode: 12/1000, score: 130, e: 0.038
episode: 13/1000, score: 114, e: 0.022
episode: 14/1000, score: 176, e: 0.01
episode: 15/1000, score: 163, e: 0.01
episode: 16/1000, score: 207, e: 0.01
episode: 17/1000, score: 343, e: 0.01
episode: 18/1000, score: 157, e: 0.01
episode: 19/1000, score: 188, e: 0.01
episode: 20/1000, score: 499, e: 0.01
episode: 21/1000, score: 173, e: 0.01
episode: 22/1000, score: 111, e: 0.01
episode: 23/1000, score: 53, e: 0.01
episode: 24/1000, score: 9, e: 0.01
episode: 25/1000, score: 26, e: 0.01
episode: 26/1000, score: 9, e: 0.01
