## Initial steps

In [None]:
!pip install gymnasium
!pip install gym
!pip install pygame

In [1]:
#Download these libraries if you don't have them
import gymnasium as gym
import numpy as np
import pygame

I recommend **checking out the previous notebook** (Q-learning CartPole) to get a better understanding of the environment, the Q-function and how reinforcement learning models learn.

We use the the OpenAI [Gym](https://github.com/openai/gym) this time, we need to downgrade to an older version too. As before, it provides easy and nice-to-use environments for reinforcement learning.<br>

The game for this model:

- `Atari Breakout`: Given many blocks positioned on the top of the screen, a paddle on the bottom, and an always moving ball. You control a paddle (moving it left and right) to bounce the ball off of it, when the ball hits a brick it breaks it and bounces off of it. The goal is to break all bricks, before the ball falls off the screen.

A gif:
<div>
    <img src="https://miro.medium.com/v2/resize:fit:1760/1*XyIpmXXAjbXerDzmGQL1yA.gif", width="400">
</div>

The model we use will be a Deep Q-Network model. It is based on the 2013 paper by DeepMind (Minh et al.) [Playing Atari with Deep Reinforcement Learning](https://arxiv.org/abs/1312.5602).<br>
The code is taken from the Keras website [here](https://keras.io/examples/rl/deep_q_network_breakout/).


To get the environment (with states, etc.), we just need to call the `gym.make()` function with the name of the environment.

If we use `render_mode='human'`, we can see the environment in a pop-up window, however this slows learning. We will only use it for demonstration of the final model.

We can close the environment with the `env.close()` function.

## Deep Q-Learning

Example + code taken from the Keras website: [Deep Q-Learning for Atari Breakout](https://keras.io/examples/rl/deep_q_network_breakout/)

**Atari Breakout**

Environment: 
- Image input: 210x160x3, but using the helper function it is turned into 4 consecutive 84x84 images
- 4 actions: 0: do nothing, 1: fire, 2: move right, 3: move left
- Reward: 1 for each brick broken



You'll need to install `baselines` (e.g., with `pip install git+https://github.com/openai/baselines.git` in the current environment) to use the Atari environment helper functions

In [None]:
!pip install git+https://github.com/openai/baselines.git

In [135]:
from baselines.common.atari_wrappers import make_atari, wrap_deepmind
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

To use GPUs for TensorFlow on Windows, you may need to downgrade to TF 2.10, or use WSL2

In [17]:
import gym, ale_py, pyglet

print("Gym:", gym.__version__), print("Ale-py:", ale_py.__version__), print("Pyglet:", pyglet.__version__),print("Tensorflow:", tf.__version__);

Gym: 0.21.0
Ale-py: 0.7.4
Pyglet: 2.0.12
Tensorflow: 2.15.0


I recommend using a separate Python environment for this. For Windows, I use Pyenv.

Ale_py downgrade to 0.7.4 is needed to run the code, else *env.reset()* will not necessary work.

For training the model, you can downgrade to just Gym 0.25.2, but for the display in the end, you'll need to use 0.21.0 or below

Configurations:

In [134]:
seed = 42 #Reproducibility
gamma = 0.99 
epsilon = 1.0 #epsilon greedy parameter
epsilon_min = 0.1
epsilon_max = 1.0
epsilon_interval = (
    epsilon_max - epsilon_min
)  #Rate at which to reduce chance of random action being taken. In the beginning, it is 100% random, in the end, it is 10% random
batch_size = 32  # Size of batch taken from replay buffer
max_steps_per_episode = 10000

# Use the Baseline Atari environment because of Deepmind helper functions
env = make_atari("BreakoutNoFrameskip-v4")
# Warp the frames, grey scale, stake four frame and scale to smaller ratio
env = wrap_deepmind(env, frame_stack=True, scale=True)
env.seed(seed)
env.reset();


Epsilon now is reducing linearly from 1 to 0.1 over 1 million steps, then stays at 0.1. For the first 50000 steps, there is only exploring (see below code)

The DeepMind wrap helps to preprocess the images, and also stacks 4 frames together to give the model a sense of motion (for the ball)

In [136]:
num_actions = 4

def create_q_model():
    # Network defined by the Deepmind paper
    inputs = layers.Input(shape=(84, 84, 4,)) #The DeepMind wrap creates 84x84 size frames

    #Convolutions on the frames on the screen
    layer1 = layers.Conv2D(32, 8, strides=4, activation="relu")(inputs)
    layer2 = layers.Conv2D(64, 4, strides=2, activation="relu")(layer1)
    layer3 = layers.Conv2D(64, 3, strides=1, activation="relu")(layer2)

    layer4 = layers.Flatten()(layer3)

    layer5 = layers.Dense(512, activation="relu")(layer4)
    action = layers.Dense(num_actions, activation="linear")(layer5)

    return keras.Model(inputs=inputs, outputs=action)

#Two models (they are trained together, but the target model is updated less often): a Q-function learning model and a target model.
#First model changes the Q-values.
model = create_q_model()

#The second "target" model is for predicting future rewards.
#The weights of a target model are updated every 10000 steps, thus when the loss between the Q-values is calculated 
# the target Q-value will be stable.
model_target = create_q_model()




The environment is inputted as a 84x84 image, but we input 4 consecutive frames as a single state (so the input shape is 84x84x4). This was the idea in the DeepMind Atari paper, and it helps the model understand the movement of the ball from frame to frame.

We use 3 convolutional layers, then a flatten layer, then 2 dense layers for decision making, last one being the output layer with 4 neurons (one for each action).

In [2]:
#Deepmind used RMSProp, but the Adam optimizer came out 1 year later and improves training time
optimizer = keras.optimizers.Adam(learning_rate=0.00025, clipnorm=1.0)

# Experience replay buffers
action_history = []
state_history = []
state_next_history = []
rewards_history = []
done_history = []
episode_reward_history = []
running_reward = 0
episode_count = 0
frame_count = 0

# Number of frames to take random action and observe output
epsilon_random_frames = 50000

# Number of frames for exploration
epsilon_greedy_frames = 1000000.0

# Maximum replay length
# Note: The Deepmind paper suggests 1000000 however this causes memory issues
max_memory_length = 100000

# Train the model after 4 actions
update_after_actions = 4

# How often to update the target network
update_target_network = 10000

# Using huber loss for stability
loss_function = keras.losses.Huber()

while True:  #Should run until solved (see below: 40 score), but we can stop it manually.
    state = np.array(env.reset())
    episode_reward = 0

    for timestep in range(1, max_steps_per_episode):
        # env.render(); Adding this line would show the attempts
        frame_count += 1
        if frame_count % 10000 == 0:
            model.save_weights('model_weights.h5') #Just in case something happens

        # Epsilon-greedy exploration, as before
        if frame_count < epsilon_random_frames or epsilon > np.random.rand(1)[0]:
            #Exploration: take random action
            action = np.random.choice(num_actions)
        else:
            #Exploitation: take the best action
            state_tensor = tf.convert_to_tensor(state)
            state_tensor = tf.expand_dims(state_tensor, 0)
            action_probs = model(state_tensor, training=False)

            action = tf.argmax(action_probs[0]).numpy()

        #Epsilon value decayed, unless it is at minimum. These lines of code can be put somewhere else too.
        if epsilon > epsilon_min: #I changed this part in the code, as it was a bit weirdly written.
            epsilon -= epsilon_interval / epsilon_greedy_frames
        
        #We make the action, take the outputs
        state_next, reward, done, _ = env.step(action)
        state_next = np.array(state_next)

        episode_reward += reward

        # Save actions and states in replay buffer
        action_history.append(action)
        state_history.append(state)
        state_next_history.append(state_next)
        done_history.append(done)
        rewards_history.append(reward)
        state = state_next

        # Update every fourth frame and once batch size is over 32
        if frame_count % update_after_actions == 0 and len(done_history) > batch_size:

            # Get indices of samples for replay buffers
            indices = np.random.choice(range(len(done_history)), size=batch_size)

            # Using list comprehension to sample from replay buffer
            state_sample = np.array([state_history[i] for i in indices])
            state_next_sample = np.array([state_next_history[i] for i in indices])
            rewards_sample = [rewards_history[i] for i in indices]
            action_sample = [action_history[i] for i in indices]
            done_sample = tf.convert_to_tensor(
                [float(done_history[i]) for i in indices]
            )

            # Build the updated Q-values for the sampled future states
            # Use the target model for stability
            future_rewards = model_target.predict(state_next_sample)
            # Q value = reward + discount factor * expected future reward
            updated_q_values = rewards_sample + gamma * tf.reduce_max(
                future_rewards, axis=1
            )

            # If final frame set the last value to -1 (done)
            updated_q_values = updated_q_values * (1 - done_sample) - done_sample

            #One-hot encoded mask, we only calculate loss on the updated Q-values
            masks = tf.one_hot(action_sample, num_actions)

            with tf.GradientTape() as tape:
                # Train the model on the states and updated Q-values
                q_values = model(state_sample)

                # Apply the masks to the Q-values to get the Q-value for action taken
                q_action = tf.reduce_sum(tf.multiply(q_values, masks), axis=1)
                # Calculate loss between new Q-value and old Q-value
                loss = loss_function(updated_q_values, q_action)

            # Backpropagation
            grads = tape.gradient(loss, model.trainable_variables)
            optimizer.apply_gradients(zip(grads, model.trainable_variables))

        if frame_count % update_target_network == 0:
            # update the the target network with new weights
            model_target.set_weights(model.get_weights())
            # Log details
            template = "running reward: {:.2f} at episode {}, frame count {}"
            print(template.format(running_reward, episode_count, frame_count))

        # Limit the state and reward history
        if len(rewards_history) > max_memory_length:
            del rewards_history[:1]
            del state_history[:1]
            del state_next_history[:1]
            del action_history[:1]
            del done_history[:1]

        if done:
            break

    # Update running reward to check condition for solving
    episode_reward_history.append(episode_reward)
    if len(episode_reward_history) > 100:
        del episode_reward_history[:1]
    running_reward = np.mean(episode_reward_history)

    episode_count += 1

    if running_reward > 40:  # Condition to consider the task solved
        print("Solved at episode {}!".format(episode_count))
        break


...



KeyboardInterrupt: 

Stopped running after ~24 hours (no GPU), the model is already fairly good but has to learn to perfect the game.

Used `model.save_weights('model_weights_last.h5')` to save the last model's weights, but we could have also just used the already saved `model_weights.h5` file.

Let's load the model and try it:

In [137]:
model.load_weights('model_weights_last.h5')

# Now you can use the render method to visualize the game
env.render('human')

  logger.warn(


True

A bit of testing:

In [138]:
num_episodes = 10  # Number of games to play
max_frames_test = 2500

for i in range(num_episodes):
    state = np.array(env.reset())
    done = False
    frames = 0
    while (not done) and (frames < max_frames_test):
        env.render("human") #Could also just not render
        state_tensor = tf.convert_to_tensor(state)
        state_tensor = tf.expand_dims(state_tensor, 0)
        action_probs = model(state_tensor, training=False)
        action = tf.argmax(action_probs[0]).numpy()#We only exploit in testing
        state, reward, done, _ = env.step(action)
        frames += 1

env.close()