# Developing a Deep Q-Learning Agent for Atari Games

**Jashwanth Kakara**  
**22B1033**

The main packages we require are **Numpy, Keras, Tensorflow, gym**

### Imports

Import necessary libraries (`gym` for the environment, `numpy` for numerical operations, `deque` for replay memory, `Sequential`, `Dense`, and `Adam` from `keras` for neural network modeling and optimization, `random` for random actions, and `matplotlib.pyplot` for plotting).

### Constants

Define constants such as `EPISODES` (number of training episodes), `MAX_STEPS` (maximum steps per episode), `BATCH_SIZE` (size of minibatch for replay), and `UPDATE_FREQ` (frequency of updating the target network).

In [1]:
import gym
import numpy as np
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.models import load_model
from keras.optimizers import Adam
import random
import matplotlib.pyplot as plt

EPISODES = 500
MAX_STEPS = 500
BATCH_SIZE = 32
UPDATE_FREQ = 5  # Update target network every UPDATE_FREQ steps

### DQNAgent Class

Represents the Deep Q-Network (DQN) agent.

**`__init__` method** initializes the agent with state and action sizes, sets up memory for experience replay (`deque`), defines hyperparameters (gamma, epsilon, etc.), and builds the neural network model (`self.model`) using `_build_model()` method.

**_build_model method** creates a simple neural network with 2 hidden layers (24 units each) and an output layer (action_size units) using `Sequential` from Keras. It uses ReLU activation for hidden layers and linear activation for the output layer. The model is compiled with Mean Squared Error loss and Adam optimizer.


In [3]:
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95    # discount rate
        self.epsilon = 1.0   # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.learning_rate = 0.001
        self.model = self._build_model()

    def _build_model(self):
        model = Sequential()
        model.add(Dense(24, input_dim=self.state_size, activation='relu'))
        model.add(Dense(24, activation='relu'))
        model.add(Dense(self.action_size, activation='linear'))
        model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))
        return model
        
    def memorize(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        act_values = self.model.predict(state)
        return np.argmax(act_values[0])

    def replay(self, batch_size):
        minibatch = random.sample(self.memory, batch_size)
        for state, action, reward, next_state, done in minibatch:
            target = reward
            if not done:
                target = (reward + self.gamma *
                          np.amax(self.model.predict(next_state)[0]))
            target_f = self.model.predict(state)
            target_f[0][action] = target
            history = self.model.fit(state, target_f, epochs=1, verbose=0)
            loss = history.history['loss'][0]
            if self.epsilon > self.epsilon_min:
                self.epsilon *= self.epsilon_decay
            return loss

    def load(self, name):
        self.model.load_weights(name)

    def save(self, name):
        self.model.save_weights(name)

    def evaluate(self, env, num_episodes):
        total_rewards = []
        for episode in range(num_episodes):
            state = env.reset()[0]
            state = np.reshape(state, [1, self.state_size])
            done = False
            total_reward = 0
            while not done:
                action = self.act(state)
                next_state, reward, done, _, a = env.step(action)
                total_reward += reward
                next_state = np.reshape(next_state, [1, self.state_size])
                state = next_state
            total_rewards.append(total_reward)
            print(f"Evaluation Episode {episode + 1}/{num_episodes}, Total Reward: {total_reward}")
        return total_rewards


### Memory Management

**memorize method** stores the experience tuple `(state, action, reward, next_state, done)` in the replay memory (`self.memory`).

### Action Selection (act method)

**act method** selects an action based on the current state (state):
- With probability epsilon, it chooses a random action (exploration).
- Otherwise, it selects the action with the highest predicted value (exploitation) using the current neural network model (`self.model.predict`).


In [9]:
    def memorize(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        act_values = self.model.predict(state)
        return np.argmax(act_values[0])

### Q-Learning

Q-Learning is an off-policy algorithm where the goal is to learn the optimal action-value function \( Q^*(s, a) \) which satisfies the Bellman equation:

$$ Q^*(s, a) = \mathbb{E} \left[ r + \gamma \max_{a'} Q^*(s', a') \mid s, a \right] $$

Here,
- \( r \) is the reward,
- \( \gamma \) is the discount factor, and
- \( s' \) is the next state.

### Replay Buffer

The Replay Buffer stores past experiences \( (s, a, r, s', \text{done}) \) to break the temporal correlations and enable efficient reuse of experiences. A mini-batch is sampled from this buffer to update the network.



### CartPole-v1 Environment

- **State Space**: The state consists of four values: cart position, cart velocity, pole angle, and pole velocity at the tip.
- **Action Space**: Two discrete actions: move cart left or right.
- **Termination Condition (done)**: The episode ends when:
  - The pole falls past a certain angle.
  - The cart moves out of bounds.
  - The maximum number of steps per episode (typically 500) is reached.
- **Reward**: The agent receives a reward of +1 for every step the pole remains upright and -10 if the pole falls (when `done` is True).


In [10]:
    def replay(self, batch_size):
        minibatch = random.sample(self.memory, batch_size)
        for state, action, reward, next_state, done in minibatch:
            target = reward
            if not done:
                target = (reward + self.gamma *
                          np.amax(self.model.predict(next_state)[0]))
            target_f = self.model.predict(state)
            target_f[0][action] = target
            history = self.model.fit(state, target_f, epochs=1, verbose=0)
            loss = history.history['loss'][0]
            if self.epsilon > self.epsilon_min:
                self.epsilon *= self.epsilon_decay
            return loss

    def load(self, name):
        self.model.load_weights(name)

    def save(self, name):
        self.model.save_weights(name)

    def evaluate(self, env, num_episodes):
        total_rewards = []
        for episode in range(num_episodes):
            state = env.reset()[0]
            state = np.reshape(state, [1, self.state_size])
            done = False
            total_reward = 0
            while not done:
                action = self.act(state)
                next_state, reward, done, _, a = env.step(action)
                total_reward += reward
                next_state = np.reshape(next_state, [1, self.state_size])
                state = next_state
            total_rewards.append(total_reward)
            print(f"Evaluation Episode {episode + 1}/{num_episodes}, Total Reward: {total_reward}")
        return total_rewards

### Main Execution (__main__ block)

- Creates the **CartPole-v1** environment (`env`), initializes the agent (`agent`), and sets up a list (`episode_rewards`) to store average rewards for plotting.
- Trains the agent over `EPISODES` episodes:
  - Resets the environment (`env.reset()`) and initializes the state (`state`).
  - Runs the agent's policy (`agent.act`) for up to `MAX_STEPS` steps per episode, collecting experiences (`agent.memorize`).
  - Performs replay (`agent.replay`) to update the neural network based on experiences.
  - Prints training progress (episode number, score, epsilon, loss) and evaluates (`agent.evaluate`) the agent's performance every 10 episodes, storing average rewards in `episode_rewards`.
- Closes the environment (`env.close()`) after training completes.
- Plots the average reward per episode (`episode_rewards`) using `matplotlib`.

In [None]:
if __name__ == "__main__":
    env = gym.make('CartPole-v1')
    state_size = env.observation_space.shape[0]
    action_size = env.action_space.n
    agent = DQNAgent(state_size, action_size)

    done = False
    episode_rewards = []  # List to store rewards for each episode

    for e in range(EPISODES):
        state = env.reset()[0]
        state = np.reshape(state, [1, state_size])
        for time in range(MAX_STEPS):
            # env.render()
            action = agent.act(state)
            next_state, reward, done, _, a = env.step(action)
            reward = reward if not done else -10
            next_state = np.reshape(next_state, [1, state_size])
            agent.memorize(state, action, reward, next_state, done)
            state = next_state
            if done:
                print(f"Episode {e}/{EPISODES}, Time: {time}, Reward: {reward}, Epsilon: {agent.epsilon:.2f}")
                break
            if len(agent.memory) > BATCH_SIZE:
                loss = agent.replay(BATCH_SIZE)
                # Logging training loss every 25 timesteps
                if time % 50 == 0:
                    print(f"Episode {e}/{EPISODES}, Time: {time}, Loss: {loss:.4f}")
        
        # Evaluate agent after each episode
        if (e+1) % 50 == 0:
            # agent.save("cartpole-dqn.h5")
            # agent1.load("cartpole-dqn.h5")
            avg_reward = np.mean(agent.evaluate(env, num_episodes=50))
            print(f"Average reward for 100 evaluations at episode {e} is {avg_reward}")
            episode_rewards.append(avg_reward)

    env.close()

    # Plot rewards versus episodes
    print(episode_rewards)
    plt.plot(np.arange(len(episode_rewards)), episode_rewards, color='blue')
    plt.title('Reward vs Training')
    plt.xlabel('Episode')
    plt.ylabel('Average Reward per 50 Episodes')
    plt.grid(True)
    plt.savefig('reward_plot.png')
    plt.show()



Episode 0/500, Time: 18, Reward: -10, Epsilon: 1.00
Episode 1/500, Time: 21, Reward: -10, Epsilon: 0.96
Episode 2/500, Time: 0, Loss: 0.6305
Episode 2/500, Time: 19, Reward: -10, Epsilon: 0.87
Episode 3/500, Time: 0, Loss: 0.4869
Episode 3/500, Time: 42, Reward: -10, Epsilon: 0.71
Episode 4/500, Time: 0, Loss: 0.4657
Episode 4/500, Time: 14, Reward: -10, Epsilon: 0.66
Episode 5/500, Time: 0, Loss: 0.5476
Episode 5/500, Time: 9, Reward: -10, Epsilon: 0.63
Episode 6/500, Time: 0, Loss: 0.5307
Episode 6/500, Time: 10, Reward: -10, Epsilon: 0.60
Episode 7/500, Time: 0, Loss: 0.5647
Episode 7/500, Time: 24, Reward: -10, Epsilon: 0.53
Episode 8/500, Time: 0, Loss: 0.5609
Episode 8/500, Time: 11, Reward: -10, Epsilon: 0.50
Episode 9/500, Time: 0, Loss: 0.5590
Episode 9/500, Time: 9, Reward: -10, Epsilon: 0.48
Episode 10/500, Time: 0, Loss: 0.3862
Episode 10/500, Time: 8, Reward: -10, Epsilon: 0.46
Episode 11/500, Time: 0, Loss: 0.5503
Episode 11/500, Time: 8, Reward: -10, Epsilon: 0.44
Episod