<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# Reinforcement Learning for Finance

**Chapter 02 &mdash; Deep Q-Learning**

&copy; Dr. Yves J. Hilpisch

<a href="https://tpq.io" target="_blank">https://tpq.io</a> | <a href="https://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>

## CartPole

## Introduction to Deep Q-Learning

This notebook demonstrates **Deep Q-Learning (DQN)**, a breakthrough algorithm that combines Q-learning with deep neural networks. We'll use the classic **CartPole** environment to show how an agent can learn complex control tasks.

**Key Concepts We'll Explore:**
- **Q-Learning**: Learning action-value functions that estimate future rewards
- **Deep Neural Networks**: Using neural networks to approximate Q-functions for complex state spaces
- **Experience Replay**: Storing and reusing past experiences to improve learning stability
- **Epsilon-Greedy Exploration**: Balancing exploration of new actions with exploitation of known good actions
- **Target Networks**: Techniques to stabilize training in deep reinforcement learning

**Why CartPole?**
CartPole is a classic control problem where an agent must balance a pole on a cart by moving the cart left or right. It's an excellent testbed for RL algorithms because:
- The state space is continuous (position, velocity, angle, angular velocity)
- The action space is discrete (left or right)
- Success requires learning long-term consequences of actions
- It's simple enough to understand but complex enough to require sophisticated learning

### The Game Environment 

In [1]:
import gymnasium as gym

**Gymnasium** (formerly OpenAI Gym) is the standard toolkit for developing and comparing reinforcement learning algorithms. It provides a wide variety of environments with standardized interfaces.

In [2]:
env = gym.make('CartPole-v1')

### Creating the CartPole Environment

`CartPole-v1` is a classic control task where:
- **Goal**: Keep a pole balanced upright on a movable cart
- **Actions**: Move cart left (0) or right (1)  
- **Episode ends**: When pole falls too far (>15°) or cart moves too far (>2.4 units)
- **Success**: Keeping the pole upright for 500 steps

In [3]:
env.action_space

Discrete(2)

In [4]:
env.action_space.n

2

### Understanding the Action Space

The **action space** defines what actions the agent can take:
- **Discrete(2)**: Two possible actions (0 and 1)
- **Action 0**: Push cart to the left
- **Action 1**: Push cart to the right

The `action_space.n` gives us the number of possible actions, and `action_space.sample()` randomly selects an action.

In [5]:
[env.action_space.sample() for _ in range(10)]

[0, 1, 0, 0, 1, 0, 0, 0, 0, 1]

In [6]:
env.observation_space

Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)

In [7]:
env.observation_space.shape

(4,)

### Understanding the Observation Space

The **observation space** defines what information the agent receives about the environment state:
- **Box(4,)**: A 4-dimensional continuous vector
- **State components**:
  1. **Cart Position**: Horizontal position of cart (-2.4 to 2.4)
  2. **Cart Velocity**: Speed of cart movement
  3. **Pole Angle**: Angle of pole from vertical (-0.2095 to 0.2095 radians ≈ ±12°)  
  4. **Pole Angular Velocity**: Rate of change of pole angle

This 4D state space is what makes CartPole challenging - the agent must learn to coordinate multiple continuous variables.

In [8]:
env.reset(seed=100)
# cart position, cart velocity, pole angle, pole angular velocity

(array([ 0.03349816,  0.0096554 , -0.02111368, -0.04570484], dtype=float32),
 {})

In [9]:
env.step(0)

(array([ 0.03369127, -0.18515752, -0.02202777,  0.24024247], dtype=float32),
 1.0,
 False,
 False,
 {})

### Environment Interaction

Let's see how agent-environment interaction works:
- `env.reset()` initializes a new episode and returns the initial state
- `env.step(action)` executes an action and returns:
  - **next_state**: New observation after the action
  - **reward**: Immediate reward (1.0 for each step the pole stays up)
  - **done**: Whether episode ended (pole fell or cart went too far)
  - **truncated**: Whether episode was truncated (time limit reached)
  - **info**: Additional information (usually empty for CartPole)

In [10]:
env.step(1)

(array([ 0.02998812,  0.01027205, -0.01722292, -0.05930644], dtype=float32),
 1.0,
 False,
 False,
 {})

In [11]:
class RandomAgent:
    def __init__(self):
        self.env = gym.make('CartPole-v1')
    def play(self, episodes=1):
        self.trewards = list()
        for e in range(episodes):
            self.env.reset()
            for step in range(1, 100):
                a = self.env.action_space.sample()
                state, reward, done, trunc, info = self.env.step(a)
                if done:
                    self.trewards.append(step)
                    break

### Baseline: Random Agent

Before implementing sophisticated learning algorithms, let's establish a baseline with a **random agent** that chooses actions randomly. This helps us understand:
1. How difficult the task is
2. What performance we need to beat
3. The natural variance in episode lengths

The `RandomAgent` class:
- Takes random actions using `env.action_space.sample()`
- Tracks episode lengths (how long the pole stays up)
- Provides a performance baseline for comparison

In [12]:
ra = RandomAgent()

In [13]:
ra.play(15)

In [14]:
ra.trewards

[19, 17, 10, 13, 13, 12, 35, 21, 17, 26, 16, 49, 20, 19, 26]

In [15]:
round(sum(ra.trewards) / len(ra.trewards), 2)

20.87

**Random Agent Performance**: The random agent typically achieves episode lengths of 10-30 steps on average. This poor performance shows that CartPole requires intelligent action selection - random movements quickly lead to the pole falling over.

**Our Goal**: Develop a learning agent that can consistently achieve the maximum episode length of 500 steps.

In [16]:
import os
import random
import warnings
import numpy as np
import tensorflow as tf
from tensorflow import keras
from collections import deque
from keras.layers import Dense
from keras.models import Sequential

## Deep Q-Learning (DQN) Implementation

Now we'll implement a **Deep Q-Network (DQN)** agent that can learn to solve CartPole. DQN was introduced by DeepMind in 2015 and represents a major breakthrough in reinforcement learning.

### Key Components:
1. **Neural Network**: Approximates the Q-function Q(state, action)
2. **Experience Replay**: Stores experiences and replays them for learning
3. **Epsilon-Greedy**: Balances exploration vs exploitation
4. **Target Network**: Stabilizes learning (we'll use a simplified version)

### Required Libraries:
- **TensorFlow/Keras**: For building and training neural networks
- **NumPy**: For numerical computations
- **Collections.deque**: For efficient experience replay memory

In [17]:
warnings.simplefilter('ignore')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
os.environ['PYTHONHASHSEED'] = '0'

In [18]:
from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()

In [19]:
opt = keras.optimizers.legacy.Adam(learning_rate=0.0001)

In [20]:
random.seed(100)
tf.random.set_seed(100)

### Environment Setup and Reproducibility

These setup steps ensure:
- **Reproducible results**: Fixed random seeds for consistent learning curves
- **Clean output**: Suppressed TensorFlow warnings and logs
- **Stable training**: Disabled eager execution for compatibility with older TensorFlow patterns

In [21]:
class DQLAgent:
    def __init__(self):
        self.epsilon = 1.0
        self.epsilon_decay = 0.9975
        self.epsilon_min = 0.1
        self.memory = deque(maxlen=2000)
        self.batch_size = 32
        self.gamma = 0.9
        self.trewards = list()
        self.max_treward = 0
        self._create_model()
        self.env = gym.make('CartPole-v1')
    def _create_model(self):
        self.model = Sequential()
        self.model.add(Dense(24, activation='relu', input_dim=4))
        self.model.add(Dense(24, activation='relu'))
        self.model.add(Dense(2, activation='linear'))
        self.model.compile(loss='mse', optimizer=opt)

### DQN Agent Architecture

The `DQLAgent` class implements the core Deep Q-Learning algorithm:

**Key Hyperparameters:**
- **epsilon**: Exploration rate (starts at 1.0 = 100% random)
- **epsilon_decay**: Rate of reducing exploration over time (0.9975)
- **epsilon_min**: Minimum exploration rate (0.1 = 10% random actions)
- **memory**: Experience replay buffer (stores last 2000 experiences)
- **batch_size**: Number of experiences to sample for training (32)
- **gamma**: Discount factor for future rewards (0.9)

**Neural Network Architecture:**
- **Input layer**: 4 neurons (for the 4-dimensional state)
- **Hidden layers**: 2 layers with 24 neurons each (ReLU activation)
- **Output layer**: 2 neurons (Q-values for each action)
- **Loss function**: Mean Squared Error (MSE)
- **Optimizer**: Adam with learning rate 0.0001

In [22]:
class DQLAgent(DQLAgent):
    def act(self, state):
        if random.random() < self.epsilon:
            return self.env.action_space.sample()
        return np.argmax(self.model.predict(state)[0])
    def replay(self):
        batch = random.sample(self.memory, self.batch_size)
        for state, action, next_state, reward, done in batch:
            if not done:
                reward += self.gamma * np.amax(
                    self.model.predict(next_state)[0])
            target = self.model.predict(state)
            target[0, action] = reward
            self.model.fit(state, target, epochs=2, verbose=False)
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

### Action Selection and Experience Replay

**The `act()` method implements epsilon-greedy action selection:**
- With probability `epsilon`: Choose a random action (exploration)
- With probability `1-epsilon`: Choose the action with highest Q-value (exploitation)
- Over time, epsilon decreases, shifting from exploration to exploitation

**The `replay()` method implements experience replay:**
1. **Sample**: Randomly select a batch of past experiences from memory
2. **Compute targets**: For each experience, calculate the target Q-value:
   - If episode ended: target = immediate reward
   - If episode continues: target = reward + gamma * max(Q(next_state))
3. **Train**: Update the neural network to minimize prediction error
4. **Decay epsilon**: Gradually reduce exploration rate

This approach breaks the correlation between consecutive experiences and stabilizes learning.

In [23]:
class DQLAgent(DQLAgent):
    def learn(self, episodes):
        for e in range(1, episodes + 1):
            state, _ = self.env.reset()
            state = np.reshape(state, [1, 4])
            for f in range(1, 5000):
                action = self.act(state)
                next_state, reward, done, trunc, _ = \
                    self.env.step(action)
                next_state = np.reshape(next_state, [1, 4])
                self.memory.append(
                    [state, action, next_state, reward, done])
                state = next_state
                if done or trunc:
                    self.trewards.append(f)
                    self.max_treward = max(self.max_treward, f)
                    templ = f'episode={e:4d} | treward={f:4d}'
                    templ += f' | max={self.max_treward:4d}'
                    print(templ, end='\r')
                    break
            if len(self.memory) > self.batch_size:
                self.replay()
        print()

### The Learning Process

**The `learn()` method orchestrates the training:**

**For each episode:**
1. **Reset environment** and get initial state
2. **Reshape state** to match neural network input format [1, 4]
3. **Play episode** for up to 5000 steps:
   - Choose action using epsilon-greedy policy
   - Execute action and observe results
   - Store experience (state, action, next_state, reward, done) in memory
   - Move to next state
4. **Track performance** by recording episode length
5. **Learn from experience** by calling replay() if enough experiences stored

**Key Features:**
- **Early termination**: Episode ends when pole falls or cart moves too far
- **Progress tracking**: Displays current episode, reward, and maximum achieved
- **Experience replay**: Only starts learning after collecting enough experiences
- **Continuous learning**: Neural network updates after every episode

In [24]:
class DQLAgent(DQLAgent):
    def test(self, episodes):
        for e in range(1, episodes + 1):
            state, _ = self.env.reset()
            state = np.reshape(state, [1, 4])
            for f in range(1, 5001):
                action = np.argmax(self.model.predict(state)[0])
                state, reward, done, trunc, _ = self.env.step(action)
                state = np.reshape(state, [1, 4])
                if done or trunc:
                    print(f, end=' ')
                    break

### Testing the Trained Agent

**The `test()` method evaluates the trained agent:**
- **Pure exploitation**: Always chooses the action with highest Q-value (no exploration)
- **No learning**: The neural network weights are frozen during testing
- **Performance measurement**: Records episode lengths to assess learned policy quality

This gives us a clean measure of how well the agent has learned the task without the noise of exploration.

In [25]:
agent = DQLAgent()

In [26]:
%time agent.learn(1500)

episode=1500 | treward= 254 | max= 500
CPU times: user 2min 11s, sys: 23.2 s, total: 2min 34s
Wall time: 2min 8s


### Training the DQN Agent

Now we train the agent for 1500 episodes. Watch the progress:

**What to expect during training:**
- **Early episodes**: Short episode lengths (similar to random agent) due to high exploration
- **Learning phase**: Gradual improvement as the agent discovers effective strategies
- **Convergence**: Eventually achieving consistent 500-step episodes (perfect performance)

**Training dynamics:**
- **Exploration decreases**: Epsilon decays from 1.0 to 0.1 over time
- **Experience accumulates**: Memory buffer fills with diverse experiences
- **Q-function improves**: Neural network learns better action-value estimates

In [27]:
agent.epsilon

0.09997053357470892

### Final Exploration Rate

After training, let's check the final epsilon value. This shows how much exploration vs exploitation the agent will use in future episodes. A value around 0.1 means the agent will still explore 10% of the time to avoid getting stuck in local optima.

In [28]:
agent.test(15)

185 211 206 101 198 234 115 287 241 116 98 201 120 174 95 

### Testing Trained Agent Performance

Now let's test the trained agent's performance with pure exploitation (no exploration). Each number represents how many steps the agent kept the pole balanced in that episode.

**Expected results:**
- **Well-trained agent**: Should consistently achieve 500 steps (maximum possible)
- **Comparison to random agent**: Remember the random agent averaged ~15-25 steps
- **Success metric**: Episodes reaching 500 steps indicate the agent has mastered the task

If the agent consistently achieves 500 steps, it has successfully learned the optimal policy!

<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="https://tpq.io" target="_blank">https://tpq.io</a> | <a href="https://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>

## Summary: From Simple Learning to Deep Q-Learning

This notebook demonstrated a major leap in reinforcement learning sophistication:

### Key Achievements:
1. **Environment mastery**: Learned to solve a continuous control problem with discrete actions
2. **Performance improvement**: From ~20 steps (random) to 500 steps (optimal)
3. **Stable learning**: Used experience replay and epsilon-greedy exploration for robust training

### Deep Q-Learning Innovations:
- **Function approximation**: Neural networks can handle continuous state spaces
- **Experience replay**: Breaking temporal correlations improves learning stability  
- **Exploration-exploitation balance**: Epsilon-greedy provides systematic exploration
- **Scalability**: Same approach works for much more complex environments

### Real-World Applications:
- **Finance**: Portfolio optimization, algorithmic trading, risk management
- **Robotics**: Robot control, manipulation, navigation
- **Games**: Game playing (AlphaGo, StarCraft, Dota)
- **Autonomous systems**: Self-driving cars, drones, smart grids

### Next Steps:
This foundation enables exploration of:
- **Double DQN**: Addressing overestimation bias
- **Dueling DQN**: Separating state values from action advantages  
- **Policy Gradient methods**: Direct policy optimization
- **Actor-Critic algorithms**: Combining value and policy learning
- **Financial applications**: Applying these techniques to trading and portfolio management

The transition from simple coin-flipping (Notebook 1) to complex control tasks (this notebook) showcases the power and potential of reinforcement learning!