<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# Reinforcement Learning for Finance

**Chapter 03 &mdash; Financial Q-Learning**

&copy; Dr. Yves J. Hilpisch

<a href="https://tpq.io" target="_blank">https://tpq.io</a> | <a href="https://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>

## Finance Environment

## Introduction to Financial Q-Learning

This notebook represents a major milestone: **applying Deep Q-Learning to real financial data**. We transition from the controlled CartPole environment to the complex, noisy world of financial markets.

### Key Learning Objectives:
- **Custom Environment Design**: Creating a financial prediction environment from scratch
- **Real Data Integration**: Working with actual EUR/USD exchange rate data
- **Financial Feature Engineering**: Using price levels vs. returns as state representations
- **Performance Metrics**: Adapting RL success criteria for financial prediction accuracy
- **Risk Management**: Implementing early stopping based on performance thresholds

### The Challenge:
Financial markets are fundamentally different from game environments:
- **Noisy signals**: Market data contains significant random variation
- **Non-stationary**: Market patterns change over time
- **Sparse rewards**: Good predictions may not be immediately rewarded
- **Real consequences**: Mistakes have financial impact

### Our Approach:
We'll create a custom `Finance` environment that:
1. **Loads real market data** (EUR/USD exchange rates)
2. **Defines prediction tasks** (will price go up or down?)
3. **Provides meaningful rewards** (accuracy-based scoring)
4. **Implements risk controls** (stops trading if accuracy drops too low)

In [1]:
import os
import random

In [2]:
random.seed(100)
os.environ['PYTHONHASHSEED'] = '0'

### Environment Setup and Reproducibility

Setting up deterministic behavior for consistent results across runs - crucial when working with financial data where small changes can lead to very different outcomes.

In [3]:
class ActionSpace:
    def sample(self):
        return random.randint(0, 1)

In [4]:
action_space = ActionSpace()

### Custom Action Space for Financial Predictions

Unlike Gymnasium's built-in environments, we need to create our own action space for financial prediction:

**ActionSpace Class:**
- **Action 0**: Predict price will go DOWN (bearish)
- **Action 1**: Predict price will go UP (bullish)
- **`sample()`**: Randomly selects between the two actions

This binary prediction task simplifies the complex world of trading into a fundamental question: "Will the asset price increase or decrease in the next period?"

In [5]:
[action_space.sample() for _ in range(10)]

[0, 1, 1, 0, 1, 1, 1, 0, 0, 0]

In [6]:
import numpy as np
import pandas as pd

In [7]:
class Finance:
    url = 'https://certificate.tpq.io/rl4finance.csv'
    def __init__(self, symbol, feature,
                 min_accuracy=0.485, n_features=4):
        self.symbol = symbol
        self.feature = feature
        self.n_features = n_features
        self.action_space = ActionSpace()
        self.min_accuracy = min_accuracy
        self._get_data()
        self._prepare_data()
    def _get_data(self):
        self.raw = pd.read_csv(self.url,
                index_col=0, parse_dates=True)

### The Finance Environment Class

This is the heart of our financial RL system. The `Finance` class creates a custom Gymnasium-like environment for financial prediction:

**Key Parameters:**
- **symbol**: The financial instrument to trade (e.g., 'EUR=' for EUR/USD)
- **feature**: What to use as state ('EUR=' for prices, 'r' for returns)
- **min_accuracy**: Minimum prediction accuracy before stopping (risk control)
- **n_features**: Number of historical periods to use as state (default: 4)

**Data Pipeline:**
1. **Download**: Fetches real market data from online source
2. **Process**: Calculates returns and direction labels
3. **Normalize**: Standardizes features for neural network training
4. **Structure**: Creates sequences for time-series prediction

In [8]:
class Finance(Finance):
    def _prepare_data(self):
        self.data = pd.DataFrame(self.raw[self.symbol]).dropna()
        self.data['r'] = np.log(self.data / self.data.shift(1))
        self.data['d'] = np.where(self.data['r'] > 0, 1, 0)
        self.data.dropna(inplace=True)
        self.data_ = (self.data - self.data.mean()) / self.data.std()
    def reset(self):
        self.bar = self.n_features
        self.treward = 0
        state = self.data_[self.feature].iloc[
            self.bar - self.n_features:self.bar].values
        return state, {}

### Data Preparation and Episode Initialization

**`_prepare_data()` Method:**
- **Return calculation**: `np.log(price[t] / price[t-1])` computes logarithmic returns
- **Direction labeling**: `d = 1` if return > 0 (price up), `d = 0` if return ≤ 0 (price down)
- **Normalization**: Standardizes features to have mean=0, std=1 (crucial for neural networks)

**`reset()` Method:**
- **Episode start**: Initializes a new trading episode
- **State construction**: Returns the last `n_features` normalized values as initial state
- **Progress tracking**: Resets reward accumulation and position tracking

**Why logarithmic returns?**
- **Statistical properties**: Log returns are more normally distributed
- **Additive**: Easy to aggregate over time periods
- **Scale invariant**: Works across different price levels

In [9]:
class Finance(Finance):
    def step(self, action):
        if action == self.data['d'].iloc[self.bar]:
            correct = True
        else:
            correct = False
        reward = 1 if correct else 0
        self.treward += reward
        self.bar += 1
        self.accuracy = self.treward / (self.bar - self.n_features)
        if self.bar >= len(self.data):
            done = True
        elif reward == 1:
            done = False
        elif (self.accuracy < self.min_accuracy) and (self.bar > 15):
            done = True
        else:
            done = False
        next_state = self.data_[self.feature].iloc[
            self.bar - self.n_features:self.bar].values
        return next_state, reward, done, False, {}

### The Step Method: Core Trading Logic

**`step(action)` Method Implementation:**

**1. Prediction Checking:**
- Compares agent's action with actual market direction (`data['d'].iloc[self.bar]`)
- Binary reward: 1 for correct prediction, 0 for incorrect

**2. Performance Tracking:**
- **`treward`**: Total correct predictions in current episode
- **`accuracy`**: Running accuracy percentage (treward / total_predictions)

**3. Episode Termination Logic:**
- **Natural end**: Reached end of data
- **Success continuation**: Correct predictions allow episode to continue
- **Risk control**: Episode ends if accuracy drops below `min_accuracy` threshold after 15+ predictions

**4. State Progression:**
- **Time advancement**: Move to next market period
- **State update**: Return next `n_features` values as new state

**Financial Interpretation:**
This simulates a trading system that stops when performance degrades, protecting capital from further losses - a crucial risk management principle in quantitative trading.

In [10]:
fin = Finance(symbol='EUR=', feature='EUR=')

In [11]:
list(fin.raw.columns)

['AAPL.O',
 'MSFT.O',
 'INTC.O',
 'AMZN.O',
 'GS.N',
 '.SPX',
 '.VIX',
 'SPY',
 'EUR=',
 'XAU=',
 'GDX',
 'GLD']

### Exploring the Financial Dataset

Let's examine what financial instruments are available in our dataset. This real-world dataset contains various currency pairs, indices, and commodities - providing a rich environment for testing trading strategies.

In [12]:
fin.reset()
# four lagged, normalized price points

(array([2.74844931, 2.64643904, 2.69560062, 2.68085214]), {})

### State Representation: Price Levels

**Using 'EUR=' as feature means our state consists of:**
- Four consecutive normalized EUR/USD price levels
- Each value represents the standardized price at different time points
- The agent must learn patterns in price movements to predict future direction

**Example state interpretation:**
- Values closer to 0: Prices near historical average
- Positive values: Prices above historical average  
- Negative values: Prices below historical average
- Sequence patterns: Trends, reversals, momentum

In [13]:
fin.action_space.sample()

1

In [14]:
fin.step(fin.action_space.sample())

(array([2.64643904, 2.69560062, 2.68085214, 2.63046153]), 0, False, False, {})

**Testing Environment Interaction:**
- **Action**: Random prediction (0 or 1)
- **Returns**: (next_state, reward, done, truncated, info)
- **Reward**: 1 if prediction correct, 0 if incorrect
- **State transition**: Next 4 normalized values in sequence

In [15]:
fin = Finance('EUR=', 'r')

In [16]:
fin.reset()
# four lagged, normalized log returns

(array([-1.19130476, -1.21344494,  0.61099805, -0.16094865]), {})

### Alternative State Representation: Returns

**Using 'r' (returns) as feature provides different information:**
- Four consecutive normalized logarithmic returns
- Each value represents the price change (percentage) between periods
- Often more stationary than price levels (better for ML)
- Focuses on momentum and volatility rather than absolute price levels

**Returns vs. Prices:**
- **Returns**: "How much did it move?" - captures dynamics
- **Prices**: "Where is it now?" - captures levels and trends
- Different features may lead to different trading strategies

In [17]:
class RandomAgent:
    def __init__(self):
        self.env = Finance('EUR=', 'r')
    def play(self, episodes=1):
        self.trewards = list()
        for e in range(episodes):
            self.env.reset()
            for step in range(1, 100):
                a = self.env.action_space.sample()
                state, reward, done, trunc, info = self.env.step(a)
                if done:
                    self.trewards.append(step)
                    break

### Financial Random Agent Baseline

The `RandomAgent` for financial prediction serves the same purpose as in CartPole - establishing a baseline performance level. However, the interpretation is different:

**Financial Context:**
- **Expected performance**: ~50% accuracy (random guessing on binary prediction)
- **Episode length**: How many predictions before stopping (due to poor performance)
- **Market efficiency**: If random performs well, market may be very efficient (hard to predict)

**Why this matters:**
- **Benchmark**: Any learning algorithm must beat random performance
- **Market reality check**: Many professional traders struggle to beat random selection
- **Risk assessment**: Shows how quickly poor strategies get stopped out

In [18]:
ra = RandomAgent()

In [19]:
ra.play(15)

In [20]:
ra.trewards

[17, 13, 17, 12, 12, 12, 13, 23, 31, 13, 12, 15]

In [21]:
round(sum(ra.trewards) / len(ra.trewards), 2)

15.83

**Random Agent Performance Analysis:**

The episode lengths show how quickly random trading gets stopped due to poor accuracy. Short episodes indicate:
- **Risk control working**: Poor performance triggers early termination
- **Market difficulty**: Even random binary prediction struggles
- **Baseline established**: Any serious trading algorithm must significantly outperform this

**Financial implication**: This simulates what happens to traders who make decisions without any systematic approach - they quickly lose capital and get forced out of the market.

In [22]:
len(fin.data)

2607

In [23]:
import os
import random
import warnings
import numpy as np
import tensorflow as tf
from tensorflow import keras
from collections import deque
from keras.layers import Dense
from keras.models import Sequential

## Deep Q-Learning for Financial Prediction

Now we adapt our CartPole DQN agent for financial markets. The core algorithm remains the same, but we need several modifications for the financial domain:

**Key Adaptations for Finance:**
1. **Custom environment**: Using our `Finance` class instead of Gymnasium
2. **Different reward structure**: Binary accuracy vs. continuous rewards
3. **Risk-adjusted gamma**: Lower discount factor (0.5) for shorter-term focus
4. **Variable input size**: Flexible `n_features` for different lookback periods
5. **Financial stopping criteria**: Performance-based episode termination

**Challenges in Financial RL:**
- **Sparse rewards**: Correct predictions only known after time passes
- **Noisy signals**: Market data contains significant randomness  
- **Regime changes**: Market behavior shifts over time
- **Overfitting risk**: High-dimensional data with limited samples

In [24]:
warnings.simplefilter('ignore')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

In [25]:
from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()

In [26]:
opt = keras.optimizers.legacy.Adam(learning_rate=0.0001)

In [27]:
class DQLAgent:
    def __init__(self, symbol, feature, min_accuracy, n_features=4):
        self.epsilon = 1.0
        self.epsilon_decay = 0.9975
        self.epsilon_min = 0.1
        self.memory = deque(maxlen=2000)
        self.batch_size = 32
        self.gamma = 0.5
        self.trewards = list()
        self.max_treward = 0
        self.n_features = n_features
        self._create_model()
        self.env = Finance(symbol, feature,
                    min_accuracy, n_features)
    def _create_model(self):
        self.model = Sequential()
        self.model.add(Dense(24, activation='relu',
                             input_dim=self.n_features))
        self.model.add(Dense(24, activation='relu'))
        self.model.add(Dense(2, activation='linear'))
        self.model.compile(loss='mse', optimizer=opt)
    def act(self, state):
        if random.random() < self.epsilon:
            return self.env.action_space.sample()
        return np.argmax(self.model.predict(state)[0])
    def replay(self):
        batch = random.sample(self.memory, self.batch_size)
        for state, action, next_state, reward, done in batch:
            if not done:
                reward += self.gamma * np.amax(
                    self.model.predict(next_state)[0])
            target = self.model.predict(state)
            target[0, action] = reward
            self.model.fit(state, target, epochs=1, verbose=False)
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay
    def learn(self, episodes):
        for e in range(1, episodes + 1):
            state, _ = self.env.reset()
            state = np.reshape(state, [1, self.n_features])
            for f in range(1, 5000):
                action = self.act(state)
                next_state, reward, done, trunc, _ = \
                    self.env.step(action)
                next_state = np.reshape(next_state,
                                        [1, self.n_features])
                self.memory.append(
                    [state, action, next_state, reward, done])
                state = next_state 
                if done:
                    self.trewards.append(f)
                    self.max_treward = max(self.max_treward, f)
                    templ = f'episode={e:4d} | treward={f:4d}'
                    templ += f' | max={self.max_treward:4d}'
                    print(templ, end='\r')
                    break
            if len(self.memory) > self.batch_size:
                self.replay()
        print()
    def test(self, episodes):
        ma = self.env.min_accuracy
        self.env.min_accuracy = 0.5
        for e in range(1, episodes + 1):
            state, _ = self.env.reset()
            state = np.reshape(state, [1, self.n_features])
            for f in range(1, 5001):
                action = np.argmax(self.model.predict(state)[0])
                state, reward, done, trunc, _ = self.env.step(action)
                state = np.reshape(state, [1, self.n_features])
                if done:
                    tmpl = f'total reward={f} | '
                    tmpl += f'accuracy={self.env.accuracy:.3f}'
                    print(tmpl)
                    break
        self.env.min_accuracy = ma

### Financial DQN Agent Implementation

**Key Modifications for Financial Markets:**

**1. Hyperparameter Adjustments:**
- **`gamma = 0.5`**: Lower discount factor for shorter-term predictions (vs. 0.9 in CartPole)
- **Flexible input**: `input_dim=n_features` adapts to different lookback periods
- **Custom environment**: Integration with our `Finance` class

**2. Network Architecture:**
- **Input layer**: `n_features` neurons (default 4 for 4-period lookback)
- **Hidden layers**: 24 neurons each (may need adjustment for financial complexity)
- **Output layer**: 2 neurons (UP/DOWN predictions)

**3. Financial-Specific Methods:**
- **`learn()`**: Adapted for financial episode structure and termination
- **`test()`**: Modified to report accuracy metrics alongside episode performance
- **Risk management**: Integration with `min_accuracy` stopping criteria

**4. Performance Tracking:**
- **Episode rewards**: Number of consecutive correct predictions
- **Accuracy metrics**: Real-time calculation of prediction success rate
- **Maximum tracking**: Monitoring best performance achieved

In [28]:
random.seed(250)
tf.random.set_seed(250)

In [29]:
agent = DQLAgent('EUR=', 'r', 0.495, 4)

### Configuring the Financial Agent

**Agent Configuration:**
- **Symbol**: 'EUR=' (EUR/USD exchange rate)
- **Feature**: 'r' (using returns instead of price levels)
- **Min accuracy**: 0.495 (slightly below 50% - allowing for some noise)
- **Features**: 4 (using 4 lagged returns as state)

**Why these settings?**
- **Returns over prices**: More stationary, easier for neural networks to learn
- **Lenient accuracy threshold**: Accounts for market noise and gives agent time to learn
- **4-period lookback**: Captures short-term momentum without too much complexity

In [30]:
%time agent.learn(250)

episode= 250 | treward=  12 | max=2603
CPU times: user 21.1 s, sys: 3.05 s, total: 24.1 s
Wall time: 21 s


### Training the Financial Prediction Agent

**Training for 250 episodes on EUR/USD data:**

**What to expect:**
- **Early episodes**: Short runs due to random predictions and risk controls
- **Learning phase**: Gradual improvement as patterns emerge in financial data
- **Convergence**: Hopefully longer episodes with >49.5% accuracy

**Key differences from CartPole:**
- **Variable episode lengths**: Depend on prediction accuracy, not just time
- **Market dependency**: Performance varies with market conditions and volatility
- **Financial noise**: Success less predictable due to inherent market randomness

**Performance indicators:**
- **Episode length**: Longer episodes = better prediction accuracy
- **Maximum episodes**: Best performance achieved so far
- **Consistency**: Stable performance across multiple episodes

In [31]:
agent.test(5)

total reward=2603 | accuracy=0.525
total reward=2603 | accuracy=0.525
total reward=2603 | accuracy=0.525
total reward=2603 | accuracy=0.525
total reward=2603 | accuracy=0.525


### Testing Financial Agent Performance

**Testing with pure exploitation (no exploration):**

**Metrics to watch:**
- **Total reward**: Number of consecutive correct predictions before stopping
- **Accuracy**: Final prediction accuracy percentage
- **Consistency**: Performance stability across test episodes

**Success criteria:**
- **Accuracy > 50%**: Better than random guessing
- **Longer episodes**: More consecutive correct predictions
- **Stable performance**: Consistent results across multiple tests

**Real-world implications:**
If the agent achieves >55% accuracy consistently, this could translate to profitable trading strategies in real markets (after accounting for transaction costs and slippage).

<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="https://tpq.io" target="_blank">https://tpq.io</a> | <a href="https://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>

## Summary: From Games to Markets

This notebook marked a crucial transition from controlled environments to real-world financial applications:

### Key Achievements:
1. **Custom Environment Creation**: Built a financial prediction environment from scratch
2. **Real Data Integration**: Successfully applied RL to actual EUR/USD market data  
3. **Risk Management**: Implemented performance-based stopping criteria
4. **Feature Engineering**: Compared price levels vs. returns as state representations

### Financial RL Innovations:
- **Domain Adaptation**: Modified DQN architecture for financial prediction tasks
- **Performance Metrics**: Accuracy-based rewards instead of game scores
- **Risk Controls**: Early termination to prevent catastrophic losses
- **Data Preprocessing**: Normalization and feature engineering for financial time series

### Technical Insights:
- **State Representation Matters**: Returns vs. prices can lead to different strategies
- **Risk-Reward Balance**: Lower gamma (0.5) focuses on shorter-term predictions
- **Market Efficiency**: Even small improvements over random (50%) can be valuable
- **Noise Management**: Financial data requires careful handling of random variation

### Real-World Applications:
- **Algorithmic Trading**: Systematic prediction and execution of trades
- **Risk Management**: Early warning systems for portfolio protection
- **Market Making**: Predicting short-term price movements for spread capture
- **Portfolio Optimization**: Asset allocation based on return predictions

### Lessons Learned:
- **Markets are Hard**: Financial prediction is more challenging than game environments
- **Data Quality Matters**: Preprocessing and feature selection are crucial
- **Risk Control Essential**: Stop-loss mechanisms prevent catastrophic failures
- **Small Edges Count**: Even 52-53% accuracy can be very valuable in finance

### Next Steps:
This foundation enables exploration of:
- **Multi-asset portfolios**: Trading multiple instruments simultaneously
- **Transaction costs**: Including realistic trading costs and slippage
- **Position sizing**: Optimizing bet sizes based on confidence levels
- **Alternative features**: Technical indicators, sentiment data, news analysis
- **Advanced architectures**: LSTMs, Transformers for time-series modeling

**The Journey So Far:**
- **Notebook 1**: Simple learning with coins and dice
- **Notebook 2**: Deep Q-Learning with CartPole  
- **Notebook 3**: Real-world application to financial markets

We've progressed from basic concepts to practical applications that could actually generate trading signals in live markets!