<a href="https://colab.research.google.com/github/leomercanti/Course_Advanced_Investing_with_AI/blob/main/Module_3_Reinforcement_Learning_in_Financial_Markets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Module 3: Reinforcement Learning in Financial Markets**


If you havent checked our previous modules yet, you can find them on the links below:
- [Module 1](https://colab.research.google.com/drive/15iRO6g-AyE2vGtdodh4xZ5RLmAcPcNV_)
- [Module 2](https://colab.research.google.com/drive/1Xr_athUWZH3iKZ6ubzvPnEAN9lPnee-1)

<br>

**Learning Goals:**
- Understand the foundations of Reinforcement Learning and how it differs from other machine learning paradigms.
- Implement a basic Q-Learning algorithm to model stock trading decisions.
- Explore advanced RL techniques like Deep Q-Networks (DQN) and policy-based methods (e.g., PPO) for more complex environments.
- Learn to backtest RL-driven trading strategies to evaluate performance.


### 3.1 Core Readings and Resources

- **Textbook:** "Reinforcement Learning: An Introduction" by Sutton and Barto

  - Chapter 4: Dynamic Programming and Q-Learning – Provides the fundamental understanding of RL, rewards, and policies.

- **Research Papers:**

  - "Deep Reinforcement Learning in Financial Trading Systems": A practical approach to applying DQN to trading.
  - “Reinforcement Learning for Portfolio Management”: Shows how RL can be applied to optimizing portfolios over time.

- **Optional:**
  - “Trust Region Policy Optimization” (Schulman et al.) – Essential for understanding the underlying mechanics of policy-based optimization algorithms like PPO.

### 3.2 Key Topics Overview

**Q-Learning for Trading**

- **Why Q-Learning?** Q-Learning is a fundamental RL algorithm where the agent learns to maximize its rewards through interactions with the environment. It builds a Q-Table to represent the value of taking each action in each state and uses these values to make trading decisions (e.g., buy, sell, hold). Q-Learning is a good introduction to RL in trading.

- **Main Concepts:**

  - States: Represent the financial environment (e.g., current stock prices, technical indicators).
  - Actions: Possible decisions (e.g., buy, sell, hold).
  - Rewards: Feedback received based on the outcome of an action (e.g., profit/loss from trade).
  - Policy: The agent’s strategy, derived from the Q-Table, which determines which action to take in each state.

- **Use Case:** Building a Q-Learning-based agent that learns to trade a stock based on historical price data.

- **Hands-On Example:** Implementing Q-Learning for Trading

In [None]:
import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt

In [None]:
# Download historical data (Apple stock for this example)
data = yf.download("AAPL", start="2022-01-01", end="2024-09-01")
prices = data['Close'].values

In [None]:
# Define Q-Learning parameters
n_actions = 3  # Buy, Hold, Sell
n_states = len(prices)  # One state per price point
alpha = 0.1  # Learning rate
gamma = 0.9  # Discount factor
epsilon = 0.1  # Exploration rate

In [None]:
# Initialize Q-Table with zeros
Q_table = np.zeros((n_states, n_actions))

In [None]:
# Reward function (Profit/Loss)
def get_reward(action, state, next_state):
    if action == 0:  # Buy
        return next_state - state
    elif action == 2:  # Sell
        return state - next_state
    return 0  # Hold

In [None]:
# Q-Learning algorithm
def q_learning(episodes=1000):
    for episode in range(episodes):
        state_idx = np.random.randint(0, n_states - 1)
        done = False
        while not done:
            # Choose action: Exploration vs. Exploitation
            if np.random.rand() < epsilon:
                action = np.random.randint(0, n_actions)
            else:
                action = np.argmax(Q_table[state_idx])

            # Take action and observe next state and reward
            next_state_idx = (state_idx + 1) % n_states
            reward = get_reward(action, prices[state_idx], prices[next_state_idx])

            # Update Q-Table
            best_next_action = np.argmax(Q_table[next_state_idx])
            Q_table[state_idx, action] = Q_table[state_idx, action] + alpha * (
                reward + gamma * Q_table[next_state_idx, best_next_action] - Q_table[state_idx, action])

            state_idx = next_state_idx
            if state_idx == n_states - 1:
                done = True

In [None]:
# Train the Q-Learning agent
q_learning()

In [None]:
# Visualize Q-values
plt.plot(Q_table)
plt.title('Q-Values for Buy, Hold, and Sell actions over time')
plt.show()

- **Expected Outcome:** The Q-Learning agent will learn to make decisions (buy/sell/hold) based on historical stock prices, maximizing profit over time. You'll visualize how Q-values evolve for each action at different states.

**Deep Q-Networks (DQN) for Trading**

- **Why DQN?** In environments with large or continuous state spaces (e.g., complex market conditions), Q-tables become impractical. Deep Q-Networks (DQN) use a neural network to approximate the Q-function, enabling the agent to handle larger environments and more complex decision-making.

- **Main Concepts:**

  - Neural Network Q-Function: Instead of storing Q-values in a table, a neural network estimates the Q-values for each action based on the state.
  - Experience Replay: Stores past experiences in a buffer and samples from this buffer during training, improving stability.
  - Target Network: A separate neural network that updates less frequently, preventing oscillations in Q-value updates.

- **Use Case:** Using DQN to make trading decisions in a more dynamic and complex financial environment.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import random

In [None]:
# Define the DQN network
class DQN(nn.Module):
    def __init__(self, state_size, action_size):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_size, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, action_size)

    def forward(self, state):
        x = torch.relu(self.fc1(state))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

In [None]:
# Initialize DQN parameters
state_size = 1  # Using stock prices as states
action_size = 3  # Buy, Hold, Sell
model = DQN(state_size, action_size)
target_model = DQN(state_size, action_size)  # Target network
optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.MSELoss()

In [None]:
# Replay buffer for experience replay
replay_buffer = []
batch_size = 32
gamma = 0.95
epsilon = 0.1

In [None]:
# Function to get rewards (you need to define this)
def get_reward(action, current_price, next_price):
    # Simple reward: profit/loss based on the action taken
    if action == 0:  # Buy
        return next_price - current_price
    elif action == 1:  # Hold
        return 0  # No profit or loss
    elif action == 2:  # Sell
        return current_price - next_price
    return 0

In [None]:
#IMPORTANT: This block of code might take a very long time to run.
# It is recommended to only read it.

# Train the DQN
def train_dqn(episodes=50):
    for episode in range(episodes):
        state = torch.FloatTensor([[prices[0]]])  # Initial state
        done = False
        while not done:
            if random.random() < epsilon:
                action = random.choice([0, 1, 2])  # Explore
            else:
                q_values = model(state)
                action = torch.argmax(q_values).item()  # Exploit

            # Take action, observe next state and reward
            next_state = torch.FloatTensor([[prices[(episode + 1) % len(prices)]]])
            reward = get_reward(action, prices[episode], prices[(episode + 1) % len(prices)])

            # Store experience in replay buffer
            replay_buffer.append((state, action, reward, next_state))

            # Experience replay training
            if len(replay_buffer) > batch_size:
                batch = random.sample(replay_buffer, batch_size)
                for s, a, r, ns in batch:
                    # Get Q-values for the current state
                    q_values = model(s)
                    current_q = q_values.squeeze()[a]

                    # Get the target Q-value from the target model
                    with torch.no_grad():
                        target_q = r + gamma * torch.max(target_model(ns))

                    # Calculate loss and update model
                    loss = loss_fn(current_q, target_q)
                    optimizer.zero_grad()
                    loss.backward()
                    optimizer.step()

            state = next_state
            if (episode + 1) % len(prices) == 0:
                done = True

train_dqn()

**Policy-Based Methods: Proximal Policy Optimization (PPO)**

- **Why PPO?** Policy-based methods differ from Q-learning by directly optimizing the policy (i.e., the action-selection strategy) instead of estimating Q-values. PPO is a popular policy-gradient method that ensures stable and efficient learning, especially in continuous action spaces.

- **Main Concepts:**

  - **Policy Optimization:** Instead of learning Q-values, PPO optimizes the probability distribution over actions.
  - **Clipping:** PPO clips policy updates to prevent large changes, ensuring stability.
  - **Advantages:** Suitable for tasks with continuous action spaces or where Q-learning methods struggle to converge.
  - **Use Case:** Applying PPO to optimize portfolio allocation over multiple assets.

### 3.3 Advanced Concepts: Model-Free vs. Model-Based RL

**Model-Free RL:**
- **Advantages:** Simpler and widely used in finance. No need for a model of the environment.
- **Disadvantages:** Requires a lot of data for training and can be sample-inefficient.

**Model-Based RL:**
- **Advantages:** More efficient in terms of data usage. Uses a model of the environment to simulate future states.
- **Disadvantages:** More complex and harder to implement in unpredictable environments like financial markets.

### 3.4 End of Module Assignments and Practice (Optional)

- **Assignment 1:** Implement a Q-Learning algorithm for stock trading. Track your agent's learning progress by visualizing Q-values for different actions over time.

- **Assignment 2:** Build a DQN agent for trading and optimize its hyperparameters (e.g., learning rate, epsilon decay). Test the model's performance on a different stock dataset and compare results with the Q-Learning approach.

- **Bonus:** Experiment with PPO for portfolio optimization. Define your portfolio and apply PPO to determine optimal asset allocations over time.

By the end of **Module 3**, you should be comfortable implementing **reinforcement learning algorithms**, such as **Q-Learning** and **Deep Q-Networks (DQN)**, to develop intelligent agents that make optimal trading decisions. Additionally, you’ve gained an understanding of policy-based methods like **Proximal Policy Optimization (PPO)** for portfolio management.

These techniques allow you to model complex financial environments and continuously improve strategies through trial and error.

With these skills, you’re now equipped to tackle fully automated trading systems and explore cutting-edge applications of AI in finance in the next phase of the program.