<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# Reinforcement Learning for Finance

**Chapter 01 &mdash; Learning through Interaction**

&copy; Dr. Yves J. Hilpisch

<a href="https://tpq.io" target="_blank">https://tpq.io</a> | <a href="https://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>

## Learning

This notebook demonstrates fundamental concepts of **learning through interaction** - a core principle of reinforcement learning. We'll explore how an agent can learn optimal strategies through trial and error when interacting with an environment.

The examples progress from simple random choices to more sophisticated learning strategies that adapt based on observed outcomes. This mimics how reinforcement learning agents improve their decision-making over time.

**Key Learning Objectives:**
- Understand the difference between random action selection and adaptive learning
- See how agents can discover patterns in biased environments
- Learn about exploration vs exploitation trade-offs
- Observe how simple frequency-based learning can improve performance

### Tossing a Biased Coin

In [1]:
import numpy as np
from numpy.random import default_rng
rng = default_rng(seed=100)

### Setting Up the Environment

First, we'll set up our random number generator with a fixed seed for reproducibility. This ensures that our experiments will give consistent results across different runs, making it easier to understand the learning patterns.

In [2]:
ssp = [1, 0]

In [3]:
asp = [1, 0]

### Experiment 1: Fair Coin with Random Actions

Let's start with the simplest case - a fair coin toss where:
- `ssp = [1, 0]` represents the environment's state space (the coin can show heads=1 or tails=0)
- `asp = [1, 0]` represents the agent's action space (the agent can predict heads=1 or tails=0)

In this first experiment, both the coin and the agent's choices are completely random and fair.

In [4]:
def epoch():
    tr = 0
    for _ in range(100):
        a = rng.choice(asp)
        s = rng.choice(ssp)
        if a == s:
            tr += 1
    return tr

The `epoch()` function simulates one episode of interaction:
- We run 100 trials per epoch
- In each trial, the agent chooses an action `a` randomly from `asp`
- The environment reveals its state `s` randomly from `ssp`
- If the agent's prediction matches the environment's outcome (`a == s`), we count it as a success
- The function returns the total number of correct predictions

In [5]:
rl = np.array([epoch() for _ in range(250)])
rl[:10]

array([56, 47, 48, 55, 55, 51, 54, 43, 55, 40])

Now let's run 250 epochs to see how well the agent performs with random choices:

In [6]:
rl.mean()

49.968

**Expected Result:** With both the coin and agent choices being random and fair, we expect about 50% success rate (around 50 correct predictions out of 100 trials per epoch).

In [7]:
ssp = [1, 1, 1, 1, 0]

In [8]:
asp = [1, 0]

### Experiment 2: Biased Coin with Random Actions

Now we introduce bias into the environment:
- `ssp = [1, 1, 1, 1, 0]` means the coin shows heads (1) with 80% probability and tails (0) with 20% probability
- `asp = [1, 0]` the agent still chooses randomly with 50/50 probability

This demonstrates what happens when there's a pattern in the environment but the agent doesn't adapt to it.

In [9]:
def epoch():
    tr = 0
    for _ in range(100):
        a = rng.choice(asp)
        s = rng.choice(ssp)
        if a == s:
            tr += 1
    return tr

In [10]:
rl = np.array([epoch() for _ in range(250)])
rl[:10]

array([53, 56, 40, 55, 53, 49, 43, 45, 50, 51])

In [11]:
rl.mean()

49.924

**Result Analysis:** Even though the environment is biased toward heads (80% probability), the agent still performs randomly at ~50% success rate because it doesn't learn from the pattern. The optimal strategy would be to always predict heads (1), which would yield 80% success rate.

In [12]:
ssp = [1, 1, 1, 1, 0]

In [13]:
def epoch(n):
    tr = 0
    asp = [0, 1]
    for _ in range(n):
        a = rng.choice(asp)
        s = rng.choice(ssp)
        if a == s:
            tr += 1
        asp.append(s)
    return tr

### Experiment 3: Biased Coin with Simple Learning

Now we introduce a simple learning mechanism. The agent still starts with random actions but:
- **Learning mechanism**: After each trial, the agent adds the observed outcome to its action space
- **Action selection**: The agent still chooses randomly, but now the action space contains more copies of frequently observed outcomes

This is a form of **experience replay** where past observations influence future decisions.

In [19]:
rl = np.array([epoch(100) for _ in range(250)])
rl[:10]

array([80, 62, 73, 67, 74, 63, 70, 62, 63, 65])

In [20]:
rl.mean()

66.66

**Result Analysis:** The performance improves significantly! By adding observed outcomes to the action space, the agent naturally gravitates toward predicting the more frequent outcome (heads). This simple learning mechanism shows how agents can adapt to environmental patterns without explicit programming.

In [21]:
from collections import Counter

In [22]:
ssp = [1, 1, 1, 1, 0]

In [23]:
def epoch(n):
    tr = 0
    asp = [0, 1]
    for _ in range(n):
        c = Counter(asp)
        a = c.most_common()[0][0]
        s = rng.choice(ssp)
        if a == s:
            tr += 1
        asp.append(s)
    return tr

### Experiment 4: Biased Coin with Optimal Strategy

Now we implement a more sophisticated learning strategy:
- **Frequency tracking**: We use `Counter` to track the frequency of observed outcomes
- **Greedy action selection**: The agent always chooses the action that has been most frequently observed (`most_common()[0][0]`)

This represents a **greedy exploitation** strategy where the agent always chooses the currently best-known action.

In [24]:
rl = np.array([epoch(100) for _ in range(250)])
rl[:10]

array([78, 79, 77, 80, 73, 79, 79, 79, 82, 84])

In [25]:
rl.mean()

79.264

**Result Analysis:** This greedy strategy performs even better! The agent quickly identifies the most frequent outcome and exploits it consistently. However, this approach has limitations - it doesn't explore alternative actions once it finds a good one, which could be problematic if the environment changes.

### Rolling a Biased Die

In [26]:
ssp = [1, 2, 3, 4, 4, 4, 4, 4, 5, 6]

In [27]:
asp = [1, 2, 3, 4, 5, 6]

### Experiment 5: Biased Die with Random Actions

Now let's extend our concepts to a more complex scenario - a biased six-sided die:
- `ssp = [1, 2, 3, 4, 4, 4, 4, 4, 5, 6]` means the die is heavily biased toward rolling 4 (50% probability)
- `asp = [1, 2, 3, 4, 5, 6]` the agent chooses randomly from all six options

This demonstrates the same concepts but with more possible outcomes, making the learning challenge more interesting.

In [28]:
def epoch():
    tr = 0
    for _ in range(600):
        a = rng.choice(asp)
        s = rng.choice(ssp)
        if a == s:
            tr += 1
    return tr

In [29]:
rl = np.array([epoch() for _ in range(250)])
rl[:10]

array([128,  89, 110,  93,  97,  88, 103,  93, 101, 102])

In [30]:
rl.mean()

100.236

**Random Strategy Result:** With 6 possible outcomes and random selection, we expect approximately 16.7% success rate (1/6). However, since the die is biased toward 4 (50% probability), random guessing will perform slightly better when the agent happens to guess 4.

In [31]:
def epoch():
    tr = 0
    asp = [1, 2, 3, 4, 5, 6]
    for _ in range(600):
        a = rng.choice(asp)
        s = rng.choice(ssp)
        if a == s:
            tr += 1
        asp.append(s)
    return tr

### Experiment 6: Biased Die with Simple Learning

Applying the same simple learning strategy to the die scenario:

In [32]:
rl = np.array([epoch() for _ in range(250)])
rl[:10]

array([186, 193, 165, 163, 191, 192, 165, 160, 166, 190])

In [33]:
rl.mean()

177.044

In [34]:
def epoch():
    tr = 0
    asp = [1, 2, 3, 4, 5, 6]
    for _ in range(600):
        c = Counter(asp)
        a = c.most_common()[0][0]
        s = rng.choice(ssp)
        if a == s:
            tr += 1
        asp.append(s)
    return tr

### Experiment 7: Biased Die with Greedy Strategy

Finally, let's apply the greedy strategy (always choosing the most common observed outcome) to the die scenario:

In [35]:
rl = np.array([epoch() for _ in range(250)])
rl[:10]

array([274, 303, 314, 309, 298, 298, 301, 305, 296, 293])

In [36]:
rl.mean()

298.9

## Summary of Results

This notebook demonstrated fundamental concepts of reinforcement learning through simple experiments:

### Key Insights:

1. **Random vs. Learning**: Random action selection performs at baseline levels regardless of environmental patterns
2. **Adaptation**: Simple learning mechanisms can significantly improve performance by adapting to environmental biases
3. **Exploitation vs. Exploration**: Greedy strategies excel at exploiting known patterns but may miss better alternatives
4. **Scalability**: The same learning principles apply whether dealing with simple (2-outcome) or complex (6-outcome) environments

### Reinforcement Learning Connections:

- **Agent**: The prediction mechanism that chooses actions
- **Environment**: The coin/die that provides states and rewards
- **Action Space**: The possible predictions (heads/tails or 1-6)
- **State Space**: The possible outcomes from the environment
- **Reward**: Success when prediction matches outcome
- **Policy**: The strategy for choosing actions (random, learned, greedy)

### Next Steps:

These concepts form the foundation for more sophisticated RL algorithms like Q-learning, policy gradients, and actor-critic methods that we'll explore in subsequent notebooks.

In [37]:
cm = 10 ** 40
print(f'{cm:,}')

10,000,000,000,000,000,000,000,000,000,000,000,000,000


<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="https://tpq.io" target="_blank">https://tpq.io</a> | <a href="https://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>