# 第6讲 一节课掌握Google最强强化学习算法

## 1.什么是on-policy 什么是off-policy

This is one of the most fundamental concepts in Reinforcement Learning. The distinction is all about **what experience the agent learns from**.

It comes down to this:
* **Target Policy ($\pi$):** The policy the agent is trying to learn and improve. This is the "final" optimal policy it wants to find.
* **Behavior Policy ($\mu$):** The policy the agent *actually uses* to explore the environment and generate experience (i.e., select actions).

---

## 1. On-Policy Learning

### Core Idea

In On-Policy learning, the **Target Policy** and the **Behavior Policy** are the **same**.

The agent learns by acting with its *current* policy and then updates that *same* policy based on the outcomes.

### Analogy

**"Learning on the job."**

Imagine you are learning to be a chef. You try to cook a dish (your current policy), you see what happens (you burn it), and you update your cooking strategy (update your policy). You are learning directly from your *own* current actions and mistakes.

### How It Works

1.  Use the current policy $\pi$ to select an action $a$ in state $s$.
2.  Observe the reward $r$ and next state $s'$.
3.  Use this single piece of experience $(s, a, r, s')$ to update the policy $\pi$.
4.  **Crucially, you must then *discard* this experience.** Why? Because the experience was generated by the "old" policy. Once you update $\pi$ to $\pi'$, any experience from $\pi$ is now "stale" and no longer relevant.

### Key Characteristics

* **Pros:**
    * Simpler to implement and understand.
    * Often more stable and is guaranteed to converge to a (local) optimum.
* **Cons:**
    * **Extremely sample-inefficient.** You throw away data after every single update. This is a massive problem in real-world applications (like robotics) where collecting experience is slow and expensive.
    * It struggles with exploration. The policy has to be "stochastic" (e.g., $\epsilon$-greedy) to try new things, but because it's *also* the target, it's torn between exploring and exploiting.

**Example Algorithms:**
* **SARSA** (State-Action-Reward-State-Action)
* **Policy Gradient** (e.g., REINFORCE, A2C/A3C)

---

## 2. Off-Policy Learning

### Core Idea

In Off-Policy learning, the **Target Policy** and the **Behavior Policy** are **different**.

The agent learns about the optimal policy ($\pi$) while following a *different* behavior policy ($\mu$) to explore the environment.

### Analogy

**"Learning from a textbook" or "Watching a demo."**

You want to learn how to be a *master* chef (the optimal target policy). You could learn this by reading a book written by a master chef, or by watching videos of *other* people (good and bad) trying to cook (the behavior policy). You are learning about the *best* way to act, regardless of the actions you (or others) are actually taking.

### How It Works

1.  You have two policies:
    * **Target Policy $\pi$:** This is the policy you want to optimize (e.g., a purely greedy policy).
    * **Behavior Policy $\mu$:** This is an exploratory policy you use to collect data (e.g., an $\epsilon$-greedy policy that takes random actions 10% of the time).
2.  Use the behavior policy $\mu$ to act in the world and collect experience $(s, a, r, s')$.
3.  Store this experience in a large **Replay Buffer**.
4.  To update the target policy $\pi$, you randomly sample a *batch* of old experiences from the Replay Buffer.
5.  This update corrects for the fact that the data was collected by a different policy (e.g., using Importance Sampling, or the Q-learning $\max$ operator).

### Key Characteristics

* **Pros:**
    * **Extremely sample-efficient.** The Replay Buffer allows the agent to reuse a single piece of experience for many, many updates. This is its biggest advantage.
    * Better exploration. The behavior policy can be dedicated to exploring (e.g., be very random) while the target policy can focus on being optimal (being greedy).
* **Cons:**
    * More complex to implement.
    * Can be less stable and have higher variance (though modern algorithms have solutions for this).

**Example Algorithms:**
* **Q-Learning** (and DQN)
* **DDPG**, **SAC** (Deep Deterministic Policy Gradient, Soft Actor-Critic)

---

### Why Q-Learning is Off-Policy (A Common Interview Question)

Your course mentions Q-learning. It's the classic example of off-policy. Look at its update rule:

$$Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)]$$

* **Behavior Policy:** The agent is in state $s$ and uses an $\epsilon$-greedy policy to choose the *actual* action $a$ it will take.
* **Target Policy:** When it updates the Q-value, it looks at the *next* state $s'$ and asks, "What is the value of the *best possible action* from here?" This is the **$\max_{a'}$** part.
* The policy it is learning about (the target, "max") is a greedy, optimal policy. But the policy it is *using* to get the data (the behavior, "$\epsilon$-greedy") is different. This is the definition of off-policy.

---

### Comparison Table

| Feature | On-Policy | Off-Policy |
| :--- | :--- | :--- |
| **Policy Relationship** | Target Policy = Behavior Policy | Target Policy $\neq$ Behavior Policy |
| **Who Generates Data?** | The policy being learned ($\pi$). | A separate behavior policy ($\mu$). |
| **Use of Old Data?** | **No.** Data is discarded after one use. | **Yes.** Uses a Replay Buffer. |
| **Sample Efficiency** | Very Low (Inefficient) | Very High (Efficient) |
| **Key Example** | **SARSA**, REINFORCE | **Q-Learning**, DQN |

## 2.什么样的问题可以被定义为强化学习？
A problem can be defined as a Reinforcement Learning (RL) problem if it can be framed as a **goal-oriented sequential decision-making process under uncertainty**.

This means the problem must have a few key components and characteristics:

---

### The Essential Components

A problem is an RL problem if you can clearly identify:

1.  **An Agent:** The learner or decision-maker. This is the "brain" you are trying to train.
    * *Example: The AI player in a chess game.*

2.  **An Environment:** The external world that the agent interacts with. It's everything outside the agent.
    * *Example: The chess board and the rules of chess.*

3.  **A State ($s$):** A snapshot of the environment at a specific moment. It's the information the agent receives to make a decision.
    * *Example: The exact position of all pieces on the chess board.*

4.  **A set of Actions ($a$):** The possible moves or choices the agent can make in a given state.
    * *Example: All legal moves a player can make from the current board state.*

5.  **A Reward Signal ($r$):** A scalar feedback value (positive, negative, or zero) that the environment provides after the agent takes an action. This signal is the *only* evaluation of the agent's performance.
    * *Example: A +1 for winning the game, -1 for losing, and 0 for every other move.*

---

### The Core Characteristics

Beyond just having these components, the problem *must* have these properties:

1.  **Sequential Decision-Making:** The agent must make a *sequence* of decisions over time. The problem is not a single "one-and-done" choice.

2.  **A Clear Goal (Maximize Cumulative Reward):** The agent's sole objective is to choose actions that maximize the **total accumulated reward** over the long run (the "return"), not just the next immediate reward.

3.  **Trial-and-Error Learning:** The agent is not given a "handbook" of correct answers. It must *discover* the best actions by trying them out (exploration) and then favoring the ones that lead to good outcomes (exploitation).

4.  **Delayed Rewards:** This is a hallmark of RL. The reward for a good (or bad) action may not come immediately.
    * *Example:* In chess, sacrificing a pawn (a small negative reward, or 0) might be the key to checkmate 20 moves later (a large positive reward). The agent must learn to connect its actions to these delayed consequences.

5.  **The Agent's Actions Influence Its Future:** The actions the agent takes must affect the future states it will encounter. If the agent's actions have no bearing on what happens next, it's not a true RL control problem.

---

### In Summary: RL vs. Other Machine Learning

* **Not Supervised Learning:** You are *not* given a dataset of (State, Correct Action) pairs. You are only given (State, Action, Reward) tuples, and the reward is just an *evaluation* (e.g., "that was bad"), not an *instruction* (e.g., "you should have done this instead").
* **Not Unsupervised Learning:** You are *not* just looking for hidden patterns in data. You have a clear, active goal: to maximize a reward signal.

**Simple Test:**
If you can describe your problem using this loop, it's an RL problem:
1.  The **Agent** sees the **State**.
2.  The **Agent** takes an **Action**.
3.  The **Environment** gives a **Reward** and a new **State**.
4.  Repeat... with the goal of getting the most total **Reward** possible.

This entire framework is formally known as a **Markov Decision Process (MDP)**.

## 3.Q-learning和policy gradient的区别？
The fundamental difference is **what they learn** and **how they choose an action**.

* **Q-Learning** is **Value-Based**. It learns the *value* of taking an action in a state.
* **Policy Gradient** is **Policy-Based**. It learns the *policy* (the probability of taking an action) directly.

---

## 1. Q-Learning (Value-Based)

### The Core Idea: Learn a Value Map 🗺️

Q-Learning's goal is to learn a function called the **Q-function** (or "Quality" function), $Q(s, a)$.

This function tells the agent the **expected future reward (the "value")** of taking action $a$ in state $s$ and then following the optimal policy forever after.

$$Q(s, a) \approx \text{Expected total future reward from state } s \text{ if we take action } a$$

### How It Chooses an Action (Implicit Policy)

The policy in Q-Learning is **implicit**. It is *derived* from the Q-values.

Once the agent has learned an accurate Q-function, the optimal policy is simply to pick the action with the highest Q-value in any given state.

**Policy:** $\pi(s) = \arg\max_a Q(s, a)$ (This is a "greedy" policy).

### Analogy: The Restaurant Critic

Think of Q-Learning as a **restaurant critic** who is building a giant guidebook.

* The critic's job is to assign a *star rating* ($Q$-value) to every possible *dish* (action) at every *restaurant* (state).
* To decide what to eat, you just look at the guidebook for your current restaurant and **pick the dish with the highest rating**.
* The critic *doesn't* tell you what to pick; they just give you the values, and your policy is to pick the best one.

---

## 2. Policy Gradient (Policy-Based)

### The Core Idea: Learn a Set of Reflexes 🎯

Policy Gradient (PG) doesn't learn values. It *directly* learns the policy function, $\pi(a|s)$.

This function is a **probability distribution**. It tells the agent, "Given your current state $s$, here is the *probability* you should take each possible action $a$."

**Policy:** $\pi(a|s) = P(\text{Action } a | \text{ State } s)$

### How It Chooses an Action (Explicit Policy)

The policy is **explicit**. The agent's "brain" *is* the policy.

* To choose an action, the agent just inputs its state $s$ into its policy network and **samples an action** from the resulting probability distribution.
* The algorithm then uses "gradient ascent" to *increase* the probability of actions that led to high rewards and *decrease* the probability of actions that led to low rewards.

### Analogy: The Master Chef

Think of Policy Gradient as training a **master chef**.

* You don't give the chef a guidebook of ratings. Instead, you train their *instincts* and *reflexes* (the policy).
* When you give the chef ingredients (state), they *directly* decide what to do (action) based on their training.
* If a dish comes out great (high reward), you "reinforce" the instincts that led to it. If it's bad, you "discourage" those instincts.

---

## Key Differences Summary

| Feature | Q-Learning (e.g., DQN) | Policy Gradient (e.g., REINFORCE) |
| :--- | :--- | :--- |
| **What it Learns** | **Value Function** $Q(s, a)$ (How good is this action?) | **Policy Function** $\pi(a|s)$ (What is the chance of this action?) |
| **Policy** | **Implicit** (Derived from Q-values, e.g., greedy). | **Explicit** (The network *is* the policy). |
| **Action Space** | Best for **Discrete** (e.g., Up, Down, Left, Right) because of the $\max_a$ operation. | Best for **Continuous** (e.g., turn wheel 15.7°) because it can output a probability distribution. |
| **Sample Efficiency** | **Off-Policy** (can use a Replay Buffer). Very sample efficient. | **On-Policy** (typically). Discards experience after one update. Very sample *in*efficient. |
| **Stability** | Can be unstable (e.g., with function approximation), but Replay Buffers help. | More stable, with smoother updates (but can get stuck in local optima). |
| **Final Policy** | Deterministic (always picks the best action). | Stochastic (picks actions based on probability). |

## 4.Reward 和value的区别？
This is a critical distinction in reinforcement learning. The simplest way to think about it is:

* **Reward** is **immediate** feedback.
* **Value** is **long-term**, *predicted* feedback.

---

### 1. Reward (R)

A **Reward** is a single, scalar number that the **environment** gives to the **agent** at the *end* of a time step (after the agent takes an action $a$ in state $s$).

* **Timescale:** **Short-term / Immediate.**
* **Source:** It is *given* by the environment. It is a fundamental part of the problem's definition.
* **Purpose:** It defines the "goal" for the agent. It tells the agent what is "good" or "bad" *right now*.
* **Analogy:**
    * Getting a piece of candy (positive reward).
    * Getting an electric shock (negative reward).
    * The points you get for eating a pellet in Pac-Man.
* **Example (Chess):**
    * Action: Make a move.
    * **Reward:** You get `+1` *only* on the final move that wins the game. You get `-1` *only* on the final move that loses. You get `0` for every single other move in the entire game.

### 2. Value (V or Q)

A **Value** (or "Value Function") is a **prediction** of the **total future reward** the **agent** *expects* to receive, starting from a given state.

* **Timescale:** **Long-term / Predictive.**
* **Source:** It is *learned* and *estimated* by the agent. It is the *solution* to the problem, not part of the problem itself.
* **Purpose:** It tells the agent what is "good" or "bad" *in the long run*. This is what the agent actually uses to make decisions.
* **Analogy:**
    * Your current bank account balance + all your *expected* future income (a long-term value).
    * Your overall "health" (a long-term state).
* **Example (Chess):**
    * The **Reward** for 99% of moves is `0`. This is not useful for deciding which move is good.
    * The **Value** of a board state (e.g., $V(s)$) is the *expected total future reward* from that state. In chess, this is equivalent to the **probability of winning** from that state.
    * A move that traps the opponent's queen (Reward = `0`) leads to a *new state* with a much higher **Value** (higher win probability).

---

### The Relationship: How They Work Together

The agent's entire job is to learn the **Value Function** so it can make good decisions.

It learns the value function by using the **Rewards** it receives.

The **Value** of a state is the *sum of all future rewards* the agent expects to get. A high-value state is one that will lead to many high rewards in the future.

**Value is the prediction; Reward is the data used to make that prediction.**

The agent uses the immediate, short-term **Rewards** to update its estimate of the long-term **Value**. This is the core of Reinforcement Learning, defined by the Bellman equation:

> The **Value** of a state today = (The **Reward** I get immediately) + (The discounted **Value** of the state I end up in tomorrow)

---

### Summary Table

| Feature | Reward (R) | Value (V or Q) |
| :--- | :--- | :--- |
| **Timescale** | **Immediate** (Short-term) | **Predictive** (Long-term) |
| **Source** | Provided by the **Environment** | Learned & estimated by the **Agent** |
| **What it answers** | "Was that action good *right now*?" | "Is this state (or state-action) *ultimately* good?" |
| **Role in RL** | The *objective signal* to be maximized. | The *prediction* used to make optimal decisions. |

## 5.为什么推荐可以看成是强化学习的问题？
Viewing recommendation as a reinforcement learning (RL) problem is a powerful, modern approach. It's a natural fit because a user's interaction with a platform is not a single event, but a **sequential decision-making process**.

The core idea is to shift the objective from **"predicting the next click"** (a static, supervised learning problem) to **"learning a policy that maximizes long-term user engagement"** (a dynamic, RL problem).

Here is how the problem is framed in RL terms, followed by *why* this is so much more powerful.

---

### 1. Mapping Recommendation to the RL Framework

First, we must be able to define the problem using the core components of RL (Agent, Environment, State, Action, Reward).

* **Agent:** The recommendation system (the algorithm or "policy") that is making the decisions.
* **Environment:** The user (and the platform, e.g., the app or website) that the agent interacts with.
* **State ($s$):** A representation of the user and their context *at this moment*. This includes:
    * User's historical data (watch history, purchase history, past ratings).
    * Current context (time of day, device, user's current mood or "session").
    * The last few items the user has seen or interacted with.
* **Action ($a$):** The decision the agent makes. This is the **item (or list of items) to recommend** to the user *right now*.
    * *Example: Showing a specific video, product, or news article.*
* **Reward ($r$):** The feedback from the user, which measures the "goodness" of the action. This can be:
    * **Positive Reward:** Click, long watch time, "like", purchase, add to cart.
    * **Zero or Negative Reward:** Skip, scroll past, "dislike", exit the app.

---

### 2. Why This is More Powerful Than Traditional Methods

Simply mapping the components isn't enough. The *reason* RL is a better fit is because it captures dynamics that traditional methods (like collaborative filtering or content-based filtering) ignore.

#### 1. It Optimizes for Long-Term Rewards (Engagement)

This is the **most important reason**.

* **Traditional (Supervised) Problem:** "Predict the Click-Through-Rate (CTR)." This optimizes for an **immediate, short-term reward** (the click). This can lead to "clickbait" recommendations. The system learns to show items that are "clickable" but not necessarily "satisfying."
* **RL Problem:** "Maximize the *cumulative* reward over the user's entire session (or lifetime)." This is the **long-term reward**.
* **Example:**
    * A "clickbait" video might get a +1 (for the click) but then an immediate "exit," leading to a total session reward of `+1`.
    * A documentary recommendation might be *skipped* (reward `0`), but the *next* recommendation (another documentary) is clicked and watched for an hour (reward `+60`).
    * The RL agent can learn to sacrifice the immediate reward (risk the skip) to build a user state that leads to a much higher total reward (long-term engagement).

#### 2. It's a Sequential Problem (Actions Affect States)

In a recommendation setting, the agent's actions *change* the environment (the user's state).

1.  **State 1:** User has watched comedy.
2.  **Action 1:** Agent recommends a specific *action* movie.
3.  User clicks and watches it.
4.  **State 2:** User has now watched comedy *and* an action movie.

The user's "state" is now different, directly because of the agent's action. The user's interests may have temporarily (or permanently) shifted. Traditional models treat every recommendation as a separate, independent prediction. RL is designed to handle this exact loop, where `Action` -> `New State` -> `New Action`.

#### 3. It Naturally Handles the Exploration vs. Exploitation Trade-off

This is a classic RL problem that is a perfect fit for recommendations.

* **Exploitation:** Show the user what you *know* they like. If they always watch "Marvel," show them another "Marvel" movie. This is safe and gives a predictable, positive reward.
* **Exploration:** Show the user something *new* and *different* (e.g., an indie film). This is risky—the user might skip it (zero reward). But it's the *only* way to **discover new user interests** and gather new data. If the user loves it, you've "unlocked" a new, high-reward area for future recommendations.

If a system *only* exploits, it creates a "filter bubble" and the user gets bored. If it *only* explores, it gives too many bad recommendations. RL algorithms (like Q-learning, Policy Gradients) are explicitly designed to find the optimal balance between these two.

### Summary

In short, framing recommendation as an RL problem shifts the goal from **"predicting a click"** (Supervised Learning) to **"learning a policy that creates a satisfying and engaging long-term user experience"** (Reinforcement Learning).

# 第7讲 如何应对百变machine learning 系统设计和 A/B Test，以Two-Sigma的时间序列预测为例

## 1.时间序列数据怎么做预处理
Here is a comprehensive breakdown of how to preprocess time-series data, structured for a machine learning system design discussion.

---

## Preprocessing for Time-Series Data

Preprocessing time-series data is fundamentally different from preprocessing i.i.d. (independent and identically distributed) data, like images or user profiles. The central challenge is that **order matters**. Every preprocessing step must respect the temporal sequence to prevent data leakage and to correctly model temporal dependencies.

Here is a 7-step process for robustly preprocessing time-series data.

### 1. Handling the Time Index (The Foundation)

This is the most critical first step. Your model needs to understand time.

* **Parse Dates:** Convert your timestamp column (e.g., a string) into a `datetime` object.
* **Set as Index:** Set this `datetime` column as the DataFrame index. This makes all subsequent temporal operations (like resampling) much easier.
* **Check Uniformity:** Verify that the time intervals are regular (e.g., every 1 minute, every 1 hour).
* **Resampling (if irregular):** If data is irregular (e.g., stock trades, user events), you must resample it to a fixed frequency.
    * **Downsampling:** (e.g., 1-minute data -> 1-hour data). You must aggregate. Common methods:
        * `mean()`: Average value over the hour.
        * `sum()`: Total value over the hour (e.g., sales).
        * `last()`: The last recorded value.
        * **OHLC:** For finance (like Two-Sigma), you'd take the `Open`, `High`, `Low`, and `Close` values within that interval.
    * **Upsampling:** (e.g., 1-hour data -> 1-minute data). This creates gaps, which leads to the next step.

### 2. Handling Missing Values (Imputation)

You cannot simply drop a row, as this would break the time sequence. You also cannot fill with the *global* mean, as that leaks future information and ignores the current trend.

* **Forward Fill (`ffill`):** Fills a `NaN` with the *last known value*. This is the most common and "safest" method. It assumes the state hasn't changed. This is a good default.
* **Backward Fill (`bfill`):** Fills a `NaN` with the *next known value*. This can cause lookahead bias and is generally less common, but useful in some contexts.
* **Linear Interpolation:** Draws a straight line between the two known points surrounding the gap. This is a good choice if the data is generally smooth and non-volatile.
* **Rolling Mean Imputation:** Uses the mean of the last $k$ data points (e.g., the last 7 days) to fill the missing value. This is more adaptive than a global mean.
* **Seasonal Imputation:** For highly seasonal data (e.g., retail sales), you might fill a missing Monday's value with the average of the previous 4 Mondays.

### 3. Feature Engineering (Creating Predictors)

This is where you create the signals for your model. The raw value $y_t$ is rarely enough.

* **Lag Features:** This is the most important feature. The value at a previous time step $t-k$ is used to predict the value at time $t$. You can create multiple lags (e.g., $t-1$, $t-2$, $t-12$ for monthly data).
* **Rolling Window Features:** These capture recent trends and volatility.
    * **Rolling Mean:** The average of the last $k$ periods (e.g., 7-day moving average). This smooths out noise.
    * **Rolling Std. Deviation:** The standard deviation of the last $k$ periods. This is a key measure of **volatility**.
    * Other aggregates like `min`, `max`, `sum` over the window.
* **Date/Time Features:** Extract information from the `datetime` index itself. This is critical for capturing cycles.
    * Hour of day
    * Day of week
    * Month of year
    * Quarter
    * `is_weekend` (binary flag)
    * `is_holiday` (binary flag)
* **Interaction Features:** Combine features, e.g., `day_of_week * hour_of_day`, to capture complex patterns (like "lunch rush on a Friday").

### 4. Making the Data Stationary

**Stationarity** means the statistical properties of the series (like mean and variance) do not change over time.
* **Why?** Classical models (like ARIMA) *require* it. Even if not strictly required, many models (including NNs) perform better on a stationary series because it's easier to predict.
* **How to achieve it:**
    * **Differencing (Detrending):** The most common method. Instead of predicting $y_t$, you predict $y'_t = y_t - y_{t-1}$. This removes a linear trend.
    * **Seasonal Differencing:** $y'_t = y_t - y_{t-12}$ (for monthly data with a yearly cycle).
    * **Log Transform:** Use $\log(y_t)$. This is very effective if the variance *grows* with the mean (e.g., exponential growth). It stabilizes the variance.

### 5. Scaling and Normalization

Many models (like Neural Networks and SVMs) require features to be on a similar scale.

* **StandardScaler:** (Z-score normalization). Subtracts the mean and divides by the standard deviation.
* **MinMaxScaler:** Scales the data to be between a fixed range (e.g., [0, 1]).

**CRITICAL PITFALL:** You **must not** fit your scaler on the entire dataset. This would cause **data leakage**, as you'd be using the mean and std. dev. from the future (test set) to scale the past (train set).

**Correct Procedure:**
1.  Split your data into train and test sets (see next step).
2.  Fit the scaler **only** on the **training data** (`scaler.fit(train_data)`).
3.  Use that *same* fitted scaler to `transform` both the training and test data.

### 6. Handling Outliers and Noise

* **Smoothing:** Applying a rolling mean (as a feature or a preprocessing step) can help reduce the impact of random noise.
* **Clipping/Capping:** You can cap values at a certain percentile (e.g., 1st and 99th) to remove extreme, unphysical values.
* **System Design Discussion:** In finance (like Two-Sigma), a market crash is **not an outlier**; it's a critical event to be modeled. In sensor data, a spike to 1,000,000 is clearly an error. You must use domain knowledge to distinguish between *errors* and *rare events*.

### 7. Splitting the Data (Train/Validation/Test)

This is the final and most important preprocessing step.

**CRITICAL PITFALL:** You **cannot** use `sklearn.model_selection.train_test_split`. It shuffles the data randomly, which completely destroys the temporal order and makes your model useless.

**Correct Procedure (Time-Based Split):**
Your validation *must* simulate the future.

* **Simple Split:** Train on all data from 2010-2018. Validate on 2019. Test on 2020.
* **Time-Series Cross-Validation (Backtesting):** This is the gold standard, especially in finance.
    * **Expanding Window:**
        * Fold 1: Train on [Year 1], Test on [Year 2]
        * Fold 2: Train on [Year 1, 2], Test on [Year 3]
        * Fold 3: Train on [Year 1, 2, 3], Test on [Year 4]
    * **Sliding Window:**
        * Fold 1: Train on [Year 1, 2], Test on [Year 3]
        * Fold 2: Train on [Year 2, 3], Test on [Year 4]
        * Fold 3: Train on [Year 3, 4], Test on [Year 5]