# Chapter 09: Advanced Transfer Learning Options

Chapters 07 and 08 introduced the two main transfer learning methods in vangja: `tune_method="parametric"` and `tune_method="prior_from_idata"`. Both methods take the posterior from a source model and use it — either summarized or in full — as the prior for a target model.

In practice, however, the default behavior may not always be optimal. Seasonal peaks may not align perfectly between source and target series. The transferred priors may allow the short series model to overfit by developing seasonal effects with unrealistically large amplitude. Or the default summary statistics (mean and standard deviation) may not be the best description of a skewed posterior.

Vangja exposes several advanced parameters that give practitioners fine-grained control over the transfer learning process. This chapter documents these options:

1. **Regularization via `loss_factor_for_tune`** — preventing seasonal amplitude blow-up and trend drift
2. **Phase alignment via `shift_for_tune`** — learning a time shift to align seasonal peaks
3. **Custom summary statistics via `override_...` parameters** — using the mode or other statistics instead of the mean

> **Note**: These features are more **experimental** than the core `parametric` and `prior_from_idata` methods. They were introduced during the research behind the paper *"Long Horizons from Short Histories: A Bayesian Transfer Learning Framework for Forecasting Time Series"* (Krajevski & Tojtovska Ribarski, 2026) and have shown promising results on stock market data. However, they add hyperparameters that require careful tuning. Use them with caution and always validate on held-out data.

---

## 1. Regularization with `loss_factor_for_tune`

### The Problem: Overfitting Short Series

When transferring seasonality from a long time series to a short one, the short series model can still overfit. Even with informed priors, the optimizer (MAP or MCMC) may push the Fourier coefficients to values that produce **excessively large seasonal effects** — fitting noise in the short training window rather than the true seasonal pattern.

Similarly, when transferring the trend slope, the short series model may drift the slope away from the value learned on the long series, especially if the short training window happens to coincide with an atypical period.

### The Solution: PyMC Potentials as Soft Constraints

Vangja addresses this with the `loss_factor_for_tune` parameter, available on both `FourierSeasonality` and `LinearTrend`. Under the hood, this adds a [PyMC Potential](https://www.pymc.io/projects/docs/en/v5.7.2/api/generated/pymc.Potential.html) — an arbitrary term added to the log-posterior — that penalizes deviations from what was learned on the source model.

### How It Works for `FourierSeasonality`

For seasonal components, the regularization prevents the model from learning seasonal effects with **greater amplitude** than those observed in the source model. The mechanism is:

1. Create a set of time points $T_j$ spanning one full period $p_j$ (e.g., 365 days for yearly seasonality)
2. Compute the seasonal effects using the **source model's** posterior mean coefficients: $\mathbf{fs}_{\text{source}} = \mathbf{F} \cdot \boldsymbol{\beta}_{\text{source}}^{MAP}$
3. Compute the seasonal effects using the **target model's** current coefficients: $\mathbf{fs}_{\text{target}} = \mathbf{F} \cdot \boldsymbol{\beta}_{\text{target}}$
4. Add a penalty that activates only when the target's seasonal amplitude **exceeds** the source's:

$$\text{penalty} = \phi_{\boldsymbol{\theta}} \cdot \lambda \cdot \min\left(0, \|\mathbf{fs}_{\text{source}}\|_2^2 - \|\mathbf{fs}_{\text{target}}\|_2^2\right)$$

where:
- $\phi_{\boldsymbol{\theta}}$ is the `loss_factor_for_tune` hyperparameter
- $\lambda$ is an automatic scaling factor: $\lambda = \frac{2 \cdot p}{n}$ if the period $p$ is longer than twice the number of training data points $n$, and $0$ otherwise. This means the regularization only activates for seasonal components whose period is too long to be reliably estimated from the short series alone.

**Key insight**: the penalty is **one-sided**. Smaller seasonal effects than the source are allowed (or even encouraged), but larger effects are penalized. This makes physical sense: if the source model learned a yearly amplitude of $\pm 20°F$ from temperature data, we don't want the bike sales model to develop a yearly seasonality with amplitude exceeding what the data supports.

### How It Works for `LinearTrend`

For the trend slope, the regularization is simpler — a squared deviation penalty:

$$\text{penalty} = -\phi_{\mathbf{w}} \cdot (w_{\text{target}} - w_{\text{source}}^{MAP})^2$$

This keeps the target model's slope close to what was learned on the source model. The larger $\phi_{\mathbf{w}}$, the more tightly the slope is constrained.

Interestingly, **negative values** of $\phi_{\mathbf{w}}$ can also be useful: they *encourage* the slope to deviate from the source, which may be appropriate when the source and target series have opposing trends.

### Usage

```python
from vangja import LinearTrend, FourierSeasonality

model = (
    LinearTrend(
        tune_method="parametric",
        loss_factor_for_tune=1,      # Regularize the slope transfer
    )
    + FourierSeasonality(
        period=365.25,
        series_order=6,
        tune_method="parametric",
        loss_factor_for_tune=1,      # Regularize yearly seasonality amplitude
    )
)

model.fit(short_data, method="mapx", idata=source_trace, t_scale_params=source_t_scale_params)
```

### Guidance from the Paper

In the experiments from the paper, the best combined model (hierarchical + transfer learning) achieved its best results **without** regularization ($\phi = 0$). The regularization potentials had a noticeable effect when using transfer learning *without* hierarchical modeling, where `loss_factor_for_tune=1` improved results.

The takeaway: regularization is most useful when there is no hierarchical structure providing its own regularization via shrinkage. When combining transfer learning with partial pooling, the shrinkage mechanism already prevents overfitting, making the potentials less necessary.

---

## 2. Phase Alignment with `shift_for_tune`

### The Problem: Misaligned Seasonal Peaks

Transfer learning assumes that the seasonal pattern from the source series has the same **phase** (timing of peaks and troughs) as the target series. But this is not always the case:

- Temperature peaks in mid-July, but ice cream sales might peak in early August (lagged demand)
- A stock index's yearly seasonality might be shifted by a few weeks compared to an individual stock
- Monthly billing cycles may cause seasonal peaks to shift between different business units

When the seasonal peaks are misaligned, directly transferring Fourier coefficients produces a seasonal pattern that is correct in **shape** but wrong in **timing**.

### The Solution: Learning a Shift Parameter

The `shift_for_tune` parameter on `FourierSeasonality` tells vangja to introduce a **learnable time shift** (in days) when computing the Fourier basis functions. Instead of:

$$x(t) = \sin\left(\frac{2\pi n t}{P}\right), \cos\left(\frac{2\pi n t}{P}\right)$$

the model computes:

$$x(t) = \sin\left(\frac{2\pi n (t + \Delta t)}{P}\right), \cos\left(\frac{2\pi n (t + \Delta t)}{P}\right)$$

where $\Delta t$ is a new parameter that the model learns during fitting. This allows the transferred seasonal shape to slide along the time axis to find the best alignment with the target data.

### Usage

```python
from vangja import FlatTrend, FourierSeasonality

model = (
    FlatTrend()
    + FourierSeasonality(
        period=365.25,
        series_order=6,
        tune_method="parametric",
        shift_for_tune=True,      # Learn a phase shift
    )
)

model.fit(short_data, method="mapx", idata=source_trace, t_scale_params=source_t_scale_params)
```

After fitting, the learned shift is stored in the model's trace under the key `fs_{idx} - shift` and is automatically applied during prediction.

### When to Use

Use `shift_for_tune=True` when:
- You suspect the source and target series have similar seasonal **shape** but different **timing**
- The seasonal peaks in the target data consistently appear shifted relative to the source
- You have enough data in the short series to estimate a single shift parameter reliably

Avoid it when:
- The seasonal patterns are truly identical in phase (adding an unnecessary parameter)
- The short series is extremely short (fewer data points than the number of other parameters)
- You are using `prior_from_idata`, which already captures the full covariance structure

---

## 3. Custom Summary Statistics with `override_...` Parameters

### The Problem: The Mean Is Not Always the Best Summary

When using `tune_method="parametric"`, vangja extracts the **mean** and **standard deviation** of each parameter's posterior from the source model. These become the location and scale of the new Normal prior:

$$\beta_i^{\text{new}} \sim \text{Normal}\left(\mathbb{E}[\beta_i | C],\; \text{Std}[\beta_i | C]\right)$$

But the posterior is almost certainly **not Gaussian**. It may be skewed, heavy-tailed, or multimodal. In such cases, the posterior mean might not even be a point of high probability density. A better choice could be the **mode** (the MAP estimate) — the single most probable value:

$$\beta_i^{\text{new}} \sim \text{Normal}\left(\underset{\beta_i}{\arg\max}\; P(\beta_i | C),\; \text{Std}[\beta_i | C]\right)$$

The paper showed that centering priors around the mode rather than the mean can improve results, particularly for the trend slope, because the mode corresponds to the region of highest posterior probability.

### The Override Parameters

Both `FourierSeasonality` and `LinearTrend` expose `override_...` parameters that let you inject custom values for the prior location and scale:

**`FourierSeasonality`:**
- `override_beta_mean_for_tune`: Replace the posterior mean of the Fourier coefficients with custom values (e.g., the posterior mode)
- `override_beta_sd_for_tune`: Replace the posterior standard deviation with custom values

**`LinearTrend`:**
- `override_slope_mean_for_tune`: Replace the posterior mean of the slope
- `override_slope_sd_for_tune`: Replace the posterior standard deviation of the slope
- `override_delta_loc_for_tune`: Replace the posterior mean (location) of the changepoint deltas
- `override_delta_scale_for_tune`: Replace the posterior scale of the changepoint deltas

### Example: Using the Posterior Mode

```python
import numpy as np
from scipy import stats

# Fit the source model first
source_model.fit(long_data, method="nuts")

# Extract the posterior samples for the slope
slope_samples = source_model.trace["posterior"]["lt_0 - slope"].values.flatten()

# Compute the mode using kernel density estimation
kde = stats.gaussian_kde(slope_samples)
x_grid = np.linspace(slope_samples.min(), slope_samples.max(), 1000)
slope_mode = x_grid[np.argmax(kde(x_grid))]

# Compute the standard deviation (still use the full posterior for spread)
slope_std = slope_samples.std()

# Create the target model with overridden statistics
from vangja import LinearTrend, FourierSeasonality

target_model = (
    LinearTrend(
        tune_method="parametric",
        override_slope_mean_for_tune=slope_mode,    # Use mode instead of mean
        override_slope_sd_for_tune=slope_std,
    )
    + FourierSeasonality(period=365.25, series_order=6, tune_method="parametric")
)

target_model.fit(
    short_data,
    method="mapx",
    idata=source_model.trace,
    t_scale_params=source_model.t_scale_params,
)
```

### When the Mode Differs from the Mean

Consider a posterior that is left-skewed (long tail toward lower values). The mean is pulled toward the tail, while the mode sits at the peak of the distribution. Centering the new prior around the mode gives the target model a stronger starting point — it begins at the most probable parameter value rather than a tail-influenced average.

This distinction is especially relevant for:
- **Trend slope**: Where the posterior may be skewed due to changepoint interactions
- **Changepoint deltas**: Where the Laplace prior produces heavy-tailed posteriors with modes near zero

### Using the Override for Different Fourier Coefficients

You can also selectively override individual coefficients. For example, if you want to use the mode for the first few harmonics (which carry the most energy) but the mean for higher-order terms:

```python
beta_samples = source_model.trace["posterior"]["fs_0 - beta(p=365.25,n=6)"].values.reshape(-1, 12)

# Compute mode for each coefficient
beta_modes = np.array([
    x_grid[np.argmax(stats.gaussian_kde(beta_samples[:, i])(x_grid))]
    for i, x_grid in enumerate(
        np.linspace(beta_samples.min(axis=0), beta_samples.max(axis=0), 1000).T
    )
])

model = FourierSeasonality(
    period=365.25,
    series_order=6,
    tune_method="parametric",
    override_beta_mean_for_tune=beta_modes,
)
```

---

## Summary and Caveats

The three advanced features discussed in this chapter provide fine-grained control over the transfer learning process:

| Feature | Parameter | Purpose | Adds Hyperparameters? |
|---------|-----------|---------|----------------------|
| **Regularization** | `loss_factor_for_tune` | Prevent seasonal amplitude blow-up and trend drift | Yes ($\phi$) |
| **Phase alignment** | `shift_for_tune` | Align seasonal peaks between source and target | Yes ($\Delta t$ learned) |
| **Custom statistics** | `override_..._for_tune` | Use mode or other statistics instead of mean | No (replaces defaults) |

### Experimental Status

These features should be considered **experimental**. While they were validated in the paper's experiments on stock market data (443 stocks, 730 time windows), they introduce additional complexity and hyperparameters:

- **`loss_factor_for_tune`** adds a hyperparameter ($\phi$) that must be tuned. The paper found that only `0` and `1` were reliably useful values, and that the feature was less necessary when hierarchical modeling with partial pooling was also used.
- **`shift_for_tune`** adds a learned parameter, which increases model complexity. On very short series, this additional degree of freedom may not be identifiable.
- **`override_...` parameters** require the user to manually compute alternative statistics (like the mode via KDE), adding workflow complexity.

### Practical Recommendations

1. **Start simple**: Use `tune_method="parametric"` or `tune_method="prior_from_idata"` with default settings first. These are well-tested and work well in most scenarios.
2. **Add regularization** if you observe the target model producing unrealistically large seasonal effects or trend slopes that diverge from the source. Try `loss_factor_for_tune=1` as a starting point.
3. **Try the mode** if the source model's posterior is visibly skewed (check with `az.plot_posterior()`). This is a low-risk change that often helps.
4. **Use `shift_for_tune`** only if domain knowledge suggests a phase mismatch. Validate by checking whether the learned shift is physically reasonable (e.g., a 2-week shift makes sense, a 6-month shift suggests a deeper modeling issue).
5. **Prefer hierarchical modeling** over manual regularization when possible. Partial pooling with `pool_type="partial"` provides a principled form of regularization that adapts to the data.

### Further Reading

- Krajevski & Tojtovska Ribarski (2026): *Long Horizons from Short Histories* — The full paper with experimental results on stock market data