# Time Series: Introduction, Generalization, and Domain Shift

**Course:** Deep Learning for Time Series  
**Instructor:** Eva Dyer  

**Goals:**  
- Understand what makes time series special
- Define generalization in time series
- Introduce the concepts of *domain* and *domain shift*
- Formalize different types of distribution shift (covariate, label, concept shift)
- Build time series examples of each shift type
- Develop a first-pass taxonomy of adaptation methods

In [None]:
# Run this to initialize
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

np.random.seed(0)

## 1. What is a time series?

- A sequence of observations indexed by time: $(x_1, x_2, \ldots, x_T)$, where $x_t \in \mathbb{R}^d$.
  - If $x_t$ are regularly sampled, we call it a *regular time series*. This is the most common case. Examples:
    - Physiological signals (EEG, ECG, EMG)
    - Wearables (IMU, accelerometers, step counts)
    - Financial series (prices, volumes)
    - Environmental data (weather, air quality, traffic)
  - Otherwise, we call it an *irregular time series*. Examples:
    - Event sequences (anomalies, natural disasters)
    - Neural activity (spike trains)
    - Longitudinal studies (health records)

- Tasks:
  - **Forecasting:** predict future values
  - **Classification:** assign labels to a sequence (e.g., arrhythmia vs normal)
  - **Segmentation:** detect events or state transitions in time
  - **Anomaly detection:** identify unusual behaviours

<center>
<img src="https://raw.githubusercontent.com/nerdslab/cis7000-dl4ts-sp26/refs/heads/notebooks/figures/ts-tasks.png" width="800">
</center>

- Another useful characterization of tasks is whether they map the time series to a single label (sequence-level prediction) or another sequence (dense / frame-level prediction):

<center>
<img src="https://raw.githubusercontent.com/nerdslab/cis7000-dl4ts-sp26/refs/heads/notebooks/figures/seq-frame-level.png" width="600">
</center>

### 1.1 Autoregressive time series

A basic model for regular time series is the autoregressive $AR(p)$ model:
$$
x_t = \sum_{i=1}^p \beta_i x_{t-i} + ɛ_t
$$
This model assumes that the time series was generated as a process where every time point depends linearly on the sliding window of $p$ observations before it (plus some noise).

In [None]:
# Helper function: Generate AR(1) time series
def generate_ar1(T=200, phi=0.5, sigma=1.0):
    """Generate an AR(1) time series: x[t] = phi * x[t-1] + noise"""
    x = np.zeros(T)
    epsilon = np.random.normal(0, sigma, T)
    for t in range(1, T):
        x[t] = phi * x[t-1] + epsilon[t]
    return x

x = generate_ar1()
plt.figure(figsize=(8, 3))
plt.plot(x)
plt.xlabel("Time")
plt.ylabel("Value")
plt.title("Example AR(1) Time Series")
plt.show()

## 2. Generalization in time series

In standard i.i.d. ML, we talk about train/test splits over *samples*.

For time series, we have at least three axes:

1. **Time:** train on early times, test on later times  
2. **Entities:** train on some patients/devices, test on new ones  
3. **Environments:** train in one context (hospital A), test in another (hospital B)

**Key Questions:**
- What does it mean to "generalize" if the data distribution drifts over time?
- How do we design splits that reflect realistic deployment?
- What happens when training and test data come from different distributions?

### 2.1 Example: Two patients, each with AR(1) data with different means

In [None]:
# Simple toy: two patients, each with AR(1) data with different means
def patient_series(T=200, mean=0.0):
    raw_signal = generate_ar1(T)
    #    "An event is anything > 1 std dev in the underlying signal"
    labels = (raw_signal > 1.0).astype(int)
    observed_data = raw_signal + mean
    return observed_data, labels

T = 200
p1, y1 = patient_series(T, mean=0.0)
p2, y2 = patient_series(T, mean=5.0)

plt.figure(figsize=(8,3))
plt.plot(p1, label="Patient 1")
plt.plot(p2, label="Patient 2")
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title("Toy data: two patients with different means")
plt.tight_layout()
plt.show()

### Example: naive vs realistic evaluation

- **Naive split:** randomly shuffle all time points from both patients
  - Training and test sets have mixed patient identities and time periods
  - Overly optimistic performance

- **More realistic:**
  - Train on *Patient 1* + early part of *Patient 2*
  - Test on later part of *Patient 2* or a completely new patient

This demonstrates the challenge of **domain shift** — when training and test data come from different distributions.

In [None]:

# -----------------------------
# Naive Random Split
# -----------------------------
X_all = np.concatenate([p1[:, None], p2[:, None]])
y_all = np.concatenate([y1, y2])

# Shuffle everything together
X_train_n, X_test_n, y_train_n, y_test_n = train_test_split(
    X_all, y_all, test_size=0.5, random_state=42, shuffle=True
)

# Fit logistic regression classifier on the raw inputs
clf_naive = LogisticRegression(random_state=42).fit(X_train_n, y_train_n)
print(f"Naive Split Accuracy: {clf_naive.score(X_test_n, y_test_n):.2%}")

# -----------------------------
# Realistic Patient-level Split
# -----------------------------
# Train ONLY on Patient 1
X_train_patient = p1[:, None]
y_train_patient = y1

# Test ONLY on Patient 2
X_test_patient = p2[:, None]
y_test_patient = y2

clf_patient = LogisticRegression(random_state=42).fit(X_train_patient, y_train_patient)
print(f"\nPatient wise split Accuracy: {clf_patient.score(X_test_patient, y_test_patient):.2%}")

**Note:** It is crucial to design your train/test splits based on how the model will be deployed. If you plan to use the model on new users, you must use a Subject-Wise split to expose distribution shifts. A easy fix would be here to standardize the data.

In [None]:
# Scale Patient 1 (Train)
scaler_p1 = StandardScaler()
X_train_scaled = scaler_p1.fit_transform(p1.reshape(-1, 1))

# Scale Patient 2 (Test)
scaler_p2 = StandardScaler()
X_test_scaled = scaler_p2.fit_transform(p2.reshape(-1, 1))

clf_fixed = LogisticRegression(random_state=42)
clf_fixed.fit(X_train_scaled, y_train_patient)

acc_fixed = clf_fixed.score(X_test_scaled, y_test_patient)
print(f"Patient wise split Accuracy: {acc_fixed:.2%}")

## 3. Domains and Domain Shift

### 3.1 Basic Definitions

- **Domain:** a distribution over input–label pairs $(X, Y)$, e.g.
  - Hospital A vs Hospital B
  - Device A vs Device B
  - Patient 1 vs Patient 2
  - Pre- vs post-covid
  - Different time periods

- **Domain shift:** training and test data come from different domains  
  (different marginals or conditionals)
  
  Formally, if $p_{\text{source}}(x, y) \neq p_{\text{target}}(x, y)$, we have domain shift.

- **Domain adaptation:** using labeled data from a *source* domain to perform
  well on a (possibly unlabeled) *target* domain.

- **Time series domain adaptation:** adapting across people, devices, and time periods.

The example above showed a domain shift between Patient 1 and Patient 2. But domain shift can manifest in different ways, which we'll explore next.

## 4. Types of Distribution Shift

To understand how these shifts arise, recall that the joint distribution $p(x, y)$ can be decomposed in two ways using the chain rule of probability:

1. $p(x, y) = p(y|x)p(x)$
2. $p(x, y) = p(x|y)p(y)$

We get different types of distribution shifts by varying one part of these equations (letting it change between training and testing).

Let $p_{\text{train}}(x, y)$ and $p_{\text{test}}(x, y)$ be the joint distributions.

There are three main types of distribution shift:

1. **Covariate shift:** $p_{\text{train}}(x) \neq p_{\text{test}}(x)$, but $p(y|x)$ stays the same.
- Example: same labeling protocol, but sensor calibration changes

- Example: training on photos of dogs and cats, but testing on drawings of dogs and cats

<img src="https://raw.githubusercontent.com/nerdslab/cis7000-dl4ts-sp26/refs/heads/notebooks/figures/covariate-dogscats.png" width="400">

2. **Label shift:** $p_{\text{train}}(y) \neq p_{\text{test}}(y)$, but $p(x|y)$ stays the same.
- Example: class prevalence changes (e.g., more positive cases when testing), but the signal characteristics for each class stay the same

- Example: classifying breeds of dogs by image, training on images from a dog show, but deployed on a dataset of pets

<img src="https://raw.githubusercontent.com/nerdslab/cis7000-dl4ts-sp26/refs/heads/notebooks/figures/label-dogs.png" width="400">

3. **Concept shift (Concept Gap):** $p_{\text{train}}(y|x) \neq p_{\text{test}}(y|x)$.
- Example: different hospitals use different diagnostic thresholds

- Example: training a binary classifier to detect planets in 2005 (one year before Pluto was declared a dwarf planet!), but deploying in 2007

<img src="https://raw.githubusercontent.com/nerdslab/cis7000-dl4ts-sp26/refs/heads/notebooks/figures/concept-pluto.png" width="400">

## 5. Examples of Distribution Shifts

### 5.1 Concept Shift

**Concept shift** (also called **concept drift** or **concept gap**) occurs when the mapping from inputs to labels ($Y|X$) changes between domains.

To isolate this effect, we will keep the input distribution ($X$) **identical** for both Train and Test. The only thing that changes is the **rule** used to decide the label ($Y|X$).

* **Hospital A (Train):** Any value $> 3$ is treated as a Case (Class 1).
* **Hospital B (Test):** Only values $> 7$ are treated as a Case (Class 1).

Patients with sensor readings between **3 and 7** are labeled **Positive (1)** in the training data, but **Negative (0)** in the test data. This creates a "Concept Gap" where the model's learned rules directly contradict the new environment's reality.

In [None]:
n_samples = 200

# 1. Generate Input Features (X)
# Both domains have the exact SAME input distribution (0 to 10)
# This ensures the failure isn't due to "Covariate Shift"
X_train = np.random.uniform(0, 10, n_samples).reshape(-1, 1)
X_test  = np.random.uniform(0, 10, n_samples).reshape(-1, 1)

# 2. Define the Concept Shift (Different Labeling Rules)
# Train Domain: Threshold > 3
y_train = (X_train > 3).astype(int).ravel()

# Test Domain: Threshold > 7
y_test = (X_test > 7).astype(int).ravel()


# 4. Visualization
plt.figure(figsize=(10, 6))

# Training Data
plt.subplot(2, 1, 1)
plt.scatter(X_train, y_train, c=y_train, cmap='coolwarm', edgecolors='k', alpha=0.6)
plt.axvline(3, color='green', linestyle='--', linewidth=2, label='Learned Boundary (x > 3)')
plt.title("Training Domain (Hospital A): Threshold > 3")
plt.yticks([0, 1], ['Normal (0)', 'Case (1)'])
plt.legend(loc='center right')

# Test Data
plt.subplot(2, 1, 2)
plt.scatter(X_test, y_test, c=y_test, cmap='coolwarm', edgecolors='k', alpha=0.6)

plt.axvline(3, color='green', linestyle='--', linewidth=2, label='Model Boundary (From Train)')
plt.axvline(7, color='black', linestyle='-', linewidth=2, label='True Test Boundary (x > 7)')

plt.axvspan(3, 7, color='red', alpha=0.2, label='Concept Gap (Model Error)')

plt.title("Test Domain (Hospital B): Threshold > 7")
plt.xlabel("Sensor Value")
plt.yticks([0, 1], ['Normal (0)', 'Case (1)'])
plt.legend(loc='center right')

plt.tight_layout()
plt.show()


# Exercise: Simulating Distribution Shifts

Generate synthetic data to visualize Label Shift and Covariate Shift.

How can we adapt a model to each of these distribution shifts?

### 5.2 Label Shift

**Label shift** occurs when the distribution of the class labels ($Y$) changes, while the relationship between the class and its features ($P(X|Y)$) remains constant.

To isolate this effect, we will keep the feature generation logic ($P(X|Y)$) **identical** for both Train and Test. The only thing that changes is the **proportion** of labels ($Y$).

* **Hospital A (Train):** A balanced scenario where **50%** of patients are Cases (e.g., during a pandemic peak).
* **Hospital B (Test):** A low-prevalence scenario where only **10%** of patients are Cases (e.g., normal operations).

The underlying concept haven't changed—a "Case" looks exactly the same in both hospitals. However, a model trained on Hospital A learns a strong **prior bias** that the disease is common. When applied to Hospital B, this model will likely over-predict the positive class (producing excessive False Positives) simply because it expects the disease to be far more frequent than it actually is.

In [None]:
n_samples = 500

# 1. Define the Fixed Mechanism P(X|Y)
def generate_patient_symptoms(labels):
    readings = []
    for y in labels:
        if y == 0:
            # Normal: Low sensor readings (mean=3)
            # TODO implement: sample from Gaussian
        else:
            # Case: High sensor readings (mean=8)
            # This "symptom" definition is identical for both Hospital A and B
            # TODO implement: sample from Gaussian
    return np.array(readings).reshape(-1, 1)

# 2. Generate Data with LABEL SHIFT
# Hospital A (Train): "Balanced scenario where 50% of patients are Cases"
y_train = # TODO implement: 50% Normal, 50% Case
X_train = # TODO implement

# Hospital B (Test): "Low-prevalence scenario where only 10% of patients are Cases"
y_test = # TODO implement: 90% Normal, 10% Case
X_test = # TODO implement

# 3. Visualization
plt.figure(figsize=(12, 6))

# Hospital A Plot
plt.subplot(1, 2, 1)
plt.hist(X_train[y_train==0], bins=20, alpha=0.6, color='blue', label='Normal (0)')
plt.hist(X_train[y_train==1], bins=20, alpha=0.6, color='red', label='Case (1)')
plt.title(f"Hospital A (Train)\nBalanced: 50% Cases")
plt.xlabel("Sensor Value")
plt.ylabel("Patient Count")
plt.legend()
plt.grid(True, alpha=0.3)

# Hospital B Plot
plt.subplot(1, 2, 2)
plt.hist(X_test[y_test==0], bins=20, alpha=0.6, color='blue', label='Normal (0)')
plt.hist(X_test[y_test==1], bins=20, alpha=0.6, color='red', label='Case (1)')
plt.title(f"Hospital B (Test)\nLow Prevalence: 10% Cases")
plt.xlabel("Sensor Value")
plt.ylabel("Patient Count")
plt.legend()
plt.grid(True, alpha=0.3)

plt.suptitle("Label Shift: Symptoms P(X|Y) stay fixed, P(Y) changes", y=1.02, fontsize=14)
plt.tight_layout()
plt.show()

**Q: How can we fix this kind of shift?**

**ANS**: TODO

### 4.3 Covariate Shift

**Covariate shift** occurs when the input distribution $p(x)$ changes but the label mapping $p(y|x)$ remains the same.

To isolate this, we will change the **Input Distribution** ($X$) but keep the **Labeling Rule** ($Y|X$) identical.

* **Sensor A (Train):** A "Failure" (Class 1) creates a **mild** signal spike (centered at 2.0).
* **Sensor B (Test):** A "Failure" (Class 1) creates a **severe** signal spike (centered at 6.0).

The "Normal" state (Class 0) looks the same in both cases. However, the model learns to identify failures by looking for "mild" spikes. When tested on Sensor B, the failures look completely different (much higher), and the model might treat them as outliers or a different class entirely.

In [None]:
# Parameters
T_cond = 200

# 1. Define Labels (Y)
# The label distribution is fixed (random 0s and 1s)
y_cond = # TODO implement

# 2. Define Covariate Shift (Sensor Drift)
# Train: Values are ~0 and ~1
# P(x|y) -> x = y + noise
x_train_cond = # TODO implement

# Test: Values are shifted by +1.5 -> (~1.5 and ~2.5)
# P(x|y) changes -> x = y + 1.5 + noise
x_test_cond = # TODO implement

# 3. Visualization
plt.figure(figsize=(10, 5))

# Plot Train Data
plt.scatter(range(T_cond), x_train_cond, c=y_cond, cmap='coolwarm', marker='o', alpha=0.6, label="Train Data")

# Plot Test Data
plt.scatter(range(T_cond, 2*T_cond), x_test_cond, c=y_cond, cmap='coolwarm', marker='x', alpha=0.6, label="Test Data")

# The Decision Boundary (Learned from Train)
# Model learns that anything > 0.5 is Class 1
plt.axhline(y=0.5, color='green', linestyle='--', linewidth=2, label=r'$\mathbf{Learned}$ Boundary (x > 0.5)')

# Highlight the Error Zone
plt.axhspan(0.5, 3.0, color='red', alpha=0.1, label='Model predicts "Class 1" here')

plt.title("Covariate Shift")
plt.xlabel("Sample Index")
plt.ylabel("Signal Value")
plt.legend(loc='lower right')

plt.show()

**Q: How can we fix this kind of shift?**

**ANS**: TODO

## 6. Adaptation Settings and Methods

When faced with domain shift, we have several adaptation strategies. These can be categorized along multiple axes:

### Axis 1: Target Labels?
- **Labeled target:** supervised / semi-supervised domain adaptation
  - We have some labeled examples from the target domain
- **Unlabeled target:** *unsupervised domain adaptation* (UDA)
  - We only have unlabeled examples from the target domain
  - Most common and challenging setting

### Axis 2: Retraining Allowed?
- **Full retraining / joint training:** train on source + target data together
- **Parameter-efficient tuning:** adapters, LoRA, fine-tuning only some layers
- **Test-time adaptation (TTA):** update model using test stream only
  - No access to source data during deployment
  - Must adapt on-the-fly

### Axis 3: Multi-domain Structure (for time series)
- **Single source vs multi-source:** adapt from one or many source domains (patients/devices)
- **Single target vs continual / streaming targets:**
  - Adapt to one target domain, or
  - Continuously adapt as new domains arrive over time


## 7. Summary and Key Takeaways

1. **Time series generalization** is challenging due to multiple axes: time, entities, and environments.

2. **Domain shift** occurs when training and test distributions differ:
   - **Covariate shift:** $p(x)$ changes, $p(y|x)$ stays the same
   - **Label shift:** $p(y)$ changes, $p(x|y)$ stays the same  
   - **Concept shift (concept gap):** $p(y|x)$ changes — **typically the most challenging!**

3. **Realistic evaluation** requires splitting data respecting domain boundaries (e.g., leave-one-patient-out), not random shuffling.

4. **Domain adaptation** strategies depend on:
   - Whether target labels are available
   - Whether retraining is allowed
   - Single vs. multi-domain structure

5. **Concept gap** is particularly problematic because the input-label relationship itself changes. Simply correcting for input distribution differences is not enough.