

# Chronos-2 Soft Group Masking Extension - Progress Summary

## The Idea

**Hypothesis**: Hard group masking in Chronos-2 can block useful information when related time series are placed in different groups. Relaxing this constraint via similarity-weighted soft masking could allow beneficial cross-group learning while maintaining group structure.

**Key insight**: Instead of binary masking (attend=1, ignore=0), use continuous weights (0 to 1) based on how similar series are, allowing the model to learn from related series even across group boundaries.

---

## Implementation

Modified Chronos-2 to add soft masking capability (inference-only, no training required):

**Core changes**:
```python
# 1. Compute similarity between all time series in batch
similarity_matrix = compute_input_similarity(context, similarity_type="correlation")

# 2. Convert similarity to soft attention mask
# Within-group: similarity = 1.0 (full attention)
# Cross-group: similarity = computed value (0 to 1)
soft_mask = hard_mask + (1 - hard_mask) * similarity_matrix

# 3. Apply temperature scaling and convert to attention bias
attention_bias = log(soft_mask) × temperature
```

**Modified files**: `model.py` (soft mask construction), `pipeline.py` (similarity computation and parameter passing)

---

## Similarity Matrix: What We Use and Alternatives

The similarity matrix determines how much cross-group attention is allowed. We implemented three approaches:

### 1. **Correlation (Pearson)** - Currently Used
- Measures linear relationships between series
- Best for: Series with similar patterns but different scales
- Formula: Normalized covariance

### 2. **Cosine Similarity** - Available, Not Yet Tested
- Measures directional similarity (ignores magnitude)
- Best for: Series with similar shapes but different amplitudes
- Formula: Angle between vectors

### 3. **Distance (Gaussian Kernel)** - Available, Not Yet Tested
- Measures numerical proximity
- Best for: Series that are close in value space
- Formula: exp(-euclidean_distance / scale)

---

## Temperature Parameter

**Purpose**: Controls how permissive cross-group attention is

- **Low temperature (e.g., 1.0)**: Stricter - only very similar series attend across groups (closer to hard masking)
- **High temperature (e.g., 20.0)**: Permissive - even moderately similar series can share information

**Mathematical effect**: `attention_bias = log(similarity) × temperature`
- Acts as a scaling factor for cross-group information flow
- Higher temperature amplifies the effect of similarity differences

---

## Testing Conclusions

We tested the soft masking approach on multiple datasets with varying characteristics:

**Key findings**:
1. **Unrelated series (M4 Hourly)**: No improvement - as expected, cross-group learning doesn't help when series are from different domains

2. **Related series (ETT datasets)**: Small improvements (~6% on full data)
   - Statistically significant with large samples
   - **BUT**: Predictions are 98% identical to baseline (high correlation)
   - Only ~61% of samples improved (barely better than random)
   - Suggests **smoothing effect** rather than meaningful cross-learning

3. **Small vs large samples**: Improvement dropped from 28% (100 samples) to 6% (117k samples)
   - Indicates overfitting on small samples
   - Real effect is much smaller than initially observed

**Current interpretation**: Soft masking provides marginal benefit in current configuration - potentially just reducing prediction variance rather than enabling meaningful cross-group learning.

---

## Still to Explore

To determine if soft masking can provide meaningful improvements, we need to test:

1. **Different similarity measures**: 
   - Cosine similarity (shape-based)
   - Distance-based (proximity-based)
   - Compare which captures useful relationships best

2. **Temperature sweep**: 
   - Test 1.0, 10.0, 20.0 (only tested 5.0 so far)
   - Find optimal trade-off between group separation and cross-learning

3. **Different datasets**:
   - Weekly vs hourly granularity
   - Datasets with stronger inter-series relationships

4. **Deeper analysis**:
   - Per-feature results (which features benefit?)
   - Temporal patterns (does it help more for short/long-term forecasts?)

**Goal**: Determine if there's a configuration where soft masking provides substantial practical benefit, or conclude it's not effective for inference-only scenarios.
```
