Skip to content

How it works

Iman edited this page Jun 21, 2026 · 1 revision

tsauditor is organized into three modules, each responsible for a different class of data quality problem. All three run automatically when you call scan(), unless you disable specific ones via the run_* parameters.


Module 1 — Profiler

File: tsauditor/profiler/

The profiler checks the structural integrity of your time-series DataFrame — whether the data is well-formed enough to reason about before any statistical analysis begins.

What it checks

Frequency and gaps (frequency.py) Examines the timestamp index for regularity. Detects:

  • Large individual gaps exceeding the expected cadence (5 calendar days for finance, 3× median gap for sensor/general)
  • Clusters of consecutive gaps — a sign of a data feed outage or systematic collection failure, not random missing data
  • Duplicate timestamps, which silently break every rolling, lag, and resampling operation Missing values (missing.py) Goes beyond counting NaNs. Distinguishes:
  • Randomly scattered missing values (often acceptable)
  • Clustered consecutive NaN runs (PRF002) — indicates a structural failure
  • Columns with a high overall missing rate above 30% (PRF006) — may be unusable Stationarity (stationarity.py) Runs the Augmented Dickey-Fuller test on each numeric column. Flags non-stationary columns (PRF003) as informational — non-stationarity is expected for price series, but worth knowing before modeling, since many ML methods assume a stable distribution over time.

Module 2 — Anomaly Detector

File: tsauditor/anomaly/

The anomaly module finds individual values that are suspicious — either globally extreme or locally inconsistent with their neighbors.

What it checks

Point anomalies (point.py) Uses both z-score and IQR methods. A value is flagged if either method detects it as an outlier — IQR is more robust to skewed distributions, z-score is more sensitive to extreme tails. The domain preset adjusts the z-score threshold: wider for finance (fat tails are real) and tighter for sensor data.

Contextual anomalies (contextual.py)

Stuck values (ANO001): The same value repeating consecutively beyond what's physically plausible. A stock price identical for 6+ trading days is almost certainly a data feed error. A sensor reading the same temperature for 3+ consecutive hours is almost certainly a dead sensor.

Contextual spikes (ANO003): A value that is extreme relative to its immediate neighbors, even if it falls within the global distribution. Uses a rolling z-score with a centered window — a 5-period window around each point, with min_periods=3 so edge values are still evaluated.


Module 3 — Leakage Detector

File: tsauditor/leakage/

This is tsauditor's core contribution. Leakage detection requires a target column to be specified. It checks whether features contain information that would not have been available at prediction time — the class of mistake that inflated the OGDC model from 70% to 99% accuracy.

What it checks

Target equivalence (equivalence.py — LEK001)

Detects features that are near-identical to the target at lag 0.

For binary targets, uses AUC separation (max(AUC, 1−AUC)) with a threshold of 0.80. Pearson correlation against a binary 0/1 target is point-biserial correlation, which is bounded by √(2/π) ≈ 0.798. A feature whose sign defines the label — textbook leakage — can score only ~0.80 on Pearson, appearing to be "a strong predictor" rather than "a copy of the target." AUC scores it at 1.0, correctly signaling the problem. This distinction is why AUC is used here.

For continuous targets, uses Spearman correlation (rank-based, robust to non-linearity) with a threshold of 0.95 — legitimate strong predictors can reach 0.5–0.7 naturally; 0.95+ is near-mathematical equivalence.

Cross-correlation leakage (correlation.py — LEK002)

Computes Spearman cross-correlation between each feature and the target across lags from −max_lag to +max_lag. A legitimate feature should correlate most strongly at lag 0 or negative lags (past values predicting future target). If the peak correlation falls at a positive lag — meaning the feature aligns most strongly with the future target — it suggests the feature was constructed using information not yet available at prediction time.

Temporal lookahead (temporal.py — LEK003)

Catches a subtler pattern: a feature that correlates with the future target beyond what the target's own autocorrelation explains. This is the signature of a forward-looking or centered rolling window — a 5-day rolling mean that includes tomorrow's value when predicting today, for example. The persistence baseline (the target's own lag-1 autocorrelation) is subtracted to avoid false-flagging legitimate trailing features that are strong simply because the target itself is autocorrelated.


The report object

All three modules return Issue objects, which scan() routes into GuardReport.critical, GuardReport.warnings, or GuardReport.info based on severity. See API Reference for the full structure.

Every Issue carries:

  • A code (e.g. LEK001) for programmatic filtering
  • A description of what was found
  • An evidence dict with the supporting statistics
  • A suggestion property giving a concrete recommended action

Clone this wiki locally