Skip to content

SQLite math/stat UDFs (ln/log/sqrt/pow/exp + stddev/var/corr/covar) with O(1) Welford memory and cross-dialect parity

Choose a tag to compare

@ZmeiGorynych ZmeiGorynych released this 04 May 08:06
· 564 commits to main since this release
feafc77

SLayer 0.4.1

Bug-fix-and-capability release that brings SQLite to math/stat parity
with Postgres, DuckDB, MySQL, and ClickHouse, and exposes seven new
statistical aggregations as built-in colon-syntax aggregates. Motivated
by DEV-1317: a
BIRD-Interact KB-translation effort that hit ~30–40 formulas which were
unencodable on bare SQLite for want of basic math/stat operators.

Backwards-compatible. No schema migrations.

What's new

Scalar math UDFs on SQLite

Use these inside Column.sql, ModelMeasure.formula, or
Aggregation.formula. NULL inputs return NULL (matching cross-dialect
SQL semantics). Math-domain errors propagate as
sqlite3.OperationalError — matching Postgres's strict semantics, not
SQLite ≥3.35's silent-NULL log() built-in.

Name Args Notes
ln(x) 1 Natural logarithm
log10(x) 1 Base-10 logarithm
log(B, X) 2 Returns log_B(X). Base first, value second — matches Postgres LOG(b, x), sqlglot's emission, and the SQLite ≥3.35 built-in. The UDF registers unconditionally and overrides any built-in so the strict-error policy is uniform across SQLite versions.
exp(x) 1 e^x
sqrt(x) 1 Square root
pow(x, n) / power(x, n) 2 Both spellings registered. Uses math.pow internally — rejects negative-base + non-integer exponent (Python's ** would silently return a complex), and bounds huge exponents to IEEE-754 (Python's ** would build unbounded big-ints).

Statistical aggregations (colon syntax)

Available as built-in aggregations on every numeric column:

{"formula": "latency:stddev_samp"}
{"formula": "latency:var_pop"}
{"formula": "price:corr(other=quantity)"}
{"formula": "price:covar_samp(other=quantity)"}
Aggregation Postgres-matching edge cases
stddev_samp / var_samp / covar_samp NULL when N ≤ 1
stddev_pop / var_pop / covar_pop NULL at N = 0; 0 at N = 1
corr NULL when fewer than 2 non-null pairs OR either side has zero variance

The two-column aggregations (corr, covar_samp, covar_pop) take
the second column as a named other= parameter — same shape as
weighted_avg(weight=…). NULL pairs (either x or y NULL) are dropped
entirely from the calculation.

Memory profile: implemented with Welford online accumulators
O(1) memory regardless of group size, and numerically more stable than
the naive two-pass formula. Safe for SQLite analytics workloads with
millions of rows per group.

Cross-dialect support

Dialect Stat aggregates Math scalars
SQLite All seven via Python UDFs (slayer/sql/sqlite_udfs.py); var_samp/var_pop aliased to variance/variance_pop to handle sqlglot's transpilation rewrite. All seven UDFs above.
Postgres / DuckDB All native: STDDEV_* / VAR_* / CORR / COVAR_*. All native.
ClickHouse All native (sqlglot may emit camelCase, e.g. varSamp). All native.
MySQL STDDEV_SAMP / STDDEV_POP / VAR_SAMP / VAR_POP are native. corr / covar_samp / covar_pop raise NotImplementedError at SQL generation time (no native function, no Python-UDF mechanism). Use MariaDB or compute client-side. All native.

Internal improvements

These don't change behaviour but are worth knowing about:

  • Unified aggregate builders. The three dialect-aware builders
    (_build_percentile, _build_formula_agg, _build_stat_agg in
    slayer/sql/generator.py) now share two helpers — _wrap_filter and
    SQLGenerator._resolve_agg_param — and emit fully model-qualified
    column references via _resolve_sql. Cognitive complexity of
    _build_stat_agg dropped from 22 → ~6, addressing a Sonar
    python:S3776 finding.

  • percentile(p=…) validation. p is now validated as a numeric
    literal in [0, 1] after parameter resolution. Closes a security gap
    where a malicious model-level default like p=pg_sleep(10) could
    bypass the SQL-injection regex check (Codex review finding).

Verify your work

# Full unit suite
poetry run pytest -m "not integration"

# SQLite + DuckDB integration (no Docker needed)
poetry run pytest tests/integration/test_integration.py -m integration
poetry run pytest tests/integration/test_integration_duckdb.py -m integration

# Lint
poetry run ruff check slayer/ tests/ examples/

If you run the Docker-backed examples, the new aggregations are smoke-
tested against live ClickHouse and MySQL via
examples/{clickhouse,mysql}/verify.py (with check_stddev_var() and,
on ClickHouse, check_corr_covar() — MySQL skips the latter since
corr/covar_* aren't supported there).

Documentation

  • CLAUDE.md — Key Conventions + Aggregation caveats sections updated.
  • docs/concepts/{formulas,queries,models,terminology}.md — new aggregations + scalar functions documented end-to-end, with the log(B, X) arg-order pin called out explicitly.
  • docs/database-support.md — per-dialect support matrix + SQLite UDF caveats.
  • docs/examples/07_aggregations/aggregations.md — extended quick-reference table.
  • slayer/help/topics/03_aggregations.md + .claude/skills/slayer-{query,models}.md — agent-facing summaries updated.

Known limitations & follow-ups

Filed as sibling Linear issues for future releases:

  • DEV-1320 — SQLite string/regex UDFs (regexp, regexp_replace, split_part, …). Not blocking any known KB pattern but rounds out parity.
  • DEV-1321 — ClickHouse stat aggregates may return NaN (not SQL NULL) for single-row / empty / zero-variance groups, contradicting the documented Postgres-style semantics. Fix needs IF(isNaN(x), NULL, x) wrapping plus ClickHouse integration coverage.
  • DEV-1322 — Type-check kwarg-as-column args for weighted_avg(weight=…) / corr(other=…) / covar_*(other=…). Today amount:corr(other=string_column) slips past enrichment and fails at the database with a dialect-specific runtime error rather than a clear "not numeric" message.

Acknowledgements

CodeRabbit, SonarCloud, and Codex (via the codex-review skill) all
contributed review feedback that shaped the final shape of this
release.