Skip to content

Clean local analytics table#3579

Merged
isTravis merged 3 commits intotr/analytics-local2from
tr/local-analytics2-clean
Apr 15, 2026
Merged

Clean local analytics table#3579
isTravis merged 3 commits intotr/analytics-local2from
tr/local-analytics2-clean

Conversation

@isTravis
Copy link
Copy Markdown
Member

This PR builds off of #3561 to clean some of the ingestion. Putting it in a separate PR as it touches many files and I wanted to segment.

Change Columns affected Space recovered
Drop joinable denormalized columns 11 columns ~2.4 GB
Merge timestampcreatedAt 1 column dropped ~144 MB
Drop title (browser tab title, not needed) 1 column ~1,249 MB
Drop country/countryCode (derive from timezone in matviews) 2 columns ~294 MB
Drop isProd (filter at import, remove env gate) 1 column ~18 MB
Total 16 columns dropped (43 → 27) ~4.1 GB

Kept (per team decision):

  • type — kept even though derivable from event
  • UTM columns (utmSource, utmMedium, utmCampaign, utmTerm, utmContent) — kept due to download-event edge case complexity

Change 1: Drop Joinable Denormalized Columns

These are all denormalized copies of data in other tables. Verified match rates against source tables:

Column Source Match rate Effective MB
pubTitle Pubs.title via pubId 98.8% 678
pubSlug Pubs.slug via pubId 99.8% 148
collectionTitle Collections.title via collectionId 99.5% 231
collectionSlug Collections.slug via collectionId 99.6% 152
collectionKind Collections.kind via collectionId 100.0% 40
pageTitle Pages.title via pageId 98.5% 112
pageSlug Pages.slug via pageId joinable 12
communityName Communities.title via communityId joinable 449
communitySubdomain Communities.subdomain via communityId 99.7% 144
collectionIds CollectionPubs via pubId verified match 898
primaryCollectionId CollectionPubs/Pubs via pubId queryable 113

The 1–2% mismatches are from renamed entities — using JOINs means the dashboard always shows the current name, which is better behavior.

Matviews that currently read denormalized titles (daily_page) will be updated to JOIN source tables. Raw-fallback queries in impactApi.ts will do the same.


Change 2: Single createdAt Column

Currently there are two timestamp columns:

Column Contains Range in data
timestamp Real event time (from client or Redshift) 2010 → 2144 (!!)
createdAt When the row was INSERT-ed into this PG table 2024-02-12 → present

Only 6 rows have timestamp == createdAt. They serve completely different purposes.

Plan:

  1. Drop the Sequelize auto-generated createdAt column (it just records import/insertion time)
  2. Rename timestampcreatedAt so the real event time lives in the standard column name
  3. Update all matview SQL, queries, indexes, and the Sequelize model to reference createdAt
  4. Configure the model with createdAt: 'createdAt' so Sequelize auto-sets it on new rows
ALTER TABLE "AnalyticsEvents" DROP COLUMN "createdAt";
ALTER TABLE "AnalyticsEvents" RENAME COLUMN "timestamp" TO "createdAt";

Data cleaning note: The timestamp column has some garbage data (year 2144, very old 2010 entries). These will be caught by the new ingestion timestamp validation (see Change 5) going forward. For historical data, a cleaning pass should clamp them to a reasonable range.


Change 3: Derive Country from Timezone Server-Side

PostgreSQL has no built-in timezone-to-country mapping. PG's pg_timezone_names view lists timezone names with UTC offsets and DST status, but has no country column. There is no PG extension for this either.

Approach: Instead of creating a PG lookup table, we keep this simple:

  1. The daily_country matview is renamed to daily_timezone and groups by timezone instead of country
  2. The API handler (impactApi.ts) maps timezone → country in Node.js using the existing countries-and-timezones npm package, then sums the counts
  3. No new PG tables, no seeding, no maintenance

This lets us:

  • Drop country and countryCode columns from the raw table (~294 MB)
  • Keep timezone (it's the source of truth, unique per event, ~234 MB)
  • Stop deriving country in api.ts at ingestion time (simplifies the handler)
  • Use a single source of truth — the npm package — for timezone→country mapping everywhere

Updated matview (daily_timezone):

CREATE MATERIALIZED VIEW analytics_daily_timezone AS
SELECT
    "communityId",
    date_trunc('day', "createdAt")::date AS date,
    COALESCE(timezone, '') AS timezone,
    COUNT(*) AS count
FROM "AnalyticsEvents"
WHERE event IN ('page','pub','collection','other')
GROUP BY "communityId", date_trunc('day', "createdAt")::date, timezone;

API handler (in impactApi.ts):

import { getCountryForTimezone } from 'countries-and-timezones';

// After fetching timezone rows from the matview:
const countryMap = new Map<string, { country: string; countryCode: string; count: number }>();
for (const row of timezoneRows) {
    const tz = getCountryForTimezone(row.timezone);
    const key = tz ? tz.id : 'Unknown';
    const existing = countryMap.get(key);
    if (existing) {
        existing.count += row.count;
    } else {
        countryMap.set(key, {
            country: tz?.name ?? 'Unknown',
            countryCode: tz?.id ?? '',
            count: row.count,
        });
    }
}
const countries = [...countryMap.values()].sort((a, b) => b.count - a.count);

Why this works well:

  • The matview collapses 18.85M rows down to ~community × date × timezone (a few hundred per community). The JS rollup is trivial.
  • The npm package is already a dependency. No new infrastructure.
  • Package updates are picked up automatically on next deploy — no table re-seeding needed.

Caveat: 16.5% of rows have generic timezones like UTC or Etc/GMT-7 that can't resolve to a country (shown as "Unknown"). This matches the current behavior — the same rows already have country IS NULL today because getCountryForTimezone('UTC') returns null.


Change 4: Drop title Column

The title column stores the browser's <title> tag (e.g. "My Article · PubPub"). It is:

  • Different from pageTitle/pubTitle/collectionTitle (only 0.3% overlap)
  • The sole source of page identity for event='other' (user profiles, about pages) — but these aren't surfaced on the dashboard
  • 1,249 MB of storage

Team decision: Drop it. Not needed.


Change 5: Drop isProd, Remove Env Gate, Filter at Import

The architecture has changed: each deployment now writes to its own isolated local database. There is no longer a shared database where prod and dev data mix. Therefore:

  1. Remove the env.PUBPUB_PRODUCTION / env.NODE_ENV gate from api.ts — every deployment writes freely to its own local DB
  2. During data import from Redshift, only import rows where isProd = true — this ensures the historical data is clean
  3. Drop the isProd column from the table — going forward, all data in a database is valid by definition
  4. Remove isProd from the client payload and Zod schema

Change 6: Timestamp Validation at Ingestion

Add a server-side check in the ingestion handler (api.ts) to reject events with unreasonable timestamps. This prevents:

  • Garbage data from buggy clients (e.g. year 2144)
  • Clock-skew attacks
  • Malformed timestamps
// In the handler, before enqueue():
const eventTime = new Date(payload.timestamp);
const now = Date.now();

// Reject if timestamp is in the future or unreasonably old (> 30 days)
if (eventTime.getTime() > now || eventTime.getTime() < now - 30 * 24 * 60 * 60 * 1000) {
    return { status: 204, body: undefined }; // silently drop
}

Why 30 days? Analytics beacons can be delayed (offline devices, queued requests), but anything older than 30 days is likely bad data. Anything in the future is definitely wrong.

Behavior: Silently returns 204 (same as success) — no error surfaced to the client, the event is just not stored.


Matview Updates Summary

daily_page — JOIN for titles

-- Before: reads denormalized columns
COALESCE("pageTitle", "pubTitle", "collectionTitle", path, '')

-- After: JOINs source tables for current titles
LEFT JOIN "Pages" pg ON pg.id = ae."pageId"
LEFT JOIN "Pubs" p ON p.id = ae."pubId"
LEFT JOIN "Collections" c ON c.id = ae."collectionId"
...
COALESCE(pg.title, p.title, c.title, ae.path, '')

daily_country → renamed to daily_timezone

-- Before: groups by stored country/countryCode columns
COALESCE(country, 'Unknown') AS country,
COALESCE("countryCode", '') AS country_code

-- After: groups by timezone (country mapping done in Node.js API handler)
timezone

The API handler uses the countries-and-timezones npm package to map timezone → country and sums the counts in JS.

daily_campaign — no change (UTM columns kept)

Other matviews

daily_summary, daily_pub, daily_collection, daily_referrer, daily_device — all reference timestamp which becomes createdAt. No other changes needed beyond the column rename.


Columns: Before and After

Dropped (16 columns)

Column Reason
createdAt (old) Sequelize auto-col, replaced by renamed timestamp
title Browser tab title, not needed
country Derived from timezone server-side via npm package
countryCode Derived from timezone server-side via npm package
isProd All data is prod (isolated DBs); filter at import only
pubTitle JOIN Pubs
pubSlug JOIN Pubs
collectionTitle JOIN Collections
collectionSlug JOIN Collections
collectionKind JOIN Collections
collectionIds JOIN CollectionPubs
primaryCollectionId JOIN CollectionPubs/Pubs
pageTitle JOIN Pages
pageSlug JOIN Pages
communityName JOIN Communities
communitySubdomain JOIN Communities

Renamed (1 column)

Old name New name Notes
timestamp createdAt Real event time, preserved for imported data

Kept (27 columns)

Column Type Purpose
id UUID Primary key
type TEXT Event category (page/track) — kept per team decision
event TEXT Event type (page, pub, collection, other, download)
createdAt TIMESTAMPTZ Event timestamp (renamed from timestamp)
referrer TEXT Inbound referrer URL
isUnique BOOLEAN Unique visit flag
search TEXT URL query string
utmSource TEXT UTM source param
utmMedium TEXT UTM medium param
utmCampaign TEXT UTM campaign param (used by daily_campaign)
utmTerm TEXT UTM term param
utmContent TEXT UTM content param
timezone TEXT Browser timezone (source for country derivation)
locale TEXT Browser locale
userAgent TEXT Full UA string
os TEXT Operating system
communityId UUID FK → Communities
pubId UUID FK → Pubs
collectionId UUID FK → Collections
pageId UUID FK → Pages
url TEXT Full URL with query string
hash TEXT URL fragment
height INTEGER Viewport height
width INTEGER Viewport width
path TEXT URL path segment
release TEXT Draft or release number
format TEXT Download format (pdf, docx, etc.)

@isTravis isTravis merged commit 14e474a into tr/analytics-local2 Apr 15, 2026
@isTravis isTravis deleted the tr/local-analytics2-clean branch April 15, 2026 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant