Clean local analytics table#3579
Merged
isTravis merged 3 commits intotr/analytics-local2from Apr 15, 2026
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR builds off of #3561 to clean some of the ingestion. Putting it in a separate PR as it touches many files and I wanted to segment.
timestamp→createdAttitle(browser tab title, not needed)country/countryCode(derive from timezone in matviews)isProd(filter at import, remove env gate)Kept (per team decision):
type— kept even though derivable fromeventutmSource,utmMedium,utmCampaign,utmTerm,utmContent) — kept due to download-event edge case complexityChange 1: Drop Joinable Denormalized Columns
These are all denormalized copies of data in other tables. Verified match rates against source tables:
pubTitlePubs.titleviapubIdpubSlugPubs.slugviapubIdcollectionTitleCollections.titleviacollectionIdcollectionSlugCollections.slugviacollectionIdcollectionKindCollections.kindviacollectionIdpageTitlePages.titleviapageIdpageSlugPages.slugviapageIdcommunityNameCommunities.titleviacommunityIdcommunitySubdomainCommunities.subdomainviacommunityIdcollectionIdsCollectionPubsviapubIdprimaryCollectionIdCollectionPubs/PubsviapubIdThe 1–2% mismatches are from renamed entities — using JOINs means the dashboard always shows the current name, which is better behavior.
Matviews that currently read denormalized titles (
daily_page) will be updated to JOIN source tables. Raw-fallback queries inimpactApi.tswill do the same.Change 2: Single
createdAtColumnCurrently there are two timestamp columns:
timestampcreatedAtOnly 6 rows have
timestamp == createdAt. They serve completely different purposes.Plan:
createdAtcolumn (it just records import/insertion time)timestamp→createdAtso the real event time lives in the standard column namecreatedAtcreatedAt: 'createdAt'so Sequelize auto-sets it on new rowsChange 3: Derive Country from Timezone Server-Side
PostgreSQL has no built-in timezone-to-country mapping. PG's
pg_timezone_namesview lists timezone names with UTC offsets and DST status, but has no country column. There is no PG extension for this either.Approach: Instead of creating a PG lookup table, we keep this simple:
daily_countrymatview is renamed todaily_timezoneand groups bytimezoneinstead ofcountryimpactApi.ts) maps timezone → country in Node.js using the existingcountries-and-timezonesnpm package, then sums the countsThis lets us:
countryandcountryCodecolumns from the raw table (~294 MB)timezone(it's the source of truth, unique per event, ~234 MB)api.tsat ingestion time (simplifies the handler)Updated matview (
daily_timezone):API handler (in
impactApi.ts):Why this works well:
Caveat: 16.5% of rows have generic timezones like
UTCorEtc/GMT-7that can't resolve to a country (shown as "Unknown"). This matches the current behavior — the same rows already havecountry IS NULLtoday becausegetCountryForTimezone('UTC')returns null.Change 4: Drop
titleColumnThe
titlecolumn stores the browser's<title>tag (e.g."My Article · PubPub"). It is:pageTitle/pubTitle/collectionTitle(only 0.3% overlap)event='other'(user profiles, about pages) — but these aren't surfaced on the dashboardTeam decision: Drop it. Not needed.
Change 5: Drop
isProd, Remove Env Gate, Filter at ImportThe architecture has changed: each deployment now writes to its own isolated local database. There is no longer a shared database where prod and dev data mix. Therefore:
env.PUBPUB_PRODUCTION/env.NODE_ENVgate fromapi.ts— every deployment writes freely to its own local DBisProd = true— this ensures the historical data is cleanisProdcolumn from the table — going forward, all data in a database is valid by definitionisProdfrom the client payload and Zod schemaChange 6: Timestamp Validation at Ingestion
Add a server-side check in the ingestion handler (
api.ts) to reject events with unreasonable timestamps. This prevents:Why 30 days? Analytics beacons can be delayed (offline devices, queued requests), but anything older than 30 days is likely bad data. Anything in the future is definitely wrong.
Behavior: Silently returns 204 (same as success) — no error surfaced to the client, the event is just not stored.
Matview Updates Summary
daily_page— JOIN for titlesdaily_country→ renamed todaily_timezoneThe API handler uses the
countries-and-timezonesnpm package to map timezone → country and sums the counts in JS.daily_campaign— no change (UTM columns kept)Other matviews
daily_summary,daily_pub,daily_collection,daily_referrer,daily_device— all referencetimestampwhich becomescreatedAt. No other changes needed beyond the column rename.Columns: Before and After
Dropped (16 columns)
createdAt(old)timestamptitlecountrytimezoneserver-side via npm packagecountryCodetimezoneserver-side via npm packageisProdpubTitlePubspubSlugPubscollectionTitleCollectionscollectionSlugCollectionscollectionKindCollectionscollectionIdsCollectionPubsprimaryCollectionIdCollectionPubs/PubspageTitlePagespageSlugPagescommunityNameCommunitiescommunitySubdomainCommunitiesRenamed (1 column)
timestampcreatedAtKept (27 columns)
idtypeeventcreatedAttimestamp)referrerisUniquesearchutmSourceutmMediumutmCampaigndaily_campaign)utmTermutmContenttimezonelocaleuserAgentoscommunityIdpubIdcollectionIdpageIdurlhashheightwidthpathreleaseformat