Skip to content

feat: TableauUpload (new Odin job for publishing data to Tableau)#78

Merged
ealexa05 merged 24 commits intomainfrom
tableau_test
Jan 30, 2026
Merged

feat: TableauUpload (new Odin job for publishing data to Tableau)#78
ealexa05 merged 24 commits intomainfrom
tableau_test

Conversation

@ealexa05
Copy link
Contributor

@ealexa05 ealexa05 commented Dec 4, 2025

Summary

This PR introduces a new Odin job class (TableauUpload), which coordinates incremental batch uploading of Cubic ODS data to Tableau.

During a run, Odin iterates through TABLES_TO_SYNC list, and sets up a TableauUpload job for each in the schedule. Each job then:

  1. Downloads the high watermark checkpoint from s3://{bucket}/odin/state/tableau_checkpoints.json if available
  2. Scans S3 for all parquet files across partitions (e.g., odin_year=2023/, odin_year=2024/)
  3. Filters found data by metadata to only download files which contain new records
  4. Applies table-specific rules for type casting and column dropping using Polars LazyFrames
  5. Constructs hyper files and publishes to Tableau in batches

Each job runs on a recurring schedule, based on its exit status:

  • retries in 4 hours after a successful sync with data
  • retries in 12 hours if no new data is found

Adding and configuring tables

To start syncing a new table, add the table name to TABLES_TO_SYNC and, if applicable, to TABLE_CONFIG.

TABLE_CONFIG includes:

  • casts: Column type overrides (e.g., force token_id to Int64)
  • drops: Columns to exclude (e.g., sensitive data like restricted_purse_id, or just unneeded columns to cut size)
  • index_column: The monotonically increasing column used for watermarking

If you have included an index_column for your table, you can also specify a minimum value in s3://{bucket}/odin/state/tableau_checkpoints.json prior to the first sync so that this job will ignore all data before that point. tableau_checkpoints.json is a file containing table names and high watermarks, e.g. {"EDW.ABP_TAP": 2404993, "EDW.TRAINS": 94128}

Dependencies

This job introduces 2 additional dependencies:

  • tableauhyperapi (for building hyper extract files)
  • tableauserverclient (for publishing to the Tableau server)

Unit tests

These tests assert that the job correctly handles:

  • type conversion (including both explictly supported and unknown types)
  • hyper file creation and schema conversion
  • nested project id resolution
  • TABLE_CONFIG structure

Change history

Change log for 4364f75

Rather than overwriting by default, pipeline now checks state of table on S3 and appends only new data to existing Tableau data source. This works by

  • When each batched hyper file is successfully sent to Tableau, the max value of index_column is recorded to s3://{bucket}/odin/state/tableau_checkpoints.json under the table name
  • If new data is added to synced S3 file, only new data (per index_column) is uploaded

Also, the pipeline now points to tables rather than individual files. For each table, it finds all contained partitions and files and scans parquet metadata to determine if it contains relevant data based on index_column

Additionally:

  • Synced files are split into separate hyper files if they are larger than BATCH_SIZE (500_000)
  • Casting and dropping of columns is now handled in a lazyframe for memory efficiency
  • index_column can now be defined in TABLE_CONFIG. Index must be numeric and monotonically increasing
  • --overwrite-table can be used to force a whole table to resync and overwrite
  • You can also manually provide a value in s3://{bucket}/odin/state/tableau_checkpoints.json to start syncing after that point

Change log for e075e27

Implements TableauUpload, which extends OdinJob, allowing Odin to run Tableau uploads on a schedule. Structure is based on ods_fact.py. Key features

  • Job scheduler sets up each table on its own retry schedule (with shorter reruns for tables that synced data during the last run)
  • Tableau upload jobs run only if configured in run.py
  • Errors are handled appropriately

Change log for a93d078

Overhaul of logging functions to use ProcessLog consistently (and in line with other Odin jobs).

  • You can now filter by uuid in Splunk to get all logs relating to a given iteration of a function
  • Logging conforms to pattern of log.add_metadata / log.add_metadata / log.failed for respective circumstances

Testing plan:

  1. Modify time constants to something low so I can observe rerun behavior
  2. Start job for table=EDW.JOURNAL_ENTRY, allow to complete initial transfer of data
  3. Allow to rerun. Likely no new data will have synced, as dev tables are not as often updated. See that no new data is synced
  4. Before next run, update watermark in s3://mbta-ctd-dataplatform-dev-springboard/odin/state/tableau_checkpoints.json to a lower value
  5. Allow to rerun again. See that it syncs over data newer than updated watermark as expected
  6. Delete data source from Tableau and watermark from json state file
Logs for testing run 1: Testing job rerun and incremental updating
➜  odin git:(odin-tableau-job) ✗ poetry run python src/odin/run.py                                                                              
2026-01-08T16:57:04-0500     INFO uuid=e0a197b6-9f10-402a-837b-ca7f4a84b735, parent=odin, process=validate_env_vars, process_id=3920, status=started, disk_free_mb=271915, sys_mem_free_pct=17, proc_mem_used_mb=114
2026-01-08T16:57:04-0500     INFO uuid=e0a197b6-9f10-402a-837b-ca7f4a84b735, parent=odin, process=validate_env_vars, process_id=3920, status=complete, disk_free_mb=271915, sys_mem_free_pct=17, proc_mem_used_mb=114, duration=0.00, DATA_ERROR=mbta-ctd-dataplatform-dev-error, DATA_SPRINGBOARD=mbta-ctd-dataplatform-dev-springboard, DATA_INCOMING=mbta-ctd-dataplatform-dev-incoming, AFC_API_CLIENT_ID=**********, AFC_API_CLIENT_SECRET=**********, DATA_ARCHIVE=mbta-ctd-dataplatform-dev-archive
2026-01-08T16:57:04-0500     INFO uuid=7e858d91-8a9b-48d3-98b3-5e80c67e9d44, parent=odin, process=odin_event_loop, process_id=3920, status=started, disk_free_mb=271915, sys_mem_free_pct=17, proc_mem_used_mb=114
2026-01-08T16:57:05-0500     INFO uuid=f667685a-1e23-4e32-b6ab-de05da138c34, parent=odin, process=TableauUpload, process_id=3926, status=started, disk_free_mb=271915, sys_mem_free_pct=15, proc_mem_used_mb=114, table=EDW.JOURNAL_ENTRY
2026-01-08T16:57:05-0500     INFO Fetching watermark for EDW.JOURNAL_ENTRY from s3://mbta-ctd-dataplatform-dev-springboard/odin/state/tableau_checkpoints.json
2026-01-08T16:57:05-0500     INFO Current watermark: 1070000
2026-01-08T16:57:05-0500     INFO uuid=217fa607-3ad0-4748-b41b-6cd42ef42beb, parent=odin, process=discover_partitions, process_id=3926, status=started, disk_free_mb=271915, sys_mem_free_pct=15, proc_mem_used_mb=133, table=EDW.JOURNAL_ENTRY
2026-01-08T16:57:05-0500     INFO Discovering partitions under s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/
2026-01-08T16:57:05-0500     INFO uuid=0af1bac2-5598-40b8-a425-5fa961bf7141, parent=odin, process=list_partitions, process_id=3926, status=started, disk_free_mb=271915, sys_mem_free_pct=15, proc_mem_used_mb=133, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/, max_objects=10000
2026-01-08T16:57:05-0500     INFO uuid=0af1bac2-5598-40b8-a425-5fa961bf7141, parent=odin, process=list_partitions, process_id=3926, status=complete, disk_free_mb=271915, sys_mem_free_pct=15, proc_mem_used_mb=140, duration=0.19, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/, max_objects=10000, partitions_found=2
2026-01-08T16:57:05-0500     INFO Found 2 partition(s): ['odin_year=2024', 'odin_year=2025']
2026-01-08T16:57:05-0500     INFO uuid=b21bf878-6422-4af0-b404-476a18a8ea38, parent=odin, process=list_objects, process_id=3926, status=started, disk_free_mb=271915, sys_mem_free_pct=15, proc_mem_used_mb=140, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2024/, max_objects=1000000, in_filter=.parquet
2026-01-08T16:57:05-0500     INFO uuid=b21bf878-6422-4af0-b404-476a18a8ea38, parent=odin, process=list_objects, process_id=3926, status=complete, disk_free_mb=271915, sys_mem_free_pct=15, proc_mem_used_mb=140, duration=0.03, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2024/, max_objects=1000000, in_filter=.parquet, objects_found=1
2026-01-08T16:57:05-0500     INFO uuid=947d24c3-356c-4529-8da1-3812a6de0019, parent=odin, process=list_objects, process_id=3926, status=started, disk_free_mb=271915, sys_mem_free_pct=15, proc_mem_used_mb=140, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2025/, max_objects=1000000, in_filter=.parquet
2026-01-08T16:57:05-0500     INFO uuid=947d24c3-356c-4529-8da1-3812a6de0019, parent=odin, process=list_objects, process_id=3926, status=complete, disk_free_mb=271915, sys_mem_free_pct=15, proc_mem_used_mb=140, duration=0.04, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2025/, max_objects=1000000, in_filter=.parquet, objects_found=1
2026-01-08T16:57:05-0500     INFO Total parquet files discovered: 2
2026-01-08T16:57:05-0500     INFO uuid=15a998ac-f50e-405c-a4a5-3fb757c46ae1, parent=odin, process=filter_files_by_metadata, process_id=3926, status=started, disk_free_mb=271915, sys_mem_free_pct=15, proc_mem_used_mb=140, table=EDW.JOURNAL_ENTRY, watermark=1070000
2026-01-08T16:57:05-0500     INFO Filtering files using metadata where journal_entry_key max > 1070000
2026-01-08T16:57:05-0500     INFO Skip s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2024/year_001.parquet: max=827909 <= watermark=1070000
2026-01-08T16:57:06-0500     INFO Include s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2025/year_001.parquet: max=1075187 > watermark=1070000
2026-01-08T16:57:06-0500     INFO Files after metadata filtering: 1 of 2
2026-01-08T16:57:06-0500     INFO uuid=71decd24-ca31-494c-ab65-28d7933ecca2, parent=odin, process=load_filtered_data, process_id=3926, status=started, disk_free_mb=271915, sys_mem_free_pct=15, proc_mem_used_mb=147, table=EDW.JOURNAL_ENTRY, file_count=1
2026-01-08T16:57:06-0500     INFO Loading data from 1 S3 files
2026-01-08T16:57:06-0500     INFO Applying filter: journal_entry_key > 1070000
2026-01-08T16:57:06-0500     INFO uuid=a3e1a750-cb21-4144-abb8-73ce8e698e52, parent=odin, process=process_batches, process_id=3926, status=started, disk_free_mb=271915, sys_mem_free_pct=15, proc_mem_used_mb=147, table=EDW.JOURNAL_ENTRY, total_rows=5166
2026-01-08T16:57:06-0500     INFO Processing 5166 rows for EDW.JOURNAL_ENTRY in batches of 500000
2026-01-08T16:57:06-0500     INFO uuid=d646751e-73f2-4ce6-bc13-489395b4334c, parent=odin, process=process_batch, process_id=3926, status=started, disk_free_mb=271915, sys_mem_free_pct=15, proc_mem_used_mb=147, table=EDW.JOURNAL_ENTRY, batch_num=1, offset=0, batch_end=5166
2026-01-08T16:57:06-0500     INFO Preparing batch 1 (rows 0 to 5166)
2026-01-08T16:57:07-0500     INFO Hyper extract contains 5166 rows
2026-01-08T16:57:07-0500     INFO Publishing batch 1 to Tableau (Mode: Append)
2026-01-08T16:57:11-0500     INFO Updated watermark for EDW.JOURNAL_ENTRY to 1075187
2026-01-08T16:57:11-0500     INFO uuid=f667685a-1e23-4e32-b6ab-de05da138c34, parent=odin, process=TableauUpload, process_id=3926, status=complete, disk_free_mb=271926, sys_mem_free_pct=16, proc_mem_used_mb=188, duration=5.96, table=EDW.JOURNAL_ENTRY, run_delay_mins=2.00, overwrite=False
2026-01-08T16:59:12-0500     INFO uuid=78b42042-5586-4c15-8aea-6e9fdf9ab530, parent=odin, process=TableauUpload, process_id=4222, status=started, disk_free_mb=271919, sys_mem_free_pct=15, proc_mem_used_mb=114, table=EDW.JOURNAL_ENTRY
2026-01-08T16:59:12-0500     INFO Fetching watermark for EDW.JOURNAL_ENTRY from s3://mbta-ctd-dataplatform-dev-springboard/odin/state/tableau_checkpoints.json
2026-01-08T16:59:12-0500     INFO Current watermark: 1075187
2026-01-08T16:59:12-0500     INFO uuid=a37aef1a-2270-470b-afe3-9b0f5cb8f2c8, parent=odin, process=discover_partitions, process_id=4222, status=started, disk_free_mb=271919, sys_mem_free_pct=14, proc_mem_used_mb=122, table=EDW.JOURNAL_ENTRY
2026-01-08T16:59:12-0500     INFO Discovering partitions under s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/
2026-01-08T16:59:12-0500     INFO uuid=55387621-4c28-49fe-aec7-79493578e589, parent=odin, process=list_partitions, process_id=4222, status=started, disk_free_mb=271919, sys_mem_free_pct=14, proc_mem_used_mb=122, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/, max_objects=10000
2026-01-08T16:59:12-0500     INFO uuid=55387621-4c28-49fe-aec7-79493578e589, parent=odin, process=list_partitions, process_id=4222, status=complete, disk_free_mb=271919, sys_mem_free_pct=14, proc_mem_used_mb=133, duration=0.20, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/, max_objects=10000, partitions_found=2
2026-01-08T16:59:12-0500     INFO Found 2 partition(s): ['odin_year=2024', 'odin_year=2025']
2026-01-08T16:59:12-0500     INFO uuid=e12473a1-91d2-471e-b900-318472327659, parent=odin, process=list_objects, process_id=4222, status=started, disk_free_mb=271919, sys_mem_free_pct=14, proc_mem_used_mb=133, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2024/, max_objects=1000000, in_filter=.parquet
2026-01-08T16:59:12-0500     INFO uuid=e12473a1-91d2-471e-b900-318472327659, parent=odin, process=list_objects, process_id=4222, status=complete, disk_free_mb=271919, sys_mem_free_pct=14, proc_mem_used_mb=133, duration=0.04, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2024/, max_objects=1000000, in_filter=.parquet, objects_found=1
2026-01-08T16:59:12-0500     INFO uuid=5831f38a-70d8-4c66-a480-68ae496f2299, parent=odin, process=list_objects, process_id=4222, status=started, disk_free_mb=271919, sys_mem_free_pct=14, proc_mem_used_mb=133, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2025/, max_objects=1000000, in_filter=.parquet
2026-01-08T16:59:12-0500     INFO uuid=5831f38a-70d8-4c66-a480-68ae496f2299, parent=odin, process=list_objects, process_id=4222, status=complete, disk_free_mb=271919, sys_mem_free_pct=14, proc_mem_used_mb=130, duration=0.04, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2025/, max_objects=1000000, in_filter=.parquet, objects_found=1
2026-01-08T16:59:12-0500     INFO Total parquet files discovered: 2
2026-01-08T16:59:12-0500     INFO uuid=9d216c1c-23cf-4cbe-bb55-73d74874bb03, parent=odin, process=filter_files_by_metadata, process_id=4222, status=started, disk_free_mb=271919, sys_mem_free_pct=14, proc_mem_used_mb=130, table=EDW.JOURNAL_ENTRY, watermark=1075187
2026-01-08T16:59:12-0500     INFO Filtering files using metadata where journal_entry_key max > 1075187
2026-01-08T16:59:13-0500     INFO Skip s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2024/year_001.parquet: max=827909 <= watermark=1075187
2026-01-08T16:59:13-0500     INFO Skip s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2025/year_001.parquet: max=1075187 <= watermark=1075187
2026-01-08T16:59:13-0500     INFO Files after metadata filtering: 0 of 2
2026-01-08T16:59:13-0500     INFO No files contain data above watermark for EDW.JOURNAL_ENTRY
2026-01-08T16:59:13-0500     INFO uuid=78b42042-5586-4c15-8aea-6e9fdf9ab530, parent=odin, process=TableauUpload, process_id=4222, status=complete, disk_free_mb=271918, sys_mem_free_pct=14, proc_mem_used_mb=137, duration=1.08, table=EDW.JOURNAL_ENTRY, run_delay_mins=1.00, overwrite=False
2026-01-08T17:00:14-0500     INFO uuid=8f2f75f4-0c6f-40d3-8306-2c6bdd15f3e3, parent=odin, process=TableauUpload, process_id=4396, status=started, disk_free_mb=271919, sys_mem_free_pct=16, proc_mem_used_mb=114, table=EDW.JOURNAL_ENTRY
2026-01-08T17:00:14-0500     INFO Fetching watermark for EDW.JOURNAL_ENTRY from s3://mbta-ctd-dataplatform-dev-springboard/odin/state/tableau_checkpoints.json
2026-01-08T17:00:14-0500     INFO Current watermark: 1070000
2026-01-08T17:00:14-0500     INFO uuid=179fa00c-de4c-42ca-a3ab-bef0b614a104, parent=odin, process=discover_partitions, process_id=4396, status=started, disk_free_mb=271919, sys_mem_free_pct=16, proc_mem_used_mb=130, table=EDW.JOURNAL_ENTRY
2026-01-08T17:00:14-0500     INFO Discovering partitions under s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/
2026-01-08T17:00:14-0500     INFO uuid=f7698cb2-7446-4eab-88d1-46d88ae095c9, parent=odin, process=list_partitions, process_id=4396, status=started, disk_free_mb=271919, sys_mem_free_pct=16, proc_mem_used_mb=130, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/, max_objects=10000
2026-01-08T17:00:14-0500     INFO uuid=f7698cb2-7446-4eab-88d1-46d88ae095c9, parent=odin, process=list_partitions, process_id=4396, status=complete, disk_free_mb=271919, sys_mem_free_pct=16, proc_mem_used_mb=142, duration=0.19, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/, max_objects=10000, partitions_found=2
2026-01-08T17:00:14-0500     INFO Found 2 partition(s): ['odin_year=2024', 'odin_year=2025']
2026-01-08T17:00:14-0500     INFO uuid=787d6211-3de4-4327-96a0-6baa3b40c05d, parent=odin, process=list_objects, process_id=4396, status=started, disk_free_mb=271919, sys_mem_free_pct=16, proc_mem_used_mb=142, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2024/, max_objects=1000000, in_filter=.parquet
2026-01-08T17:00:14-0500     INFO uuid=787d6211-3de4-4327-96a0-6baa3b40c05d, parent=odin, process=list_objects, process_id=4396, status=complete, disk_free_mb=271919, sys_mem_free_pct=16, proc_mem_used_mb=142, duration=0.05, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2024/, max_objects=1000000, in_filter=.parquet, objects_found=1
2026-01-08T17:00:14-0500     INFO uuid=2f09b93c-f18f-41c1-bdba-9f52a7df2ad9, parent=odin, process=list_objects, process_id=4396, status=started, disk_free_mb=271919, sys_mem_free_pct=16, proc_mem_used_mb=142, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2025/, max_objects=1000000, in_filter=.parquet
2026-01-08T17:00:14-0500     INFO uuid=2f09b93c-f18f-41c1-bdba-9f52a7df2ad9, parent=odin, process=list_objects, process_id=4396, status=complete, disk_free_mb=271919, sys_mem_free_pct=16, proc_mem_used_mb=139, duration=0.04, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2025/, max_objects=1000000, in_filter=.parquet, objects_found=1
2026-01-08T17:00:14-0500     INFO Total parquet files discovered: 2
2026-01-08T17:00:14-0500     INFO uuid=8513611f-2514-4b90-aa7e-c39970679e70, parent=odin, process=filter_files_by_metadata, process_id=4396, status=started, disk_free_mb=271919, sys_mem_free_pct=16, proc_mem_used_mb=139, table=EDW.JOURNAL_ENTRY, watermark=1070000
2026-01-08T17:00:14-0500     INFO Filtering files using metadata where journal_entry_key max > 1070000
2026-01-08T17:00:15-0500     INFO Skip s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2024/year_001.parquet: max=827909 <= watermark=1070000
2026-01-08T17:00:15-0500     INFO Include s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2025/year_001.parquet: max=1075187 > watermark=1070000
2026-01-08T17:00:15-0500     INFO Files after metadata filtering: 1 of 2
2026-01-08T17:00:15-0500     INFO uuid=ae0a7d41-1417-4f55-a486-a0dfd9db53f8, parent=odin, process=load_filtered_data, process_id=4396, status=started, disk_free_mb=271919, sys_mem_free_pct=15, proc_mem_used_mb=143, table=EDW.JOURNAL_ENTRY, file_count=1
2026-01-08T17:00:15-0500     INFO Loading data from 1 S3 files
2026-01-08T17:00:15-0500     INFO Applying filter: journal_entry_key > 1070000
2026-01-08T17:00:16-0500     INFO uuid=3697999d-b944-4b30-ba3f-662fdf82b56e, parent=odin, process=process_batches, process_id=4396, status=started, disk_free_mb=271919, sys_mem_free_pct=15, proc_mem_used_mb=157, table=EDW.JOURNAL_ENTRY, total_rows=5166
2026-01-08T17:00:16-0500     INFO Processing 5166 rows for EDW.JOURNAL_ENTRY in batches of 500000
2026-01-08T17:00:16-0500     INFO uuid=17c30fc2-6799-425e-8b9f-6752b34f8928, parent=odin, process=process_batch, process_id=4396, status=started, disk_free_mb=271919, sys_mem_free_pct=15, proc_mem_used_mb=157, table=EDW.JOURNAL_ENTRY, batch_num=1, offset=0, batch_end=5166
2026-01-08T17:00:16-0500     INFO Preparing batch 1 (rows 0 to 5166)
2026-01-08T17:00:16-0500     INFO Hyper extract contains 5166 rows
2026-01-08T17:00:16-0500     INFO Publishing batch 1 to Tableau (Mode: Append)
2026-01-08T17:00:21-0500     INFO Updated watermark for EDW.JOURNAL_ENTRY to 1075187
2026-01-08T17:00:21-0500     INFO uuid=8f2f75f4-0c6f-40d3-8306-2c6bdd15f3e3, parent=odin, process=TableauUpload, process_id=4396, status=complete, disk_free_mb=271921, sys_mem_free_pct=15, proc_mem_used_mb=235, duration=6.70, table=EDW.JOURNAL_ENTRY, run_delay_mins=2.00, overwrite=False
...
Logs for testing run 2: Deleting Tableau table and watermark and rerunning
➜  odin git:(odin-tableau-job) ✗ poetry run python src/odin/ingestion/tableau/tableau_upload_test.py
2026-01-09T11:51:19-0500     INFO uuid=7bb338f5-f8fe-4d3b-ae83-b554f804e951, parent=odin, process=TableauUpload, process_id=31579, status=started, disk_free_mb=268748, sys_mem_free_pct=17, proc_mem_used_mb=99, table=EDW.JOURNAL_ENTRY
2026-01-09T11:51:19-0500     INFO Fetching watermark for EDW.JOURNAL_ENTRY from s3://mbta-ctd-dataplatform-dev-springboard/odin/state/tableau_checkpoints.json
2026-01-09T11:51:19-0500     INFO Current watermark: None
2026-01-09T11:51:19-0500     INFO uuid=538dd11f-fc95-4916-8336-ca85d51e1ad1, parent=odin, process=discover_partitions, process_id=31579, status=started, disk_free_mb=268748, sys_mem_free_pct=17, proc_mem_used_mb=119, table=EDW.JOURNAL_ENTRY
2026-01-09T11:51:19-0500     INFO Discovering partitions under s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/
2026-01-09T11:51:19-0500     INFO uuid=97576ad5-8f4c-4061-9231-994f7e23c8e6, parent=odin, process=list_partitions, process_id=31579, status=started, disk_free_mb=268748, sys_mem_free_pct=17, proc_mem_used_mb=119, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/, max_objects=10000
2026-01-09T11:51:19-0500     INFO uuid=97576ad5-8f4c-4061-9231-994f7e23c8e6, parent=odin, process=list_partitions, process_id=31579, status=complete, disk_free_mb=268748, sys_mem_free_pct=17, proc_mem_used_mb=131, duration=0.21, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/, max_objects=10000, partitions_found=2
2026-01-09T11:51:19-0500     INFO Found 2 partition(s): ['odin_year=2024', 'odin_year=2025']
2026-01-09T11:51:19-0500     INFO uuid=43423430-8946-43c9-b6de-57c7823f0fe4, parent=odin, process=list_objects, process_id=31579, status=started, disk_free_mb=268748, sys_mem_free_pct=17, proc_mem_used_mb=131, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2024/, max_objects=1000000, in_filter=.parquet
2026-01-09T11:51:19-0500     INFO uuid=43423430-8946-43c9-b6de-57c7823f0fe4, parent=odin, process=list_objects, process_id=31579, status=complete, disk_free_mb=268748, sys_mem_free_pct=17, proc_mem_used_mb=131, duration=0.04, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2024/, max_objects=1000000, in_filter=.parquet, objects_found=1
2026-01-09T11:51:19-0500     INFO uuid=d073200b-a934-4b1a-84c6-0ace3c84e0b3, parent=odin, process=list_objects, process_id=31579, status=started, disk_free_mb=268748, sys_mem_free_pct=17, proc_mem_used_mb=131, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2025/, max_objects=1000000, in_filter=.parquet
2026-01-09T11:51:19-0500     INFO uuid=d073200b-a934-4b1a-84c6-0ace3c84e0b3, parent=odin, process=list_objects, process_id=31579, status=complete, disk_free_mb=268748, sys_mem_free_pct=17, proc_mem_used_mb=131, duration=0.04, partition=s3://mbta-ctd-dataplatform-dev-springboard/odin/data/cubic/ods/EDW.JOURNAL_ENTRY/odin_year=2025/, max_objects=1000000, in_filter=.parquet, objects_found=1
2026-01-09T11:51:19-0500     INFO Total parquet files discovered: 2
2026-01-09T11:51:19-0500     INFO uuid=9c83d1d3-f923-44c5-b581-e91c0dc072a3, parent=odin, process=load_filtered_data, process_id=31579, status=started, disk_free_mb=268748, sys_mem_free_pct=17, proc_mem_used_mb=131, table=EDW.JOURNAL_ENTRY, file_count=2
2026-01-09T11:51:19-0500     INFO Loading data from 2 S3 files
2026-01-09T11:51:20-0500     INFO uuid=ef3c6178-319b-4373-8a4f-0e1666529786, parent=odin, process=process_batches, process_id=31579, status=started, disk_free_mb=268747, sys_mem_free_pct=16, proc_mem_used_mb=150, table=EDW.JOURNAL_ENTRY, total_rows=272634
2026-01-09T11:51:20-0500     INFO Processing 272634 rows for EDW.JOURNAL_ENTRY in batches of 500000
2026-01-09T11:51:20-0500     INFO uuid=6eedbdf2-899d-4f9c-a81b-328357c3891a, parent=odin, process=process_batch, process_id=31579, status=started, disk_free_mb=268747, sys_mem_free_pct=16, proc_mem_used_mb=150, table=EDW.JOURNAL_ENTRY, batch_num=1, offset=0, batch_end=272634
2026-01-09T11:51:20-0500     INFO Preparing batch 1 (rows 0 to 272634)
2026-01-09T11:51:22-0500     INFO Hyper extract contains 272634 rows
2026-01-09T11:51:22-0500     INFO Publishing batch 1 to Tableau (Mode: Overwrite)
2026-01-09T11:51:27-0500     INFO Updated watermark for EDW.JOURNAL_ENTRY to 1075187
2026-01-09T11:51:27-0500     INFO uuid=7bb338f5-f8fe-4d3b-ae83-b554f804e951, parent=odin, process=TableauUpload, process_id=31579, status=complete, disk_free_mb=268745, sys_mem_free_pct=15, proc_mem_used_mb=119, duration=8.76, table=EDW.JOURNAL_ENTRY, run_delay_mins=240.00, overwrite=False

Results

  • Initial data transfer completes
  • Job scheduler assigns appropriate resync times to each table individually
  • No data is downloaded from S3 or transmitted to Tableau when no data newer than watermark is available
  • Updating watermark between runs causes script to sync over data newer than updated watermark as expected

Remaining features:

  • Update names for clarity
  • Remove --table arg (also redundant)
  • Remove run as main
  • Write full, human readable docs at top that can be easily understood exactly what this is doing and how/why (e.g., it solves X problem by having Y config)
  • Add full, human readable docs from above to Notion (https://www.notion.so/mbta-downtown-crossing/Data-Platform-9f78ea9ad675432c87ab08d6d38280c2)
  • Establish naming convention more broadly understandable in data engineering world
  • Overhaul logging functions to use ProcessLog consistently (and in line with other Odin jobs) and unify uuid across all logs relating to a given iteration of a function

Out of scope

  • Add optional whitelist to TABLE_CONFIG to more easily limit synced columns to only those needed
  • (maybe) Add update frequency to TABLE_CONFIG configurable per table
  • Remove --overwrite-table argument (not usable in prod, and redundant with watermark)

Notes

Do not upload to dev or prod yet. Need to coordinate with infra to get new required Tableau variables:

ealexa05 and others added 3 commits January 7, 2026 10:23
* Adding incrementality and batching to pipeline

* Adding handling for partions (#88)
* Refactoring as Odin job

* Adding Tableau job to Odin

* Adding Tableau variables to required startup vars
@ealexa05 ealexa05 changed the title Tableau upload test Job for publishing data to Tableau Jan 9, 2026
@ealexa05 ealexa05 changed the title Job for publishing data to Tableau New Odin job: publishing data to Tableau Jan 9, 2026
@ealexa05 ealexa05 changed the title New Odin job: publishing data to Tableau TableauUpload (new Odin job for publishing data to Tableau) Jan 12, 2026
@ealexa05 ealexa05 requested a review from runkelcorey January 14, 2026 17:06
Copy link

@runkelcorey runkelcorey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks more or less fine but it leaves me wondering: is there a reason you didn't include unit tests for these functions? the functions that don't do any uploading or downloading could be tested pretty easily and I think mocking the s3 and TSC clients would be pretty straightforward. speaking from LAMP's experience, I would love to have a test suite for our Tableau modules

Co-authored-by: Corey Runkel <39202587+runkelcorey@users.noreply.github.com>
@ealexa05
Copy link
Contributor Author

ealexa05 commented Jan 15, 2026

this looks more or less fine but it leaves me wondering: is there a reason you didn't include unit tests for these functions? the functions that don't do any uploading or downloading could be tested pretty easily and I think mocking the s3 and TSC clients would be pretty straightforward. speaking from LAMP's experience, I would love to have a test suite for our Tableau modules

Just a short cut to get this out sooner, but it's a good point. Working on some basic coverage

@ealexa05 ealexa05 changed the title TableauUpload (new Odin job for publishing data to Tableau) feat: TableauUpload (new Odin job for publishing data to Tableau) Jan 16, 2026
@ealexa05 ealexa05 requested a review from runkelcorey January 21, 2026 19:20
@ealexa05
Copy link
Contributor Author

@runkelcorey I added unit tests for the major things I think could get disrupted, and updated the logging. Could you give the new changes a look when you have a chance?

Copy link

@runkelcorey runkelcorey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes look great!

@ealexa05 ealexa05 merged commit e7cfb35 into main Jan 30, 2026
5 checks passed
@ealexa05 ealexa05 deleted the tableau_test branch January 30, 2026 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants