feat: TableauUpload (new Odin job for publishing data to Tableau)#78
feat: TableauUpload (new Odin job for publishing data to Tableau)#78
Conversation
* Adding incrementality and batching to pipeline * Adding handling for partions (#88)
* Refactoring as Odin job * Adding Tableau job to Odin * Adding Tableau variables to required startup vars
runkelcorey
left a comment
There was a problem hiding this comment.
this looks more or less fine but it leaves me wondering: is there a reason you didn't include unit tests for these functions? the functions that don't do any uploading or downloading could be tested pretty easily and I think mocking the s3 and TSC clients would be pretty straightforward. speaking from LAMP's experience, I would love to have a test suite for our Tableau modules
Co-authored-by: Corey Runkel <39202587+runkelcorey@users.noreply.github.com>
Just a short cut to get this out sooner, but it's a good point. Working on some basic coverage |
…nversion, nested project id resolution, table conig structure
|
@runkelcorey I added unit tests for the major things I think could get disrupted, and updated the logging. Could you give the new changes a look when you have a chance? |
…removing defaults in code
Summary
This PR introduces a new Odin job class (
TableauUpload), which coordinates incremental batch uploading of Cubic ODS data to Tableau.During a run, Odin iterates through
TABLES_TO_SYNClist, and sets up a TableauUpload job for each in the schedule. Each job then:s3://{bucket}/odin/state/tableau_checkpoints.jsonif availableEach job runs on a recurring schedule, based on its exit status:
Adding and configuring tables
To start syncing a new table, add the table name to
TABLES_TO_SYNCand, if applicable, toTABLE_CONFIG.TABLE_CONFIGincludes:casts: Column type overrides (e.g., force token_id to Int64)drops: Columns to exclude (e.g., sensitive data like restricted_purse_id, or just unneeded columns to cut size)index_column: The monotonically increasing column used for watermarkingIf you have included an
index_columnfor your table, you can also specify a minimum value ins3://{bucket}/odin/state/tableau_checkpoints.jsonprior to the first sync so that this job will ignore all data before that point.tableau_checkpoints.jsonis a file containing table names and high watermarks, e.g.{"EDW.ABP_TAP": 2404993, "EDW.TRAINS": 94128}Dependencies
This job introduces 2 additional dependencies:
tableauhyperapi(for building hyper extract files)tableauserverclient(for publishing to the Tableau server)Unit tests
These tests assert that the job correctly handles:
TABLE_CONFIGstructureChange history
Change log for 4364f75
Rather than overwriting by default, pipeline now checks state of table on S3 and appends only new data to existing Tableau data source. This works by
index_columnis recorded tos3://{bucket}/odin/state/tableau_checkpoints.jsonunder the table nameindex_column) is uploadedAlso, the pipeline now points to tables rather than individual files. For each table, it finds all contained partitions and files and scans parquet metadata to determine if it contains relevant data based on
index_columnAdditionally:
BATCH_SIZE(500_000)index_columncan now be defined inTABLE_CONFIG. Index must be numeric and monotonically increasing--overwrite-tablecan be used to force a whole table to resync and overwrites3://{bucket}/odin/state/tableau_checkpoints.jsonto start syncing after that pointChange log for e075e27
Implements
TableauUpload, which extendsOdinJob, allowing Odin to run Tableau uploads on a schedule. Structure is based on ods_fact.py. Key featuresChange log for a93d078
Overhaul of logging functions to use
ProcessLogconsistently (and in line with other Odin jobs).uuidin Splunk to get all logs relating to a given iteration of a functionlog.add_metadata/log.add_metadata/log.failedfor respective circumstancesTesting plan:
table=EDW.JOURNAL_ENTRY, allow to complete initial transfer of datas3://mbta-ctd-dataplatform-dev-springboard/odin/state/tableau_checkpoints.json to a lower valueLogs for testing run 1: Testing job rerun and incremental updating
Logs for testing run 2: Deleting Tableau table and watermark and rerunning
Results
Remaining features:
--tablearg (also redundant)ProcessLogconsistently (and in line with other Odin jobs) and unify uuid across all logs relating to a given iteration of a functionOut of scope
TABLE_CONFIGto more easily limit synced columns to only those neededTABLE_CONFIGconfigurable per table--overwrite-tableargument (not usable in prod, and redundant with watermark)Notes
Do not upload to dev or prod yet. Need to coordinate with infra to get new required Tableau variables: