Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement functionality to support idempotent ingestion #1

Closed
MattTriano opened this issue Nov 10, 2022 · 0 comments · Fixed by #17
Closed

Implement functionality to support idempotent ingestion #1

MattTriano opened this issue Nov 10, 2022 · 0 comments · Fixed by #17

Comments

@MattTriano
Copy link
Owner

Many of the tables I'm interested in are updated monthly, weekly, even daily, and they're often snapshots exported from database tables that support operational systems (eg records correspond to some workload and are updated as the workload is worked). By retrieving a data table and comparing it against prior pulls of that table, it's possible to identify which records are new as well as which existing records were changed/updated. To avoid missing an update, it's necessary to do this retrieval+comparison for every distinct export of the table from its source system, but most public data systems don't indicate when the next update will happen (although this cadence can often be reliably deduced by checking at periodic intervals), so it's often necessary to check more often than is necessary.

Data tables can be pretty large (often in excess of 1GB), so it's both rude and expensive to download and ingest the table more frequently than is needed. Fortunately, the main data tables this project uses are served via Socrata's data platform, which provides an API for checking table metadata which includes the time that the data was last updated, which can quickly avert the need to execute an unnecessary data pull. And when it is necessary to pull data it would be ideal to only ingest new or updated records.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant