-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider using BQ instead of GCS for schema-less storage #423
Comments
cc @relud @whd @mreid-moz |
for small backfills that might be fine, but it's going to be an issue for a larger backfill |
Aside from the export issue, this seems like an interesting hack worth testing. It could be tested in parallel to the existing landfill implementation but would be relatively expensive i.e. we should probably prototype this with a limited set of data such as the structured ingestion stream. Notably, I don't think we've tested GCS landfill -> decoder yet, so confirming that works as expected is probably a perequisite to trying out BQ -> decoder. In the current GCP architecture, the primary purpose of schema-less storage is landfill that can be reprocessed by decoder, but there are perhaps other useful applications for it as well. Regarding exports, we've never needed to reprocess more than a day (really, an hour) of raw data before on the current pipeline, and on the pipeline before that I think it was on the order of a week of data, which would certainly hit export limits. If the export is per-project we could do something hacky like have multiple sinks sharing a subscription writing to different projects I suppose. At any rate, we should certainly consider larger backfills as a potential issue of this approach. Also worth noting, there are related in-flight considerations regarding the efficacy of Dataflow generally in #371 and #380 and considerations herein may affect work around those. |
This has been implemented. |
The article Infinite Backup for Cloud Pub/Sub discusses advantages of choosing BigQuery rather than GCS as a sink for long-term storage of streaming data, considering that storage in BQ is the same cost as GCS.
For us, we could use this technique for decoder errors or for landfill. The table in BigQuery would have a payload column of byte array type, and then we could have various optional columns for attributes that might exist on the message.
Some benefits:
Downsides:
The text was updated successfully, but these errors were encountered: