Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using BQ instead of GCS for schema-less storage #423

Closed
jklukas opened this issue Feb 8, 2019 · 4 comments
Closed

Consider using BQ instead of GCS for schema-less storage #423

jklukas opened this issue Feb 8, 2019 · 4 comments
Labels
enhancement New feature or request

Comments

@jklukas
Copy link
Contributor

jklukas commented Feb 8, 2019

The article Infinite Backup for Cloud Pub/Sub discusses advantages of choosing BigQuery rather than GCS as a sink for long-term storage of streaming data, considering that storage in BQ is the same cost as GCS.

For us, we could use this technique for decoder errors or for landfill. The table in BigQuery would have a payload column of byte array type, and then we could have various optional columns for attributes that might exist on the message.

Some benefits:

  • BQ is per-record storage; it abstracts away the underlying files, batching concerns, etc.
  • Leverage BQ queries for doing analysis of the error messages

Downsides:

  • We don't yet know a lot about what it looks like to get data out of BigQuery; we've run into export limits with BQ before, for example
@jklukas jklukas added the enhancement New feature or request label Feb 8, 2019
@jklukas
Copy link
Contributor Author

jklukas commented Feb 8, 2019

cc @relud @whd @mreid-moz

@relud
Copy link
Contributor

relud commented Feb 8, 2019

bigquery limits

Exports per day — 50,000 exports per project and up to 10 TB per day (the 10TB data limit is cumulative across all exports)

for small backfills that might be fine, but it's going to be an issue for a larger backfill

@whd
Copy link
Member

whd commented Feb 9, 2019

Aside from the export issue, this seems like an interesting hack worth testing. It could be tested in parallel to the existing landfill implementation but would be relatively expensive i.e. we should probably prototype this with a limited set of data such as the structured ingestion stream.

Notably, I don't think we've tested GCS landfill -> decoder yet, so confirming that works as expected is probably a perequisite to trying out BQ -> decoder. In the current GCP architecture, the primary purpose of schema-less storage is landfill that can be reprocessed by decoder, but there are perhaps other useful applications for it as well.

Regarding exports, we've never needed to reprocess more than a day (really, an hour) of raw data before on the current pipeline, and on the pipeline before that I think it was on the order of a week of data, which would certainly hit export limits. If the export is per-project we could do something hacky like have multiple sinks sharing a subscription writing to different projects I suppose. At any rate, we should certainly consider larger backfills as a potential issue of this approach.

Also worth noting, there are related in-flight considerations regarding the efficacy of Dataflow generally in #371 and #380 and considerations herein may affect work around those.

@jklukas
Copy link
Contributor Author

jklukas commented Aug 13, 2019

This has been implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants