Consider using BQ instead of GCS for schema-less storage #423

jklukas · 2019-02-08T20:21:22Z

The article Infinite Backup for Cloud Pub/Sub discusses advantages of choosing BigQuery rather than GCS as a sink for long-term storage of streaming data, considering that storage in BQ is the same cost as GCS.

For us, we could use this technique for decoder errors or for landfill. The table in BigQuery would have a payload column of byte array type, and then we could have various optional columns for attributes that might exist on the message.

Some benefits:

BQ is per-record storage; it abstracts away the underlying files, batching concerns, etc.
Leverage BQ queries for doing analysis of the error messages

Downsides:

We don't yet know a lot about what it looks like to get data out of BigQuery; we've run into export limits with BQ before, for example

jklukas · 2019-02-08T20:21:39Z

cc @relud @whd @mreid-moz

relud · 2019-02-08T20:31:46Z

bigquery limits

Exports per day — 50,000 exports per project and up to 10 TB per day (the 10TB data limit is cumulative across all exports)

for small backfills that might be fine, but it's going to be an issue for a larger backfill

whd · 2019-02-09T00:49:28Z

Aside from the export issue, this seems like an interesting hack worth testing. It could be tested in parallel to the existing landfill implementation but would be relatively expensive i.e. we should probably prototype this with a limited set of data such as the structured ingestion stream.

Notably, I don't think we've tested GCS landfill -> decoder yet, so confirming that works as expected is probably a perequisite to trying out BQ -> decoder. In the current GCP architecture, the primary purpose of schema-less storage is landfill that can be reprocessed by decoder, but there are perhaps other useful applications for it as well.

Regarding exports, we've never needed to reprocess more than a day (really, an hour) of raw data before on the current pipeline, and on the pipeline before that I think it was on the order of a week of data, which would certainly hit export limits. If the export is per-project we could do something hacky like have multiple sinks sharing a subscription writing to different projects I suppose. At any rate, we should certainly consider larger backfills as a potential issue of this approach.

Also worth noting, there are related in-flight considerations regarding the efficacy of Dataflow generally in #371 and #380 and considerations herein may affect work around those.

jklukas · 2019-08-13T17:31:19Z

This has been implemented.

jklukas added the enhancement New feature or request label Feb 8, 2019

jklukas mentioned this issue Apr 12, 2019

Republish validation errors from debug pings to dedicated topic #528

Open

jklukas closed this as completed Aug 13, 2019

maxheld83 mentioned this issue Mar 30, 2021

consider ELT with BigQuery as pseudo-schemaless db subugoe/leine#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using BQ instead of GCS for schema-less storage #423

Consider using BQ instead of GCS for schema-less storage #423

jklukas commented Feb 8, 2019

jklukas commented Feb 8, 2019

relud commented Feb 8, 2019

whd commented Feb 9, 2019

jklukas commented Aug 13, 2019

Consider using BQ instead of GCS for schema-less storage #423

Consider using BQ instead of GCS for schema-less storage #423

Comments

jklukas commented Feb 8, 2019

jklukas commented Feb 8, 2019

relud commented Feb 8, 2019

whd commented Feb 9, 2019

jklukas commented Aug 13, 2019