New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Beam sink to BigQuery failure: Too many sources provided #199
Comments
While the workaround works for some cases, it has two issues:
We should investigate this in more detail. I wonder if we can get around this by partitioning the input PCollection again and then using |
Out of curiosity, does this also happen if we set --max_num_workers to something that is not too high, such that the total number of workers is bounded by say a few hundred (by worker here I mean number of cores not just machines)? I think the source of confusion is that we don't know where these "sources" come from, correct? Because I assumed each worker will be one source but it seems that is not the case. If we can figure what the "fix" that the Beam team is planning to do for this "custom sink" for Python SDK, maybe we can prioritize that to do it ourselves and send a Pull Request. |
It seems it's independent from the number of workers. Today Allie had a run with only 50 workers and it failed due to the same reason. Their fix is to write a custom sink for BQ, again another comment for the bug report: |
Hmm, okay, I think Asha mentioned that this is "fixed" in the Java SDK, @arostamianfar can you provide more context here? @samanvp, given your last comment, do you think we should either revert PR #197 or reduce its constant (for number of keys)? My concern is that we are adding a GroupByKey step which is turned on by default for "large inputs" (which is expensive) without really fixing anything (Allie is saying that her case fails even with |
I looked into this a bit more and I think I have a better idea about what's going on. The BQ sink writes data to GCS files as json and not actually to BQ (the json files are stored under a directory like Partitioning the data just before write, doing LimitWrite on each partition, and then always RE Java fix: I recall seeing it on some thread. Need to dig it up. |
…form, and remove num_bigquery_write_shards flag. Previously, issue googlegenomics#199 forced us to use a hack to shard the variants before they are written to BigQuery, which negatively affects the speed of the tool. With the implementation of the new sink, the flag is no longer need.
…form, and remove num_bigquery_write_shards flag. Previously, issue googlegenomics#199 forced us to use a hack to shard the variants before they are written to BigQuery, which negatively affects the speed of the tool. With the implementation of the new sink, the flag is no longer need.
…form, and remove num_bigquery_write_shards flag. Previously, issue googlegenomics#199 forced us to use a hack to shard the variants before they are written to BigQuery, which negatively affects the speed of the tool. With the implementation of the new sink, the flag is no longer need.
…form, and remove num_bigquery_write_shards flag. Previously, issue googlegenomics#199 forced us to use a hack to shard the variants before they are written to BigQuery, which negatively affects the speed of the tool. With the implementation of the new sink, the flag is no longer need.
…form, and remove num_bigquery_write_shards flag. Previously, issue googlegenomics#199 forced us to use a hack to shard the variants before they are written to BigQuery, which negatively affects the speed of the tool. With the implementation of the new sink, the flag is no longer need.
We observed during multiple runs of VT pipeline whenever we set the --num_workers to a high number (256 or higher), it fails with the following error message:
After communicating with Beam team we were pointed to this temporary solution and we were told:
The text was updated successfully, but these errors were encountered: