-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Main ping copy_deduplicate query raises memory exception #307
Comments
On the bright side, the other copy_deduplicate job (which handles populating all the stable tables besides main) finished in just under 10 minutes, so that looks to be working well. |
I'm currently testing a process where we first produce a list of document_ids with the number of occurrences, then we select all the records where First attempt failed, and now I'm going to try breaking into pieces. |
I was able to successfully create a deduped version of
|
The above single query succeeded in 42 minutes. I'm going to PR this change to bigquery-etl. |
The new docker image has been built and published, so I kicked off the airflow job to run again. We should see that succeed in ~40 minutes. |
Succeeded in 37 minutes! |
tl;dr -
Looks like a single day of main ping is too much for the copy_deduplicate query as currently expressed. The Airflow job failed last night. From the logs:
I will look this morning into whether it's possible to recast this query to be more efficient. It may be necessary to break this into two steps with a temp table in between.
cc @relud @whd
The text was updated successfully, but these errors were encountered: