Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Main ping copy_deduplicate query raises memory exception #307

Closed
jklukas opened this issue Aug 23, 2019 · 6 comments · Fixed by #308
Closed

Main ping copy_deduplicate query raises memory exception #307

jklukas opened this issue Aug 23, 2019 · 6 comments · Fixed by #308
Assignees

Comments

@jklukas
Copy link
Contributor

jklukas commented Aug 23, 2019

tl;dr -

Resources exceeded during query execution: The query could not be executed in the allotted memory

Looks like a single day of main ping is too much for the copy_deduplicate query as currently expressed. The Airflow job failed last night. From the logs:

[2019-08-23 04:57:44,743] {logging_mixin.py:95} INFO - [2019-08-23 04:57:44,743] {pod_launcher.py:104} INFO -     raise exceptions.from_http_response(response)
[2019-08-23 04:57:44,747] {logging_mixin.py:95} INFO - [2019-08-23 04:57:44,747] {pod_launcher.py:104} INFO - google.api_core.exceptions.BadRequest: 400 GET https://www.googleapis.com/bigquery/v2/projects/moz-fx-data-derived-datasets/queries/c31b5f34-69c7-45d7-8b89-ec0d89153f70?maxResults=0&location=US: Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 101% of limit.
[2019-08-23 04:57:44,748] {logging_mixin.py:95} INFO - [2019-08-23 04:57:44,748] {pod_launcher.py:104} INFO - Top memory consumer(s):
[2019-08-23 04:57:44,749] {logging_mixin.py:95} INFO - [2019-08-23 04:57:44,749] {pod_launcher.py:104} INFO -   query parsing and optimization: 5%
[2019-08-23 04:57:44,749] {logging_mixin.py:95} INFO - [2019-08-23 04:57:44,749] {pod_launcher.py:104} INFO -   other/unattributed: 95%

I will look this morning into whether it's possible to recast this query to be more efficient. It may be necessary to break this into two steps with a temp table in between.

cc @relud @whd

@jklukas
Copy link
Contributor Author

jklukas commented Aug 23, 2019

On the bright side, the other copy_deduplicate job (which handles populating all the stable tables besides main) finished in just under 10 minutes, so that looks to be working well.

@jklukas
Copy link
Contributor Author

jklukas commented Aug 23, 2019

I'm currently testing a process where we first produce a list of document_ids with the number of occurrences, then we select all the records where occurrences = 1 into the stable table, then we do the window query on just the rows where occurrences > 1. The vast majority of rows have no duplicates in the live table, so this may make the window function tenable. It may also be possible to express this entire operation as a single query without the need for a temp table.

First attempt failed, and now I'm going to try breaking into pieces.

@jklukas
Copy link
Contributor Author

jklukas commented Aug 23, 2019

I was able to successfully create a deduped version of main_v4 via a query broken into three parts; the process took just over 30 minutes of runtime. I've now unified those parts together into a single query and am waiting to see if that completes:

CREATE TABLE
  tmp.klukas_main_deduped2
PARTITION BY
  DATE(submission_timestamp)
CLUSTER BY
  sample_id AS
WITH
  base AS (
  SELECT
    *
  FROM
    `moz-fx-data-shared-prod.telemetry_live.main_v4`
  WHERE
    DATE(submission_timestamp) = '2019-08-22' ),
  --
  duped_docids AS (
  SELECT
    document_id,
    COUNT(document_id) AS occurrences
  FROM
    base
  GROUP BY
    document_id
  HAVING
    occurrences > 1),
  --
  nonduped AS (
  SELECT
    base.*
  FROM
    base
  LEFT JOIN
    duped_docids
  USING
    (document_id)
  WHERE
    duped_docids.document_id IS NULL),
  --
  numbered_duplicates AS (
  SELECT
    base.*,
    ROW_NUMBER() OVER (PARTITION BY document_id ORDER BY submission_timestamp) AS _n
  FROM
    base
  JOIN
    duped_docids
  USING
    (document_id) ),
  --
  deduped AS (
  SELECT
    * EXCEPT (_n)
  FROM
    numbered_duplicates
  WHERE
    _n = 1 )
SELECT
  *
FROM
  nonduped
UNION ALL
SELECT
  *
FROM
  deduped

@jklukas jklukas self-assigned this Aug 23, 2019
@jklukas
Copy link
Contributor Author

jklukas commented Aug 23, 2019

The above single query succeeded in 42 minutes. I'm going to PR this change to bigquery-etl.

@jklukas
Copy link
Contributor Author

jklukas commented Aug 23, 2019

The new docker image has been built and published, so I kicked off the airflow job to run again. We should see that succeed in ~40 minutes.

@jklukas
Copy link
Contributor Author

jklukas commented Aug 23, 2019

Succeeded in 37 minutes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant