Main ping copy_deduplicate query raises memory exception #307

jklukas · 2019-08-23T13:46:34Z

tl;dr -

Resources exceeded during query execution: The query could not be executed in the allotted memory

Looks like a single day of main ping is too much for the copy_deduplicate query as currently expressed. The Airflow job failed last night. From the logs:

[2019-08-23 04:57:44,743] {logging_mixin.py:95} INFO - [2019-08-23 04:57:44,743] {pod_launcher.py:104} INFO -     raise exceptions.from_http_response(response)
[2019-08-23 04:57:44,747] {logging_mixin.py:95} INFO - [2019-08-23 04:57:44,747] {pod_launcher.py:104} INFO - google.api_core.exceptions.BadRequest: 400 GET https://www.googleapis.com/bigquery/v2/projects/moz-fx-data-derived-datasets/queries/c31b5f34-69c7-45d7-8b89-ec0d89153f70?maxResults=0&location=US: Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 101% of limit.
[2019-08-23 04:57:44,748] {logging_mixin.py:95} INFO - [2019-08-23 04:57:44,748] {pod_launcher.py:104} INFO - Top memory consumer(s):
[2019-08-23 04:57:44,749] {logging_mixin.py:95} INFO - [2019-08-23 04:57:44,749] {pod_launcher.py:104} INFO -   query parsing and optimization: 5%
[2019-08-23 04:57:44,749] {logging_mixin.py:95} INFO - [2019-08-23 04:57:44,749] {pod_launcher.py:104} INFO -   other/unattributed: 95%

I will look this morning into whether it's possible to recast this query to be more efficient. It may be necessary to break this into two steps with a temp table in between.

cc @relud @whd

The text was updated successfully, but these errors were encountered:

jklukas · 2019-08-23T13:52:38Z

On the bright side, the other copy_deduplicate job (which handles populating all the stable tables besides main) finished in just under 10 minutes, so that looks to be working well.

jklukas · 2019-08-23T14:51:03Z

I'm currently testing a process where we first produce a list of document_ids with the number of occurrences, then we select all the records where occurrences = 1 into the stable table, then we do the window query on just the rows where occurrences > 1. The vast majority of rows have no duplicates in the live table, so this may make the window function tenable. It may also be possible to express this entire operation as a single query without the need for a temp table.

First attempt failed, and now I'm going to try breaking into pieces.

jklukas · 2019-08-23T16:54:21Z

I was able to successfully create a deduped version of main_v4 via a query broken into three parts; the process took just over 30 minutes of runtime. I've now unified those parts together into a single query and am waiting to see if that completes:

CREATE TABLE
  tmp.klukas_main_deduped2
PARTITION BY
  DATE(submission_timestamp)
CLUSTER BY
  sample_id AS
WITH
  base AS (
  SELECT
    *
  FROM
    `moz-fx-data-shared-prod.telemetry_live.main_v4`
  WHERE
    DATE(submission_timestamp) = '2019-08-22' ),
  --
  duped_docids AS (
  SELECT
    document_id,
    COUNT(document_id) AS occurrences
  FROM
    base
  GROUP BY
    document_id
  HAVING
    occurrences > 1),
  --
  nonduped AS (
  SELECT
    base.*
  FROM
    base
  LEFT JOIN
    duped_docids
  USING
    (document_id)
  WHERE
    duped_docids.document_id IS NULL),
  --
  numbered_duplicates AS (
  SELECT
    base.*,
    ROW_NUMBER() OVER (PARTITION BY document_id ORDER BY submission_timestamp) AS _n
  FROM
    base
  JOIN
    duped_docids
  USING
    (document_id) ),
  --
  deduped AS (
  SELECT
    * EXCEPT (_n)
  FROM
    numbered_duplicates
  WHERE
    _n = 1 )
SELECT
  *
FROM
  nonduped
UNION ALL
SELECT
  *
FROM
  deduped

jklukas · 2019-08-23T17:40:28Z

The above single query succeeded in 42 minutes. I'm going to PR this change to bigquery-etl.

Closes #307

jklukas · 2019-08-23T18:49:49Z

The new docker image has been built and published, so I kicked off the airflow job to run again. We should see that succeed in ~40 minutes.

jklukas · 2019-08-23T19:29:53Z

Succeeded in 37 minutes!

jklukas self-assigned this Aug 23, 2019

jklukas added a commit that referenced this issue Aug 23, 2019

Make copy_deduplicate query more efficient for large tables

2cddbcd

Closes #307

jklukas mentioned this issue Aug 23, 2019

Make copy_deduplicate query more efficient for large tables #308

Merged

jklukas closed this as completed in #308 Aug 23, 2019

jklukas added a commit that referenced this issue Aug 23, 2019

Make copy_deduplicate query more efficient for large tables

5b00557

Closes #307

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Main ping copy_deduplicate query raises memory exception #307

Main ping copy_deduplicate query raises memory exception #307

jklukas commented Aug 23, 2019

jklukas commented Aug 23, 2019

jklukas commented Aug 23, 2019 •

edited

jklukas commented Aug 23, 2019

jklukas commented Aug 23, 2019

jklukas commented Aug 23, 2019

jklukas commented Aug 23, 2019

Main ping copy_deduplicate query raises memory exception #307

Main ping copy_deduplicate query raises memory exception #307

Comments

jklukas commented Aug 23, 2019

jklukas commented Aug 23, 2019

jklukas commented Aug 23, 2019 • edited

jklukas commented Aug 23, 2019

jklukas commented Aug 23, 2019

jklukas commented Aug 23, 2019

jklukas commented Aug 23, 2019

jklukas commented Aug 23, 2019 •

edited