Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event funnel queries are slow #282

Open
wlach opened this issue Oct 21, 2021 · 5 comments
Open

Event funnel queries are slow #282

wlach opened this issue Oct 21, 2021 · 5 comments
Labels
Events Issues or enhancements to the events explores

Comments

@wlach
Copy link
Contributor

wlach commented Oct 21, 2021

The following query (generated by the event funnel explore) is slower than expected:

WITH
  step_1 AS (
  SELECT
    mozfun.event_analysis.aggregate_match_strings( ARRAY_AGG( CONCAT(mozfun.event_analysis.escape_metachars(property_value.value), mozfun.event_analysis.event_index_to_match_string(et.INDEX)))) AS match_string
  FROM
    `mozdata.fenix.event_types` AS et
  CROSS JOIN
    UNNEST(event_properties) AS properties
  CROSS JOIN
    UNNEST(properties.value) AS property_value
  WHERE
    (category = 'events')
    AND (event = 'app_opened')
    ),
  step_2 AS (
  SELECT
    mozfun.event_analysis.aggregate_match_strings( ARRAY_AGG( CONCAT(mozfun.event_analysis.escape_metachars(property_value.value),mozfun.event_analysis.event_index_to_match_string(et.INDEX)))) AS match_string
  FROM
    `mozdata.fenix.event_types` AS et
  CROSS JOIN
    UNNEST(event_properties) AS properties
  CROSS JOIN
    UNNEST(properties.value) AS property_value
  WHERE
    (category = 'context_menu')
    AND (event = 'item_tapped')
    AND (properties.key = 'named')
    AND (property_value.key = 'share_link') )
SELECT
  COUNT(CASE
      WHEN REGEXP_CONTAINS(funnel_analysis.events, mozfun.event_analysis.create_funnel_regex([step_1.match_string], TRUE)) THEN 1
    ELSE
    NULL
  END
    ) AS funnel_analysis_count_completed_step_1,
  COUNT(CASE
      WHEN (REGEXP_CONTAINS(funnel_analysis.events, mozfun.event_analysis.create_funnel_regex([step_1.match_string], TRUE))) AND (REGEXP_CONTAINS(funnel_analysis.events, mozfun.event_analysis.create_funnel_regex([step_1.match_string, step_2.match_string], TRUE))) THEN 1
    ELSE
    NULL
  END
    ) AS funnel_analysis_count_completed_step_2
FROM
  `mozdata.fenix.events_daily` AS funnel_analysis
CROSS JOIN
  step_1
CROSS JOIN
  step_2
WHERE
  funnel_analysis.submission_date = DATE('2021-10-18')
LIMIT
  500

According to BigQuery, it's only processing 2.4gb worth of data but when I run it in the console (in the mozdata project), it takes a full 21.4 seconds to run. Looking at the query profile, this segment sticks out:

image

It appears this corresponds to the REGEXP_CONTAINS part in the latter part of the query. It seems most likely that it's the repeated calls to REGEXP_CONTAINS that are slow.

@wlach
Copy link
Contributor Author

wlach commented Oct 21, 2021

The performance is even worse when no event matches the criteria specified in step 1 and step 2. No idea why.

@wlach
Copy link
Contributor Author

wlach commented Oct 22, 2021

I had a try at speeding this up with a JavaScript UDF, I figured I might be able to reduce the number of time spent evaluating different regexes that way. Slot time seems to go down from ~28 minutes to 7 minutes which is a nice improvement, though the numbers seem to be a bit off. We'd need to rewrite our regex generation if we took this approach, I'm doing some hacky stuff below but it's not generating quite the right results:

CREATE TEMP FUNCTION
  udf_js_regex_match (regex STRING, input STRING)
  RETURNS int64
  LANGUAGE js AS """

    if (input == null || !input.length) {
      return 0;
    }
    re = new RegExp(regex.replace(/\\Q/g, '').replace(/\\E/g, '').replace('(?:.*?)', '.*').replace(/\\?:/g, '').slice(1,-2) + "?)");
    let match = input.match(re)
    return match ? match.filter(m=>m !== undefined).length : 0;
""";

WITH
  step_1 AS (
  SELECT
    mozfun.event_analysis.aggregate_match_strings( ARRAY_AGG( CONCAT(property_value.value, mozfun.event_analysis.event_index_to_match_string(et.INDEX)))) AS match_string
  FROM
    `mozdata.fenix.event_types` AS et
  CROSS JOIN
    UNNEST(event_properties) AS properties
  CROSS JOIN
    UNNEST(properties.value) AS property_value
  WHERE
    (category = 'events')
    AND (event = 'app_opened')
    AND 1=1 -- no filter on 'step_1.property_value'
    ),
  step_2 AS (
  SELECT
    mozfun.event_analysis.aggregate_match_strings( ARRAY_AGG( CONCAT(property_value.value,mozfun.event_analysis.event_index_to_match_string(et.INDEX)))) AS match_string
  FROM
    `mozdata.fenix.event_types` AS et
  CROSS JOIN
    UNNEST(event_properties) AS properties
  CROSS JOIN
    UNNEST(properties.value) AS property_value
  WHERE
    (category = 'context_menu')
    AND (event = 'item_tapped')
    AND (properties.key = 'named')
    AND (property_value.key = 'share_link')),
    results as (
SELECT
  udf_js_regex_match(mozfun.event_analysis.create_funnel_regex([step_1.match_string, step_2.match_string], TRUE), funnel_analysis.events) AS funnel_analysis_count_completed_step_2
FROM
  `mozdata.fenix.events_daily` AS funnel_analysis
CROSS JOIN
  step_1
CROSS JOIN
  step_2
WHERE
  funnel_analysis.submission_date = DATE('2021-10-18')
)
select countif(funnel_analysis_count_completed_step_2 > 1) as step1, countif(funnel_analysis_count_completed_step_2 > 2) as step2 from results

@wlach
Copy link
Contributor Author

wlach commented Oct 25, 2021

Update: we have opened a support request with GCP to diagnose and hopefully provide some recommendations

@wlach
Copy link
Contributor Author

wlach commented Oct 28, 2021

Update: we have opened a support request with GCP to diagnose and hopefully provide some recommendations

Google looked into it, but basically reproduced the same set of findings that are already in this issue: repeated use of regular expressions in BigQuery is (relatively) slow.

@wlach
Copy link
Contributor Author

wlach commented Nov 3, 2021

Feedback from Google suggests that the query as I'm testing it is only using a small number of slots (87) which may be partly responsible for the slowness. I think we're using on-demand pricing with GCP + Looker, which might limit the number of slots allocated to this query. Something to bear in mind if this comes up as a big issue for people.

@wlach wlach closed this as completed Nov 3, 2021
@wlach wlach reopened this Nov 3, 2021
@wlach wlach added the Events Issues or enhancements to the events explores label Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Events Issues or enhancements to the events explores
Projects
None yet
Development

No branches or pull requests

1 participant