Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gather Data Science feedback on union and tuple types #88

Open
jklukas opened this issue Aug 27, 2019 · 6 comments
Open

Gather Data Science feedback on union and tuple types #88

jklukas opened this issue Aug 27, 2019 · 6 comments

Comments

@jklukas
Copy link

jklukas commented Aug 27, 2019

Wanted: Data science review of query interface

We have a few remaining ambiguous types in JSON schemas that are preventing some important ping fields from appearing as fields in BigQuery. Currently, these ambiguous values end up as part of the additional_properties JSON blob in BigQuery ping tables, so they are available but awkward and potentially expensive to query.

This issue lays out the proposed interface for presenting these as fields. We want to gather feedback now from users of the data, because deploying these changes will be to some extent irreversible; once we coerce these fields to a certain BQ type, we cannot change our minds and choose a different type for an existing field.

For both of these schema issues, we're going to use the event ping to demonstrate.

tl;dr Please play with the query given below in #88 (comment) and leave feedback in this issue about any potential gotchas or improvements you'd like to see with the proposed transformations.

@jklukas
Copy link
Author

jklukas commented Aug 27, 2019

Event arrays (and other tuple types)

Discussed in depth in #38, we model events in main ping, event ping, focus event ping, etc. as a "tuple" of 6 elements. Each of those elements has a specific type, but the types are different for the different positions.

We propose to have the jsonschema-transpiler support these "tuple" types as STRUCTs with fields named according to BigQuery's anonymous field syntax.

Values in the event ping at payload.events.parent look like:

[
  [
    14721983,
    "uptake.remotecontent.result",
    "uptake",
    "remotesettings",
    "success",
    {
      "source": "settings-changes-monitoring",
      "trigger": "timer",
      "age": "240805"
    }
  ],
  ...
]

The pipeline will turn this into a repeated record field payload.events.parent in the BigQuery table like:

f0_ f1_ f2_ f3_ f4_ f5_.key f5_.value
14721983 uptake.remotecontent.result uptake remotesettings success source settings-changes-monitoring

See table moz-fx-data-shared-prod.analysis.klukas_event_raw for one day of data with events parsed into the above form.

Those field names aren't very useful, but they're easy to produce consistently in the pipeline and they match BQ conventions. With the prod data layout in BQ, though, we expect users to be accessing ping data through views, and we can provide more helpful names for these fields at the view layer.

So, in practice, a user would query the view telemetry.event and get nicer names. I've produced an example of what the view will look like in moz-fx-data-shared-prod.analysis.klukas_event:

SELECT
  document_id,
  parent_events.*
FROM
  `moz-fx-data-shared-prod.analysis.klukas_event`
CROSS JOIN
  UNNEST(payload.events.parent) AS parent_events
LIMIT
  2

@jklukas
Copy link
Author

jklukas commented Aug 27, 2019

Union types

It's great to have a concrete type everywhere, but we unfortunately have plenty of ambiguous cases where our JSON schema has to allow multiple types, such as ["integer", "boolean"] because different clients send the field as different types.

We have prior art for this in how BigQuery handles importing Avro files that contain union types. In such cases, the field becomes a struct with one field for each type in the union named like <avro_type_name>_value. For each row, only one of the fields in the struct will be populated with a non-null value.

Like in the tuple case, views will give us some flexibility to present these structs in different ways for different contexts.

A concrete example in event pings and main pings is environment.addons.theme.foreignInstall which can be either a boolean or an int. This will be transformed in BQ to a STRUCT<boolean_value BOOLEAN, long_value INT64>.

@jklukas
Copy link
Author

jklukas commented Aug 27, 2019

Putting it all together

Here is a query on top of view moz-fx-data-shared-prod.analysis.klukas_event that shows the proposed user interface for these two types of data:

SELECT
  -- A field from the top level of the ping
  document_id,

  -- A union-type field, where we only care about cases where the client had a boolean value
  environment.addons.theme.foreign_install.boolean_value AS theme_foreign_install,

  -- Unnested contents of events from process "parent"
  parent_events.*

FROM
  `moz-fx-data-shared-prod.analysis.klukas_event`
CROSS JOIN
  UNNEST(payload.events.parent) AS parent_events
WHERE
  environment.addons.theme.foreign_install.boolean_value = TRUE
LIMIT
  10
Row document_id theme_foreign_install event_timestamp event_timestamp_parsed event_category event_method event_object event_string_value event_map_values.key event_map_values.value  
1 f23484ce-8f18-4005-b0e2-7ade05cbf38e true 281505 2019-08-26 12:45:41.505 UTC addonsManager update extension 1 updated_from app  
                  step started  
                  addon_id {b9db16a4-6edc-47ec-a1f4-b86292ed211d}  
2 2ef7818d-1efc-44c1-9bcf-0dad2490fdab true 11861113 2019-08-26 15:29:41.113 UTC pwmgr open_management mainmenu null  

@tdsmith
Copy link

tdsmith commented Aug 27, 2019

I'll always take rectangular data over nested data if you're offering it to me; Sunah says we can build an additional view on top to recreate something like the Parquet events table. I think that would lead me to be less worried about the underlying representation.

I'm not sure how y'all came to the decision about how to represent histograms in main pings, but, similarly, representing the event extras as JSON objects rather than a key-value-struct-array would be a little goofy-looking but might be easier to work with.

The union type proposal makes sense to me; as a user, if I end up typing COALESCE(CAST(foo.integer_value AS STRING), foo.string_value) a few times, that's not the end of the world. I wonder a little about whether we could get away with coercing ambiguous types to their string representation and throwing away the original type information but that seems worse in most respects.

@jklukas
Copy link
Author

jklukas commented Aug 28, 2019

I wonder a little about whether we could get away with coercing ambiguous types to their string representation and throwing away the original type information but that seems worse in most respects.

It does feel like storing these values as a JSON string could be more convenient in some ways, but would introduce more ambiguities. We can preserve the original JSON types by including quotes in the JSON string if the original value was a string, but then the user would need to strip out the quotes. If we don't include the quotes, then we lose type information (which doesn't seem very important, but I could imagine cases where it turns out to be).

I'm imagining that in most cases union types are a historical artifact where ancient versions produce one of the types and will eventually fall off completely, so analyses will often be able to assume a single type and not worry about COALESCEing multiple potential types.

There might be room for a few UDFs to help with common cases of parsing these as well, but I think we'll understand that better once folks are making queries on top of these union types in the wild.

@sunahsuh
Copy link

sunahsuh commented Sep 6, 2019

We can preserve the original JSON types by including quotes in the JSON string if the original value was a string, but then the user would need to strip out the quotes.

Desktop Firefox only allows strings anyway for both the key and value, which we added to the event specification early on due to lack of union type support in Parquet. (See the blurb for "extra" here: https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/collection/events.html#serialization-format)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants