-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide how to handle union types (like [string, int]) in BQ #87
Comments
I would prefer using a JSON (string) type to encode a variant types. This functionality is part of the design and is currently being used for histograms in the main ping. The lack of type information is fine, because all incoming pings are JSON anyways. Casting is also a behavior that handles nested objects e.g. if If appropriate, I think we should use mozilla-schema-generator to fill in sections where variant map types occur, like scalars or user preferences. This skirts around the problem, but it's generally the right choice for our use-case because of the data dictionary. The use of a struct to encode the variant types is an interesting solution, but I don't think there are that many usability gains over casting into a STRING other than the strong typing. The user still has to know what the type is ahead of time in order to use the struct properly (trying all combinations until the result is not null, perhaps), and it has the possibility of column bloat. |
I was worried that JSON_EXTRACT would not be able to handle a string not wrapped in
And it returns string So I guess we're at the point of figuring out how we reach a decision about committing to this representation such that we're comfortable deploying it and not being able to back it out. |
I think the decision should be left to the developers of the schema, though the option to explicitly cast with the {
"type": "object",
"properties": {
"event": {
"oneOf": [
{
"type": "string",
"format": "json"
},
{
"type": "array",
"items": {
"type": "array",
"items": {
"type": "string"
}
}
}
]
}
}
} Unfortunately, this relaxes the schema and allows for potentially invalid data. The following examples would all be allowed: {"events": [["1", "a"], ["2", "b"]]}
{"events": "[[\"1\", \"a\"], [\"2\", \"b\"]]"}
{"events": "asdf"} |
The one case where this would help (and that we currently do not support) are maps of variable types. For example: {
"type": "object",
"additionalProperties": {
"type": ["null", "boolean", "integer", "string"]
}
} will be empty. This is a similarly structured schema that is well defined:
|
It looks like BigQuery does have a method for handling union types coming from Avro:
This sounds similar to one of the suggestions above (STRUCT<int INT64, string STRING>), but I'm not sure if it's the same thing. But, we have hewed towards BQ's decisions on how to convert Avro types before (turning maps into repeated key/value structs) and I hesitate to deviate in this case. |
That's great, the whole "collapse" algorithm was built to specifically get around the lack of support for variant types. The choices in this project have been mostly conservative (aside from object casting and case normalization) with respect to the BigQuery constraints. The Avro into BigQuery pathway is the de-facto specification, since Avro is more expressive than BigQuery (maps and unions come to mind) and the import pathway is officially supported by BigQuery. I think it would be useful to implement #63 as part of handling unions. |
Tested BQ's avro behavior. Created an Avro schema with a union type:
I created a binary avro file containing a few records matching this schema:
And loaded that file to a new table in BQ:
This is what I saw in the BQ schema browser:
So it appends |
Here is a query which pulls out the
|
And here's a query that filters using this
|
Seeking feedback on this from data users in #88 |
Currently, we let fields go to additional_properties if they are a union like [string, int]. We would like a way to include such fields in the table structure.
Options for how to express in BQ
One option is to default to string in this case, so
4
and"4"
in the JSON both become string4
. In this case, we lose information about what the original type was, but that doesn't seem terribly important.A variant on coercing to strings is that we could have it be a "JSON-formatted string field" such that
4
in JSON becomes string4
and"4"
in JSON becomes string"4"
with the quotes retained. That would allow us to maintain original type information from the JSON, and maybe we'd be able to use JSON_EXTRACT functions on the field. The extra effort here doesn't seem worth it for the original type information.Another option is to turn this into a
STRUCT<int INT64, string STRING>
where only one of the values will be non-null.Options for naming the fields
I'm not sure if we've previously discussed the idea of potentially modifying the field name as a way to give ourselves flexibility to change how we want to encode in the future without having to change the type of a field.
For example, if we decided to use a struct, we could change field
field_name
tofield_name_struct
in the output BQ schema so that if we decide that some other representation works better, we could add it with a different name rather than having to change the type offield_name
, necessitating recreating tables.cc @acmiyaguchi
The text was updated successfully, but these errors were encountered: