-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(flink): fix compilation of memtable with nested data #8751
Conversation
1ab6833
to
fd86d28
Compare
| def test_create_memtable(con, data, schema, expected): | ||
| t = ibis.memtable(data, schema=ibis.schema(schema)) | ||
| # cannot use con.execute(t) directly because of some behavioral discrepancy between | ||
| # `TableEnvironment.execute_sql()` and `TableEnvironment.sql_query()` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I raised an issue on Flink JIRA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gnarly -- nice job tracking this down!
Can you also add a test that will start to XPASS when upstream supports using the ARRAY constructor with named structs?
bce04c9
to
3bb7d7f
Compare
fbeb2ec
to
96a4cb3
Compare
96a4cb3
to
0485987
Compare
|
@gforsyth Thanks for the review! I addressed all of the comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I was staring at this a bit more and I think I have a cleaner way to handle the unaliasing pass -- lmk what you think.
ibis/backends/sql/dialects.py
Outdated
| first_arg = seq_get(expression.expressions, 0) | ||
| if isinstance(first_arg, sge.Struct): | ||
| # it's an array of structs | ||
| named_structs = False | ||
| for arg in expression.expressions: | ||
| for e in arg.expressions: | ||
| if isinstance(e, sge.Alias): | ||
| named_structs = True | ||
| # get rid of aliasing because we want to compile this as CAST instead | ||
| args = deepcopy(expression.expressions) | ||
| if named_structs: | ||
| for arg in args: | ||
| arg.set("expressions", [e.this for e in arg.expressions]) | ||
|
|
||
| format_values = ", ".join([self.sql(arg) for arg in args]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| first_arg = seq_get(expression.expressions, 0) | |
| if isinstance(first_arg, sge.Struct): | |
| # it's an array of structs | |
| named_structs = False | |
| for arg in expression.expressions: | |
| for e in arg.expressions: | |
| if isinstance(e, sge.Alias): | |
| named_structs = True | |
| # get rid of aliasing because we want to compile this as CAST instead | |
| args = deepcopy(expression.expressions) | |
| if named_structs: | |
| for arg in args: | |
| arg.set("expressions", [e.this for e in arg.expressions]) | |
| format_values = ", ".join([self.sql(arg) for arg in args]) | |
| first_arg = seq_get(expression.expressions, 0).copy() | |
| if isinstance(first_arg, sge.Struct): | |
| for arg in expression.expressions: | |
| arg.set("expressions", [e.unalias() for e in arg.expressions]) | |
| format_values = ", ".join( | |
| [self.sql(arg) for arg in expression.expressions] | |
| ) |
there's an unalias method on sqlglot objects that I think lets us simplify this a fair bit (but I might not be covering some corner cases you've run across)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer - I like this solution and it definitely makes the code much cleaner! I'm tempted to modify on a deepcopy of expression.expression to avoid unexpected consequences with passing this around...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for bearing with my multi-stage review, @chloeh13q !
Description of changes
This PR aims to fix the compilation of memtables with nested data.
What was broken
In particular, Flink does not support the
STRUCT(1 AS `a`)aliasing syntax to define named STRUCTs. In order to do so, we must use a workaround usingCAST, e.g.,However, Flink also does not allow you to directly construct ARRAYs of named STRUCTs using the
ARRAY[]constructor. This is a bug that I identified and I have filed it with the Flink community (JIRA ticket ref: https://issues.apache.org/jira/browse/FLINK-34898).For the time being, we will need to use another
CASTworkaound that casts the entire nested array, e.g.,How to fix
To summarize,
CAST(ARRAY[] AS ARRAY<ROW<>, ROW<>>)CAST(ROW() AS ROW<datatype of each field>)ROW()I thought of two approaches to this:
visit_NonNullLiteral()method)GeneratorI found both implementations in different scenarios and decided to go with option (2).
Issues closed
#8516