Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: STRUCT and ARRAY support #318

Merged
merged 37 commits into from Sep 9, 2021
Merged

feat: STRUCT and ARRAY support #318

merged 37 commits into from Sep 9, 2021

Conversation

@jimfulton
Copy link
Contributor

@jimfulton jimfulton commented Aug 30, 2021

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #293
Fixes #314
Fixed #233
Fixes #37
馃

@jimfulton
Copy link
Contributor Author

@jimfulton jimfulton commented Aug 31, 2021

FTR, WRT superset, once I finally got it working :), it behaves the same with and without these changes.

BTW, we have logic that tries to unpack sub-structs, I think so that there would eventually be scalars for superset to work with.

If you have an array of structs, we still create columns for the fields of the struct in the array. This causes superset to error, because it has no way to get at structs in an array. We should probably not unpack structs in arrays.

@snippet-bot
Copy link

@snippet-bot snippet-bot bot commented Sep 1, 2021

Here is the summary of changes.

You are about to add 10 region tags.

This comment is generated by snippet-bot.
If you find problems with this result, please file an issue at:
https://github.com/googleapis/repo-automation-bots/issues.
To update this comment, add snippet-bot:force-run label or use the checkbox below:

  • Refresh this comment

@jimfulton jimfulton marked this pull request as ready for review Sep 2, 2021
@jimfulton jimfulton requested review from as code owners Sep 2, 2021
@jimfulton jimfulton requested a review from tmatsuo Sep 2, 2021
@jimfulton
Copy link
Contributor Author

@jimfulton jimfulton commented Sep 2, 2021

Some notes for reviewers:

  • Heart of change is _struct.py, which isn't large, but also isn't obvious. :( I cribbed from the built-in JSON and ARRAY types. When reviewing, it's probably helpful to look at those. The "Comparator" framework is confusing, in large part because the name doesn't make sense.

    Having said that, the core logic is in _setop_getitem, which is a hook used by the base class of STRUCTs comparator.

    The __getattr__ method just delegates to (the inherited) __getitem__.

  • This PR also has 2 other small changes:

    • Machinery for mapping BQ types to SQLAlchemy types has been factored into a separate _types module, both to avoid cluttering base more and to partially avoid circular imports. (There's still a circular import issue that isn't fixable without a bigger refactoring that I deemed unwarranted.)
    • Implementation of ARRAY indexing, which wasn't implemented. A number of my tests (see test__struct.py in both unit and system tests) used an example that has nested structs and arrays.

@tswast tswast self-requested a review Sep 7, 2021
Copy link
Collaborator

@tswast tswast left a comment

I haven't quite digested everything in this PR yet, but I figured I'd share the feedback I have so far.

docs/struct.rst Show resolved Hide resolved
setup.py Show resolved Hide resolved
sqlalchemy_bigquery/_struct.py Outdated Show resolved Hide resolved
sqlalchemy_bigquery/_struct.py Show resolved Hide resolved
sqlalchemy_bigquery/_struct.py Outdated Show resolved Hide resolved
sqlalchemy_bigquery/_struct.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@tswast tswast left a comment

I like it! Just a few things I think we should clarify before merging.

sqlalchemy_bigquery/_struct.py Outdated Show resolved Hide resolved
global type_compiler

try:
process = type_compiler.process
except AttributeError:
type_compiler = base.dialect.type_compiler(base.dialect())
process = type_compiler.process
Copy link
Collaborator

@tswast tswast Sep 8, 2021

Could we put this in a _get_type_compiler / _get_process function? I don't see anywhere else we initialize type_compiler, but I'd be more comfortable having this logic closer to the # We have to delay getting the type compiler, because of circular imports. :( comment.

Copy link
Contributor Author

@jimfulton jimfulton Sep 8, 2021

I refactored so this is combined and isolated in one place using a new, better named _get_subtype_col_spec function.

type_compiler = base.dialect.type_compiler(base.dialect())
process = type_compiler.process

fields = ", ".join(f"{name} {process(type_)}" for name, type_ in self.__fields)
Copy link
Collaborator

@tswast tswast Sep 8, 2021

I assume process is able to handle nested arrays/structs?

Copy link
Contributor Author

@jimfulton jimfulton Sep 8, 2021

yes

sqlalchemy_bigquery/_struct.py Outdated Show resolved Hide resolved
f"STRUCT fields can only be accessed with strings field names,"
f" not {name}."
)
subtype = self.expr.type._STRUCT__byname.get(name.lower())
Copy link
Collaborator

@tswast tswast Sep 8, 2021

Where does _STRUCT__byname come from? I'm assuming somewhere from SQLAlchemy, but I'm not getting any results when searching for byname.

Copy link
Collaborator

@tswast tswast Sep 8, 2021

Oh, I think I figured it out: https://docs.python.org/3/tutorial/classes.html#private-variables

Any identifier of the form __spam (at least two leading underscores, at most one trailing underscore) is textually replaced with _classname__spam, where classname is the current class name with leading underscore(s) stripped.

Can we comment about this? I assume we have to do it because we know self.expr.type is a STRUCT, but it's not self.

Copy link
Contributor Author

@jimfulton jimfulton Sep 8, 2021

I refactored this to make name mangling more explicit and consistent, so I don't think comments are needed anymore. See if you agree. :)

I mainly use "private" variables, which aren't :), to avoid namespace conflicts when subclassing across responsibility boundaries. Arguably, explicit naming is better.

return operator, index, subtype

def __getattr__(self, name):
if name.lower() in self.expr.type._STRUCT__byname:
Copy link
Collaborator

@tswast tswast Sep 8, 2021

I'm a bit confused why self.__byname doesn't work in this case.

Edit: I see now that it's part of the Comparator class. Still probably worth a similar comment to the one I recommend in _setup_getitem

Copy link
Contributor Author

@jimfulton jimfulton Sep 8, 2021

See my response on name mangling

sqlalchemy_bigquery/_struct.py Show resolved Hide resolved
for f in field.fields
]
results += _get_transitive_schema_fields(sub_fields, cur_fields)
cur_fields.pop()
Copy link
Collaborator

@tswast tswast Sep 8, 2021

Since we pop these off, does that mean we don't get the top-level struct field, just the leaf fields? I suspect this might hide some ARRAY columns if a parent node has mode REPEATED, but is not included.

Edit: I see the top field is added to results on line 83. Might be worth a comment as to why we pop here.

Copy link
Contributor Author

@jimfulton jimfulton Sep 8, 2021

I got rid of cur_fields. It wasn't needed (anymore).

for field in fields:
results += [field]
if field.field_type == "RECORD":
cur_fields.append(field)
Copy link
Collaborator

@tswast tswast Sep 8, 2021

I don't quite understand what cur_fields is doing. Is there a better name we can pick for this? Maybe it's referring to ancestors?

Copy link
Contributor Author

@jimfulton jimfulton Sep 8, 2021

Haha, it's not doing anything.

This is based on

def _get_columns_helper(self, columns, cur_columns):
"""
Recurse into record type and return all the nested field names.
As contributed by @sumedhsakdeo on issue #17
"""
results = []
for col in columns:
results += [
SchemaField(
name=".".join(col.name for col in cur_columns + [col]),
field_type=col.field_type,
mode=col.mode,
description=col.description,
fields=col.fields,
)
]
if col.field_type == "RECORD":
cur_columns.append(col)
results += self._get_columns_helper(col.fields, cur_columns)
cur_columns.pop()
return results
, which I inherited.

I've refactored it quite a bit and failed to notice that this wasn't needed any more. Fixed.

(_col().NAME, "`t`.`person`.NAME"),
(_col().children, "`t`.`person`.children"),
(
_col().children[0].label("anon_1"), # SQLAlchemy doesn't add the label
Copy link
Collaborator

@tswast tswast Sep 8, 2021

Should we file an issue for this to investigate later? If so, let's add TODO and link to the issue.

Copy link
Contributor Author

@jimfulton jimfulton Sep 8, 2021

Done

tswast
tswast approved these changes Sep 9, 2021
global _get_subtype_col_spec

type_compiler = base.dialect.type_compiler(base.dialect())
_get_subtype_col_spec = type_compiler.process
Copy link
Collaborator

@tswast tswast Sep 9, 2021

Fancy! I didn't realize a function could replace itself. I like it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants