Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InputColumn class is ignoring index #880

Closed
2 tasks done
mamonu opened this issue Nov 8, 2022 · 2 comments
Closed
2 tasks done

InputColumn class is ignoring index #880

mamonu opened this issue Nov 8, 2022 · 2 comments
Assignees
Labels
comparison levels documentation Improvements or additions to documentation good first issue Good for newcomers

Comments

@mamonu
Copy link
Contributor

mamonu commented Nov 8, 2022

What happens?

I cannot access an array based column properly on Athena ( lat_lng_uncommon.lat , lat_lng_uncommon.long).
@ThomasHepworth has pinned the issue to InputColumn class

To Reproduce

import sqlglot
from splink.input_column import InputColumn
InputColumn("lat_long.test").input_name_as_tree
sqlglot.parse_one("lat_long.test")

this should give similar trees but it doesnt

image

OS:

Linux

Splink version:

3.4.1

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree
@ThomasHepworth
Copy link
Contributor

Ok, so, to summarise the issue and my opinions.

The basic problem is that the generated sql tree from lat_long.test is analogous to that which would be generated if you were referencing tables in a simple selection or join scenario.

i.e. sqlglot evaluates:

sql = "lat_long.test"
sqlglot.parse_one(sql, read="duckdb")

in the same manner it would:

select lat_long.test
from lat_long_df as lat_long

Taking the lat_long portion to be a table reference, rather than the name of a column followed by the struct index.

Screenshot 2022-11-08 at 17 37 44

This also means that simple tweaks such as changing the allowed signatures to include Column Identifier Identifier, do not fix the issue. Sqlglot will still incorrectly identify test as the column in the above, and as such will try to add the _l and _r suffix to that, instead of lat_long.

While we could add additional arguments and make tweaks to the various methods inside the InputColumn class to ensure it works, I'd recommend that we leave it alone.

My view is that this is all currently working as intended and any changes we make have the potential to break something else that we hadn't considered. Adding additional arguments to ensure this specific use case works seems excessive here and adds another consideration and potential area of confusion for users.

This issue is also easily overcome by the user putting in a little bit of extra legwork, which I'll add in the footnote below.


My suggestion for anyone who encounters this issue going forward would be to either:

  1. Generate an array, instead of a struct. Index the array as you normally would lat_long[1].
  2. Expand the struct to be multiple columns. In this instance, separate latitudinal and longitudinal columns.

@RossKen RossKen added documentation Improvements or additions to documentation comparison levels labels Jan 26, 2023
@RossKen
Copy link
Contributor

RossKen commented Jan 26, 2023

Add correct syntax to docstrings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comparison levels documentation Improvements or additions to documentation good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants