Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track unique ids for each scope #54

Merged
merged 2 commits into from
Jun 13, 2024
Merged

Track unique ids for each scope #54

merged 2 commits into from
Jun 13, 2024

Conversation

crisptrutski
Copy link
Collaborator

@crisptrutski crisptrutski commented Jun 11, 2024

Refs metabase/metabase#42586

In order to track the provenance of query outputs we will need to follow the flow of data from the source table columns, through the various scopes, to the final top-level expression.

In essence we replace each string label in the in the :context vector with a [label, integer-id] pair, so that we can tell when two scopes are the same, and in future use these references to build a DAG of the data dependencies.

The test here is very basic, it just checks that our set representation does not de-duplicate the contents of the identical sub-selects. This satisfies the TDD gods1, but isn't that useful on its own.

It'll be superseded when we get to more complex queries, but we're quite a way off from supporting the examples I have in mind.

Footnotes

  1. Actually I lied, and didn't write the test first, and then when I checked I was telling the truth I discovered that they already wouldn't be de-duplicated as Macaw tracks the second CTE's scope as nested within the first one. Well, I still think it's a useful minimal test for establishing which IDs should match and which shouldn't.

@@ -485,6 +503,21 @@
(components (query-fixture :snowflakelet))
(components (query-fixture :snowflake))))

(defn sorted-vec
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, doesn't return a vec anymore. All the usage sites compare it to vectors, which works fine, and is why I dropped the explicit conversion. In the next PR I simply rename it to sorted, so think this is OK to slip through.

SELECT
b.x,
c.x
FROM b, c;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add a trailing line-break in a later PR.

Comment on lines +515 to +516
(is (=? [{:component {:column "x"}, :scope ["SELECT" (=?/same :subselect-1)]}
{:component {:column "x"}, :scope ["SELECT" (=?/same :subselect-2)]}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that Hawk will care if these two values are actually the same as each other, or as :top-level, but we certainly do! Maybe we need something new like (=?/unique :blah) which enforces uniqueness across all the captures?

(collect/query->components statement (merge {:preserve-identifiers? true} opts)))

(defn replace-names
"Given a SQL query, apply the given table, column, and schema renames.
"Given an SQL query, apply the given table, column, and schema renames.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And now we know that I say "sequel" and you say "ess queue ell" :)

Copy link
Contributor

@tsmacdonald tsmacdonald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot to chew on, but it looks good to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants