Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved(?) referenced_relations function #106

Closed
michalc opened this issue Jun 16, 2022 · 17 comments
Closed

Improved(?) referenced_relations function #106

michalc opened this issue Jun 16, 2022 · 17 comments

Comments

@michalc
Copy link

michalc commented Jun 16, 2022

The referenced_relations function I think has a few issues:

  • Doesn't deal with dots in schema or table names well - concatenating the schema and table name with a dot between them won't allow subsequent processing since it's not uniquely defined which part is the schema and which the table e.g. if used in a query it will have to be "my.schema.with.dots"."my_table.with.dots"
  • Incorrectly assumes anything matching a CTE name is to be removed, when actually, it could be a table in a schema in the search path.

Below is a different function that mostly addresses these. It doesn't use the Visitor base class, but instead is a recursive function.

import pglast

def referenced_relations(query, default_schema="public"):
    def _relations(node, ctenames=()):
        tables = set()

        if node.get("withClause", None) is not None:
            if node["withClause"]["recursive"]:
                for cte in node["withClause"]["ctes"]:
                    ctenames += (cte["ctename"],)
                    tables = tables.union(_relations(cte, ctenames))
            else:
                for cte in node["withClause"]["ctes"]:
                    tables = tables.union(_relations(cte, ctenames))
                    ctenames += (cte["ctename"],)

        if node.get("@", None) == "RangeVar" and (
            node["schemaname"] is not None or node["relname"] not in ctenames
        ):
            tables.add((node["schemaname"] or default_schema, node["relname"]))

        for node_type, node_value in node.items():
            if node_type == "withClause":
                continue
            for nested_node in node_value if isinstance(node_value, tuple) else (node_value,):
                if isinstance(nested_node, dict):
                    tables = tables.union(_relations(nested_node, ctenames))

        return tables

    return sorted(list(_relations(pglast.parse_sql(query)[0]())))

which you can use by, for example:

query = '''
    WITH my_ref AS (
        -- This queries a relation in a schema in the search path, and
        -- so should be included in the output
        SELECT * FROM my_ref
        WHERE a=1
    )
    SELECT * FROM my_ref
'''
print(referenced_relations(query))

It's not perfect, but at least for my use cases, it's improved, so wanted to share. I think it handles recursive CTEs appropriately, as well as chains of CTEs where I think later ones can refer to the previous ones. I think it also properly handles CTEs in subqueries, i.e. a CTE in a subquery won't prevent a relation with the same name but without a schema in another part of the query from being returned.

It can't be determined exactly what schema schema-less relations are in - there could be multiple schemas in the search path, but if this is important to detect in a use, the client code could pass default_schema=None and then look at the returned list, and take some action if there is a None schema.

Not sure if this should be integrated into pglast somewhere, but happy to raise a PR if so.

@lelit
Copy link
Owner

lelit commented Jun 16, 2022

Thank you for the report!

I cannot switch focus on this right now, but I'll surely do.

In any case, it would be great if you could provide some sample statements that trigger the issue you mentioned, or even proper failing tests 😉

@michalc
Copy link
Author

michalc commented Jun 16, 2022

In any case, it would be great if you could provide some sample statements that trigger the issue you mentioned

Ah so incorrectly identifying relations as CTEs would be this one:

query = '''
    WITH my_ref AS (
        SELECT * FROM my_ref
        WHERE a=1
    )
    SELECT * FROM my_ref
'''
print(referenced_relations(query))

This should include my_ref since inside the CTE definition,my_ref has to be a real relation. However, the referenced_relations function from visitors returns the empty set(). (Not saying it's good practice to ever write such a query, but it's possible)

And the dots issue:

query = '''
    SELECT * FROM "my.schema"."my.table"
'''
print(referenced_relations(query))

This returns {'my.schema.my.table'}, and so it's not possible to use this without making some assumption about which is the schema and which is the table name.

@lelit
Copy link
Owner

lelit commented Jun 16, 2022

Thanks, this helps a lot!

At a first glance, both seems simple enough to be fixed in the visitor-based implementation.

@michalc
Copy link
Author

michalc commented Jun 16, 2022

Ah I couldn't quite figure out how to do it in the visitor-based implementation... I think it was the order that the nodes are visited in that caused me trouble.

lelit added a commit that referenced this issue Jun 17, 2022
This fixes a defect in the referenced_relations() function reported in
issue #106.
lelit added a commit that referenced this issue Jun 17, 2022
This fixes the other defect reported in issue #106, when a CTE query
references a relation with the same name as the CTE itself.
@lelit
Copy link
Owner

lelit commented Jun 17, 2022

These commits should fix both the problems.

@michalc
Copy link
Author

michalc commented Jun 17, 2022

Ah great!

If it's ok to give feedback? The quoting one looks good to me, but the unusual case one - it looks like if the current node is inside a CTE, then it unconditionally returns the relation? I don't think that would always be the case, since inside a CTE there can be references to another CTE, e.g.

WITH my_cte_1 AS (
    SELECT 1
), WITH my_cte_2 AS (
    SELECT * FROM my_cte_1
)
SELECT * FROM my_cte_2

or in the case of a single recursive CTE that references itself (taken from https://www.postgresql.org/docs/current/queries-with.html)

WITH RECURSIVE t(n) AS (
    VALUES (1)
  UNION ALL
    SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;

@lelit
Copy link
Owner

lelit commented Jun 17, 2022

Oh, these are even trickier! 😄
Thanks for the cases, I will try to address them as well.

@michalc
Copy link
Author

michalc commented Jun 17, 2022

Ah if you're after tricky cases 😉 , before I started looking into this I didn't realise that CTEs could be in subqueries as well - so CTEs can even refer to other CTEs much further "up" the tree:

WITH my_cte_1 AS (SELECT 1)
SELECT * FROM (
    WITH my_cte_2 AS (SELECT * FROM my_cte_1)
    SELECT * FROM my_cte_2
) a

lelit added a commit that referenced this issue Jun 18, 2022
Traverse the statement tree twice, first to collect the CTE names, then
to recognize concrete relation names.

This should fix additional defects reported in issue #106.
@lelit
Copy link
Owner

lelit commented Jun 18, 2022

The commit above should solve the trickier cases, but if you find more of them, be my guest 😃

lelit added a commit that referenced this issue Jun 18, 2022
This is yet another variant of glitch reported in issue #106.
@michalc
Copy link
Author

michalc commented Jun 18, 2022

Ah great! Although got some more tricky cases:

SELECT (WITH my_ref AS (SELECT 1) SELECT 1)
FROM my_ref

the my_ref on the second line is a concrete relation, but referenced_relations returns the empty set().

And:

WITH
  my_cte_1 AS (SELECT 1),
  my_cte_2 AS (
    WITH my_cte_1 AS (SELECT * FROM my_cte_1)
    SELECT * FROM my_cte_1
)
SELECT * FROM my_cte_2

where this case referenced_relations returns {'my_cte_1'} when I think it should return the empty set().

@lelit
Copy link
Owner

lelit commented Jun 18, 2022

Oh, a never ending story then!

@michalc
Copy link
Author

michalc commented Jun 18, 2022

There do just seem to be so many cases...

On a related note, the recursive-ness of the original I posted has been ringing around my head. Thanks to this StackOverflow answer I converted this to a non-recursive version:

import pglast
from collections import deque

def referenced_relations(query, default_schema="public"):
    tables = set()

    node_ctenames = deque()
    node_ctenames.append((pglast.parse_sql(query)[0](), ()))

    while node_ctenames:
        node, ctenames = node_ctenames.popleft()

        if node.get("withClause", None) is not None:
            if node["withClause"]["recursive"]:
                for cte in node["withClause"]["ctes"]:
                    ctenames += (cte["ctename"],)
                    node_ctenames.append((cte, ctenames))
            else:
                for cte in node["withClause"]["ctes"]:
                    node_ctenames.append((cte, ctenames))
                    ctenames += (cte["ctename"],)

        if node.get("@", None) == "RangeVar" and (
            node["schemaname"] is not None or node["relname"] not in ctenames
        ):
            tables.add((node["schemaname"] or default_schema, node["relname"]))

        for node_type, node_value in node.items():
            if node_type == "withClause":
                continue
            for nested_node in node_value if isinstance(node_value, tuple) else (node_value,):
                if isinstance(nested_node, dict):
                    node_ctenames.append((nested_node, ctenames))

    return sorted(list(tables))

lelit added a commit that referenced this issue Jun 19, 2022
Previous approach was wrong, because it unconditionally collected all
CTE names before analizying the statement itself, but they must instead
processed in order.

This fixes further cases reported in issue #106.
@lelit
Copy link
Owner

lelit commented Jun 19, 2022

The new version covers all these cases: I also found a RECURSIVE case, taken from PG regress tests, where your implementation fails 😉

Thanks a lot for your patience!

@michalc
Copy link
Author

michalc commented Jun 19, 2022

The new version covers all these cases: I also found a RECURSIVE case, taken from PG regress tests, where your implementation fails 😉

Ah very good to know! Looks like I didn't quite understand how multiple recursive CTEs work together. Thank you!

@michalc
Copy link
Author

michalc commented Jun 19, 2022

Just for completeness/posterity, I think this version is better with recursive queries:

import pglast
from collections import deque

def referenced_relations(query, default_schema="public"):
    tables = set()

    node_ctenames = deque()
    node_ctenames.append((pglast.parse_sql(query)[0](), ()))

    while node_ctenames:
        node, ctenames = node_ctenames.popleft()

        if node.get("withClause", None) is not None:
            if node["withClause"]["recursive"]:
                ctenames += tuple((cte['ctename'] for cte in node["withClause"]["ctes"]))
                for cte in node["withClause"]["ctes"]:
                    node_ctenames.append((cte, ctenames))
            else:
                for cte in node["withClause"]["ctes"]:
                    node_ctenames.append((cte, ctenames))
                    ctenames += (cte["ctename"],)

        if node.get("@", None) == "RangeVar" and (
            node["schemaname"] is not None or node["relname"] not in ctenames
        ):
            tables.add((node["schemaname"] or default_schema, node["relname"]))

        for node_type, node_value in node.items():
            if node_type == "withClause":
                continue
            for nested_node in node_value if isinstance(node_value, tuple) else (node_value,):
                if isinstance(nested_node, dict):
                    node_ctenames.append((nested_node, ctenames))

    return sorted(list(tables))

(I didn't realise that in a WITH RECURSIVE clause, all the CTEs can refer to each other)

@lelit
Copy link
Owner

lelit commented Jun 19, 2022

I released v3.12 with this, thank you again!

@lelit lelit closed this as completed Jun 19, 2022
@michalc
Copy link
Author

michalc commented Jun 19, 2022

No problem at all, in fact thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants