Improved(?) referenced_relations function #106

michalc · 2022-06-16T09:23:31Z

The referenced_relations function I think has a few issues:

Doesn't deal with dots in schema or table names well - concatenating the schema and table name with a dot between them won't allow subsequent processing since it's not uniquely defined which part is the schema and which the table e.g. if used in a query it will have to be "my.schema.with.dots"."my_table.with.dots"
Incorrectly assumes anything matching a CTE name is to be removed, when actually, it could be a table in a schema in the search path.

Below is a different function that mostly addresses these. It doesn't use the Visitor base class, but instead is a recursive function.

import pglast

def referenced_relations(query, default_schema="public"):
    def _relations(node, ctenames=()):
        tables = set()

        if node.get("withClause", None) is not None:
            if node["withClause"]["recursive"]:
                for cte in node["withClause"]["ctes"]:
                    ctenames += (cte["ctename"],)
                    tables = tables.union(_relations(cte, ctenames))
            else:
                for cte in node["withClause"]["ctes"]:
                    tables = tables.union(_relations(cte, ctenames))
                    ctenames += (cte["ctename"],)

        if node.get("@", None) == "RangeVar" and (
            node["schemaname"] is not None or node["relname"] not in ctenames
        ):
            tables.add((node["schemaname"] or default_schema, node["relname"]))

        for node_type, node_value in node.items():
            if node_type == "withClause":
                continue
            for nested_node in node_value if isinstance(node_value, tuple) else (node_value,):
                if isinstance(nested_node, dict):
                    tables = tables.union(_relations(nested_node, ctenames))

        return tables

    return sorted(list(_relations(pglast.parse_sql(query)[0]())))

which you can use by, for example:

query = '''
    WITH my_ref AS (
        -- This queries a relation in a schema in the search path, and
        -- so should be included in the output
        SELECT * FROM my_ref
        WHERE a=1
    )
    SELECT * FROM my_ref
'''
print(referenced_relations(query))

It's not perfect, but at least for my use cases, it's improved, so wanted to share. I think it handles recursive CTEs appropriately, as well as chains of CTEs where I think later ones can refer to the previous ones. I think it also properly handles CTEs in subqueries, i.e. a CTE in a subquery won't prevent a relation with the same name but without a schema in another part of the query from being returned.

It can't be determined exactly what schema schema-less relations are in - there could be multiple schemas in the search path, but if this is important to detect in a use, the client code could pass default_schema=None and then look at the returned list, and take some action if there is a None schema.

Not sure if this should be integrated into pglast somewhere, but happy to raise a PR if so.

The text was updated successfully, but these errors were encountered:

lelit · 2022-06-16T09:36:39Z

Thank you for the report!

I cannot switch focus on this right now, but I'll surely do.

In any case, it would be great if you could provide some sample statements that trigger the issue you mentioned, or even proper failing tests 😉

michalc · 2022-06-16T10:19:15Z

In any case, it would be great if you could provide some sample statements that trigger the issue you mentioned

Ah so incorrectly identifying relations as CTEs would be this one:

query = '''
    WITH my_ref AS (
        SELECT * FROM my_ref
        WHERE a=1
    )
    SELECT * FROM my_ref
'''
print(referenced_relations(query))

This should include my_ref since inside the CTE definition,my_ref has to be a real relation. However, the referenced_relations function from visitors returns the empty set(). (Not saying it's good practice to ever write such a query, but it's possible)

And the dots issue:

query = '''
    SELECT * FROM "my.schema"."my.table"
'''
print(referenced_relations(query))

This returns {'my.schema.my.table'}, and so it's not possible to use this without making some assumption about which is the schema and which is the table name.

lelit · 2022-06-16T10:34:23Z

Thanks, this helps a lot!

At a first glance, both seems simple enough to be fixed in the visitor-based implementation.

michalc · 2022-06-16T11:01:56Z

Ah I couldn't quite figure out how to do it in the visitor-based implementation... I think it was the order that the nodes are visited in that caused me trouble.

This fixes a defect in the referenced_relations() function reported in issue #106.

This fixes the other defect reported in issue #106, when a CTE query references a relation with the same name as the CTE itself.

lelit · 2022-06-17T12:42:44Z

These commits should fix both the problems.

michalc · 2022-06-17T13:46:21Z

Ah great!

If it's ok to give feedback? The quoting one looks good to me, but the unusual case one - it looks like if the current node is inside a CTE, then it unconditionally returns the relation? I don't think that would always be the case, since inside a CTE there can be references to another CTE, e.g.

WITH my_cte_1 AS (
    SELECT 1
), WITH my_cte_2 AS (
    SELECT * FROM my_cte_1
)
SELECT * FROM my_cte_2

or in the case of a single recursive CTE that references itself (taken from https://www.postgresql.org/docs/current/queries-with.html)

WITH RECURSIVE t(n) AS (
    VALUES (1)
  UNION ALL
    SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;

lelit · 2022-06-17T14:17:18Z

Oh, these are even trickier! 😄
Thanks for the cases, I will try to address them as well.

michalc · 2022-06-17T14:25:17Z

Ah if you're after tricky cases 😉 , before I started looking into this I didn't realise that CTEs could be in subqueries as well - so CTEs can even refer to other CTEs much further "up" the tree:

WITH my_cte_1 AS (SELECT 1)
SELECT * FROM (
    WITH my_cte_2 AS (SELECT * FROM my_cte_1)
    SELECT * FROM my_cte_2
) a

Traverse the statement tree twice, first to collect the CTE names, then to recognize concrete relation names. This should fix additional defects reported in issue #106.

lelit · 2022-06-18T10:58:21Z

The commit above should solve the trickier cases, but if you find more of them, be my guest 😃

This is yet another variant of glitch reported in issue #106.

michalc · 2022-06-18T12:58:24Z

Ah great! Although got some more tricky cases:

SELECT (WITH my_ref AS (SELECT 1) SELECT 1)
FROM my_ref

the my_ref on the second line is a concrete relation, but referenced_relations returns the empty set().

And:

WITH
  my_cte_1 AS (SELECT 1),
  my_cte_2 AS (
    WITH my_cte_1 AS (SELECT * FROM my_cte_1)
    SELECT * FROM my_cte_1
)
SELECT * FROM my_cte_2

where this case referenced_relations returns {'my_cte_1'} when I think it should return the empty set().

lelit · 2022-06-18T16:13:53Z

Oh, a never ending story then!

michalc · 2022-06-18T16:42:15Z

There do just seem to be so many cases...

On a related note, the recursive-ness of the original I posted has been ringing around my head. Thanks to this StackOverflow answer I converted this to a non-recursive version:

import pglast
from collections import deque

def referenced_relations(query, default_schema="public"):
    tables = set()

    node_ctenames = deque()
    node_ctenames.append((pglast.parse_sql(query)[0](), ()))

    while node_ctenames:
        node, ctenames = node_ctenames.popleft()

        if node.get("withClause", None) is not None:
            if node["withClause"]["recursive"]:
                for cte in node["withClause"]["ctes"]:
                    ctenames += (cte["ctename"],)
                    node_ctenames.append((cte, ctenames))
            else:
                for cte in node["withClause"]["ctes"]:
                    node_ctenames.append((cte, ctenames))
                    ctenames += (cte["ctename"],)

        if node.get("@", None) == "RangeVar" and (
            node["schemaname"] is not None or node["relname"] not in ctenames
        ):
            tables.add((node["schemaname"] or default_schema, node["relname"]))

        for node_type, node_value in node.items():
            if node_type == "withClause":
                continue
            for nested_node in node_value if isinstance(node_value, tuple) else (node_value,):
                if isinstance(nested_node, dict):
                    node_ctenames.append((nested_node, ctenames))

    return sorted(list(tables))

Previous approach was wrong, because it unconditionally collected all CTE names before analizying the statement itself, but they must instead processed in order. This fixes further cases reported in issue #106.

lelit · 2022-06-19T09:26:38Z

The new version covers all these cases: I also found a RECURSIVE case, taken from PG regress tests, where your implementation fails 😉

Thanks a lot for your patience!

michalc · 2022-06-19T09:31:36Z

The new version covers all these cases: I also found a RECURSIVE case, taken from PG regress tests, where your implementation fails 😉

Ah very good to know! Looks like I didn't quite understand how multiple recursive CTEs work together. Thank you!

michalc · 2022-06-19T11:02:02Z

Just for completeness/posterity, I think this version is better with recursive queries:

import pglast
from collections import deque

def referenced_relations(query, default_schema="public"):
    tables = set()

    node_ctenames = deque()
    node_ctenames.append((pglast.parse_sql(query)[0](), ()))

    while node_ctenames:
        node, ctenames = node_ctenames.popleft()

        if node.get("withClause", None) is not None:
            if node["withClause"]["recursive"]:
                ctenames += tuple((cte['ctename'] for cte in node["withClause"]["ctes"]))
                for cte in node["withClause"]["ctes"]:
                    node_ctenames.append((cte, ctenames))
            else:
                for cte in node["withClause"]["ctes"]:
                    node_ctenames.append((cte, ctenames))
                    ctenames += (cte["ctename"],)

        if node.get("@", None) == "RangeVar" and (
            node["schemaname"] is not None or node["relname"] not in ctenames
        ):
            tables.add((node["schemaname"] or default_schema, node["relname"]))

        for node_type, node_value in node.items():
            if node_type == "withClause":
                continue
            for nested_node in node_value if isinstance(node_value, tuple) else (node_value,):
                if isinstance(nested_node, dict):
                    node_ctenames.append((nested_node, ctenames))

    return sorted(list(tables))

(I didn't realise that in a WITH RECURSIVE clause, all the CTEs can refer to each other)

lelit · 2022-06-19T13:45:26Z

I released v3.12 with this, thank you again!

michalc · 2022-06-19T14:14:29Z

No problem at all, in fact thank you!

lelit added a commit that referenced this issue Jun 17, 2022

Properly double quote relation names when needed

360f9d3

This fixes a defect in the referenced_relations() function reported in issue #106.

lelit added a commit that referenced this issue Jun 17, 2022

Handle unusual case in referenced_relations()

8c7d102

This fixes the other defect reported in issue #106, when a CTE query references a relation with the same name as the CTE itself.

lelit added a commit that referenced this issue Jun 18, 2022

Properly double quote CTE names

2e0be60

This is yet another variant of glitch reported in issue #106.

lelit closed this as completed Jun 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved(?) referenced_relations function #106

Improved(?) referenced_relations function #106

michalc commented Jun 16, 2022

lelit commented Jun 16, 2022

michalc commented Jun 16, 2022

lelit commented Jun 16, 2022

michalc commented Jun 16, 2022

lelit commented Jun 17, 2022

michalc commented Jun 17, 2022

lelit commented Jun 17, 2022

michalc commented Jun 17, 2022

lelit commented Jun 18, 2022

michalc commented Jun 18, 2022

lelit commented Jun 18, 2022

michalc commented Jun 18, 2022

lelit commented Jun 19, 2022

michalc commented Jun 19, 2022

michalc commented Jun 19, 2022

lelit commented Jun 19, 2022

michalc commented Jun 19, 2022

Improved(?) referenced_relations function #106

Improved(?) referenced_relations function #106

Comments

michalc commented Jun 16, 2022

lelit commented Jun 16, 2022

michalc commented Jun 16, 2022

lelit commented Jun 16, 2022

michalc commented Jun 16, 2022

lelit commented Jun 17, 2022

michalc commented Jun 17, 2022

lelit commented Jun 17, 2022

michalc commented Jun 17, 2022

lelit commented Jun 18, 2022

michalc commented Jun 18, 2022

lelit commented Jun 18, 2022

michalc commented Jun 18, 2022

lelit commented Jun 19, 2022

michalc commented Jun 19, 2022

michalc commented Jun 19, 2022

lelit commented Jun 19, 2022

michalc commented Jun 19, 2022