-
-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved(?) referenced_relations function #106
Comments
Thank you for the report! I cannot switch focus on this right now, but I'll surely do. In any case, it would be great if you could provide some sample statements that trigger the issue you mentioned, or even proper failing tests 😉 |
Ah so incorrectly identifying relations as CTEs would be this one: query = '''
WITH my_ref AS (
SELECT * FROM my_ref
WHERE a=1
)
SELECT * FROM my_ref
'''
print(referenced_relations(query)) This should include And the dots issue: query = '''
SELECT * FROM "my.schema"."my.table"
'''
print(referenced_relations(query)) This returns |
Thanks, this helps a lot! At a first glance, both seems simple enough to be fixed in the visitor-based implementation. |
Ah I couldn't quite figure out how to do it in the visitor-based implementation... I think it was the order that the nodes are visited in that caused me trouble. |
This fixes a defect in the referenced_relations() function reported in issue #106.
This fixes the other defect reported in issue #106, when a CTE query references a relation with the same name as the CTE itself.
These commits should fix both the problems. |
Ah great! If it's ok to give feedback? The quoting one looks good to me, but the unusual case one - it looks like if the current node is inside a CTE, then it unconditionally returns the relation? I don't think that would always be the case, since inside a CTE there can be references to another CTE, e.g. WITH my_cte_1 AS (
SELECT 1
), WITH my_cte_2 AS (
SELECT * FROM my_cte_1
)
SELECT * FROM my_cte_2 or in the case of a single recursive CTE that references itself (taken from https://www.postgresql.org/docs/current/queries-with.html) WITH RECURSIVE t(n) AS (
VALUES (1)
UNION ALL
SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t; |
Oh, these are even trickier! 😄 |
Ah if you're after tricky cases 😉 , before I started looking into this I didn't realise that CTEs could be in subqueries as well - so CTEs can even refer to other CTEs much further "up" the tree: WITH my_cte_1 AS (SELECT 1)
SELECT * FROM (
WITH my_cte_2 AS (SELECT * FROM my_cte_1)
SELECT * FROM my_cte_2
) a |
Traverse the statement tree twice, first to collect the CTE names, then to recognize concrete relation names. This should fix additional defects reported in issue #106.
The commit above should solve the trickier cases, but if you find more of them, be my guest 😃 |
This is yet another variant of glitch reported in issue #106.
Ah great! Although got some more tricky cases: SELECT (WITH my_ref AS (SELECT 1) SELECT 1)
FROM my_ref the And: WITH
my_cte_1 AS (SELECT 1),
my_cte_2 AS (
WITH my_cte_1 AS (SELECT * FROM my_cte_1)
SELECT * FROM my_cte_1
)
SELECT * FROM my_cte_2 where this case |
Oh, a never ending story then! |
There do just seem to be so many cases... On a related note, the recursive-ness of the original I posted has been ringing around my head. Thanks to this StackOverflow answer I converted this to a non-recursive version: import pglast
from collections import deque
def referenced_relations(query, default_schema="public"):
tables = set()
node_ctenames = deque()
node_ctenames.append((pglast.parse_sql(query)[0](), ()))
while node_ctenames:
node, ctenames = node_ctenames.popleft()
if node.get("withClause", None) is not None:
if node["withClause"]["recursive"]:
for cte in node["withClause"]["ctes"]:
ctenames += (cte["ctename"],)
node_ctenames.append((cte, ctenames))
else:
for cte in node["withClause"]["ctes"]:
node_ctenames.append((cte, ctenames))
ctenames += (cte["ctename"],)
if node.get("@", None) == "RangeVar" and (
node["schemaname"] is not None or node["relname"] not in ctenames
):
tables.add((node["schemaname"] or default_schema, node["relname"]))
for node_type, node_value in node.items():
if node_type == "withClause":
continue
for nested_node in node_value if isinstance(node_value, tuple) else (node_value,):
if isinstance(nested_node, dict):
node_ctenames.append((nested_node, ctenames))
return sorted(list(tables)) |
Previous approach was wrong, because it unconditionally collected all CTE names before analizying the statement itself, but they must instead processed in order. This fixes further cases reported in issue #106.
The new version covers all these cases: I also found a RECURSIVE case, taken from PG regress tests, where your implementation fails 😉 Thanks a lot for your patience! |
Ah very good to know! Looks like I didn't quite understand how multiple recursive CTEs work together. Thank you! |
Just for completeness/posterity, I think this version is better with recursive queries: import pglast
from collections import deque
def referenced_relations(query, default_schema="public"):
tables = set()
node_ctenames = deque()
node_ctenames.append((pglast.parse_sql(query)[0](), ()))
while node_ctenames:
node, ctenames = node_ctenames.popleft()
if node.get("withClause", None) is not None:
if node["withClause"]["recursive"]:
ctenames += tuple((cte['ctename'] for cte in node["withClause"]["ctes"]))
for cte in node["withClause"]["ctes"]:
node_ctenames.append((cte, ctenames))
else:
for cte in node["withClause"]["ctes"]:
node_ctenames.append((cte, ctenames))
ctenames += (cte["ctename"],)
if node.get("@", None) == "RangeVar" and (
node["schemaname"] is not None or node["relname"] not in ctenames
):
tables.add((node["schemaname"] or default_schema, node["relname"]))
for node_type, node_value in node.items():
if node_type == "withClause":
continue
for nested_node in node_value if isinstance(node_value, tuple) else (node_value,):
if isinstance(nested_node, dict):
node_ctenames.append((nested_node, ctenames))
return sorted(list(tables)) (I didn't realise that in a WITH RECURSIVE clause, all the CTEs can refer to each other) |
I released v3.12 with this, thank you again! |
No problem at all, in fact thank you! |
The referenced_relations function I think has a few issues:
"my.schema.with.dots"."my_table.with.dots"
Below is a different function that mostly addresses these. It doesn't use the Visitor base class, but instead is a recursive function.
which you can use by, for example:
It's not perfect, but at least for my use cases, it's improved, so wanted to share. I think it handles recursive CTEs appropriately, as well as chains of CTEs where I think later ones can refer to the previous ones. I think it also properly handles CTEs in subqueries, i.e. a CTE in a subquery won't prevent a relation with the same name but without a schema in another part of the query from being returned.
It can't be determined exactly what schema schema-less relations are in - there could be multiple schemas in the search path, but if this is important to detect in a use, the client code could pass
default_schema=None
and then look at the returned list, and take some action if there is aNone
schema.Not sure if this should be integrated into pglast somewhere, but happy to raise a PR if so.
The text was updated successfully, but these errors were encountered: