Failed types dependencies reconstructions #100

Closed
dvarrazzo opened this Issue Sep 4, 2014 · 32 comments

Comments

Projects
None yet
2 participants
@dvarrazzo
Contributor

dvarrazzo commented Sep 4, 2014

Steps to reproduce, attempt 1: then I'll see if I can reproduce it without the ltree library installed, for test suite sake.

-- inside an empty database called 'pyrseas':
create schema xact;
create type ltree;
CREATE FUNCTION ltree_in(cstring) RETURNS ltree
    LANGUAGE c STRICT
    AS '$libdir/ltree', 'ltree_in';
ALTER FUNCTION ltree_in(cstring) OWNER TO postgres;
CREATE FUNCTION ltree_out(ltree) RETURNS cstring
    LANGUAGE c STRICT
    AS '$libdir/ltree', 'ltree_out';
ALTER FUNCTION ltree_out(ltree) OWNER TO postgres;
CREATE TYPE ltree (
    INPUT = ltree_in,
    OUTPUT = ltree_out,
    INTERNALLENGTH = variable,
    ALIGNMENT = int4,
    STORAGE = extended,
    CATEGORY = 'U');
CREATE TYPE xact.split_edit AS (    id integer,
    account_code ltree,
    amount numeric,
    amount_xn numeric,
    description text);

Try to dump it and recreate in an empty database:

dbtoyaml -U postgres pyrseas | yamltodb -U postgres postgres

This result in the generation of:

ALTER SCHEMA xact OWNER TO postgres;

CREATE TYPE xact.split_edit AS (    id integer,
    account_code ltree,
    amount numeric,
    amount_xn numeric,
    description text);

ALTER TYPE xact.split_edit OWNER TO postgres;

CREATE TYPE ltree;
... 

that will fail to restore xact.split_edit as the type ltree has not been created yet.

@dvarrazzo

This comment has been minimized.

Show comment
Hide comment
@dvarrazzo

dvarrazzo Sep 4, 2014

Contributor

This is a more synthetic and self-contained test:

$ psql -U postgres
postgres=# create database pyrseas;
postgres=# \c pyrseas
psql (9.3.5, server 9.1.14)
You are now connected to database "pyrseas" as user "postgres".
pyrseas=# create type ltree as (s text);
CREATE TYPE
pyrseas=# create schema xact;
CREATE SCHEMA
pyrseas=# CREATE TYPE xact.split_edit AS ( account_code ltree);
CREATE TYPE
pyrseas=# \q

$ dbtoyaml -U postgres pyrseas | yamltodb -U postgres postgres
CREATE SCHEMA xact;

ALTER SCHEMA xact OWNER TO postgres;

CREATE TYPE xact.split_edit AS (    account_code ltree);

ALTER TYPE xact.split_edit OWNER TO postgres;

CREATE TYPE ltree AS (    s text);

ALTER TYPE ltree OWNER TO postgres;

again, the dependant type is created before the type it depends on.

Contributor

dvarrazzo commented Sep 4, 2014

This is a more synthetic and self-contained test:

$ psql -U postgres
postgres=# create database pyrseas;
postgres=# \c pyrseas
psql (9.3.5, server 9.1.14)
You are now connected to database "pyrseas" as user "postgres".
pyrseas=# create type ltree as (s text);
CREATE TYPE
pyrseas=# create schema xact;
CREATE SCHEMA
pyrseas=# CREATE TYPE xact.split_edit AS ( account_code ltree);
CREATE TYPE
pyrseas=# \q

$ dbtoyaml -U postgres pyrseas | yamltodb -U postgres postgres
CREATE SCHEMA xact;

ALTER SCHEMA xact OWNER TO postgres;

CREATE TYPE xact.split_edit AS (    account_code ltree);

ALTER TYPE xact.split_edit OWNER TO postgres;

CREATE TYPE ltree AS (    s text);

ALTER TYPE ltree OWNER TO postgres;

again, the dependant type is created before the type it depends on.

@jmafc jmafc added the yamltodb label Sep 4, 2014

@jmafc

This comment has been minimized.

Show comment
Hide comment
@jmafc

jmafc Sep 4, 2014

Member

I'm afraid than dbtype.py, in particular TypeDict.link_refs and TypeDict.diff_map, and possibly also function.py need code revisions to account for dependencies of functions on non-standard types.

Member

jmafc commented Sep 4, 2014

I'm afraid than dbtype.py, in particular TypeDict.link_refs and TypeDict.diff_map, and possibly also function.py need code revisions to account for dependencies of functions on non-standard types.

@dvarrazzo

This comment has been minimized.

Show comment
Hide comment
@dvarrazzo

dvarrazzo Sep 5, 2014

Contributor

It gets harder than that... but then it gets better. Let me explain.

Trying to force-feed the generated sql to see if i found other stumbling blocks, I found this other one:

CREATE DOMAIN xact._ltree_no_recursion AS ltree
    CONSTRAINT _ltree_no_recursion_check CHECK (public.nlevel(VALUE) < 50);

needless to say the function has not been created yet. I though - woah, he's never gonna be able to parse this... But then pg_dump has the same sort of problems, how does it solve that? There must be information in the db schema. And there is indeed: pg_depend.

Found the constraint in the schema:

=# select oid, conname, consrc from pg_constraint 
where conname = '_ltree_no_recursion_check';
-[ RECORD 1 ]------------------------
oid     | 16445
conname | _ltree_no_recursion_check
consrc  | (public.nlevel(VALUE) < 50)

knowing the oid and its relation I can look for its dependencies:

=# select *, refclassid::regclass as table from pg_depend
where (classid, objid) = ('pg_constraint'::regclass, 16445);
 classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype |  table  
---------+-------+----------+------------+----------+-------------+---------+---------
    2606 | 16445 |        0 |       1255 |    16443 |           0 | n       | pg_proc
    2606 | 16445 |        0 |       1247 |    16444 |           0 | a       | pg_type

sure enough pg_type is the domain on which the constraint is defined; the n dependency is the function our constraint depends on.

=# create function nspname(oid) returns name as
$$select nspname from pg_namespace where oid = $1$$
language sql stable strict;

=# select nspname(typnamespace), typname from pg_type where oid = 16444;
 nspname |       typname       
---------+---------------------
 xact    | _ltree_no_recursion

=# select nspname(pronamespace), proname from pg_proc where oid = 16443;
 nspname | proname 
---------+---------
 public  | nlevel

so there are enough information in the database schema to rebuild the dependency graph. I don't know about pyrseas internal: I assume there is already some form of dependencies tracking so that you create the indexes after the tables they are defined on etc. A quick grep shows that you know about pg_depend but only use it to avoid dumping objects belonging to an extension.

I think you have two approaches: either you add a depends_on field in the generated yaml, with a list of objects your object depends on, or you keep it sorted in topological order. yamltodb should respectively either rebuild the topological order from the dependencies found in the yaml or trust the yaml to contain the objects in the right order. I think the first approach is preferable for two reasons: one is that you have effectively more details about the database, which are otherwise hard to parse. The second is that I see that the yaml has the objects defined inside the schema. This makes impossible to represent objects depending on other objects cross-schema:

  • type s1.t1 depending on function s2.f1
  • type s2.t2 depending on function s1.f2

you cannot represent that with all the objects nested inside the schemas, whereas you can have, back to my original example:

schema xact:
  domain _ltree_no_recursion:
    check_constraints:
      _ltree_no_recursion_check:
        expression: (public.nlevel(VALUE) < 50)
        depends_on:
            class: function
            object: public.nlevel(ltree)

independently of whether the xact schema is defined before of after the public schema.

Contributor

dvarrazzo commented Sep 5, 2014

It gets harder than that... but then it gets better. Let me explain.

Trying to force-feed the generated sql to see if i found other stumbling blocks, I found this other one:

CREATE DOMAIN xact._ltree_no_recursion AS ltree
    CONSTRAINT _ltree_no_recursion_check CHECK (public.nlevel(VALUE) < 50);

needless to say the function has not been created yet. I though - woah, he's never gonna be able to parse this... But then pg_dump has the same sort of problems, how does it solve that? There must be information in the db schema. And there is indeed: pg_depend.

Found the constraint in the schema:

=# select oid, conname, consrc from pg_constraint 
where conname = '_ltree_no_recursion_check';
-[ RECORD 1 ]------------------------
oid     | 16445
conname | _ltree_no_recursion_check
consrc  | (public.nlevel(VALUE) < 50)

knowing the oid and its relation I can look for its dependencies:

=# select *, refclassid::regclass as table from pg_depend
where (classid, objid) = ('pg_constraint'::regclass, 16445);
 classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype |  table  
---------+-------+----------+------------+----------+-------------+---------+---------
    2606 | 16445 |        0 |       1255 |    16443 |           0 | n       | pg_proc
    2606 | 16445 |        0 |       1247 |    16444 |           0 | a       | pg_type

sure enough pg_type is the domain on which the constraint is defined; the n dependency is the function our constraint depends on.

=# create function nspname(oid) returns name as
$$select nspname from pg_namespace where oid = $1$$
language sql stable strict;

=# select nspname(typnamespace), typname from pg_type where oid = 16444;
 nspname |       typname       
---------+---------------------
 xact    | _ltree_no_recursion

=# select nspname(pronamespace), proname from pg_proc where oid = 16443;
 nspname | proname 
---------+---------
 public  | nlevel

so there are enough information in the database schema to rebuild the dependency graph. I don't know about pyrseas internal: I assume there is already some form of dependencies tracking so that you create the indexes after the tables they are defined on etc. A quick grep shows that you know about pg_depend but only use it to avoid dumping objects belonging to an extension.

I think you have two approaches: either you add a depends_on field in the generated yaml, with a list of objects your object depends on, or you keep it sorted in topological order. yamltodb should respectively either rebuild the topological order from the dependencies found in the yaml or trust the yaml to contain the objects in the right order. I think the first approach is preferable for two reasons: one is that you have effectively more details about the database, which are otherwise hard to parse. The second is that I see that the yaml has the objects defined inside the schema. This makes impossible to represent objects depending on other objects cross-schema:

  • type s1.t1 depending on function s2.f1
  • type s2.t2 depending on function s1.f2

you cannot represent that with all the objects nested inside the schemas, whereas you can have, back to my original example:

schema xact:
  domain _ltree_no_recursion:
    check_constraints:
      _ltree_no_recursion_check:
        expression: (public.nlevel(VALUE) < 50)
        depends_on:
            class: function
            object: public.nlevel(ltree)

independently of whether the xact schema is defined before of after the public schema.

@dvarrazzo

This comment has been minimized.

Show comment
Hide comment
@dvarrazzo

dvarrazzo Sep 5, 2014

Contributor

Views are other objects that may depend one on each other and apparently this is not taken care of.

Contributor

dvarrazzo commented Sep 5, 2014

Views are other objects that may depend one on each other and apparently this is not taken care of.

@dvarrazzo

This comment has been minimized.

Show comment
Hide comment
@dvarrazzo

dvarrazzo Sep 5, 2014

Contributor

Another case that would be solved by dependency tracking is with foreign key: they need a matching unique constraint so the unique index must be created before. I've checked and the dependency relation is explicit in pg_depend.

Contributor

dvarrazzo commented Sep 5, 2014

Another case that would be solved by dependency tracking is with foreign key: they need a matching unique constraint so the unique index must be created before. I've checked and the dependency relation is explicit in pg_depend.

@dvarrazzo

This comment has been minimized.

Show comment
Hide comment
@dvarrazzo

dvarrazzo Sep 5, 2014

Contributor

Last update on this issue (after hand-fixing these stuff eventually my sql got imported): some functions were defined before the types used as arguments were. Speaking with Joe it seems that the type dependency is tracked correctly for the return type so it's probably just a matter to extend it for the arguments.

To recap, proper dependencies tracking should be done:

  1. between domains and funcs (because a domain can have a func as constraint and a func can have a domain as type)
  2. across types (original bug submission)
  3. across views (because they can be defined on other views)

cases 1 and 2 can be solved putting together all the types, domains and funcs in the same graph. case 3 by creating a graph for functions. The problem of the foreign keys can be solved much easily defining the index before the fkeys.

Contributor

dvarrazzo commented Sep 5, 2014

Last update on this issue (after hand-fixing these stuff eventually my sql got imported): some functions were defined before the types used as arguments were. Speaking with Joe it seems that the type dependency is tracked correctly for the return type so it's probably just a matter to extend it for the arguments.

To recap, proper dependencies tracking should be done:

  1. between domains and funcs (because a domain can have a func as constraint and a func can have a domain as type)
  2. across types (original bug submission)
  3. across views (because they can be defined on other views)

cases 1 and 2 can be solved putting together all the types, domains and funcs in the same graph. case 3 by creating a graph for functions. The problem of the foreign keys can be solved much easily defining the index before the fkeys.

@dvarrazzo

This comment has been minimized.

Show comment
Hide comment
@dvarrazzo

dvarrazzo Sep 6, 2014

Contributor

Playing a bit with a generic dependency tracker, see my deptrack branch. Forked off from v0.7.1 because of #103. Currently reads the dependencies from the database but doesn't store them in the yaml.

Contributor

dvarrazzo commented Sep 6, 2014

Playing a bit with a generic dependency tracker, see my deptrack branch. Forked off from v0.7.1 because of #103. Currently reads the dependencies from the database but doesn't store them in the yaml.

dvarrazzo added a commit to dvarrazzo/Pyrseas that referenced this issue Sep 7, 2014

dvarrazzo added a commit to dvarrazzo/Pyrseas that referenced this issue Sep 7, 2014

Sketching the framework for a topological sort
Dropped the a-priori (and broken) knowledge of the "right order" of the
classes to restore. The `diff_map()` method is split in an internal
version returning a map object -> changes so that we will be (hopefully)
able to topo-sort the statements according to the objects dependencies.
The refactoring is not implemented for all the objects: for the moment
I'm testing with a domain -> constraint -> function dependency as per
issue #100. It's also nice that the public schema ends up last in the
generated statements: the sort will have to put it first, even if it
doesn't make any diff).

The DbObject type is now hashable so that we can use it as dict key.
@dvarrazzo

This comment has been minimized.

Show comment
Hide comment
@dvarrazzo

dvarrazzo Sep 9, 2014

Contributor

Hi Joe, as you are taking a look here, as of de2e4dc, it currently needs the following to be ready to really work:

  • several objects generate excessive yaml: some dependencies can be inferred from the context and that takes the object's get_implied_deps() to be implemented/fixed. I have a metadata dir generated with the 0.7.1 and I can diff it to see what new attributes are created and whether they are really needed. The test suite will also warn me I guess.
  • there are still a few objects that need diff_map() and _diff_map() to be splitted. I've fixed the ones I'm using in my testing db but if you try running it with a database that contains an object I haven't fixed here yet it would fail.
  • the loop between C types and in/out functions must be dealt with. Currently I break the loop because otherwise the toposort function gives an error (that's on purpose), but then the sql file can be invalid. A solution would be to introduce a shell type and make the functions depending on it. Another is to special-case these function to not be dumped to sql as the other functions but to dump them implicitly with the type. I have to check what currently pyrseas do (I know it generate shell type, functions and then the real type, I just have to check where it does it it :)
Contributor

dvarrazzo commented Sep 9, 2014

Hi Joe, as you are taking a look here, as of de2e4dc, it currently needs the following to be ready to really work:

  • several objects generate excessive yaml: some dependencies can be inferred from the context and that takes the object's get_implied_deps() to be implemented/fixed. I have a metadata dir generated with the 0.7.1 and I can diff it to see what new attributes are created and whether they are really needed. The test suite will also warn me I guess.
  • there are still a few objects that need diff_map() and _diff_map() to be splitted. I've fixed the ones I'm using in my testing db but if you try running it with a database that contains an object I haven't fixed here yet it would fail.
  • the loop between C types and in/out functions must be dealt with. Currently I break the loop because otherwise the toposort function gives an error (that's on purpose), but then the sql file can be invalid. A solution would be to introduce a shell type and make the functions depending on it. Another is to special-case these function to not be dumped to sql as the other functions but to dump them implicitly with the type. I have to check what currently pyrseas do (I know it generate shell type, functions and then the real type, I just have to check where it does it it :)
@dvarrazzo

This comment has been minimized.

Show comment
Hide comment
@dvarrazzo

dvarrazzo Sep 9, 2014

Contributor

Joe, I'm finding regularly a problem: looking for an object given its extkey. E.g. if I have cast(A[] as B[]), I may have B[] referred as cast.target and function atob(A[]) as cast.function. Given the db object, how do I find these objects? I've hacked that function _get_by_extkey() but I feel it wrong (e.g. see this usage). Also, extkey for the constraint is wrong: the pair (schema, name) is not unique as two table in the same schema may have a constraint with the same name (maybe not an unique as it defines an index, but a check or a fkey yes). This is probably an issue to fix itself, it definitely gets in the way for the _get_by_extkey() hack.

Contributor

dvarrazzo commented Sep 9, 2014

Joe, I'm finding regularly a problem: looking for an object given its extkey. E.g. if I have cast(A[] as B[]), I may have B[] referred as cast.target and function atob(A[]) as cast.function. Given the db object, how do I find these objects? I've hacked that function _get_by_extkey() but I feel it wrong (e.g. see this usage). Also, extkey for the constraint is wrong: the pair (schema, name) is not unique as two table in the same schema may have a constraint with the same name (maybe not an unique as it defines an index, but a check or a fkey yes). This is probably an issue to fix itself, it definitely gets in the way for the _get_by_extkey() hack.

@jmafc

This comment has been minimized.

Show comment
Hide comment
@jmafc

jmafc Sep 9, 2014

Member

You have noticed the keylist class attributes, I hope? Cast.keylist is ['source', 'target'] so cast(A[] as B[]) would be found under db.casts['A[]', 'B[]']. Now, I must admit I've rarely if ever exercised the code with PG array types, so I don't know how that may fare.

The Constraint.keylist is ['schema', 'table', 'name'], not just (schema, constraint_name). HTH.

Member

jmafc commented Sep 9, 2014

You have noticed the keylist class attributes, I hope? Cast.keylist is ['source', 'target'] so cast(A[] as B[]) would be found under db.casts['A[]', 'B[]']. Now, I must admit I've rarely if ever exercised the code with PG array types, so I don't know how that may fare.

The Constraint.keylist is ['schema', 'table', 'name'], not just (schema, constraint_name). HTH.

@dvarrazzo

This comment has been minimized.

Show comment
Hide comment
@dvarrazzo

dvarrazzo Sep 9, 2014

Contributor

Yes, I've seen it. No, it wouldn't have helped as in cast (just an example) I need to find a function with the proper signature.

After writing other code I've improved that idiom:

func = db._get_by_extkey('function %s' % self.function)

However I've replaced it with:

fschema, fname = split_schema_obj(self.function)
# TODO: broken with pathological function with a '(' in its name
fname, fargs = fname.split('(', 1)
func = db.functions[fschema, fname, fargs[:-1]]

which could be a slight gain in term of robustness but definitely smells like something missing (I'm probably using the db object in a way it wasn't really designed for, so I'm not blaming you for this :)

About my remark on the constraint key: I see the key is unique indeed, however get_extkey() returns constraint name, which is not. Now, probably the extkey again wasn't meant to be globally unique, but with the dependency tracker I should be able to get anything any object lists in depends_on and look it up in the db to get its DbObject As you may have seen from the first commits in the branch I've gone a bit back and forward trying to decide what to put in depends_on and I've currently settled for the extkey: it sort of works but it would definitely give problems on constraints with clashing names.

So, is there any string representing an object that we

  • can rely it to be unique
  • can be used to look up the object in the db (for every kind of object)

so we can drop my _get_by_extkey() broken attempt?

Contributor

dvarrazzo commented Sep 9, 2014

Yes, I've seen it. No, it wouldn't have helped as in cast (just an example) I need to find a function with the proper signature.

After writing other code I've improved that idiom:

func = db._get_by_extkey('function %s' % self.function)

However I've replaced it with:

fschema, fname = split_schema_obj(self.function)
# TODO: broken with pathological function with a '(' in its name
fname, fargs = fname.split('(', 1)
func = db.functions[fschema, fname, fargs[:-1]]

which could be a slight gain in term of robustness but definitely smells like something missing (I'm probably using the db object in a way it wasn't really designed for, so I'm not blaming you for this :)

About my remark on the constraint key: I see the key is unique indeed, however get_extkey() returns constraint name, which is not. Now, probably the extkey again wasn't meant to be globally unique, but with the dependency tracker I should be able to get anything any object lists in depends_on and look it up in the db to get its DbObject As you may have seen from the first commits in the branch I've gone a bit back and forward trying to decide what to put in depends_on and I've currently settled for the extkey: it sort of works but it would definitely give problems on constraints with clashing names.

So, is there any string representing an object that we

  • can rely it to be unique
  • can be used to look up the object in the db (for every kind of object)

so we can drop my _get_by_extkey() broken attempt?

@jmafc

This comment has been minimized.

Show comment
Hide comment
@jmafc

jmafc Sep 10, 2014

Member

Where is this get_extkey() that you're saying returns constraint name? Do you mean your _get_by_extkey()? or the DbObject extern_key()? Constraints don't really have an extern_key (although you can probably invoke it in Python). The extern_key was meant to be globally unique (IIRC within schema if a schema-level object) but only for "top level" objects: tables, views, functions, etc., not for indexes, constraints and objects that can only exist as "attached" to the former. For the latter, you could probably write a local extern_key() (or perhaps some other name) that would construct it from the keylist attributes.

Member

jmafc commented Sep 10, 2014

Where is this get_extkey() that you're saying returns constraint name? Do you mean your _get_by_extkey()? or the DbObject extern_key()? Constraints don't really have an extern_key (although you can probably invoke it in Python). The extern_key was meant to be globally unique (IIRC within schema if a schema-level object) but only for "top level" objects: tables, views, functions, etc., not for indexes, constraints and objects that can only exist as "attached" to the former. For the latter, you could probably write a local extern_key() (or perhaps some other name) that would construct it from the keylist attributes.

@jmafc

This comment has been minimized.

Show comment
Hide comment
@jmafc

jmafc Sep 10, 2014

Member

Re last point: perhaps a unique_ident function that would return <object_type>.<schema_name>.etc, where the <schema_name>.etc. is constructed directly from the keylist?

Member

jmafc commented Sep 10, 2014

Re last point: perhaps a unique_ident function that would return <object_type>.<schema_name>.etc, where the <schema_name>.etc. is constructed directly from the keylist?

@dvarrazzo

This comment has been minimized.

Show comment
Hide comment
@dvarrazzo

dvarrazzo Sep 10, 2014

Contributor

My first attempt was indeed to store something like (class,) + key(). It looked "ugly" in yaml as it is a list and it gets splitted across several lines, so I had something like:

depends_on:
  - - class1
    - key11
    - key12
    - key13
  - - class2
    - key21
    - key22
    - key23

while it could have been:

depends_on:
  - [class1, key11, key12, key13]
  - [class2, key21, key22, key23]

is there a way to tell PyYAML to "stay pretty"? :)

But I just found another case where the extern_key() is just not right.: if a type is defined on functions that don't belong to the same schema of the type itself, pyrseas gives an error in TypesDict.link_refs(). Commented out the error just to go ahead (I suspect link_refs will just disappear at the end of deptrack), and noticed that Function.extern_key() just never prints the namespace, it implies that the context in which it is used is the "correct one" (this probably causes the above error in link_refs().

So yeah, I may go back to use the keylist in yaml because "correct" is probably a better property than "pretty" :)

Contributor

dvarrazzo commented Sep 10, 2014

My first attempt was indeed to store something like (class,) + key(). It looked "ugly" in yaml as it is a list and it gets splitted across several lines, so I had something like:

depends_on:
  - - class1
    - key11
    - key12
    - key13
  - - class2
    - key21
    - key22
    - key23

while it could have been:

depends_on:
  - [class1, key11, key12, key13]
  - [class2, key21, key22, key23]

is there a way to tell PyYAML to "stay pretty"? :)

But I just found another case where the extern_key() is just not right.: if a type is defined on functions that don't belong to the same schema of the type itself, pyrseas gives an error in TypesDict.link_refs(). Commented out the error just to go ahead (I suspect link_refs will just disappear at the end of deptrack), and noticed that Function.extern_key() just never prints the namespace, it implies that the context in which it is used is the "correct one" (this probably causes the above error in link_refs().

So yeah, I may go back to use the keylist in yaml because "correct" is probably a better property than "pretty" :)

@dvarrazzo

This comment has been minimized.

Show comment
Hide comment
@dvarrazzo

dvarrazzo Sep 10, 2014

Contributor

Awww, problem! Can't find any dependency across views :( They declare a dependency on the output types, that's already good, but how do we track the views inter-dependencies?

Contributor

dvarrazzo commented Sep 10, 2014

Awww, problem! Can't find any dependency across views :( They declare a dependency on the output types, that's already good, but how do we track the views inter-dependencies?

@dvarrazzo

This comment has been minimized.

Show comment
Hide comment
@dvarrazzo

dvarrazzo Sep 10, 2014

Contributor

Found it, dumping a pg_catalog :) Views are rewrite rules. This query returns the dependencies (in readable format and including the system catalogs:

select distinct ev_class::regclass as depending,
    relid[1]::oid::regclass as dependant
from (
    select ev_class, regexp_matches(ev_action, ':relid\s+(\d+)', 'g') as relid
    from pg_rewrite
    where rulename = '_RETURN') x
where ev_class::int <> relid[1]::int;

will plug it in tomorrow :)

Contributor

dvarrazzo commented Sep 10, 2014

Found it, dumping a pg_catalog :) Views are rewrite rules. This query returns the dependencies (in readable format and including the system catalogs:

select distinct ev_class::regclass as depending,
    relid[1]::oid::regclass as dependant
from (
    select ev_class, regexp_matches(ev_action, ':relid\s+(\d+)', 'g') as relid
    from pg_rewrite
    where rulename = '_RETURN') x
where ev_class::int <> relid[1]::int;

will plug it in tomorrow :)

@jmafc

This comment has been minimized.

Show comment
Hide comment
@jmafc

jmafc Sep 10, 2014

Member

Since I haven't exercised any of the code in deptrack (only read through the submission comments and browsed through the changes), I wasn't exactly aware of what you meant by "storing the dependencies in the YAML". Based on the example above, it seems you're proposing a generic "depends_on" attribute for each DbObject in the YAML output, or stated differently, expose the pg_depend catalog interspersed between the YAML objects. That's not exactly what I envisioned and, for lack of a better description, it doesn't look very "user-friendly".

I do believe that we need a better dependency resolution mechanism, i.e., the topological sort, but I really would prefer to keep the YAML as a sort of structured SQL, if you wll, so that dependencies between objects take the form of named attributes in other objects, e.g., like the dependent_funcs for views. If that is not achievable, perhaps we could go with a generic dependencies (or depends_on if you prefer) attributes that lists the objects by categories, or at least in a more friendly way than a sequence of internal keys.

Member

jmafc commented Sep 10, 2014

Since I haven't exercised any of the code in deptrack (only read through the submission comments and browsed through the changes), I wasn't exactly aware of what you meant by "storing the dependencies in the YAML". Based on the example above, it seems you're proposing a generic "depends_on" attribute for each DbObject in the YAML output, or stated differently, expose the pg_depend catalog interspersed between the YAML objects. That's not exactly what I envisioned and, for lack of a better description, it doesn't look very "user-friendly".

I do believe that we need a better dependency resolution mechanism, i.e., the topological sort, but I really would prefer to keep the YAML as a sort of structured SQL, if you wll, so that dependencies between objects take the form of named attributes in other objects, e.g., like the dependent_funcs for views. If that is not achievable, perhaps we could go with a generic dependencies (or depends_on if you prefer) attributes that lists the objects by categories, or at least in a more friendly way than a sequence of internal keys.

@dvarrazzo

This comment has been minimized.

Show comment
Hide comment
@dvarrazzo

dvarrazzo Sep 10, 2014

Contributor

Yes, the dependency tracking needs to store some tracking information: the dependencies are known by the db but when stored to yaml they are lost. What I'm doing now is minimizing the information stored because some of the dependencies info can be reconstructed by the yaml anyway: examples are that every schema object depends on the schema it is contained into, every function depends on the type of its arguments and so on.

What I'm doing is to implementing a method called get_implied_deps(db) that for every object type returns these objects (as a set of DbObject) for example here we say that a foreingn key depends on the table it references. Because it is a Constraint subclass its parent class also says that it depends on the table it is defined into. With this function implemented, the dependencies to effectively store are really few: the set of object queried by pg_depend minus the objects known by get_implied_deps() (here it is). A few examples are in the first comments of this bug report: a constraint on functions eventually called in its body, views on other views/tables/functions used in its body and few others. The information is currently stored in a yaml attribute depends_on.

At time of emitting sql, we call again get_implied_deps() but we add back the info read from pg_depend (see the get_deps() function) and we topo-sort on the graph. So even if there is nothing stored in depends_on we have the implicit information such as the mutual dependency across types, another thing pyrseas doesn't currently do right. This I think is analogue to what you do on link_refs() but more complete.

For the way to store the info in yaml I'm completely open: my current solution emits for the problem in this thread:

domain _ltree_no_recursion:
  check_constraints:
    _ltree_no_recursion_check:
      depends_on:
      - function nlevel(ltree)
      expression: (public.nlevel(VALUE) < 50)
  owner: piro
  type: ltree

which is close to as readable as it gets, because it uses get_extkey() to store the info (see the get_deps() implementation in the link above). However it is suboptimal because for some objects the external key doesn't contain the whole resolution info (e.g. the constraints). As you say the keylist is ugly as it feels totally outside the model and I agree with you.

With a small change (i.e. making sure the above key is really unique, which may take a method, let's just call it fullextkey() for now, but it would be implemented as return self.get_extkey() in the base DbObject class and overridden on only a few subclasses) I think the above representation is optimal. depends_on is only present in a few cases and where it is I think it is right that it lives in the object. If you want to store them somewhere else in yaml let me know, but I think it would be wrong.

Contributor

dvarrazzo commented Sep 10, 2014

Yes, the dependency tracking needs to store some tracking information: the dependencies are known by the db but when stored to yaml they are lost. What I'm doing now is minimizing the information stored because some of the dependencies info can be reconstructed by the yaml anyway: examples are that every schema object depends on the schema it is contained into, every function depends on the type of its arguments and so on.

What I'm doing is to implementing a method called get_implied_deps(db) that for every object type returns these objects (as a set of DbObject) for example here we say that a foreingn key depends on the table it references. Because it is a Constraint subclass its parent class also says that it depends on the table it is defined into. With this function implemented, the dependencies to effectively store are really few: the set of object queried by pg_depend minus the objects known by get_implied_deps() (here it is). A few examples are in the first comments of this bug report: a constraint on functions eventually called in its body, views on other views/tables/functions used in its body and few others. The information is currently stored in a yaml attribute depends_on.

At time of emitting sql, we call again get_implied_deps() but we add back the info read from pg_depend (see the get_deps() function) and we topo-sort on the graph. So even if there is nothing stored in depends_on we have the implicit information such as the mutual dependency across types, another thing pyrseas doesn't currently do right. This I think is analogue to what you do on link_refs() but more complete.

For the way to store the info in yaml I'm completely open: my current solution emits for the problem in this thread:

domain _ltree_no_recursion:
  check_constraints:
    _ltree_no_recursion_check:
      depends_on:
      - function nlevel(ltree)
      expression: (public.nlevel(VALUE) < 50)
  owner: piro
  type: ltree

which is close to as readable as it gets, because it uses get_extkey() to store the info (see the get_deps() implementation in the link above). However it is suboptimal because for some objects the external key doesn't contain the whole resolution info (e.g. the constraints). As you say the keylist is ugly as it feels totally outside the model and I agree with you.

With a small change (i.e. making sure the above key is really unique, which may take a method, let's just call it fullextkey() for now, but it would be implemented as return self.get_extkey() in the base DbObject class and overridden on only a few subclasses) I think the above representation is optimal. depends_on is only present in a few cases and where it is I think it is right that it lives in the object. If you want to store them somewhere else in yaml let me know, but I think it would be wrong.

@jmafc

This comment has been minimized.

Show comment
Hide comment
@jmafc

jmafc Sep 10, 2014

Member

For things such as foreign key constraints, the dependencies are already taken care of in ClassDict.link_refs. In that case, I think there are two related problems with depending (pun intended) on pg_depend. One is that use of pg_depend requires oids (at least in the way that you implemented the queries) but oids are non-permanent, and second is that input objects, e.g., a new foreign key referencing a new table, have no oids (and no presence in pg_depend). So you're still going to need link_refs or similar methods to plug new dependencies into the topo sort.

For other dependencies, we already have several queries against pg_depend (12 by my count). The question then becomes whether we should centralize those in a single place or continue to spread them across classes. My OO side of the brain tends to go with the latter.

Member

jmafc commented Sep 10, 2014

For things such as foreign key constraints, the dependencies are already taken care of in ClassDict.link_refs. In that case, I think there are two related problems with depending (pun intended) on pg_depend. One is that use of pg_depend requires oids (at least in the way that you implemented the queries) but oids are non-permanent, and second is that input objects, e.g., a new foreign key referencing a new table, have no oids (and no presence in pg_depend). So you're still going to need link_refs or similar methods to plug new dependencies into the topo sort.

For other dependencies, we already have several queries against pg_depend (12 by my count). The question then becomes whether we should centralize those in a single place or continue to spread them across classes. My OO side of the brain tends to go with the latter.

@dvarrazzo

This comment has been minimized.

Show comment
Hide comment
@dvarrazzo

dvarrazzo Sep 10, 2014

Contributor

The oids are not saved and its persistence and not relied upon: they are used to find the objects at time of dbtoyaml but then the object key of some sort is stored in the yaml (if any). At time of yamltodb the oid are not used and the depends_on info is used instead. One step inside the scripts, the oids and pg_depend are used by Database.from_catalog(), the depends_on info by Database.from_map(). _link_refs() should disappear from both the cases because its knowledge is being moved from the various DbObject.get_implied_deps() implementations.

In current pyrseas, pg_depend is mostly used to filter away the objects belonging to an exception (SELECT objid FROM pg_depend WHERE deptype = 'e'). I only see a different usage in table.py and operclass.py. What I would do instead is actually to fetch these objects too but avoid to include them in the diff and dump. This way you would be able to track that e.g. the table bignums has a column type mpz that belongs to the extension pgmp so that "patching" a database to install that table would also install the extension. This is another story though.

Contributor

dvarrazzo commented Sep 10, 2014

The oids are not saved and its persistence and not relied upon: they are used to find the objects at time of dbtoyaml but then the object key of some sort is stored in the yaml (if any). At time of yamltodb the oid are not used and the depends_on info is used instead. One step inside the scripts, the oids and pg_depend are used by Database.from_catalog(), the depends_on info by Database.from_map(). _link_refs() should disappear from both the cases because its knowledge is being moved from the various DbObject.get_implied_deps() implementations.

In current pyrseas, pg_depend is mostly used to filter away the objects belonging to an exception (SELECT objid FROM pg_depend WHERE deptype = 'e'). I only see a different usage in table.py and operclass.py. What I would do instead is actually to fetch these objects too but avoid to include them in the diff and dump. This way you would be able to track that e.g. the table bignums has a column type mpz that belongs to the extension pgmp so that "patching" a database to install that table would also install the extension. This is another story though.

@jmafc jmafc added the dependencies label Sep 15, 2014

@jmafc

This comment has been minimized.

Show comment
Hide comment
@jmafc

jmafc Sep 15, 2014

Member

I've marked this issue and #72 with the tag dependencies as they can probably be solved by by your work on your deptrack branch. There may be others so feel free to tag them as well. Note: #97 may be also be affected but I may be able to fix it in a simpler way.

Member

jmafc commented Sep 15, 2014

I've marked this issue and #72 with the tag dependencies as they can probably be solved by by your work on your deptrack branch. There may be others so feel free to tag them as well. Note: #97 may be also be affected but I may be able to fix it in a simpler way.

@dvarrazzo

This comment has been minimized.

Show comment
Hide comment
@dvarrazzo

dvarrazzo Sep 17, 2014

Contributor

Hello, update on the state. While debugging to try and understand why the statements were not generated in the order the objects were I've start finding all the ad-hoc solutions to keep dependencies in order. Many of these are caused by the dependencies passed by link_refs, then there is the inhstack etc. So I'm trying different approach to drop all the dependencies that fight the global graph: dropping link_refs in a single gulp is too big a change as a lot of objects depend on the received attribute - and just assume they are there so it's all exploding in AttributeError. A few links are actually weak, e.g. the constr receive a _table but it's only used to get its qualname(): that can be perfectly recreated from the attribute. But the link_refs for the table does much more, it puts the columns in place.

In short the branch is becoming a major refactoring as my dependency tracking fight with yours and yours is all scattered throughout the places. The lack of constructors of the objects makes very difficult to understand what the attributes are used for, some are always there and they are guarded by hasattr, for some there is no hasattr check but actually the attribute is added half-way during the runtime, making the structure and the behaviour of the program non orthogonal. Honestly it takes a lot of cleanup and I'm not sure yet if get to the end of it or I'll give up half way.

Contributor

dvarrazzo commented Sep 17, 2014

Hello, update on the state. While debugging to try and understand why the statements were not generated in the order the objects were I've start finding all the ad-hoc solutions to keep dependencies in order. Many of these are caused by the dependencies passed by link_refs, then there is the inhstack etc. So I'm trying different approach to drop all the dependencies that fight the global graph: dropping link_refs in a single gulp is too big a change as a lot of objects depend on the received attribute - and just assume they are there so it's all exploding in AttributeError. A few links are actually weak, e.g. the constr receive a _table but it's only used to get its qualname(): that can be perfectly recreated from the attribute. But the link_refs for the table does much more, it puts the columns in place.

In short the branch is becoming a major refactoring as my dependency tracking fight with yours and yours is all scattered throughout the places. The lack of constructors of the objects makes very difficult to understand what the attributes are used for, some are always there and they are guarded by hasattr, for some there is no hasattr check but actually the attribute is added half-way during the runtime, making the structure and the behaviour of the program non orthogonal. Honestly it takes a lot of cleanup and I'm not sure yet if get to the end of it or I'll give up half way.

@jmafc

This comment has been minimized.

Show comment
Hide comment
@jmafc

jmafc Sep 17, 2014

Member

Thanks for the update and for all the effort, and apologies for all the ad hoc solutions. When Pyrseas got started, I was mostly envisioning tackling the "standard" database objects, i.e., tables and columns, views and perhaps functions, which is what Andromeda did and IIRC apgdiff as well. Then Peter Eisentraut wrote a comment essentially complaining that these VC tools I was reviewing in my blog only covered a few of the PG object types and thus instigated my desire to do a complete (or nearly so) task of coverage. I probably should've dug into Post-Facto's topo sort at that time and saved both of us a lot of work in the long run. But, as they say, that's water under the bridge.

When you brought up the issue of lack of class-specific constructors I was thinking of starting to add those to the master branch, a class at at time. If you think it could help, I could do that and meet you half way (although it may be more like a quarter or less on this side). Or if there is anything else that I could help with, e.g., merging some (new) parts of deptrack that are most likely to survive the refactoring, pleas let me know.

Member

jmafc commented Sep 17, 2014

Thanks for the update and for all the effort, and apologies for all the ad hoc solutions. When Pyrseas got started, I was mostly envisioning tackling the "standard" database objects, i.e., tables and columns, views and perhaps functions, which is what Andromeda did and IIRC apgdiff as well. Then Peter Eisentraut wrote a comment essentially complaining that these VC tools I was reviewing in my blog only covered a few of the PG object types and thus instigated my desire to do a complete (or nearly so) task of coverage. I probably should've dug into Post-Facto's topo sort at that time and saved both of us a lot of work in the long run. But, as they say, that's water under the bridge.

When you brought up the issue of lack of class-specific constructors I was thinking of starting to add those to the master branch, a class at at time. If you think it could help, I could do that and meet you half way (although it may be more like a quarter or less on this side). Or if there is anything else that I could help with, e.g., merging some (new) parts of deptrack that are most likely to survive the refactoring, pleas let me know.

@dvarrazzo

This comment has been minimized.

Show comment
Hide comment
@dvarrazzo

dvarrazzo Sep 17, 2014

Contributor

Ha ha, I've re-read Peter's article a few days ago, when I started hacking again on Pyrseas.

I'm getting there: I promise you will have a 1.0 release with complete topo-sort ;) Can promise when it will happen tho...

Listen, the best help you can give me is to fix #103 and check if the master passes all the tests again: there is that assumption about pg_pltemplate that I think is wrong. If it works I can rebase deptrack on master and then we can try not diverging too much from there. As it is now, merging will be already a bit painful, but if you also start touching every single attribute of every db object we won't get out of the experience alive :)

Tonight I was wondering "what am I doing here" :) well, actually I like the potentiality of the program and of the library too: your work of introspection of the database is fairly complete and could be useful even outside db2yaml and back. Once I'm finished with this deptrack thing, which is a sort of "strategic move", we can clean up the Python objects and generally the codebase (or we can do in parallel but only after I've merged or rebased on master). One use case I was thinking was for extensions development: Pyrseas may be used to calculate the difference between the objects included in an extension and generate the upgrade script...

Contributor

dvarrazzo commented Sep 17, 2014

Ha ha, I've re-read Peter's article a few days ago, when I started hacking again on Pyrseas.

I'm getting there: I promise you will have a 1.0 release with complete topo-sort ;) Can promise when it will happen tho...

Listen, the best help you can give me is to fix #103 and check if the master passes all the tests again: there is that assumption about pg_pltemplate that I think is wrong. If it works I can rebase deptrack on master and then we can try not diverging too much from there. As it is now, merging will be already a bit painful, but if you also start touching every single attribute of every db object we won't get out of the experience alive :)

Tonight I was wondering "what am I doing here" :) well, actually I like the potentiality of the program and of the library too: your work of introspection of the database is fairly complete and could be useful even outside db2yaml and back. Once I'm finished with this deptrack thing, which is a sort of "strategic move", we can clean up the Python objects and generally the codebase (or we can do in parallel but only after I've merged or rebased on master). One use case I was thinking was for extensions development: Pyrseas may be used to calculate the difference between the objects included in an extension and generate the upgrade script...

@dvarrazzo

This comment has been minimized.

Show comment
Hide comment
@dvarrazzo

dvarrazzo Sep 18, 2014

Contributor

Joe, I've not quite understood how the renames are detected. I've seen that in several diff_map() methods there is a pattern like if hasattr(obj, 'oldname'): ... .rename() but haven't quite understood who sets that attribute. Any hint? Thank you!

Contributor

dvarrazzo commented Sep 18, 2014

Joe, I've not quite understood how the renames are detected. I've seen that in several diff_map() methods there is a pattern like if hasattr(obj, 'oldname'): ... .rename() but haven't quite understood who sets that attribute. Any hint? Thank you!

@dvarrazzo

This comment has been minimized.

Show comment
Hide comment
@dvarrazzo

dvarrazzo Sep 18, 2014

Contributor

Ah, ok: I found it in the issues.rst file: it is set manually. Nevermind :)

Contributor

dvarrazzo commented Sep 18, 2014

Ah, ok: I found it in the issues.rst file: it is set manually. Nevermind :)

@jmafc

This comment has been minimized.

Show comment
Hide comment
@jmafc

jmafc Sep 18, 2014

Member

Yes, there is no good way to detect a name change, except perhaps for a table column being renamed since it stays in the same sequence as the original (but, provided no intervening columns are dropped). And as stated in issues.rst even if you manually add the oldname attribute in the YAML, you can't really store it in a VCS because it would probably cause an error on the next run.

Member

jmafc commented Sep 18, 2014

Yes, there is no good way to detect a name change, except perhaps for a table column being renamed since it stays in the same sequence as the original (but, provided no intervening columns are dropped). And as stated in issues.rst even if you manually add the oldname attribute in the YAML, you can't really store it in a VCS because it would probably cause an error on the next run.

@dvarrazzo

This comment has been minimized.

Show comment
Hide comment
@dvarrazzo

dvarrazzo Sep 24, 2014

Contributor

Hi, could find an evening to work on the topic and made a big step forward, see the last commits. I've killed a lot of duplication in the code: all the logic for object comparison has been removed from the obj dicts because now we don't need to perform things "in stages". I've also started with the objects dropping in reverse topological order, although that will take more testing as I'm mostly testing generating a database from scratch.

I've seen your work in the master: I've rebased the deptrack branch on master but it's currently giving me a problem. I will fix it in the next hacking session.

I'm close to be able to load my testing database only using the deptrack info. However bug #102 is getting in the way. Would it be possible to fix that?

Apart from that the only stopper to be able to reload my complex database is caused by the fact indexes behind a pkey or unique constraint are not loaded from the db. They should be loaded but not dumped instead, because we need to transfer their dependencies on the constraint. My case in point is an unique constraint on ltree failing to restore because the operator class needed to create is still missing. There is an explicit dependency info, but it's from the index to the opclass. Tricky, isn't it? :)

Contributor

dvarrazzo commented Sep 24, 2014

Hi, could find an evening to work on the topic and made a big step forward, see the last commits. I've killed a lot of duplication in the code: all the logic for object comparison has been removed from the obj dicts because now we don't need to perform things "in stages". I've also started with the objects dropping in reverse topological order, although that will take more testing as I'm mostly testing generating a database from scratch.

I've seen your work in the master: I've rebased the deptrack branch on master but it's currently giving me a problem. I will fix it in the next hacking session.

I'm close to be able to load my testing database only using the deptrack info. However bug #102 is getting in the way. Would it be possible to fix that?

Apart from that the only stopper to be able to reload my complex database is caused by the fact indexes behind a pkey or unique constraint are not loaded from the db. They should be loaded but not dumped instead, because we need to transfer their dependencies on the constraint. My case in point is an unique constraint on ltree failing to restore because the operator class needed to create is still missing. There is an explicit dependency info, but it's from the index to the opclass. Tricky, isn't it? :)

@jmafc

This comment has been minimized.

Show comment
Hide comment
@jmafc

jmafc Sep 24, 2014

Member

Just took a quick look at #102. Not sure how easy/hard it may be to fix (I didn't even realize constraints could be inherited). In any case, I have some family issues over the next week or so, so I don't know how much time I'll have available but I'll try to test it and take a deeper look.

Member

jmafc commented Sep 24, 2014

Just took a quick look at #102. Not sure how easy/hard it may be to fix (I didn't even realize constraints could be inherited). In any case, I have some family issues over the next week or so, so I don't know how much time I'll have available but I'll try to test it and take a deeper look.

@dvarrazzo

This comment has been minimized.

Show comment
Hide comment
@dvarrazzo

dvarrazzo Sep 24, 2014

Contributor

No problem, take care!

Contributor

dvarrazzo commented Sep 24, 2014

No problem, take care!

dvarrazzo added a commit to dvarrazzo/Pyrseas that referenced this issue Oct 5, 2014

dvarrazzo added a commit to dvarrazzo/Pyrseas that referenced this issue Oct 5, 2014

Sketching the framework for a topological sort
Dropped the a-priori (and broken) knowledge of the "right order" of the
classes to restore. The `diff_map()` method is split in an internal
version returning a map object -> changes so that we will be (hopefully)
able to topo-sort the statements according to the objects dependencies.
The refactoring is not implemented for all the objects: for the moment
I'm testing with a domain -> constraint -> function dependency as per
issue #100. It's also nice that the public schema ends up last in the
generated statements: the sort will have to put it first, even if it
doesn't make any diff).

The DbObject type is now hashable so that we can use it as dict key.
@jmafc

This comment has been minimized.

Show comment
Hide comment
@jmafc

jmafc Oct 12, 2017

Member

@dvarrazzo Since this issue is what led to your creating the deptrack branch, and that has been essentially completed and merged into master I think we can safely close it. Otherwise, the most we could do is verify that the initially reported test case has been corrected. I'll wait a couple days before closing it. In any case, if there are still unresolved dependency problems, it would be more appropriate to open new specific issues.

Member

jmafc commented Oct 12, 2017

@dvarrazzo Since this issue is what led to your creating the deptrack branch, and that has been essentially completed and merged into master I think we can safely close it. Otherwise, the most we could do is verify that the initially reported test case has been corrected. I'll wait a couple days before closing it. In any case, if there are still unresolved dependency problems, it would be more appropriate to open new specific issues.

@jmafc jmafc added this to the Release 0.8 milestone Nov 22, 2017

@jmafc

This comment has been minimized.

Show comment
Hide comment
@jmafc

jmafc Nov 22, 2017

Member

@dvarrazzo Closing this issue now. I have tested the synthetic test case that you provided in the second comment and, against current master, the ltree type gets created before xact.split_edit.

Member

jmafc commented Nov 22, 2017

@dvarrazzo Closing this issue now. I have tested the synthetic test case that you provided in the second comment and, against current master, the ltree type gets created before xact.split_edit.

@jmafc jmafc closed this Nov 22, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment