Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

database schema serialization format #32

Closed
andreypopp opened this Issue May 17, 2012 · 12 comments

Comments

Projects
None yet
2 participants
Contributor

andreypopp commented May 17, 2012

Currently database schema serialization format consist of single YAML file. The problem with that is when database schema becomes a bit too large — several views and functions along with twenty or so tables — maintaining and review of changes becomes a bit difficult.

What I propose is different serialization format:

  • Serialization of views, functions, tables into separate files (schema.name.yaml or even using subdirs schema/name.yaml)
  • Make dbtoyaml and yamltodb operate on collections of such files.

This is just a general view, details should be refined. Also, this serialization format could be optional.

Owner

jmafc commented May 17, 2012

Agreed. Even before I created dbtoyaml, I preferred the "single object" model for storing, e.g., table definitions, in a VCS. You can do that, manually, by using "dbtoyaml -t" repeatedly, but

  • it is cumbersome when you have many tables, etc., (even if you script it, you could inadvertantly omit some), and
  • the individual tablename.yaml files can't be simply concatenated for processing by yamltodb without pre-processing to remove redundant "schema xxx" lines.

In addition, I believe programmers and DBAs would prefer to store the textual definition of views and function source as separate objects (non-redundantly) in a nicely indented format rather the output from pyyaml. And yes, the serialization enhacement should be optional.

Thinking out loud, perhaps this could be prototyped with a script that serialized the "all in one" format to separate files, and a converse tool to merge them back for input to yamltodb.

Contributor

andreypopp commented May 17, 2012

In addition, I believe programmers and DBAs would prefer to store the textual definition of views and function source as separate objects (non-redundantly) in a nicely indented format rather the output from pyyaml. And yes, the serialization enhacement should be optional.

Exactly! Same holds for tables — storing them in plain DDL would be nice too (this is subject for another ticket, but anyway... that would require additional effort for building SQL/DDL parser and maintaining its compatibility with original PostgreSQL's parser — maybe it would be nice to try to include Pyrseas as PostgresSQL's contrib module, what do you think?).

Owner

jmafc commented May 17, 2012

I'm not keen on storing the SQL DDL for tables and particularly on parsing SQL, which is an inferior language IMHO (I prefer D, aka Tutorial D). Apgdiff already provides the capability of storing pg_dump output but it's hard to provide full coverage of PG features that way. As to contrib, Pyrseas is already in PGXN. I had started an attempt to embed Pyrseas (dbobject) code as a PL/Pythonu module(s) but didn't get too far.

Contributor

andreypopp commented May 23, 2012

I've done some experimental work on this, see commit 40698292cc5bc6793a545322b7268bf83ec2b43e:

  • View.definition and Function.source are now wrapped in pyrseas.yaml.MultiLineStr to be serialized using | style. That would make serialized definitions more readable.
  • dbtoyaml and yamltodb commands can now serialize/deserialize schema to/from repository — directory with schema.yaml and separate YAML file for each view with filename in form of schemaname.viewname.yaml.

I found it a bit difficult to store views as plain SQL files because COMMENT statement should be stored separately of CREATE VIEW... statement — that would require minimal parsing (actually just split two statements by ;), but if we do the same with functions — that would not be so easy. I think we could stick to some conventions like — "COMMENT statement should be the last one in file" or something like this, though I'm not a big fan of those non-intuitive implementation-dependent things.

Regarding functions — I also would like to store them in separate files, but besides all points above there is another difficulty — how to store functions with the same name but different arguments' types — a. store them in single file or b. store them separately but how we should encode filenames then?

Given all these questions I still found that storing database schema in multiple files feels more natural.

Owner

jmafc commented May 23, 2012

Very interesting. It'll take me a while to digest it and see how it all works out. With regard to functions, if separate files are used, the names would have to follow the keylist attribute of class Proc, i.e., schema (name), (function) name, and arguments. For example, given public functions f(int, int) and f(int, text), the filenames could be public.f.integer_integer.yaml and public.f.integer_text.yaml. There are issues of course: f() stored as f..yaml? how to deal with data types with embedded spaces?

Contributor

andreypopp commented May 24, 2012

Just after thinking a bit more about functions — I think functions with the same name are closely related to each other so it might be a good idea to store them in single file.

Owner

jmafc commented May 24, 2012

I cloned your repo to try things out. I'm not sure I understand what you did in pyrseas/yaml.py: importing yaml in a module named 'yaml' looks like looking for trouble. In any case, when I try to run dbtoyaml, I see:

$ export PYTHONPATH=$PWD
$ pyrseas/dbtoyaml.py --help
Traceback (most recent call last):
  File "pyrseas/dbtoyaml.py", line 12, in <module>
    from pyrseas import yaml
  File "/home/jma/src/andreypopp/pyrseas/yaml.py", line 5, in <module>
    import yaml
  File "/home/jma/src/andreypopp/pyrseas/yaml.py", line 14, in <module>
    yaml.add_representer(MultiLineStr, MultiLineStr_presenter)
AttributeError: 'module' object has no attribute 'add_representer'

If I print(dir(yaml)) before the line in error, I see:

['MultiLineStr', 'MultiLineStr_presenter', '__all__', '__builtins__', '__doc__', '__file__', '__name__', '__package__', 'absolute_import', 'yaml']

My test environment is Python 2.7.3rc2 on Linux (Debian), but also saw the same behavior with Python 3.2.3rc2.

Owner

jmafc commented May 24, 2012

This is an interesting thread about splitting db objects in pg_dump output, in particular, various observations on directory structures for the output and making unique names for functions:

http://archives.postgresql.org/pgsql-hackers/2010-12/threads.php#02285
http://archives.postgresql.org/pgsql-hackers/2011-01/thrd16.php#00047

Also, in trying to keep dbtoyaml options similar/parallel to pg_dump options, I think the --repository option could be specified as -Fd/--format=directory since the default format will be a single file (somewhat equivalent to -Fp/--format=plain). I realize that the 'directory' format of pg_dump doesn't output the same structure as we're suggesting here, so perhaps --format=tree or --format=split may be more appropriate.

Owner

jmafc commented Nov 14, 2012

Andrey, I finally got around to looking a bit more at this (because Josep Martínez Vilà is also interested in this feature). Since I couldn't figure out how it was apparently working for you and not for me, I went with a simpler solution, pyrseas/yamlutil.py, as follows:

"""Pyrseas YAML utilities"""

from yaml import add_representer


class MultiLineStr(str):
    """ Marker for multiline strings"""


def MultiLineStr_presenter(dumper, data):
    return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
add_representer(MultiLineStr, MultiLineStr_presenter)

With this and the changes to the __init__ methods of the Function and View classes, I was able to produce the function source and view definitions as multi-line strings, e.g.,

schema public:
  description: standard public schema
  function f1(integer, integer):
    language: sql
    returns: integer
    source: |-
      SELECT GREATEST($1, $2)
  ...
  view v1:
    definition: |2-
       SELECT now()::date AS today;
  view v2:
    definition: |2-
       SELECT t1.c1, t1.c3 * 2
         FROM t1;

However, this did not work properly with some actual views or functions (from production/staging databases) that had large text definitions. This will need some research and further testing.

I'll look at the repository changes next.

Owner

jmafc commented Nov 30, 2012

Change f13980d, based on Andrey's experimental work above has been committed. Now view definitions, function source text and object descriptions (COMMENT text) are formatted in YAML block style.

Owner

jmafc commented Dec 18, 2012

Change 94574eb is initial submission to address this new feature.

Owner

jmafc commented Jan 4, 2013

Change 6a6565d essentially completes the work for this enhacement (although some refinements are still needed).

@jmafc jmafc closed this Jan 4, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment