Primary keys (possibly all unique constraints) need to be created before views #72

Open
denaje opened this Issue Oct 30, 2013 · 6 comments

3 participants

@denaje

Pyrseas: 0.7.0b1
PostgreSQL: 9.3.1
Python: 3.3

It seems as though the order of creation of items is:

  • Tables
  • Views
  • Constraints

This causes a problem if the creation of a view depends on constraints on the tables. For instance, if a view does something like (simplified example):

CREATE VIEW my_schema.my_view AS
    SELECT u.id, u.name
    FROM users u
    GROUP BY u.id;

Then Postgres will complain

ERROR:  column "u.name" must appear in the GROUP BY clause or be used in an aggregate function

This is not a problem if the PK constraint on users.id is created, because Postgres is smart enough to figure out that grouping on a unique constraint will ensure uniqueness in the result set. However, the PK constraint is not created at the time the view is created, causing the error.

@jmafc
Perseas member

I was very surprised when I tried to create a similar test case and saw how pg_dump handled it. Instead of outputting a CREATE VIEW, pg_dump does a CREATE TABLE v1 (i.e., with the columns in the view), then the PRIMARY KEY constraint, and ends with:

CREATE RULE "_RETURN" AS ON SELECT TO v1 
DO INSTEAD SELECT t1.c1, t1.c2 FROM t1 GROUP BY t1.c1;

Also, this Postgres "smartness" was apparently only introduced in PG 9.1. I tested against 8.4 and 9.0 and got the ERROR above. I'm ambivalent about the new behavior since it appears to depend on some hidden "magic" (psql still shows the view as any other).

The general order of processing is currently (most) pg_class objects (regular tables, sequences, views and materialized views, pg_constraints objects (check, PK, FK, unique), indexes, rules and foreign tables. Within pg_class we already have special ordering to deal with sequences (which can be "owned" by tables), views and inherited tables. Handling of pg_constraints is also special because FKs are processed in a second pass.

I'm not sure how easy it will be to introduce a different order to address this issue.

@rhunwicks
@jmafc
Perseas member

I tried a more realistic example. I mean, if users.id is the primary key, there is no reason for the GROUP BY. Something more reasonable would be SELECT u.id, name, COUNT(*) FROM users u JOIN m USING (id) GROUP BY id where m is a table that has multiple items per user, e.g., messages in a mail system, i.e., you're trying to get a count of the items by user id but would like to see their names as well. Pg_dump 9.1 and later handles it the same way, i.e., with a _RETURN rule (I also noticed it knows not to COPY data out of the view table). I believe this is simply an SQL limitation due to its general lack of orthogonality (IIRC, QUEL had no problem with aggregates mixed with joins).

In any case, for now I think we'll document this as a known issue.

@denaje

I should have included a more realistic example in my comment. Our views are more like the one @jmafc included. For now, I think we may be able to simply include GROUP BY name and any other necessary columns as well, even though it is a bit uglier (don't know if this has any performance implications, but I shouldn't think so).

The relevant piece from the documentation:

When GROUP BY is present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions or if the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column. A functional dependency exists if the grouped columns (or a subset thereof) are the primary key of the table containing the ungrouped column.

I very well understand that systems are more complex than they appear on the surface, so please forgive my ignorance, but is there a good reason why PKs need to wait until after views are created, instead of being created immediately after table creation?

@jmafc
Perseas member

In general, yamltodb followed the lead of pg_dump in terms of ordering of object creation, since they had to get the dependencies right. Personally, I probably would've chosen to add PRIMARY KEY clauses on CREATE TABLE statements, but the ALTER TABLE ADD CONSTRAINT approach is more flexible. I believe pg_dump also does them separately because they support COPYing the data back in, before adding PKs.

Constraints are already processed right after pg_class objects (sequences, tables and views) and since there are two passes, primary keys do get added after tables, but not before views. To fix the issue we'd probably have to (a) split views and materialized views out of pyrseas/dbobject/table.py (something I've considered, for other reasons) so we could call ViewDict.diff_map after ConstraintDict.diff_map, (b) pass an extra argument to ClassDict.diff_map so that it can be called first for sequences/tables and a second time, after constraints, for views, or (c) in Table.diff_map, use the Table.primary_key link to generate the ALTER TABLE ADD CONSTRAINT for primary keys. Without further study, option (a) appears preferable.

@jmafc
Perseas member

Documented limitation in commit ce91eb9.

@jmafc jmafc added the dependencies label Sep 15, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment