Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using --schema flag for having dump schema level granularity and make COPY less error prone #14

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

andilabs
Copy link

Using --schema flag for having dump schema level granularity
improved per schema filtering, make COPY less error prone
changed DELIMITER, QUOTE and ESCAPE for better handling of complicated data type fields (JSON in particular)
added explicitly the table collumns listed in COPY FROM to handle situation when the order of collumns differs between source and destination databases
added flag --password to explicitly ask for password when pg_dump needed (i.e each run without --data-only flag)
commented out standard_conforming_strings = off so this will be set to default value, which is on and is needed for import with \t delimiter
introduced the requirement that --sample_schema if specified must include _pg_sample in name to avoid risk of accidentally delete when used with --force

andilabs and others added 3 commits January 24, 2020 13:17
like in pg_dump and introduce --sample_schema overtaking old functionality
Using --schema flag for having dump schema level granularity
changed DELIMITER, QUOTE and ESCAPE for better handling of complicated data type fields (JSON in particular)
added explicitly the table collumns listed in COPY FROM to handle situation when the order of collumns differs between source and destination databases
added flag --password to explicitly ask for password when pg_dump needed (i.e each run without --data-only flag)
commented out standard_conforming_strings = off so this will be set to default value, which is on and is needed for import with \t delimiter
introduced the requirement that --sample_schema if specified must include _pg_sample in name to avoid risk of accidentally delete when used with --force
@andilabs
Copy link
Author

andilabs commented Feb 1, 2020

@mla I am not much into Perl. Any kind of feedback will be welcome ✊

@mla
Copy link
Owner

mla commented Feb 5, 2020

These look like great changes, @andilabs. Why was --password option added to pg_dump?

@mla
Copy link
Owner

mla commented Feb 5, 2020

I saw your comment on the change. Just wondering if it's actually necessary since pg_dump says:

--password
           Force pg_dump to prompt for a password before connecting to a
           database.

           This option is never essential, since pg_dump will automatically
           prompt for a password if the server demands password
           authentication. However, pg_dump will waste a connection attempt
           finding out that the server wants a password. In some cases it is
           worth typing -W to avoid the extra connection attempt.

@andilabs
Copy link
Author

andilabs commented Feb 5, 2020 via email

@andilabs
Copy link
Author

Hi @mla how are you doing in coronavirus times? I hope you are doing well!
What about merging this PR?

Cheers!
Andi

@mla
Copy link
Owner

mla commented Apr 17, 2020

Hey @andilabs. Doing okay, thanks. Hope you are too. Yes, sorry for the delay on this. Will target this weekend.

mla added a commit that referenced this pull request Apr 18, 2020
@mla
Copy link
Owner

mla commented Apr 18, 2020

@andilabs can you explain what issue you hit with the COPY being error prone? Or even better, an example test case?

From the project dir, you should be able to run: prove -v
to run the existing tests.

On the standard_conforming_strings setting, I see you commented it out with a comment that we'll use the Pg default. But isn't it more robust to know what the setting is? I don't particularly care if setting it on or off is better, but seems like we should try to minimize any configuration differences between environments.

I've working in the dev branch and working through your changes if you want to see the current state. The splitting of schema and table name shouldn't be necessary. The values in @table are Table instances, so we should be able to inspect the schema() and table() accessors directly.

If you could just explain what the string_agg stuff and adding CSV DELIMITER specifications and such to the COPY so I can better understand what problem was being triggered.

Thanks!

@mla
Copy link
Owner

mla commented Apr 19, 2020

Also, I would rather we not add the --password option to the pg commands. The pg_dump docs state, for example:

-W
       --password
           Force pg_dump to prompt for a password before connecting to a
           database.

           This option is never essential, since pg_dump will automatically
           prompt for a password if the server demands password
           authentication. However, pg_dump will waste a connection attempt
           finding out that the server wants a password. In some cases it is
           worth typing -W to avoid the extra connection attempt.

So what is happening for you? It is not prompting you for a password?

We already support the connection params as part of the main script, so prob those should just be propagated to the pg_dump call if supplied.

@mla
Copy link
Owner

mla commented Apr 19, 2020

I guess there's no great way to pass the password securely. There is a PGPASSWORD env var but it's not recommended to use.

There's a PGPASSFILE var that we could use but seems like a bit of a pain. We could create a correctly formatted pgpass file and point to that. https://www.postgresql.org/docs/9.6/libpq-pgpass.html

Or we could pass the --password option to pg_dump if it was supplied to pg_sample. But I thought that would be triggered automatically.

my ($schema_name, $table_name) = split('\.', $table, 2);
my $cleaned_table_name = substr $table_name, 1, -1;
my $cleaned_schema_name = substr $schema_name, 1, -1;
my ($q) = "SELECT string_agg(quote_ident(column_name), ',') FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = '$cleaned_table_name' AND TABLE_SCHEMA = '$cleaned_schema_name'";
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mla this was needed to explicitly persist the order of columns, which may differ in prod vs dev db.

my $cleaned_schema_name = substr $schema_name, 1, -1;
my ($q) = "SELECT string_agg(quote_ident(column_name), ',') FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = '$cleaned_table_name' AND TABLE_SCHEMA = '$cleaned_schema_name'";
my ($column_names_to_keep_order) = $dbh->selectrow_array($q);
print "COPY $table ($column_names_to_keep_order) FROM stdin WITH CSV DELIMITER E'\\t' QUOTE '\b' ESCAPE '\\';\n";
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mla this was for mentioned JSONB fields proper escaping

@andilabs
Copy link
Author

  • ok let's get rid of password opt-out good practice of using pgpass
  • the case for more error-prone COPY was connected to dealing with nested JSONB fields escaping. It's a bit pitty task for me to find examples now, because even if I find it, then I will have to clean/fake/ those data being confidential production system data.
  • I added some comments in MR too.
  • regarding The values in @table are Table instances, so we should be able to inspect the schema() and table() accessors directly. excuses this. I might be miss using perl because of lacking experience in it.

@justin808
Copy link

@mla Will this one get merged?

@mla
Copy link
Owner

mla commented Oct 23, 2020

@mla Will this one get merged?

What features from this are you looking for? Are you hitting a problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants