New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for 'quote' and 'format' parameters to copy_from() #461

Closed
wants to merge 9 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@seanharr11

seanharr11 commented Jul 29, 2016

Support added for 'quote' and 'format' parameters to the COPY FROM command.

  1. Added 'format' parameter to "copy_from()". Expects a string: either "TXT" or "CSV" (default is "TXT"). This allows the 'quote' parameter to be set (see next bullet).
  2. Added 'quote' parameter to "copy_from()". Expects a single character, defaults to '"'. This allows columns of data to be enclosed in this character, removing the need to escape various characters within the string.
    • For instance, the row 143,Hello, World,10002 would get interpreted as 4 columns, separated by commas, when in reality you would want only 3 comma separated values, with "Hello, World" in quotes (123,"Hello, World",10002). This 'quote' parameter allows copy_from() to load .csv files in the proper format defined here in RFC4180
    • If quote is set, and format is set to "TXT", the parameter 'quote' will be ignored when compiling the COPY FROM command. (because the native COPY FROM does not support 'quote' in TXT format)
  3. Test coverage added for 3 cases:
    • Loading a standard csv (double-quote enclosed strings)
    • Loading a csv with single-quote enclosed strings
    • Passing quote="'" when format="TXT" (expects no error)
# Basic case to load a csv into a table
with open("foobar.csv", "r") as fp:
    cur = self.conn.cursor()
    cur.copy_from(fp, table_name, null="NULL", sep=",", format="CSV")

@seanharr11 seanharr11 changed the title from Copy from quote character support to Support for 'quote' and 'format' parameters to copy_from() Jul 29, 2016

@dvarrazzo

This comment has been minimized.

Member

dvarrazzo commented Aug 1, 2016

Hello Sean,

thank you for your patch. However postgres COPY command supports about 12 different options and more would be likely added in the future: I don't think mapping 1-1 all the command options with matching function parameters is really useful. The same features, and everything to come, can be obtained using copy_expert(): having them as parameters in copy_from() (and why not in copy_to() too?) is only syntactic sugar. The fact that sep and null are exposed as parameters is probably some relic from when the world was a simpler place (postgres 7.2?).

So I'm -1 about this patch. @fogzot what do you think?

@fogzot

This comment has been minimized.

Member

fogzot commented Aug 1, 2016

I agree. We should probably have unified everything under copy expert some tine ago. Just adding parameters isn't going to work unless we switch to a new function that takes a configuration object. So, no, I rather not merge right now.

@seanharr11

This comment has been minimized.

seanharr11 commented Aug 1, 2016

My issue with copy_expert() is that it is basically a workaround to get past the permissions issues that get raised when running the PostgreSQL COPY FROM command with the cursor.execute("COPY table FROM 'table.csv' ...") syntax. (These get raised when not running on same instance that hosts the DB server).

While this function is very clever in providing a robust solution to the above (by sending the contents of a file through a STDIN buffer), it is still raw in the sense that the user is forced to free-form enter the COPY FROM command. I believe a more complete solution would avoid formatting a string to create a psql command, and create a higher level of abstraction: A function that accepts a file-like object, and a finite set of parameters (many with defaults).

Could we write a wrapper around the C-implementation of copy_from, that takes in all of the parameters for the COPY FROM command, generates the "COPY...FROM..." command string, and passes this string along to the C function which handles the STDIN piping and execution of the command?

I could take this on if the idea is approved, what are your thoughts?

@dvarrazzo

This comment has been minimized.

Member

dvarrazzo commented Aug 1, 2016

I think the abstraction level is right for psycopg, which is a driver. For instance the method cursor.execute() takes a raw sql, yes, allowing binding parameters, but the syntax of the command is postgres-specific and doesn't go through any representation of the command as a Python object. As a result psycopg doesn't need to know any of the features exposed by postgres as SQL level: it only needs to know the database at protocol level. Wrapping postgres features (data structures, command structures) and exposing them as python object is the purpose of higher-level libraries, such as SQLAlchemy, which builds on top of psycopg.

copy_expert() is not "a workaround to get past permission issues": it is a function to access all the COPY FROM/TO features without needing to chase all the parameters introduced in any PG version, as much as cursor.execute() is a function to access all the SQL features provided by any PG version without knowing each of them singularly. copy_from() and copy_to() are useless IMO, only kept for backward compatibility: as far as I'm concerned people could live perfectly only using copy_expert() (maybe renamed as copy() to make it sound less scary...).

Closing this ticket as this specific patch will not be merged.

@dvarrazzo dvarrazzo closed this Aug 1, 2016

@seanharr11

This comment has been minimized.

seanharr11 commented Aug 1, 2016

Daniel - I didn't mean to come off as brash when saying it is a "workaround to get past permission issues", I just wanted to point out that the biggest value-add of this function is to issue the COPY FROM command without having to worry about piping STDIN.

After reading your response above, I agree that the proposed function belongs at an ORM level of abstraction, and I would agree that copy_from() and copy_to() are misleading and should be marked as deprecated (if they are not already). The existence of these functions are misleading, and may lead individuals to believe that there is no way to set the 'quote' parameters, when in fact copy_expert() is the correct solution.

As to SQLAlchemy supporting an abstract bulk_load() function, this would be a great feature. I recently released etlalchemy, a tool allowing database migrations (data and schema) with a few lines of Python. This tool implements the functions dump_data() and send_data(), which respectively dump a database table to a file, and load it (in bulk) into a target database of a different SQL vendor. Perhaps I will look into contributing this piece of my project to SQLAlchemy as a more abstract bulk_load() function. I use psycopg2's copy_expert() within this function, and am thankful for the work you've all put in (most other drivers do not support this)!

Thanks for the insight here, and please feel free to pass along etlalchemy to others, as I will pass along the knowledge learned here on Stack Overflow and elsewhere. (I'd love some followers on github, I think this project has legs...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment