Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/API: converting invalid column names in to_sql #6796

Closed
jorisvandenbossche opened this issue Apr 4, 2014 · 11 comments · Fixed by #6902
Closed

BUG/API: converting invalid column names in to_sql #6796

jorisvandenbossche opened this issue Apr 4, 2014 · 11 comments · Fixed by #6902
Labels
IO SQL to_sql, read_sql, read_sql_query
Milestone

Comments

@jorisvandenbossche
Copy link
Member

At the moment, with the new sql code, there is a bug in the conversion of invalid column names (trough the function _safe_col_name). Eg:

In [1]: import sqlalchemy
In [2]: engine = sqlalchemy.create_engine('sqlite:///:memory:')
In [4]: df = pd.DataFrame({'safe_col':[1,2], 'invalid col':[3,4]})
In [7]: pd.io.sql.to_sql(df, 'test', engine)
In [9]: pd.io.sql.read_table('test', engine)
Out[9]:
   pandas_index invalid_col  safe_col
0             0        None         1
1             1        None         2

I know the reason, and it is easily fixed (the table statement uses the adapted column, but the inserting still the old one, so the data are not inserted). But I was wondering if this is actually necessary?

Because, databases can handle column names with spaces (at least sqlite, mysql and postgresql that I know a little bit), the names are then just quoted and sqlalchemy ensures this is done properly for us.
Often, it is advised to not use spaces (or capital letters) in column names, as it is just tedious to always have to quote them and it is more robust to not have to do that. But it does work, so I would think it is up to the user to provide the column names he/she wants.

So proposal:

  • we don't change the column names and leave them as the user provides them
  • we could provide a kwarg to do this if the users wants to, but I don't think this is priority at the moment (and I think we should then also convert to lower case, as there is as much reason to do that as to changes spaces to underscores)

@mangecoeur @hayd

PS: the fact there are Nones in the example above and not NaNs is still another issue I think.

@jreback
Copy link
Contributor

jreback commented Apr 4, 2014

@jorisvandenbossche I would go with an option, may translate_names=None (or accept a string to describe what is happening), to translate spaces/case. Default it not to do anything. If you have time for 0.14.0 or bump to later version ok too.

@danielballan
Copy link
Contributor

I can pick this one up if @jorisvandenbossche has not already begun work on it.

@jorisvandenbossche
Copy link
Member Author

@danielballan certainly do! (very welcome as I have still other work that I would like to do for the sql code)

I have not yet begun to work on it, the problems just lies in the fact that the insert statement uses the frame's column names, and not the converted ones (https://github.com/pydata/pandas/blob/62b4d6405c21cfd74b4e9e00c298648d6b2e2e82/pandas/io/sql.py#L442) as in the table statement (https://github.com/pydata/pandas/blob/62b4d6405c21cfd74b4e9e00c298648d6b2e2e82/pandas/io/sql.py#L502))

I would certainly default to do nothing, and maybe provide a keyword argument to do something else (remove space, to lower case). But the first is the priority I think.

But maybe it would be better that first #6735 gets merged. If you can take a look at it, please do.

@mangecoeur
Copy link
Contributor

@jorisvandenbossche @danielballan I think the whole safe columns thing is unnecessary if we trust SQLAlchemy to escape names correctly (which I think we can). In that case we don't even need an option for translate names (we're accumulating too many options as it is, as they say options is where good design goes to die).
I would nuke safe columns, make sure Legacy mode does it's escaping properly, and if users want some custom renaming they're free to use column renaming just like everyone else.

@danielballan
Copy link
Contributor

Good. I was coming to the same conclusion.

On Tuesday, April 15, 2014, mangecoeur notifications@github.com wrote:

@jorisvandenbossche https://github.com/jorisvandenbossche @danielballanhttps://github.com/danielballanI think the whole safe columns thing is unnecessary if we trust SQLAlchemy
to escape names correctly (which I think we can). In that case we don't
even need an option for translate names (we're accumulating too many
options as it is, as they say options is where good design goes to die).
I would nuke safe columns, make sure Legacy mode does it's escdaping
properly, and if users want some custom renaming they're free to use column
renaming just like everyone else.


Reply to this email directly or view it on GitHubhttps://github.com//issues/6796#issuecomment-40545324
.

@jorisvandenbossche
Copy link
Member Author

@mangecoeur Fully agree, that was also in line with what I proposed. I would let it fully up to the user to provide the column names it wants.
And indeed, for legacy mode, try to keep the same behaviour with write_frame as before, and the new behaviour for to_sql as I did for the index=True/False

@jorisvandenbossche
Copy link
Member Author

See here: #6883 (comment), I was completely mistaken about the to_sql being new. So if we want to be sure that the legacy mode keeps it behaviour, this should also be the case for to_sql, which makes it difficult as we have one docstring for both new and legacy to_sql, and they would have different defaults then.

@danielballan
Copy link
Contributor

If we commit to one backward-incompatible change in to_sql, with regard to the index, we have an opportunity to make a second one "for free" in the same move, right? If we all agree that safe columns is unnecessary, should we remove it now and document it along with the change in #6883?

@jorisvandenbossche
Copy link
Member Author

It's not completely "for free" as it are two different changes (with possible different side effects for existing code), but I agree if we want to change it, we do it better now and together.

Or, we could also trigger a warning for legacy mode now that it will change it next version (but maybe that is not needed?)

@jorisvandenbossche
Copy link
Member Author

An idea: could we do a quick check (when in legacy mode) if there are columns names with spaces? (the names which would have been converted previously) And then raise a warning, to warn the user that the behaviour changed and that the column names will not be converted anymore (and maybe hint how to do it themselves). Or would this just be annoying?

I think this could help prevent surprises, but of course, once you've seen it and decide to keep using column names with spaces this can become annoying.

@jreback
Copy link
Contributor

jreback commented Apr 17, 2014

a warning sounds good

don't need to try too hard for backwards compat - it's an API change - users need to adjust

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants