Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR?: read_sql no longer supports duplicate column names #53117

Closed
jorisvandenbossche opened this issue May 6, 2023 · 4 comments · Fixed by #53118
Closed

REGR?: read_sql no longer supports duplicate column names #53117

jorisvandenbossche opened this issue May 6, 2023 · 4 comments · Fixed by #53118
Labels
IO SQL to_sql, read_sql, read_sql_query Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Probably caused by #50048, and probably related to #52437 (another change in behaviour that might have been caused by the same PR). In essence the change is from using DataFrame.from_records to processing the records into a list of column arrays and then using DataFrame(dict(zip(columns, arrays))). That has slightly different behaviour.

Dummy reproducer for the read_sql change:

import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})

from sqlalchemy import create_engine
eng = create_engine("sqlite://")
df.to_sql("test_table", eng, index=False)

pd.read_sql("SELECT a, b, a +1 as a FROM test_table;", eng)

With pandas 1.5 this returns

   a    b  a
0  1  0.1  2
1  2  0.2  3
2  3  0.3  4

with pandas 2.0 this returns

   a    b
0  2  0.1
1  3  0.2
2  4  0.3

I don't know how much we want to support duplicate column names in read_sql, but it is a change in behaviour, and the new behaviour of just silently ignoring it / dropping some data also isn't ideal IMO.

cc @phofl

@jorisvandenbossche jorisvandenbossche added Regression Functionality that used to work in a prior pandas version IO SQL to_sql, read_sql, read_sql_query labels May 6, 2023
@jorisvandenbossche jorisvandenbossche added this to the 2.0.2 milestone May 6, 2023
@jorisvandenbossche
Copy link
Member Author

To illustrate with just the construction of the DataFrame:

In [1]: data = [(1, 0.1, 2), (2, 0.2, 3), (3, 0.3, 4)]
   ...: columns = ['a', 'b', 'a']

In [2]: pd.DataFrame.from_records(data, columns)
Out[2]: 
   0    1  2
a  1  0.1  2
b  2  0.2  3
a  3  0.3  4

In [10]: arrays = [np.array([1, 2, 3]), np.array([0.1, 0.2, 0.3]), np.array([2, 3, 4])]

In [11]: pd.DataFrame(dict(zip(columns, arrays)))
Out[11]: 
   a    b
0  2  0.1
1  3  0.2
2  4  0.3

I assume we could relatively easily fix this by using a different constructor for the arrays, like:

In [15]: pd.DataFrame._from_arrays(arrays, columns, None)
Out[15]: 
   a    b  a
0  1  0.1  2
1  2  0.2  3
2  3  0.3  4

Further, the functionality to specify the dtype backend is probably something that could be moved into from_records? (and then the sql code can keep using from_records)

@MarcoGorelli
Copy link
Member

would it possible/acceptable to just raise on duplicate column names? I'd make the case disallowing that wherever possible (if people have a dataframe with duplicates rows labels and take a tranpose, then OK, they'll get duplicates, but in IO functions I'd have thought it acceptable to prohibit)

@phofl
Copy link
Member

phofl commented May 6, 2023

Can discuss this, but should fix for 2.0.2

@z--m-n
Copy link

z--m-n commented May 11, 2023

Thanks for solving this.

I don't know how much we want to support duplicate column names in read_sql, but it is a change in behaviour,

would it possible/acceptable to just raise on duplicate column names?

Relevant discussion. But read_sql would then deviate in behaviour from DataFrame.

Meaning, it would be best if read_sql outputs the same number of columns as when replacing

...
pd.read_sql("SELECT a, b, a +1 as a FROM test_table;", eng)

in the dummy reproducer, with

...
exe = eng.execute("SELECT a, b, a +1 as a FROM test_table;")
pd.DataFrame(exe)

which in pandas 2.0.1 returns

   a    b  a
0  1  0.1  2
1  2  0.2  3
2  3  0.3  4

with the duplicate columns intact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants