New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.merge() doesn't merge int and str column dtypes but no warning or error #9780

Closed
inodb opened this Issue Apr 1, 2015 · 15 comments

Comments

Projects
None yet
6 participants
@inodb

inodb commented Apr 1, 2015

When merging an int dtype with a str dtype the join does not work:

>>> import pandas as pd
>>> df1 = pd.DataFrame({"A":[0]})
>>> df2 = pd.DataFrame({"A":["0"]})
>>> pd.merge(df1, df2, on=["A"])
Empty DataFrame
Columns: [A]
Index: []

I think it would be better to get a warning that the join is performed on incompatible column dtypes.

This is my pandas version:

>>> pd.show_versions()                                                                                                                                    

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Darwin
OS-release: 13.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: 0.21
numpy: 1.9.2
scipy: 0.15.1                                                                                                                                                          
statsmodels: 0.6.1
IPython: 2.2.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.1
pytz: 2014.9
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: 2.5.6
sqlalchemy: 0.9.7
pymysql: None
psycopg2: None

Thanks for all your work on pandas!

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 2, 2015

You are doing an inner merge, which doesn't match. Not sure if we could reliably detect this, as it involves a computation to figure out that you have strings that looks like numbers.

In [8]: >>> pd.merge(df1, df2, on=["A"],how='outer')
Out[8]: 
   A
0  0
1  0

In [9]: >>> pd.merge(df1, df2, on=["A"],how='outer').dtypes
Out[9]: 
A    float64
dtype: object

you can also do

df.convert_objects(convert_numeric=True) to force the objects to become numbers

@jreback jreback added this to the Next Major Release milestone Apr 2, 2015

@inodb

This comment has been minimized.

inodb commented Apr 3, 2015

Thanks for your reply and the quick fix. I actually don't think that one should check whether the string represents numbers or not. I think it is more about whether the dtypes of the columns with the same name match. If you do an inner merge on DataFrames with no matching columns you get a MergeError. I think it would make sense to also throw one if there are matching columns but their dtypes can't be silently cast to match. Especially if one specifically sets the option on=["A"].

@sudk1896

This comment has been minimized.

Contributor

sudk1896 commented Sep 14, 2016

@inodb @jreback: I would like to take a stab at this. From what I can understand, I need to check the datatypes of the columns that are to be merged, and if they're unequal then throw an error. Could you tell me what type of error ? Thanks.

@jreback

This comment has been minimized.

Contributor

jreback commented Sep 14, 2016

this should raise a ValueError if the dtypes for the on columns don't match

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Sep 14, 2016

Suppose the case of an int and float column. I don't think it should raise if you want to merge on those columns? (can eg already typically occur when having NaNs in one of both)

@jreback

This comment has been minimized.

Contributor

jreback commented Sep 14, 2016

right we care about 'obvious' mismatches here that by-definition cannot match. so object vs int/float (or datetimelike vs int/float/object). We will need to lib.infer_dtype to make this more specific (e.g. string/unicode cannot match int)

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Sep 14, 2016

Personally, I would leave this as a responsibility of the user.
Although I understand that the example above can be confusing at first, trying to guess when the user passed wrong dtypes by accident instead of on purpose seems like adding complexity that is not really worth it (but it's a trade-off of course)

Something else, currently, as @jreback shows above (#9780 (comment)), the strings of df2 get coerced to numeric values on merging:

In [57]: pd.merge(df1, df2, on=["A"],how='outer').dtypes
Out[57]: 
A    int64
dtype: object

I suppose this happens on purpose? IMO it should return object array and keep the original values (like pd.concat does)

@jreback

This comment has been minimized.

Contributor

jreback commented Sep 14, 2016

@jorisvandenbossche yes, we should certainly not coerce on mixed types in merging, unless they are losslessly convertible (e.g. int & float), so maybe make a separate issue.

But I think we should raise on str/numeric, and datetimelike/(str or numeric), it is simply not possible (and if the user really wants that, then its just a concat).

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Sep 14, 2016

and if the user really wants that, then its just a concat

That's a good point :-)

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Sep 14, 2016

yes, we should certainly not coerce on mixed types in merging, unless they are losslessly convertible (e.g. int & float), so maybe make a separate issue.

On a second thought, if we disallow merging on str/numeric (the case of the initial example), I don't think are cases left that we would allow but where no coercing should happen? (for which I wanted to open an new issue)

@jreback

This comment has been minimized.

Contributor

jreback commented Sep 14, 2016

@jorisvandenbossche can you show example / elaborate on your last?

On a second thought, if we disallow merging on str/numeric (the case of the initial example), I don't think are cases left that we would allow but where no coercing should happen? (for which I wanted to open an new issue)

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Sep 14, 2016

Well, the original example above has a dataframe with integers in the key column, and another dataframe with strings in the key column. They are now coerced to integers, something I think should not happen:

In [62]: df1 = pd.DataFrame({"A":[0], "B":[1]})

In [63]: df2 = pd.DataFrame({"A":["0"], "B":[2]})

In [64]: pd.merge(df1, df2, on=["A"],how='outer')
Out[64]: 
   A  B_x  B_y
0  0  1.0  NaN
1  0  NaN  2.0

In [65]: pd.merge(df1, df2, on=["A"],how='outer').A.dtype
Out[65]: dtype('int64')

In [66]: pd.merge(df1, df2, on=["A"],how='outer').A.values
Out[66]: array([0, 0])

And I wanted to open an new issue for this. But, this is also an example where we would want to raise an error about "incompatible columns to merge on".
So would there still be cases where we would not raise such an error, but still happens unwanted coercing?

@jreback

This comment has been minimized.

Contributor

jreback commented Sep 14, 2016

ah I c. I think this should raise a nice errors message (maybe saying that you might want to use pd.concat), which will simply work.

concat

In [16]: pd.concat([df1,df2])
Out[16]:
   A
0  0
0  0

In [17]: pd.concat([df1,df2]).dtypes
Out[17]:
A    object
dtype: object

In [18]: pd.concat([df1,df2]).values
Out[18]:
array([[0],
       ['0']], dtype=object)

merge

In [19]: pd.merge(df1, df2, on ='A')
Out[19]:
Empty DataFrame
Columns: [A]
Index: []

In [20]: pd.merge(df1, df2, on ='A', how='outer')
Out[20]:
   A
0  0
1  0

In [21]: pd.merge(df1, df2, on ='A', how='outer').dtypes
Out[21]:
A    int64
dtype: object

so [20,21] are wrong (this should be object).

But I would actually simply raise ValueError on the merge. Its a mistake on the users part.

@jreback jreback modified the milestones: Next Major Release, 0.22.0 Dec 7, 2017

@kylebarron

This comment has been minimized.

Contributor

kylebarron commented Mar 19, 2018

Was #18764 included in the Pandas 0.22.0 release? It looks like @jreback added that to the 0.22.0 milestone on Dec 7, 2017, and it was merged to master 19 days before 0.22.0 was released, but the initial example still fails for me on 0.22.0:

>>> import pandas as pd
>>> df1 = pd.DataFrame({"A": [0]})
>>> df2 = pd.DataFrame({"A": ["0"]})
>>> pd.merge(df1, df2, on=["A"])
Empty DataFrame
Columns: [A]
Index: []
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-37-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.4.0
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.27.3
numpy: 1.14.1
scipy: 1.0.0
pyarrow: 0.8.0
xarray: None
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.2
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.3
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.4
pandas_gbq: None
pandas_datareader: None
@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Mar 19, 2018

#18674 will be in the 0.23 release.

The 0.22 release just had one change. You can always view the release notes for a version at http://pandas.pydata.org/pandas-docs/stable/whatsnew.html.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment