Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Int64 numbers from Pandas DataFrame.to_markdown() incorrectly displayed #49465

Open
jbencina opened this issue Nov 2, 2022 · 6 comments
Open
Labels
Upstream issue Issue related to pandas dependency

Comments

@jbencina
Copy link
Contributor

jbencina commented Nov 2, 2022

Summary

When a Pandas DataFrame contains a 64 bit integer and the .to_markdown() method is called on the DataFrame, the printed integer is incorrect due to overflow.

This behavior is being passed along by the tabulate package but is really a fundamental Python issue. I bring this up here because the Pandas .head() method does print the correct number. Should Pandas be handling this case to present a consistent view of DataFrame data to users regardless of method?

If this fix is outside the scope of Pandas, perhaps the Pandas documentation should be updated as a warning.

Reproduction

Test 64bit int with Pandas head()

import pandas as pd
df = pd.DataFrame({'colA': [503498111827123021]})
df.head()
                 colA
0  503498111827123021

Test 64bit int with Pandas to_markdown()

import pandas as pd
df = pd.DataFrame({'colA': [503498111827123021]})
print(df.to_markdown(floatfmt='.0f'))
|    |               colA |
|---:|-------------------:|
|  0 | 503498111827123008 |

Test with Python format()

>>> format(503498111827123021, '.0f')
'503498111827123008'

Pandas Version

Python 3.9.6 (default, Aug  5 2022, 15:21:02)
[Clang 14.0.0 (clang-1400.0.29.102)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 91111fd99898d9dcaa6bf6bedb662db4108da6e6
python           : 3.9.6.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 21.6.0
Version          : Darwin Kernel Version 21.6.0: Thu Sep 29 20:12:57 PDT 2022; root:xnu-8020.240.7~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.5.1
numpy            : 1.23.4
pytz             : 2022.6
dateutil         : 2.8.2
setuptools       : 58.0.4
pip              : 21.2.4
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : 0.9.0
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : None
tzdata           : None
@MarcoGorelli
Copy link
Member

Thanks @jbencina for the report

Looks like the issue isn't in pandas?

In [7]: tabulate.tabulate(df, floatfmt='.0f')
Out[7]: '-  ------------------\n0  503498111827123008\n-  ------------------'

Might be something to report to https://github.com/astanin/python-tabulate

@MarcoGorelli MarcoGorelli added the Upstream issue Issue related to pandas dependency label Nov 2, 2022
@jbencina
Copy link
Contributor Author

jbencina commented Nov 2, 2022

@MarcoGorelli Thanks. I opened a ticket with the tabulate team astanin/python-tabulate#213. The root cause seems to be that tabulate is treating the int64 data type as a float when coming from a DataFrame. The result is applying the incorrect Python formatting to it. Passing a long int directly to tabulate doesn't produce this issue

table = [[503498111827123021]]

print(tabulate(table))
------------------
503498111827123021

print(tabulate(table, floatfmt='.0f'))
------------------
503498111827123021
------------------

@jbencina
Copy link
Contributor Author

jbencina commented Nov 5, 2022

Confirmed this is fixed in the upcoming release of tabulate

@jbencina jbencina closed this as completed Nov 5, 2022
@MarcoGorelli
Copy link
Member

cool, thanks!

@MarcoGorelli
Copy link
Member

the minimum version should probably be bumped then - do you want to submit a pull request to do that?

(reopening the issue until the minimum version is bumped)

@MarcoGorelli MarcoGorelli reopened this Nov 6, 2022
@jbencina
Copy link
Contributor Author

jbencina commented Nov 7, 2022

Good point. I'll see if there's an idea when the next version will be out and circle back here with a PR when available

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Upstream issue Issue related to pandas dependency
Projects
None yet
Development

No branches or pull requests

2 participants