Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMPAT: infer larger than unit64 to object dtype #18584

Closed
qiaobz opened this issue Dec 1, 2017 · 13 comments
Closed

COMPAT: infer larger than unit64 to object dtype #18584

qiaobz opened this issue Dec 1, 2017 · 13 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions IO SQL to_sql, read_sql, read_sql_query
Milestone

Comments

@qiaobz
Copy link

qiaobz commented Dec 1, 2017

Code Sample, a copy-pastable example if possible

# Your code here
querySql = "select col1 from Table1" # col1 is NUMBER(20) in the oracle databse, example: 20124400100030425447
df = pd.read_sql(querySql, db)  # db is the oracle connection

Problem description

raise "OverflowError: long too big to convert", it may be the uint64 problem?

In [1]: import cx_Oracle

In [2]: db = cx_Oracle.connect(user='TEST', password='0000', dsn='localhost:1521/orcl')

In [3]: import pandas as pd

In [4]: df = pd.read_sql('select col1 from TABLE1',db)
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-4-318bf64d0a9b> in <module>()
----> 1 df = pd.read_sql('select col1 from TABLE1',db)

C:\Program Files\Anaconda\lib\site-packages\pandas\io\sql.pyc in read_sql(sql, con, index_col, coerce_float, params, parse_dates, colu
mns, chunksize)
    398             sql, index_col=index_col, params=params,
    399             coerce_float=coerce_float, parse_dates=parse_dates,
--> 400             chunksize=chunksize)
    401
    402     try:

C:\Program Files\Anaconda\lib\site-packages\pandas\io\sql.pyc in read_query(self, sql, index_col, coerce_float, params, parse_dates, c
hunksize)
   1456             frame = _wrap_result(data, columns, index_col=index_col,
   1457                                  coerce_float=coerce_float,
-> 1458                                  parse_dates=parse_dates)
   1459             return frame
   1460

C:\Program Files\Anaconda\lib\site-packages\pandas\io\sql.pyc in _wrap_result(data, columns, index_col, coerce_float, parse_dates)
    155
    156     frame = DataFrame.from_records(data, columns=columns,
--> 157                                    coerce_float=coerce_float)
    158
    159     _parse_date_columns(frame, parse_dates)

C:\Program Files\Anaconda\lib\site-packages\pandas\core\frame.pyc in from_records(cls, data, index, exclude, columns, coerce_float, nr
ows)
   1140         else:
   1141             arrays, arr_columns = _to_arrays(data, columns,
-> 1142                                              coerce_float=coerce_float)
   1143
   1144             arr_columns = _ensure_index(arr_columns)

C:\Program Files\Anaconda\lib\site-packages\pandas\core\frame.pyc in _to_arrays(data, columns, coerce_float, dtype)
   6249     if isinstance(data[0], (list, tuple)):
   6250         return _list_to_arrays(data, columns, coerce_float=coerce_float,
-> 6251                                dtype=dtype)
   6252     elif isinstance(data[0], collections.Mapping):
   6253         return _list_of_dict_to_arrays(data, columns,

C:\Program Files\Anaconda\lib\site-packages\pandas\core\frame.pyc in _list_to_arrays(data, columns, coerce_float, dtype)
   6328         content = list(lib.to_object_array(data).T)
   6329     return _convert_object_array(content, columns, dtype=dtype,
-> 6330                                  coerce_float=coerce_float)
   6331
   6332

C:\Program Files\Anaconda\lib\site-packages\pandas\core\frame.pyc in _convert_object_array(content, columns, coerce_float, dtype)
   6394         return arr
   6395
-> 6396     arrays = [convert(arr) for arr in content]
   6397
   6398     return arrays, columns

C:\Program Files\Anaconda\lib\site-packages\pandas\core\frame.pyc in convert(arr)
   6390     def convert(arr):
   6391         if dtype != object and dtype != np.object:
-> 6392             arr = lib.maybe_convert_objects(arr, try_float=coerce_float)
   6393             arr = maybe_cast_to_datetime(arr, dtype)
   6394         return arr

pandas/_libs/src\inference.pyx in pandas._libs.lib.maybe_convert_objects()

OverflowError: long too big to convert

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Windows
OS-release: 8.1
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.21.0
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.13.3
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added the IO SQL to_sql, read_sql, read_sql_query label Dec 1, 2017
@gfyoung
Copy link
Member

gfyoung commented Dec 1, 2017

@qiaobz : Thanks reporting this! Unfortunately, this is a little hard to reproduce. Could you provide a reproducible example by creating a small table in memory (using Python's sqlite library) that will cause this error to surface?

@qiaobz
Copy link
Author

qiaobz commented Dec 2, 2017

sorry for troubling you

  • when using sqllite3, it's no problem
import sqlite3
import pandas as pd
conn = sqlite3.connect(":memory:")
cur = con.cursor()
cur.execute('''CREATE TABLE "TABLE1" ("COL1" NUMBER(20,0))''')
cur.execute('''INSERT INTO "TABLE1"("COL1") VALUES ("20124400100030425447")''')
pd.read_sql("select COL1 FROM TABLE1", conn)

and the ouput is, (no error)

           COL1
0  2.012440e+19
  • reproduce the problem:
    • oracle database, create table and insert data
    my config  
    
    user: Test  
    password: 0000  
    dsn: localhost:1521/orcl
    
    CREATE TABLE "TEST"."TABLE1" ("COL1" NUMBER(20,0))
    INSERT INTO "TABLE1"("COL1") VALUES ('20124400100030425447')
    
    • use cx_Oracle and pandas
    import pandas as pd
    import cx_Oracle
    db = cx_Oracle.connect(user='TEST', password='0000', dsn='localhost:1521/orcl')
    querysql = "SELECT COL1 FROM TABLE1"
    df = pd.read_sql(querysql, db)
    
    and then cause “OverflowError: long too big to convert”

@gfyoung
Copy link
Member

gfyoung commented Dec 2, 2017

@qiaobz : Interesting...can you provide a stacktrace for when you try to read from the Oracle DB? If so, can you add it to your original issue too? That will be useful for anyone in the future.

@qiaobz
Copy link
Author

qiaobz commented Dec 2, 2017

my fault, I add it now

In [1]: import cx_Oracle

In [2]: db = cx_Oracle.connect(user='TEST', password='0000', dsn='localhost:1521/orcl')

In [3]: import pandas as pd

In [4]: df = pd.read_sql('select col1 from TABLE1',db)
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-4-318bf64d0a9b> in <module>()
----> 1 df = pd.read_sql('select col1 from TABLE1',db)

C:\Program Files\Anaconda\lib\site-packages\pandas\io\sql.pyc in read_sql(sql, con, index_col, coerce_float, params, parse_dates, colu
mns, chunksize)
    398             sql, index_col=index_col, params=params,
    399             coerce_float=coerce_float, parse_dates=parse_dates,
--> 400             chunksize=chunksize)
    401
    402     try:

C:\Program Files\Anaconda\lib\site-packages\pandas\io\sql.pyc in read_query(self, sql, index_col, coerce_float, params, parse_dates, c
hunksize)
   1456             frame = _wrap_result(data, columns, index_col=index_col,
   1457                                  coerce_float=coerce_float,
-> 1458                                  parse_dates=parse_dates)
   1459             return frame
   1460

C:\Program Files\Anaconda\lib\site-packages\pandas\io\sql.pyc in _wrap_result(data, columns, index_col, coerce_float, parse_dates)
    155
    156     frame = DataFrame.from_records(data, columns=columns,
--> 157                                    coerce_float=coerce_float)
    158
    159     _parse_date_columns(frame, parse_dates)

C:\Program Files\Anaconda\lib\site-packages\pandas\core\frame.pyc in from_records(cls, data, index, exclude, columns, coerce_float, nr
ows)
   1140         else:
   1141             arrays, arr_columns = _to_arrays(data, columns,
-> 1142                                              coerce_float=coerce_float)
   1143
   1144             arr_columns = _ensure_index(arr_columns)

C:\Program Files\Anaconda\lib\site-packages\pandas\core\frame.pyc in _to_arrays(data, columns, coerce_float, dtype)
   6249     if isinstance(data[0], (list, tuple)):
   6250         return _list_to_arrays(data, columns, coerce_float=coerce_float,
-> 6251                                dtype=dtype)
   6252     elif isinstance(data[0], collections.Mapping):
   6253         return _list_of_dict_to_arrays(data, columns,

C:\Program Files\Anaconda\lib\site-packages\pandas\core\frame.pyc in _list_to_arrays(data, columns, coerce_float, dtype)
   6328         content = list(lib.to_object_array(data).T)
   6329     return _convert_object_array(content, columns, dtype=dtype,
-> 6330                                  coerce_float=coerce_float)
   6331
   6332

C:\Program Files\Anaconda\lib\site-packages\pandas\core\frame.pyc in _convert_object_array(content, columns, coerce_float, dtype)
   6394         return arr
   6395
-> 6396     arrays = [convert(arr) for arr in content]
   6397
   6398     return arrays, columns

C:\Program Files\Anaconda\lib\site-packages\pandas\core\frame.pyc in convert(arr)
   6390     def convert(arr):
   6391         if dtype != object and dtype != np.object:
-> 6392             arr = lib.maybe_convert_objects(arr, try_float=coerce_float)
   6393             arr = maybe_cast_to_datetime(arr, dtype)
   6394         return arr

pandas/_libs/src\inference.pyx in pandas._libs.lib.maybe_convert_objects()

OverflowError: long too big to convert

@gfyoung
Copy link
Member

gfyoung commented Dec 2, 2017

Hmm, that's very strange. Here's what I suggest you do:

  1. Create a new conda environment with pandas installed. Make sure you can reproduce the error.
  2. Try installing the most updated version of pandas (we're at 0.21, but you are using 0.20.1) and see if the error persists still.
  3. Locate the line where you initialize the returned DataFrame. That would correspond to this block in your stack trace (though the filepath is different because you would be in a different conda environment):
C:\Program Files\Anaconda\lib\site-packages\pandas\io\sql.pyc in _wrap_result(data, columns, index_col, coerce_float, parse_dates)
    155
    156     frame = DataFrame.from_records(data, columns=columns,
--> 157                                    coerce_float=coerce_float)
    158
    159     _parse_date_columns(frame, parse_dates)

Add a line above to print the parameters to DataFrame.from_records and rerun the code. I would be curious to see what the Oracle DB is giving you compared to in-memory.

@qiaobz
Copy link
Author

qiaobz commented Dec 2, 2017

thx, I try to new a conda environment. but my pandas in this issue is 0.21.0, you can see the INSTALLED VERSIONS(pd.show_versions() output)
btw, I found that 20124400100030425447 is beyond the range of uinit64, such like 10124420010000167598 will cause no error

@gfyoung
Copy link
Member

gfyoung commented Dec 2, 2017

btw, I found that 20124400100030425447 is beyond the range of uinit64

Right, which is why I was asking you what value the Oracle DB is returning compared to the in-memory version. Were you able to figure that out by following my instructions?

@qiaobz
Copy link
Author

qiaobz commented Dec 2, 2017

the parameters to DataFrame.from_records:

  • Oracle DB
    [(20124400100030425447L,)]
  • in-memory version
    [(2.0124400100030427e+19,)]

Hmm, so I complicate the problem.
We can easily reproduce the error

import pandas as pd
df = pd.DataFrame([(20124400100030425447L,)])

then casue error

@jreback
Copy link
Contributor

jreback commented Dec 2, 2017

If you really need integers this large then you have to explicitly set them with object dtype to avoid the conversions.

In [4]: df = pd.DataFrame([(20124400100030425447,)], dtype=object)

In [5]: df
Out[5]: 
                      0
0  20124400100030425447

In [6]: df.dtypes
Out[6]: 
0    object
dtype: object

In [7]: np.iinfo(np.int64).max
Out[7]: 9223372036854775807

@jreback jreback closed this as completed Dec 2, 2017
@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Dec 2, 2017
@jreback jreback added this to the won't fix milestone Dec 2, 2017
@jorisvandenbossche
Copy link
Member

Shouldn't we infer this? Numpy does:

In [9]: np.array([20124400100030425447])
Out[9]: array([20124400100030425447], dtype=object)

@gfyoung
Copy link
Member

gfyoung commented Dec 3, 2017

Shouldn't we infer this?

I don't see why not? Is there any reason why historically we didn't do it?

@qiaobz
Copy link
Author

qiaobz commented Dec 3, 2017

two little question

  • there is no dtype parameters to read_sql()
  • pd.DataFrame(cur.fetchall(), dtype=object) # cur is db.cursor() will set all columns with object dtype

@jreback
Copy link
Contributor

jreback commented Dec 3, 2017

yeah this is a bit of an unhandled case in maybe_convert_objects

@gfyoung would you have a look?

@jreback jreback reopened this Dec 3, 2017
@jreback jreback modified the milestones: won't fix, Next Major Release Dec 3, 2017
@jreback jreback added Compat pandas objects compatability with Numpy or Python functions Difficulty Intermediate labels Dec 3, 2017
@jreback jreback changed the title read_sql() Raise "OverflowError: long too big to convert" COMPAT: infer larger than unit64 to object dtype Dec 3, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 4, 2017
For integers larger than what uint64 can handle,
we gracefully default to the object dtype instead
of overflowing.

Closes pandas-devgh-18584.
@gfyoung gfyoung modified the milestones: Next Major Release, 0.22.0 Dec 4, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 4, 2017
For integers larger than what uint64 can handle,
we gracefully default to the object dtype instead
of overflowing.

Closes pandas-devgh-18584.
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 5, 2017
For integers larger than what uint64 can handle,
we gracefully default to the object dtype instead
of overflowing.

Closes pandas-devgh-18584.
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 5, 2017
For integers larger than what uint64 can handle,
we gracefully default to the object dtype instead
of overflowing.

Closes pandas-devgh-18584.
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 5, 2017
For integers larger than what uint64 can handle,
we gracefully default to the object dtype instead
of overflowing.

Closes pandas-devgh-18584.
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 5, 2017
For integers larger than what uint64 can handle,
we gracefully default to the object dtype instead
of overflowing.

Closes pandas-devgh-18584.
jorisvandenbossche pushed a commit that referenced this issue Dec 5, 2017
For integers larger than what uint64 can handle,
we gracefully default to the object dtype instead
of overflowing.
For integers smaller than what int64 can
handle, we gracefully default to the object
dtype instead of overflowing.

Closes gh-18584.
@jorisvandenbossche jorisvandenbossche removed the Compat pandas objects compatability with Numpy or Python functions label Dec 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

No branches or pull requests

4 participants