New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cx_Oracle executemany silent failure when inserting big dataset #153

Closed
datatravelgit opened this Issue Mar 2, 2018 · 5 comments

Comments

Projects
None yet
3 participants
@datatravelgit
Copy link

datatravelgit commented Mar 2, 2018

Hi,

I am running the following versions:

import cx_Oracle
cx_Oracle.__version__
Out[32]: '6.1'
print(conn.version)
11.2.0.3.0
import sys
print(sys.version)
3.6.3 |Anaconda, Inc.| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)]

I have a dataframe with 225 columns and more than 3M records:

type(index_data)
Out[35]: pandas.core.frame.DataFrame

index_data.columns
Out[25]: 
Index(['12m_forward_eps', '12m_trailing_eps', '1_year_ADPS_growth_rate',
       ...
       'mk_usd', 'closing_weight'],
      dtype='object', length=225)

len(index_data)
Out[27]: 3355526

When I try to use the cx_Oracle executemany command, I have no errors but also, no data inserted in the database!

cursor = conn.cursor()
sql = 'INSERT INTO index (' + ', '.join(index_data.columns) + ') VALUES (' + ''. join([ ':' + str(v)+', ' for v in list(range(1, len(index_data.columns)))]) + ':' + str(len(index_data.columns)) + ')'
cursor.prepare(sql)
cursor.executemany(None, index_data.values.tolist())
conn.commit()

It seems that the library has some memory leak and the exception is cached without raising any error.
You can see, that when I am limiting the number of records to insert, I have a proper error:

cursor.executemany(None, index_data[:10].values.tolist())
Traceback (most recent call last):
  File "C:\UBS\Dev\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-28-66a03cab4f0e>", line 1, in <module>
    cursor.executemany(None, index_data[:10].values.tolist())
cx_Oracle.DatabaseError: DPI-1055: value is not a number (NaN) and cannot be used in Oracle numbers

Is this a dangerous behaviors that should prompt us to use another library?

Thanks
DT

PS: In case that you wonder, I tested my connection and steps using a small dataset (2 columns, 2 rows), and I was able to run and commit cx_Oracle executemany

@anthony-tuininga

This comment has been minimized.

Copy link
Member

anthony-tuininga commented Mar 2, 2018

Does this happen with "bare" cx_Oracle? In other words, without pandas? Can you provide a script that demonstrates the problem? Since the amount of data is significant the only meaningful way would be by generating the data. You can certainly work around the situation by batching the rows yourself, but it would be nice to figure out what is causing the problem -- and internally batch the rows in cx_Oracle if needed.

Is this a dangerous behaviors that should prompt us to use another library?

Which behaviour are you referring to? The use of NaN? You can use NaN with cx_Oracle but you do have to tell it you want to use BINARY_DOUBLE or BINARY_FLOAT, not NUMBER (the default).

@datatravelgit

This comment has been minimized.

Copy link
Author

datatravelgit commented Mar 6, 2018

Which behaviour are you referring to?

When I pretend to use 3.5Million records x 225 columns with np.nan in some columns, the cx_Oracle library ends the executemany + commit statements without errors. It gives the impression that it works but it does not insert any data in the database. Very dangerous behavior to have it in production!.

Nevertheless, I did some further investigation:

  1. Before trying to save the pandas dataframe (df), I replaced all nan with None:
    df = df.where(pd.notnull(df), None)

  2. Given that the performance of the executemany seems to be exponentially affected by the number of records to insert (I can provide the stats if required), and that there was an error saying "cx_Oracle.DatabaseError: DPI-1015: array size of 3500000 is too large" (btw, it would be nice to document the limit somewhere), I decided to split the df into 100k chunks, decision that speed up the process... but:

  3. I have 1 prepare statement
    cursor.prepare(sql)

  4. I setinputsizes before each executemany
    cursor.setinputsizes(*cx_types)

  5. But, when I am in the X number of executemany, the following error appears:
    "TypeError: expecting string or bytes object"

Reading the documentation from Cursor.execute, it seems that I have a columns with None that comes up with a number:
"For maximum efficiency when reusing an statement, it is best to use the setinputsizes() method to specify the parameter types and sizes ahead of time; in particular, None is assumed to be a string of length 1 so any values that are later bound as numbers or dates will raise a TypeError exception."

But, I do have a cursor.setinputsizes that says that the expected column is a cx_Oracle.NATIVE_FLOAT

Questions:

  • I cannot see in the documentation that cursor.setinputsizes is taken into account for executemany. Is this a missing entry in the documentation? (http://cx-oracle.readthedocs.io/en/latest/cursor.html?highlight=executemany) or is this a fact and therefore, when I am slicing the df I will have different datatypes based on the "None"s that the executemany is finding in the first column?

Thanks
DT

@cjbj cjbj added the question label Mar 7, 2018

@cjbj

This comment has been minimized.

Copy link
Member

cjbj commented Mar 7, 2018

Just a note for future readers interested in loading large volumes of data: Oracle's SQL*Loader and Data Pump were added to Instant Client 12.2.

@anthony-tuininga

This comment has been minimized.

Copy link
Member

anthony-tuininga commented Mar 7, 2018

When I pretend to use 3.5Million records x 225 columns with np.nan in some columns, the cx_Oracle library ends the executemany + commit statements without errors. It gives the impression that it works but it does not insert any data in the database. Very dangerous behavior to have it in production!

I would agree that this is invalid behaviour. Can you provide a script that demonstrates the problem?

Given that the performance of the executemany seems to be exponentially affected by the number of records to insert (I can provide the stats if required)

Please do provide those stats! It may make sense to split the rows to insert into multiple chunks internally.

and that there was an error saying "cx_Oracle.DatabaseError: DPI-1015: array size of 3500000 is too large" (btw, it would be nice to document the limit somewhere)

The limit isn't number of rows but data size (2 GB), so size of each row multiplied by the number of rows. I'll look into places to document that limitation. Most people don't run into it, though!

I cannot see in the documentation that cursor.setinputsizes is taken into account for executemany. Is this a missing entry in the documentation? (http://cx-oracle.readthedocs.io/en/latest/cursor.html?highlight=executemany) or is this a fact and therefore, when I am slicing the df I will have different datatypes based on the "None"s that the executemany is finding in the first column?

cursor.executemany() does indeed take into account the call to cursor.setinputsizes(). I can adjust that documentation as well.

anthony-tuininga added a commit that referenced this issue Mar 21, 2018

@anthony-tuininga

This comment has been minimized.

Copy link
Member

anthony-tuininga commented Mar 29, 2018

Closing due to lack of activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment