Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_sql function takes forever to insert in oracle database #14315

Closed
rajattjainn opened this issue Sep 28, 2016 · 11 comments
Closed

to_sql function takes forever to insert in oracle database #14315

rajattjainn opened this issue Sep 28, 2016 · 11 comments
Labels
IO SQL to_sql, read_sql, read_sql_query Performance Memory or execution speed performance

Comments

@rajattjainn
Copy link

I am using pandas to do some analysis on a excel file, and once that analysis is complete, I want to insert the resultant dataframe into a database. The size of this dataframe is around 300,000 rows and 27 columns.
I am using pd.to_sql method to insert dataframe in the database. When I use a MySQL database, insertion in the database takes place around 60-90 seconds. However when I try to insert the same dataframe using the same function in an oracle database, the process takes around 2-3 hours to complete.

Relevant code can be found below:

data_frame.to_sql(name='RSA_DATA',  con=get_engine(), if_exists='append',
                          index=False, chunksize=config.CHUNK_SIZE)

I tried using different chunk_sizes (from 50 to 3000), but the difference in time was only of the order of 10 minutes.
Any solution to the above problem ?

@jorisvandenbossche
Copy link
Member

What database driver are you using?
Given that it differs so much between two databases, it seems likely the problem should be searched in the driver, the speed of the connection, settings of the database, ... (which all influence the speed of the insertion).

@jorisvandenbossche jorisvandenbossche added the IO SQL to_sql, read_sql, read_sql_query label Sep 28, 2016
@rajattjainn
Copy link
Author

I used cx_Oracle driver to connect oracle database with my code.
However I don't know if cx_Oracle driver is the cause of this problem. Using a different but a hacky approach, I have been able to insert data in around 120 seconds. I broke the dataframe into multiple dataframes using numpy.array_split() method and used SQLAlchemy bulk insert for inserting in database. But I think this is more of a hack, and not the best solution.

Both the databases are on same machine (I used a lubuntu Virtual Machine for this comparison), hence connection speed shouldn't be an issue ?

@jorisvandenbossche
Copy link
Member

@addresseerajat Can you have a look at the discussion in #8953 ?
Especially the monkey patch suggested here: #8953 (comment) is possibly worth to try out.

@rajattjainn
Copy link
Author

@jorisvandenbossche: I looked at the solution and tried using a similar approach. The relevant code is as follows:

df_dict = data_frame.to_dict(orient='records')
connection = get_engine()
connection.execute(rsa_data.insert().values(df_dict))

rsa_data is the name of table into which I am inserting data.

The above line gives me an error:

The 'oracle' dialect with current database version settings does not support in-place multirow inserts.

My database version is oracle 11g.

However when I execute the following command, I am able to insert into the database. The only problem: it takes a lot of time to insert.

df_dict = data_frame.to_dict(orient='records')
connection = get_engine()
connection.execute(rsa_data.insert(), df_dict)

@daefresh
Copy link

Were there any other findings here? I've discovered that when pushing data into oracle using cx_oracle it's painfully slow. 10 rows can take 15 seconds to insert. The server we're using is decent (32GB of RAM and 8 core).

@wuhuanyan
Copy link

wuhuanyan commented Jul 30, 2018

我最近遇到了同样的问题。最后,我找到了解决问题的方法。
pandas.dataframe.to_sql with oracle database

@BenjaminHabert
Copy link

BenjaminHabert commented Feb 4, 2019

As mentioned by @wuhaochen I have also ran into this problem. For me the issue was that oracle was creating columns of CLOB data type for all the string columns of the pandas dataframe. I sped-up the code by explicitly setting the schema dtype parameter of to_sql() and using VARCHAR dtypes for string columns.

I think this should be the default behavior of to_sql as creating CLOB is counter-intuitive.

@rm17tink
Copy link

Could you provide example for the varchar conversion? numbers always work quickly. thanks

@BenjaminHabert
Copy link

Sorry, the correct parameter of to_sql() is dtype. This stackoverflow answer should be helpful for this problem.

@mroeschke mroeschke added the Performance Memory or execution speed performance label May 16, 2020
@iron0012
Copy link

iron0012 commented Sep 8, 2022

to_sql() is still practically broken when working with Oracle without using the workaround recommended above.

@mroeschke
Copy link
Member

It's not clear that there's a pandas specific fix for this issue so going to close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

8 participants