New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_sql function takes forever to insert in oracle database #14315

Open
addresseerajat opened this Issue Sep 28, 2016 · 6 comments

Comments

Projects
None yet
4 participants
@addresseerajat

addresseerajat commented Sep 28, 2016

I am using pandas to do some analysis on a excel file, and once that analysis is complete, I want to insert the resultant dataframe into a database. The size of this dataframe is around 300,000 rows and 27 columns.
I am using pd.to_sql method to insert dataframe in the database. When I use a MySQL database, insertion in the database takes place around 60-90 seconds. However when I try to insert the same dataframe using the same function in an oracle database, the process takes around 2-3 hours to complete.

Relevant code can be found below:

data_frame.to_sql(name='RSA_DATA',  con=get_engine(), if_exists='append',
                          index=False, chunksize=config.CHUNK_SIZE)

I tried using different chunk_sizes (from 50 to 3000), but the difference in time was only of the order of 10 minutes.
Any solution to the above problem ?

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Sep 28, 2016

What database driver are you using?
Given that it differs so much between two databases, it seems likely the problem should be searched in the driver, the speed of the connection, settings of the database, ... (which all influence the speed of the insertion).

@addresseerajat

This comment has been minimized.

addresseerajat commented Sep 29, 2016

I used cx_Oracle driver to connect oracle database with my code.
However I don't know if cx_Oracle driver is the cause of this problem. Using a different but a hacky approach, I have been able to insert data in around 120 seconds. I broke the dataframe into multiple dataframes using numpy.array_split() method and used SQLAlchemy bulk insert for inserting in database. But I think this is more of a hack, and not the best solution.

Both the databases are on same machine (I used a lubuntu Virtual Machine for this comparison), hence connection speed shouldn't be an issue ?

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Sep 29, 2016

@addresseerajat Can you have a look at the discussion in #8953 ?
Especially the monkey patch suggested here: #8953 (comment) is possibly worth to try out.

@addresseerajat

This comment has been minimized.

addresseerajat commented Oct 4, 2016

@jorisvandenbossche: I looked at the solution and tried using a similar approach. The relevant code is as follows:

df_dict = data_frame.to_dict(orient='records')
connection = get_engine()
connection.execute(rsa_data.insert().values(df_dict))

rsa_data is the name of table into which I am inserting data.

The above line gives me an error:

The 'oracle' dialect with current database version settings does not support in-place multirow inserts.

My database version is oracle 11g.

However when I execute the following command, I am able to insert into the database. The only problem: it takes a lot of time to insert.

df_dict = data_frame.to_dict(orient='records')
connection = get_engine()
connection.execute(rsa_data.insert(), df_dict)
@daefresh

This comment has been minimized.

daefresh commented Nov 30, 2016

Were there any other findings here? I've discovered that when pushing data into oracle using cx_oracle it's painfully slow. 10 rows can take 15 seconds to insert. The server we're using is decent (32GB of RAM and 8 core).

@wuhuanyan

This comment has been minimized.

wuhuanyan commented Jul 30, 2018

我最近遇到了同样的问题。最后,我找到了解决问题的方法。
pandas.dataframe.to_sql with oracle database

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment