New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to speed up inserts from pandas dataframe? #76
Comments
Hi. Inserting large amount of data is not so optimized right now. See String performance issue: #32. Can you provide schema, part of data for insert and a little piece of code?
This package transforms python data types passed as sequence (row-like) of tuples into binary representation in columnar form.
This can happen because of insufficient CPU. CPU is required for necessary data transformation.
Tweaking server is useless if you're limited by CPU on driver's side.
Very large data frames can be sent to server by using |
Hi. At the moment, I cannot provide the schema and peace of code, but when this problem happens again, I will provide. As I understand, pandas dataframes are as well as clickhouse in columnar format. For pandas, every column is a series. When I am using df.values.tolist() I am transforming pandas columnar format to row format which is I believe unnecessary. Can you somehow use pandas format in order to avoid column to row transformation? For example, if you need to convert column by column in pandas, these operation could be vectorized an executed in C language instead of python (df.column = df.column.astype(...), which is quite fast. Furthermore, if I want to insert data from (remote) select, I cannot do it because clickhouse_driver expects insert data as separate parameter. For example, this doesn't work: insert into database.table (fields...) select fields... from remote('ip_address', database, table) where condition... |
The main bottleneck now is data (de)serialization. It should be implemented at C-level. ClickHouse use protobuf-like protocol. It seems that seeking for fast protobuf c extension is the right direction. |
Hi, @ghuname. Can you check 0.1.2 version? It should be more efficient for inserts. |
Optional numpy arrays/pandas dataframes writing merged into master: 90a49c2. See tests for usage examples. |
@xzkostyan Please give a clue to me. client.execute( That is all? |
There are a wrappers over pure If you want to use pure execute please consider looking into wrappers' internals. Looking forward to your feedback. If you're installing package from github you should manually install |
OK, to be precise. I should do the following:
Is that all? |
At the moment I can see the problem with nullable columns, because they are not supported. |
Yes. Nullable columns are not supported now. It's temporary limitation.
Another steps are OK. |
@ghuname should we close this issue since 0.2.0 version has numpy support? |
I wanted to test it, but during the week I cannot find free time, and during weekend I don't have vpn access to the database. Anyways, you can close this issue. |
@xzkostyan , I've run into issue using insert_dataframe and this is the only post I've found which is somewhat helpful. I've followed the above suggestions but am getting errors. When I use AttributeError: 'numpy.ndarray' object has no attribute 'values' I've tried using TypeError: Unsupported column type: <class 'numpy.ndarray'>. list or tuple is expected. When I tried to convert the df to a list and pass it in it gives the error below: AttributeError: 'list' object has no attribute 'transpose' When I set AttributeError: 'list' object has no attribute 'values' Any idea what I'm doing wrong here? python = 3.8.2 |
@etadelta222 it seems that you're inserting list or numpy array. client = Client('localhost', settings={"use_numpy":True})
client.insert_datafram('INSERT INTO [table] VALUES', DataFrame(...)) |
Thanks for the response @xzkostyan . Please see my code below. When I use
So I changed the definition to
Gives me error:
|
It seems that |
Yeah, I can't seem to figure out why it's doing that.
df.dtypes returns this:
Clickhouse table definition:
|
Nullable columns are not supported with |
Gotcha. I converted the table back to having non nullable columns and am back to the same error:
|
There are two points:
|
Thank you for your help @xzkostyan! I ended up using the astype() method and that did the trick! |
I think we can close this issue. |
I have pandas dataframe on my laptop with few millions of records. I am inserting them to clickhouse table with:
client.execute('insert into database.table (col1, col2…, coln) values', df.values.tolist())
After execution of this command I looked at laptop’s network activity.
As you can see network activity is in peaks up to 12 Mbps, with lows at 6 Mbps.
Such activity takes quite a long, and than at one moment, laptop's network send goes to the 100 Mbps for some short period of time and insert is over.
Can someone explain how insert in clickhouse driver works?
Why they are not going to the clickhouse server at top network speed?
I tried to play with settings like max_insert_block_size or insert_block_size, but with no success.
A there any clickhouse server parameters that could improve the speed of inserts?
What would be the fastest way to insert pandas dataframe to clickhouse table?
The text was updated successfully, but these errors were encountered: