How to speed up inserts from pandas dataframe? #76

ghuname · 2019-02-21T09:45:07Z

I have pandas dataframe on my laptop with few millions of records. I am inserting them to clickhouse table with:
client.execute('insert into database.table (col1, col2…, coln) values', df.values.tolist())

After execution of this command I looked at laptop’s network activity.

As you can see network activity is in peaks up to 12 Mbps, with lows at 6 Mbps.
Such activity takes quite a long, and than at one moment, laptop's network send goes to the 100 Mbps for some short period of time and insert is over.

Can someone explain how insert in clickhouse driver works?
Why they are not going to the clickhouse server at top network speed?

I tried to play with settings like max_insert_block_size or insert_block_size, but with no success.
A there any clickhouse server parameters that could improve the speed of inserts?

What would be the fastest way to insert pandas dataframe to clickhouse table?

xzkostyan · 2019-03-01T14:10:10Z

Hi. Inserting large amount of data is not so optimized right now. See String performance issue: #32.

Can you provide schema, part of data for insert and a little piece of code?

Can someone explain how insert in clickhouse driver works?

This package transforms python data types passed as sequence (row-like) of tuples into binary representation in columnar form.

Why they are not going to the clickhouse server at top network speed?

This can happen because of insufficient CPU. CPU is required for necessary data transformation.

A there any clickhouse server parameters that could improve the speed of inserts?

Tweaking server is useless if you're limited by CPU on driver's side.

What would be the fastest way to insert pandas dataframe to clickhouse table?

Very large data frames can be sent to server by using clickhouse-client.

ghuname · 2019-03-26T08:12:59Z

Hi. At the moment, I cannot provide the schema and peace of code, but when this problem happens again, I will provide.

As I understand, pandas dataframes are as well as clickhouse in columnar format. For pandas, every column is a series. When I am using df.values.tolist() I am transforming pandas columnar format to row format which is I believe unnecessary.

Can you somehow use pandas format in order to avoid column to row transformation?

For example, if you need to convert column by column in pandas, these operation could be vectorized an executed in C language instead of python (df.column = df.column.astype(...), which is quite fast.

Furthermore, if I want to insert data from (remote) select, I cannot do it because clickhouse_driver expects insert data as separate parameter. For example, this doesn't work:

insert into database.table (fields...) select fields... from remote('ip_address', database, table) where condition...

xzkostyan · 2019-04-02T18:43:53Z

The main bottleneck now is data (de)serialization. It should be implemented at C-level.

ClickHouse use protobuf-like protocol. It seems that seeking for fast protobuf c extension is the right direction.

xzkostyan · 2019-10-23T06:11:49Z

Hi, @ghuname.

Can you check 0.1.2 version? It should be more efficient for inserts.

xzkostyan · 2020-12-02T17:37:23Z

Optional numpy arrays/pandas dataframes writing merged into master: 90a49c2. See tests for usage examples.

ghuname · 2020-12-04T16:36:08Z

@xzkostyan Please give a clue to me.
If you a talking about tests in https://github.com/mymarilyn/clickhouse-driver/blob/master/tests/test_insert.py, I should create table/dataframe and than execute:

client.execute(
'INSERT INTO test (a, b) VALUES', data: pd.DataFrame, columnar=True
)

That is all?

xzkostyan · 2020-12-04T17:20:30Z

There are a wrappers over pure execute in client: query_dataframe and insert_dataframe. Please see another example. You should also initialize client with settings={'use_numpy': True}.

If you want to use pure execute please consider looking into wrappers' internals.

Looking forward to your feedback.

If you're installing package from github you should manually install pandas and numpy packages.

ghuname · 2020-12-04T19:47:54Z

OK, to be precise. I should do the following:

update clickhouse_driver to version 0.1.6
pip install clickhouse-driver[numpy]
create test_table structure
client = Client('localhost', settings={'use_numpy': True})
df = client.query_dataframe("select *...')
client.insert_dataframe('INSERT INTO test_table VALUES', df)

Is that all?

ghuname · 2020-12-04T19:54:30Z

At the moment I can see the problem with nullable columns, because they are not supported.
If you used pandas DataFrame instead of NumPy, you will be able to support null values as well (pandas will do that for you https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html).
For example, for integers https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#integer-dtypes-and-missing-data

xzkostyan · 2020-12-04T20:09:59Z

Yes. Nullable columns are not supported now. It's temporary limitation.

pip install git+https://github.com/mymarilyn/clickhouse-driver@master#egg=clickhouse-driver
pip install numpy pandas

Another steps are OK.

xzkostyan · 2020-12-17T08:33:52Z

@ghuname should we close this issue since 0.2.0 version has numpy support?

ghuname · 2020-12-17T17:32:06Z

I wanted to test it, but during the week I cannot find free time, and during weekend I don't have vpn access to the database. Anyways, you can close this issue.

etadelta222 · 2021-08-04T18:18:37Z

@xzkostyan , I've run into issue using insert_dataframe and this is the only post I've found which is somewhat helpful. I've followed the above suggestions but am getting errors. When I use settings={"use_numpy":True}> I end up getting the error below when trying to call client.insert_datafram('INSERT INTO [table] VALUES',df)

AttributeError: 'numpy.ndarray' object has no attribute 'values'

I've tried using settings={"use_pandas":True}> and end up with the error below:

TypeError: Unsupported column type: <class 'numpy.ndarray'>. list or tuple is expected.

When I tried to convert the df to a list and pass it in it gives the error below:

AttributeError: 'list' object has no attribute 'transpose'

When I set transpose=False it gives the error below:

AttributeError: 'list' object has no attribute 'values'

Any idea what I'm doing wrong here?

python = 3.8.2
clickhouse-driver = 0.2.1

xzkostyan · 2021-08-04T22:03:51Z

@etadelta222 it seems that you're inserting list or numpy array. insert_dataframe is expecting pandas DataFrame as the second argument. The correct way:

client = Client('localhost', settings={"use_numpy":True})
client.insert_datafram('INSERT INTO [table] VALUES', DataFrame(...))

etadelta222 · 2021-08-05T17:10:02Z

Thanks for the response @xzkostyan . Please see my code below. When I use next_hour = max_date+timedelta(hours=1) I get an error

numpy.core._exceptions.UFuncTypeError: ufunc 'add' cannot use operands with types dtype('<M8[s]') and dtype('O')

So I changed the definition to next_hour = max_date + np.timedelta64(1,'h')

from sqlalchemy import create_engine
import pandas as pd
from clickhouse_driver import Client
from clickhouse_driver.errors import Error
import configparser
from datetime import datetime, timedelta

config = configparser.ConfigParser()
config.read('config.ini')

client = Client(host=config.get('clickhouse', 'host'),
                port=config.get('clickhouse', 'port'),
                user=config.get('clickhouse', 'user'),
                password=config.get('clickhouse', 'password'),
                database=config.get('clickhouse', 'database'),
                settings={"use_numpy": True})
result = client.execute('SELECT max(event_time) event_time from table')
max_date = result[0][0]
print(max_date)

# next_hour = max_date+timedelta(hours=1)
next_hour = max_date + np.timedelta64(1,'h')

print(next_hour)

engine = create_engine('postgresql://{user}:{password}@{host}:{port}/{database}'.
                       format(user=config.get('redshift', 'user'),
                              password=config.get('redshift', 'password'),
                              host=config.get('redshift', 'host'),
                              port=config.get('redshift', 'port'),
                              database=config.get('redshift', 'database')))
df = pd.read_sql(
    'SELECT list_of_columns FROM table where event_time >= \'{0}\' and event_time < \'{1}\';'.format(max_date, next_hour),
    engine)

print(df.dtypes)


try:
    client.insert_dataframe('INSERT INTO event VALUES', df)
except Error as e:
    if e.code == 60:
        print("connected")
    else:
        print(str(e))

Gives me error:

 File "clickhouse_driver/bufferedwriter.pyx", line 54, in clickhouse_driver.bufferedwriter.BufferedWriter.write_strings
AttributeError: 'NoneType' object has no attribute 'encode'

xzkostyan · 2021-08-05T17:26:13Z

It seems that None is inserting into String/FixedString column.

etadelta222 · 2021-08-05T21:15:51Z

Yeah, I can't seem to figure out why it's doing that.
I've checked the dataframe and it's populated and I'm also making sure to pass default values for NULL in all columns. I'm allowing nulls in my clickhouse table definition Nullable(String). However now I'm getting another error:

Code: 50 Uknown type Nullable(String)

df.dtypes returns this:

column_1                object
column_2                object
column_3                object
column_4                  object
column_5          datetime64[ns]
column_6                 object
column_7            object
column_8                   object
dtype: object

Clickhouse table definition:

CREATE TABLE IF NOT EXISTS db.table
(
	column_1 String
	,column_2 String
	,column_3 Nullable(String)
	,column_4 Nullable(String)
	,column_5 DateTime
	,column_6 Nullable(String)
	,column_7 Nullable(String)
	,column_8 Nullable(String)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(column_5)
ORDER BY (column_5);

xzkostyan · 2021-08-05T21:41:15Z

Nullable columns are not supported with use_numpy=True option.

etadelta222 · 2021-08-05T22:07:15Z

Gotcha. I converted the table back to having non nullable columns and am back to the same error:

AttributeError: 'NoneType' object has no attribute 'encode'

xzkostyan · 2021-08-06T07:39:37Z

There are two points:

Driver doesn't support nullable columns. It means that INSERT INTO table with Nullable(X) column cannot be done. You already fixed it. No nullable columns in latest scheme.
String/FixedString column is expecting str instance on python's side. You need to inspect frame string columns for Nones

etadelta222 · 2021-08-06T16:17:54Z

Thank you for your help @xzkostyan! I ended up using the astype() method and that did the trick!
df = df.astype(str)

xzkostyan · 2021-09-24T20:15:04Z

I think we can close this issue.

etadelta222 mentioned this issue Aug 4, 2021

BUG: AttributeError: 'list' object has no attribute 'tolist' when insert_dataframe with use_numpy on #218

Closed

xzkostyan closed this as completed Sep 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to speed up inserts from pandas dataframe? #76

How to speed up inserts from pandas dataframe? #76

ghuname commented Feb 21, 2019

xzkostyan commented Mar 1, 2019

ghuname commented Mar 26, 2019

xzkostyan commented Apr 2, 2019

xzkostyan commented Oct 23, 2019

xzkostyan commented Dec 2, 2020

ghuname commented Dec 4, 2020

xzkostyan commented Dec 4, 2020 •

edited

ghuname commented Dec 4, 2020

ghuname commented Dec 4, 2020

xzkostyan commented Dec 4, 2020

xzkostyan commented Dec 17, 2020

ghuname commented Dec 17, 2020

etadelta222 commented Aug 4, 2021 •

edited

xzkostyan commented Aug 4, 2021

etadelta222 commented Aug 5, 2021 •

edited

xzkostyan commented Aug 5, 2021 •

edited

etadelta222 commented Aug 5, 2021

xzkostyan commented Aug 5, 2021

etadelta222 commented Aug 5, 2021

xzkostyan commented Aug 6, 2021

etadelta222 commented Aug 6, 2021

xzkostyan commented Sep 24, 2021

How to speed up inserts from pandas dataframe? #76

How to speed up inserts from pandas dataframe? #76

Comments

ghuname commented Feb 21, 2019

xzkostyan commented Mar 1, 2019

ghuname commented Mar 26, 2019

xzkostyan commented Apr 2, 2019

xzkostyan commented Oct 23, 2019

xzkostyan commented Dec 2, 2020

ghuname commented Dec 4, 2020

xzkostyan commented Dec 4, 2020 • edited

ghuname commented Dec 4, 2020

ghuname commented Dec 4, 2020

xzkostyan commented Dec 4, 2020

xzkostyan commented Dec 17, 2020

ghuname commented Dec 17, 2020

etadelta222 commented Aug 4, 2021 • edited

xzkostyan commented Aug 4, 2021

etadelta222 commented Aug 5, 2021 • edited

xzkostyan commented Aug 5, 2021 • edited

etadelta222 commented Aug 5, 2021

xzkostyan commented Aug 5, 2021

etadelta222 commented Aug 5, 2021

xzkostyan commented Aug 6, 2021

etadelta222 commented Aug 6, 2021

xzkostyan commented Sep 24, 2021

xzkostyan commented Dec 4, 2020 •

edited

etadelta222 commented Aug 4, 2021 •

edited

etadelta222 commented Aug 5, 2021 •

edited

xzkostyan commented Aug 5, 2021 •

edited