Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to speed up inserts from pandas dataframe? #76

Closed
ghuname opened this issue Feb 21, 2019 · 22 comments
Closed

How to speed up inserts from pandas dataframe? #76

ghuname opened this issue Feb 21, 2019 · 22 comments

Comments

@ghuname
Copy link

ghuname commented Feb 21, 2019

I have pandas dataframe on my laptop with few millions of records. I am inserting them to clickhouse table with:
client.execute('insert into database.table (col1, col2…, coln) values', df.values.tolist())

After execution of this command I looked at laptop’s network activity.

image

As you can see network activity is in peaks up to 12 Mbps, with lows at 6 Mbps.
Such activity takes quite a long, and than at one moment, laptop's network send goes to the 100 Mbps for some short period of time and insert is over.

Can someone explain how insert in clickhouse driver works?
Why they are not going to the clickhouse server at top network speed?

I tried to play with settings like max_insert_block_size or insert_block_size, but with no success.
A there any clickhouse server parameters that could improve the speed of inserts?

What would be the fastest way to insert pandas dataframe to clickhouse table?

@xzkostyan
Copy link
Member

Hi. Inserting large amount of data is not so optimized right now. See String performance issue: #32.

Can you provide schema, part of data for insert and a little piece of code?

Can someone explain how insert in clickhouse driver works?

This package transforms python data types passed as sequence (row-like) of tuples into binary representation in columnar form.

Why they are not going to the clickhouse server at top network speed?

This can happen because of insufficient CPU. CPU is required for necessary data transformation.

A there any clickhouse server parameters that could improve the speed of inserts?

Tweaking server is useless if you're limited by CPU on driver's side.

What would be the fastest way to insert pandas dataframe to clickhouse table?

Very large data frames can be sent to server by using clickhouse-client.

@ghuname
Copy link
Author

ghuname commented Mar 26, 2019

Hi. At the moment, I cannot provide the schema and peace of code, but when this problem happens again, I will provide.

As I understand, pandas dataframes are as well as clickhouse in columnar format. For pandas, every column is a series. When I am using df.values.tolist() I am transforming pandas columnar format to row format which is I believe unnecessary.

Can you somehow use pandas format in order to avoid column to row transformation?

For example, if you need to convert column by column in pandas, these operation could be vectorized an executed in C language instead of python (df.column = df.column.astype(...), which is quite fast.

Furthermore, if I want to insert data from (remote) select, I cannot do it because clickhouse_driver expects insert data as separate parameter. For example, this doesn't work:

insert into database.table (fields...) select fields... from remote('ip_address', database, table) where condition...

@xzkostyan
Copy link
Member

The main bottleneck now is data (de)serialization. It should be implemented at C-level.

ClickHouse use protobuf-like protocol. It seems that seeking for fast protobuf c extension is the right direction.

@xzkostyan
Copy link
Member

Hi, @ghuname.

Can you check 0.1.2 version? It should be more efficient for inserts.

@xzkostyan
Copy link
Member

Optional numpy arrays/pandas dataframes writing merged into master: 90a49c2. See tests for usage examples.

@ghuname
Copy link
Author

ghuname commented Dec 4, 2020

@xzkostyan Please give a clue to me.
If you a talking about tests in https://github.com/mymarilyn/clickhouse-driver/blob/master/tests/test_insert.py, I should create table/dataframe and than execute:

client.execute(
'INSERT INTO test (a, b) VALUES', data: pd.DataFrame, columnar=True
)

That is all?

@xzkostyan
Copy link
Member

xzkostyan commented Dec 4, 2020

There are a wrappers over pure execute in client: query_dataframe and insert_dataframe. Please see another example. You should also initialize client with settings={'use_numpy': True}.

If you want to use pure execute please consider looking into wrappers' internals.

Looking forward to your feedback.

If you're installing package from github you should manually install pandas and numpy packages.

@ghuname
Copy link
Author

ghuname commented Dec 4, 2020

OK, to be precise. I should do the following:

  1. update clickhouse_driver to version 0.1.6
  2. pip install clickhouse-driver[numpy]
  3. create test_table structure
  4. client = Client('localhost', settings={'use_numpy': True})
  5. df = client.query_dataframe("select *...')
  6. client.insert_dataframe('INSERT INTO test_table VALUES', df)

Is that all?

@ghuname
Copy link
Author

ghuname commented Dec 4, 2020

At the moment I can see the problem with nullable columns, because they are not supported.
If you used pandas DataFrame instead of NumPy, you will be able to support null values as well (pandas will do that for you https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html).
For example, for integers https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#integer-dtypes-and-missing-data

@xzkostyan
Copy link
Member

Yes. Nullable columns are not supported now. It's temporary limitation.

  1. pip install git+https://github.com/mymarilyn/clickhouse-driver@master#egg=clickhouse-driver
  2. pip install numpy pandas

Another steps are OK.

@xzkostyan
Copy link
Member

@ghuname should we close this issue since 0.2.0 version has numpy support?

@ghuname
Copy link
Author

ghuname commented Dec 17, 2020

I wanted to test it, but during the week I cannot find free time, and during weekend I don't have vpn access to the database. Anyways, you can close this issue.

@etadelta222
Copy link

etadelta222 commented Aug 4, 2021

@xzkostyan , I've run into issue using insert_dataframe and this is the only post I've found which is somewhat helpful. I've followed the above suggestions but am getting errors. When I use settings={"use_numpy":True}> I end up getting the error below when trying to call client.insert_datafram('INSERT INTO [table] VALUES',df)

AttributeError: 'numpy.ndarray' object has no attribute 'values'

I've tried using settings={"use_pandas":True}> and end up with the error below:

TypeError: Unsupported column type: <class 'numpy.ndarray'>. list or tuple is expected.

When I tried to convert the df to a list and pass it in it gives the error below:

AttributeError: 'list' object has no attribute 'transpose'

When I set transpose=False it gives the error below:

AttributeError: 'list' object has no attribute 'values'

Any idea what I'm doing wrong here?

python = 3.8.2
clickhouse-driver = 0.2.1

@xzkostyan
Copy link
Member

@etadelta222 it seems that you're inserting list or numpy array. insert_dataframe is expecting pandas DataFrame as the second argument. The correct way:

client = Client('localhost', settings={"use_numpy":True})
client.insert_datafram('INSERT INTO [table] VALUES', DataFrame(...))

@etadelta222
Copy link

etadelta222 commented Aug 5, 2021

Thanks for the response @xzkostyan . Please see my code below. When I use next_hour = max_date+timedelta(hours=1) I get an error

numpy.core._exceptions.UFuncTypeError: ufunc 'add' cannot use operands with types dtype('<M8[s]') and dtype('O')

So I changed the definition to next_hour = max_date + np.timedelta64(1,'h')

from sqlalchemy import create_engine
import pandas as pd
from clickhouse_driver import Client
from clickhouse_driver.errors import Error
import configparser
from datetime import datetime, timedelta

config = configparser.ConfigParser()
config.read('config.ini')

client = Client(host=config.get('clickhouse', 'host'),
                port=config.get('clickhouse', 'port'),
                user=config.get('clickhouse', 'user'),
                password=config.get('clickhouse', 'password'),
                database=config.get('clickhouse', 'database'),
                settings={"use_numpy": True})
result = client.execute('SELECT max(event_time) event_time from table')
max_date = result[0][0]
print(max_date)

# next_hour = max_date+timedelta(hours=1)
next_hour = max_date + np.timedelta64(1,'h')

print(next_hour)

engine = create_engine('postgresql://{user}:{password}@{host}:{port}/{database}'.
                       format(user=config.get('redshift', 'user'),
                              password=config.get('redshift', 'password'),
                              host=config.get('redshift', 'host'),
                              port=config.get('redshift', 'port'),
                              database=config.get('redshift', 'database')))
df = pd.read_sql(
    'SELECT list_of_columns FROM table where event_time >= \'{0}\' and event_time < \'{1}\';'.format(max_date, next_hour),
    engine)

print(df.dtypes)


try:
    client.insert_dataframe('INSERT INTO event VALUES', df)
except Error as e:
    if e.code == 60:
        print("connected")
    else:
        print(str(e))

Gives me error:

 File "clickhouse_driver/bufferedwriter.pyx", line 54, in clickhouse_driver.bufferedwriter.BufferedWriter.write_strings
AttributeError: 'NoneType' object has no attribute 'encode'

@xzkostyan
Copy link
Member

xzkostyan commented Aug 5, 2021

It seems that None is inserting into String/FixedString column.

@etadelta222
Copy link

Yeah, I can't seem to figure out why it's doing that.
I've checked the dataframe and it's populated and I'm also making sure to pass default values for NULL in all columns. I'm allowing nulls in my clickhouse table definition Nullable(String). However now I'm getting another error:

Code: 50 Uknown type Nullable(String)

df.dtypes returns this:

column_1                object
column_2                object
column_3                object
column_4                  object
column_5          datetime64[ns]
column_6                 object
column_7            object
column_8                   object
dtype: object

Clickhouse table definition:

CREATE TABLE IF NOT EXISTS db.table
(
	column_1 String
	,column_2 String
	,column_3 Nullable(String)
	,column_4 Nullable(String)
	,column_5 DateTime
	,column_6 Nullable(String)
	,column_7 Nullable(String)
	,column_8 Nullable(String)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(column_5)
ORDER BY (column_5);

@xzkostyan
Copy link
Member

Nullable columns are not supported with use_numpy=True option.

@etadelta222
Copy link

Gotcha. I converted the table back to having non nullable columns and am back to the same error:

AttributeError: 'NoneType' object has no attribute 'encode'

@xzkostyan
Copy link
Member

There are two points:

  • Driver doesn't support nullable columns. It means that INSERT INTO table with Nullable(X) column cannot be done. You already fixed it. No nullable columns in latest scheme.
  • String/FixedString column is expecting str instance on python's side. You need to inspect frame string columns for Nones

@etadelta222
Copy link

Thank you for your help @xzkostyan! I ended up using the astype() method and that did the trick!
df = df.astype(str)

@xzkostyan
Copy link
Member

I think we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants