Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query performance is very slow #371

Closed
idubinets opened this issue Nov 29, 2021 · 3 comments
Closed

Query performance is very slow #371

idubinets opened this issue Nov 29, 2021 · 3 comments
Labels
question Further information is requested wontfix This will not be worked on
Milestone

Comments

@idubinets
Copy link

idubinets commented Nov 29, 2021

Steps to reproduce:

  1. write to the bucket 50k records:
with InfluxDBClient(url="http://influxdb:8086", token="XXXX", org="XXXXX") as client:
            write_api = client.write_api(write_options=WriteOptions(batch_size=500,
                                                      flush_interval=1_000,
                                                      jitter_interval=2_000,
                                                      retry_interval=5_000,
                                                      max_retries=5,
                                                      max_retry_delay=30_000,
                                                      exponential_base=2))

            count_records = 50_000
            for i in range (count_records):
                # we gonna write the same record multiple times, but with the different timestamp to avoid merge
                write_api.write("xxxxxxx", record=f'demo,tag_1=1,tag_2=1,tag_3=1,tag_4=1,tag_5=1,tag_6=1,tag_7=1 field1=100,field2=50,fileld3=50,fieldt=0 {time.time_ns()}')
            write_api.flush()
            write_api.close()
  1. try to query them
query = f'from(bucket:"xxxxxxx") |> range(start: {test_start}) \
                                               |> filter(fn: (r) => r["_measurement"] == "demo" and r["tag_1"] == "1")'
start_time = time.time()
data_frame = query_api.query_data_frame(org='readytech', query=query)

# print(data_frame.to_string())            
print(f'data frame returned in: {(time.time() - start_time)} secs')

any ideas? I tried query_stream, query. only query_raw is fast enough

@bednar
Copy link
Contributor

bednar commented Nov 30, 2021

Hi @idubinets,

thanks for using our client.

The query in your example returns 200 000 rows because you are miss a pivot function. This causes unnecessary amount of parsing, try to use something like:

import time

from influxdb_client import InfluxDBClient
from influxdb_client.client.write_api import WriteOptions

url = "http://localhost:8086"
token = "my-token"
org = "my-org"
bucket = "my-bucket"
with InfluxDBClient(url=url, token=token, org=org) as client:
    test_start = time.time_ns()
    count_records = 50_000
    with client.write_api(write_options=WriteOptions(batch_size=500,
                                                     flush_interval=1_000,
                                                     jitter_interval=2_000,
                                                     retry_interval=5_000,
                                                     max_retries=5,
                                                     max_retry_delay=30_000,
                                                     exponential_base=2)) as write_api:
        for i in range(count_records):
            # we gonna write the same record multiple times, but with the different timestamp to avoid merge
            write_api.write(bucket,
                            record=f'demo,tag_1=1,tag_2=1,tag_3=1,tag_4=1,tag_5=1,tag_6=1,tag_7=1 field1=100,field2=50,fileld3=50,fieldt=0 {time.time_ns()}')

    query = f'from(bucket:"{bucket}") |> range(start: -1h) ' \
            f'|> filter(fn: (r) => r["_measurement"] == "demo" and r["tag_1"] == "1") ' \
            f'|> pivot(rowKey:["_time"], columnKey: ["_field"], valueColumn: "_value") ' \
            f'|> drop(columns: ["_start", "_stop"])'
    start_time = time.time()
    data_frame = client.query_api().query_data_frame(org=org, query=query)

    print(f'data frame returned in: {(time.time() - start_time)} secs')
    print(data_frame)

Regards

@bednar
Copy link
Contributor

bednar commented Dec 14, 2021

This issue has been closed because it has not had recent activity. Please reopen if this issue is still important to you and you have additionally information.

@bednar bednar closed this as completed Dec 14, 2021
@bednar bednar added this to the 1.25.0 milestone Dec 14, 2021
@idubinets
Copy link
Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants