The contains function may not be optimized #20931

LieLieLiekey · 2021-03-12T18:52:40Z

Environment info:

influxDB version: 2.0.3

System info: from docker
Debain, X86_64, 8-core Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, 16GB RAM

Data describe:

BucketName: 15day_profile_bucket
MeasurementName: function_info
Tags: Function, Pid, Tid, ProcessName, UUID, State
Fields: Internal, cumulative

there may be 2.4w record and 200 series in 1 minute.

Problem:

The contains function query is very slow, it seems that the group key filter is not used.

The following flux query took 0.63s:

from(bucket: "15day_profile_bucket")
|> range(start: 2021-03-03T05:54:39.611Z, stop: 2021-03-03T06:54:39.611Z)
|> filter(fn: (r) => r["_measurement"] == "function_info" )
|> limit(n: 1)
|> filter(fn: (r) => contains(value: r["UUID"], set: ["7f0a1436-37ad-4b7a-9ab1-7acce9ee3060"])  )
|> yield()

this is image:

but the flux query took 37.88s:

from(bucket: "15day_profile_bucket")
|> range(start: 2021-03-03T05:54:39.611Z, stop: 2021-03-03T06:54:39.611Z)
|> filter(fn: (r) => r["_measurement"] == "function_info" )
|> filter(fn: (r) => contains(value: r["UUID"], set: ["7f0a1436-37ad-4b7a-9ab1-7acce9ee3060"])  )
|> limit(n: 1)
|> yield()

this is image:

Expected behavior:

The time spent on the two queries differs too much.

Because UUID is a tag field, so the first flux query ( is filter first then limit), and the second query (is limit first then filter) should no big difference.

So I guess the contains function does not use the group key for filtering, but scans all the data。

Use Case:

Our team used influxdb-v2, but that is the bottleneck of our project now.

I have tried to use multiple or operations to replace contains function, but when the number of filters is large(70+), the or operation is slower.

The text was updated successfully, but these errors were encountered:

MarcoPignati · 2021-03-15T21:39:10Z

Same here. Comparing time taken for 2 identical simple scripts (one with a filter, another with a contains) the one with contains took, if i remember well, more than 30x.

MarcoPignati · 2021-09-02T07:21:47Z

rather than using contains() I am now using the approach suggested here: https://community.grafana.com/t/grafana-influxdb-flux-query-for-displaying-multi-select-variable-inputs/35536
the filtering works perfectly and performance is not impacted. In my case the variable $device of the example is obtained via a another query.

danxmoran added area/2.x OSS 2.0 related issues and PRs area/flux Issues related to the Flux query engine area/performance labels Mar 15, 2021

MarcoPignati mentioned this issue Mar 19, 2021

Improve contains influxdata/flux#1914

Closed

MarcoPignati mentioned this issue Oct 15, 2021

The contains function may not be optimized influxdata/flux#3546

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The contains function may not be optimized #20931

The contains function may not be optimized #20931

LieLieLiekey commented Mar 12, 2021 •

edited

Loading

MarcoPignati commented Mar 15, 2021

MarcoPignati commented Sep 2, 2021

The contains function may not be optimized #20931

The contains function may not be optimized #20931

Comments

LieLieLiekey commented Mar 12, 2021 • edited Loading

Environment info:

Data describe:

Problem:

Expected behavior:

Use Case:

MarcoPignati commented Mar 15, 2021

MarcoPignati commented Sep 2, 2021

LieLieLiekey commented Mar 12, 2021 •

edited

Loading