System performance degrades with large numbers of runs #2071

strangemonad-faire · 2019-09-09T21:39:34Z

What happened:

On systems with lots of runs, listing experiments becomes unbearably slow.

What did you expect to happen:

The performance remains constant

What steps did you take:

I enabled slow query logging and identified the culprit query. Change #1836 attempted to address the issue but it doesn't.

the query that's generated by gorm is inherently inefficient. No index can make that query faster as it's currently written. In particular, the left join that does a select * from resource_references will need to materialize a temp table. That temp table will have no index for the next part of the join

Here's an explain statement from our cluster that contains ~70k resource_references:

mysql> explain extended SELECT subq.*, CONCAT("[",GROUP_CONCAT(r.Payload SEPARATOR ","),"]") AS refs FROM (SELECT rd.*, CONCAT("[",GROUP_CONCAT(m.Payload SEPARATOR ","),"]") AS metrics FROM (SELECT UUID, DisplayName, Name, StorageState, Namespace, Description, CreatedAtInSec, ScheduledAtInSec, FinishedAtInSec, Conditions, PipelineId, PipelineSpecManifest, WorkflowSpecManifest, Parameters, pipelineRuntimeManifest, WorkflowRuntimeManifest FROM run_details) AS rd LEFT JOIN run_metrics AS m ON rd.UUID=m.RunUUID GROUP BY rd.UUID) AS subq LEFT JOIN (select * from resource_references where ResourceType='Run') AS r ON subq.UUID=r.ResourceUUID WHERE UUID = '5e9c12b0-d27d-11e9-86a0-42010a8000e5' GROUP BY subq.UUID LIMIT 1;
+----+-------------+---------------------+------+---------------+-------------+---------+-------+-------+----------+----------------------------------------------------+
| id | select_type | table               | type | possible_keys | key         | key_len | ref   | rows  | filtered | Extra                                              |
+----+-------------+---------------------+------+---------------+-------------+---------+-------+-------+----------+----------------------------------------------------+
|  1 | PRIMARY     | <derived2>          | ref  | <auto_key0>   | <auto_key0> | 257     | const |    10 |   100.00 | Using index condition                              |
|  1 | PRIMARY     | <derived4>          | ref  | <auto_key0>   | <auto_key0> | 257     | const |    10 |   100.00 | Using where                                        |
|  4 | DERIVED     | resource_references | ALL  | NULL          | NULL        | NULL    | NULL  | 63229 |   100.00 | Using where                                        |
|  2 | DERIVED     | <derived3>          | ALL  | NULL          | NULL        | NULL    | NULL  | 19546 |   100.00 | Using temporary; Using filesort                    |
|  2 | DERIVED     | m                   | ALL  | PRIMARY       | NULL        | NULL    | NULL  |     1 |   100.00 | Using where; Using join buffer (Block Nested Loop) |
|  3 | DERIVED     | run_details         | ALL  | NULL          | NULL        | NULL    | NULL  | 19546 |   100.00 | NULL                                               |
+----+-------------+---------------------+------+---------------+-------------+---------+-------+-------+----------+----------------------------------------------------+
6 rows in set, 1 warning (6.90 sec)

Anything else you would like to add:

I'm not familiar enough with gorm to know how to force it to generate a more efficient query.

The text was updated successfully, but these errors were encountered:

xiaogaozi · 2019-09-23T03:04:30Z

We faced the same issue, ListRuns API become very slowly when resource_references table is large (about 6000 rows). In most case the API couldn't send response back.

xiaogaozi · 2019-09-26T06:38:37Z

More details after explain the query statement, same as @strangemonad-faire reported, query from resource_references table cannot use the referencefilter index. Our Kubeflow Pipelines version is 0.1.30.

Note: This is a test database, please ignore the rows column.

mysql> EXPLAIN SELECT subq.*, CONCAT("[",GROUP_CONCAT(r.Payload SEPARATOR ","),"]") AS refs FROM (SELECT rd.*, CONCAT("[",GROUP_CONCAT(m.Payload SEPARATOR ","),"]") AS metrics FROM (SELECT UUID, DisplayName, Name, StorageState, Namespace, Description, CreatedAtInSec, ScheduledAtInSec, FinishedAtInSec, Conditions, PipelineId, PipelineName, PipelineSpecManifest, WorkflowSpecManifest, Parameters, PipelineRuntimeManifest, WorkflowRuntimeManifest FROM run_details WHERE UUID in (SELECT ResourceUUID FROM resource_references as rf WHERE (rf.ResourceType = 'Run' AND rf.ReferenceUUID = 'd314b06d-c7ce-4ead-b075-9809ddd5dba5' AND rf.ReferenceType = 'Experiment')) AND StorageState <> 'STORAGESTATE_ARCHIVED' ORDER BY CreatedAtInSec DESC, UUID DESC LIMIT 11) AS rd LEFT JOIN run_metrics AS m ON rd.UUID=m.RunUUID GROUP BY rd.UUID) AS subq LEFT JOIN (select * from resource_references where ResourceType='Run') AS r ON subq.UUID=r.ResourceUUID GROUP BY subq.UUID ORDER BY CreatedAtInSec DESC, UUID DESC;
+----+-------------+---------------------+--------+-------------------------+-----------------+---------+----------------------------+------+-----------------------------------------------------------+
| id | select_type | table               | type   | possible_keys           | key             | key_len | ref                        | rows | Extra                                                     |
+----+-------------+---------------------+--------+-------------------------+-----------------+---------+----------------------------+------+-----------------------------------------------------------+
|  1 | PRIMARY     | <derived2>          | ALL    | NULL                    | NULL            | NULL    | NULL                       |    6 | Using temporary; Using filesort                           |
|  1 | PRIMARY     | <derived5>          | ALL    | NULL                    | NULL            | NULL    | NULL                       |    5 | Using where; Using join buffer (Block Nested Loop)        |
|  5 | DERIVED     | resource_references | ALL    | referencefilter         | NULL            | NULL    | NULL                       |    6 | Using where                                               |
|  2 | DERIVED     | <derived3>          | ALL    | NULL                    | NULL            | NULL    | NULL                       |    6 | Using temporary; Using filesort                           |
|  2 | DERIVED     | m                   | ALL    | PRIMARY                 | NULL            | NULL    | NULL                       |    1 | Using where; Using join buffer (Block Nested Loop)        |
|  3 | DERIVED     | rf                  | ref    | PRIMARY,referencefilter | referencefilter | 771     | const,const,const          |    6 | Using where; Using index; Using temporary; Using filesort |
|  3 | DERIVED     | run_details         | eq_ref | PRIMARY                 | PRIMARY         | 257     | mlpipeline.rf.ResourceUUID |    1 | Using where                                               |
+----+-------------+---------------------+--------+-------------------------+-----------------+---------+----------------------------+------+-----------------------------------------------------------+

But even if the referencefilter index could be used (e.g. through FORCE INDEX (referencefilter)), the above query may still very slowly. The reason is select * from resource_references where ResourceType='Run' could match many rows if resource_references table is very large. The less serious problem is SELECT ResourceUUID FROM resource_references as rf WHERE (rf.ResourceType = 'Run' AND rf.ReferenceUUID = 'xxx' AND rf.ReferenceType = 'Experiment') statement, this query will match all rows in specified experiment.

xiaogaozi · 2019-09-27T07:59:20Z

@IronPan @gaoning777 Do you guys have time to address this issue?

strangemonad-faire · 2019-10-02T21:34:39Z

FWIW, for the time being, I'm working around the issue by manually making sure old workflows are deleted from k8s via kubectl delete workflow and then manually deleting sql records for old runs

    select UUID, DisplayName, Name, StorageState, Namespace, Description, CreatedAtInSec, ScheduledAtInSec, 
        Conditions, PipelineId,  FinishedAtInSec
    from run_details
    where FinishedAtInSec is not null and FinishedAtInSec <> 0 and FinishedAtInSec < {max_finished_at_in_sec}
    order by FinishedAtInSec asc limit {limit};

    begin;
    delete from resource_references
    where ResourceType = 'Run' and ResourceUUID in ({run_uuid_strings});

    delete from run_metrics
    where RunUUID in ({run_uuid_strings});

    delete from run_details
    where UUID in ({run_uuid_strings});
    commit;

xiaogaozi · 2019-10-08T07:43:51Z

@strangemonad-faire We also have considered the solution that delete MySQL records manually. But sadly this isn't a long-term way. It's just a workaround.

xiaogaozi · 2019-10-12T06:18:43Z

Just FYI, we have made some small changes try to optimize ListRuns API performance. The image above show the "Select Latency" metric of AWS RDS, you could see it become much lower and more smooth than before (red arrow indicates the time we deploy new version of Kubeflow).

The basic idea is remove useless query and columns, e.g. we removed:

select * from resource_references where ResourceType='Run'
WorkflowSpecManifest and WorkflowRuntimeManifest columns when select from run_details table (these two columns might be very large)

You could see the details in here (ignore .travis.yml change it just for build Docker image).

@strangemonad-faire If you have time could you try this optimization see if can help you.

xiaogaozi · 2019-10-22T04:09:34Z

Any Kubeflow maintainer could address this issue? @Ark-kun @gaoning777 @IronPan @jingzhang36 @Bobgy

IronPan · 2019-10-23T04:22:52Z

@xiaogaozi @strangemonad-faire Thank you a lot for dive into this and found the issue. I guess the part that joining the metrics and resource reference is the root cause. These information is still needed I believe. We'll take a look and get back soon.

rmgogogo · 2019-11-18T09:56:30Z

It looks the PR 2559 fixed this issue. Feel free to reopen if not. Also welcome PR.
#2559

xiaogaozi · 2019-12-05T04:01:04Z

Thank you for the elegant fix! We'll test later to verify the change.

I still have one question about if we can delete WorkflowSpecManifest and WorkflowRuntimeManifest these two fields from API response. Based on our test result (see below images), the API response time reduce significantly. For example:

Size: from maximum 62.2KB to 2.5KB, reduce 96%.
Time: from maximum 986ms to 267ms, reduce 73%.
Waterfall: because "Waiting (TTFB)" (the green bar) and "Content Download" (the blue bar) reduce significantly, "Stalled" (the gray bar) reduced also (HTTP 1.x has maximum 6 concurrent connections limit).

before

after

Signed-off-by: Suresh Nakkeran <suresh.n@ideas2it.com>

strangemonad-faire mentioned this issue Sep 9, 2019

Create composite indexes [ResourceType, ReferenceUUID, ReferenceType] #1836

Merged

gaoning777 added the area/pipelines label Sep 10, 2019

gaoning777 assigned IronPan Sep 10, 2019

rmgogogo self-assigned this Oct 23, 2019

IronPan added priority/p0 area/backend labels Oct 23, 2019

wstian mentioned this issue Nov 6, 2019

fix slow query when get runs #2559

Merged

IronPan mentioned this issue Nov 7, 2019

Load/Perf test for Pipeline API server #2571

Closed

rmgogogo added the kind/bug label Nov 18, 2019

rmgogogo added this to the M-2019Q4 milestone Nov 18, 2019

rmgogogo closed this as completed Nov 18, 2019

jingzhang36 mentioned this issue Mar 25, 2020

Benchmark and performance report for KFP instance #3259

Closed

magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this issue Oct 22, 2023

runtimes installation issue fix (kubeflow#2071)

e0bd9ed

Signed-off-by: Suresh Nakkeran <suresh.n@ideas2it.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System performance degrades with large numbers of runs #2071

System performance degrades with large numbers of runs #2071

strangemonad-faire commented Sep 9, 2019 •

edited

xiaogaozi commented Sep 23, 2019

xiaogaozi commented Sep 26, 2019 •

edited

xiaogaozi commented Sep 27, 2019

strangemonad-faire commented Oct 2, 2019

xiaogaozi commented Oct 8, 2019

xiaogaozi commented Oct 12, 2019

xiaogaozi commented Oct 22, 2019

IronPan commented Oct 23, 2019

rmgogogo commented Nov 18, 2019

xiaogaozi commented Dec 5, 2019

System performance degrades with large numbers of runs #2071

System performance degrades with large numbers of runs #2071

Comments

strangemonad-faire commented Sep 9, 2019 • edited

xiaogaozi commented Sep 23, 2019

xiaogaozi commented Sep 26, 2019 • edited

xiaogaozi commented Sep 27, 2019

strangemonad-faire commented Oct 2, 2019

xiaogaozi commented Oct 8, 2019

xiaogaozi commented Oct 12, 2019

xiaogaozi commented Oct 22, 2019

IronPan commented Oct 23, 2019

rmgogogo commented Nov 18, 2019

xiaogaozi commented Dec 5, 2019

strangemonad-faire commented Sep 9, 2019 •

edited

xiaogaozi commented Sep 26, 2019 •

edited