Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System performance degrades with large numbers of runs #2071

Closed
strangemonad-faire opened this issue Sep 9, 2019 · 10 comments
Closed

System performance degrades with large numbers of runs #2071

strangemonad-faire opened this issue Sep 9, 2019 · 10 comments

Comments

@strangemonad-faire
Copy link

strangemonad-faire commented Sep 9, 2019

What happened:

On systems with lots of runs, listing experiments becomes unbearably slow.

What did you expect to happen:

The performance remains constant

What steps did you take:

I enabled slow query logging and identified the culprit query. Change #1836 attempted to address the issue but it doesn't.

  1. the query that's generated by gorm is inherently inefficient. No index can make that query faster as it's currently written. In particular, the left join that does a select * from resource_references will need to materialize a temp table. That temp table will have no index for the next part of the join

Here's an explain statement from our cluster that contains ~70k resource_references:

mysql> explain extended SELECT subq.*, CONCAT("[",GROUP_CONCAT(r.Payload SEPARATOR ","),"]") AS refs FROM (SELECT rd.*, CONCAT("[",GROUP_CONCAT(m.Payload SEPARATOR ","),"]") AS metrics FROM (SELECT UUID, DisplayName, Name, StorageState, Namespace, Description, CreatedAtInSec, ScheduledAtInSec, FinishedAtInSec, Conditions, PipelineId, PipelineSpecManifest, WorkflowSpecManifest, Parameters, pipelineRuntimeManifest, WorkflowRuntimeManifest FROM run_details) AS rd LEFT JOIN run_metrics AS m ON rd.UUID=m.RunUUID GROUP BY rd.UUID) AS subq LEFT JOIN (select * from resource_references where ResourceType='Run') AS r ON subq.UUID=r.ResourceUUID WHERE UUID = '5e9c12b0-d27d-11e9-86a0-42010a8000e5' GROUP BY subq.UUID LIMIT 1;
+----+-------------+---------------------+------+---------------+-------------+---------+-------+-------+----------+----------------------------------------------------+
| id | select_type | table               | type | possible_keys | key         | key_len | ref   | rows  | filtered | Extra                                              |
+----+-------------+---------------------+------+---------------+-------------+---------+-------+-------+----------+----------------------------------------------------+
|  1 | PRIMARY     | <derived2>          | ref  | <auto_key0>   | <auto_key0> | 257     | const |    10 |   100.00 | Using index condition                              |
|  1 | PRIMARY     | <derived4>          | ref  | <auto_key0>   | <auto_key0> | 257     | const |    10 |   100.00 | Using where                                        |
|  4 | DERIVED     | resource_references | ALL  | NULL          | NULL        | NULL    | NULL  | 63229 |   100.00 | Using where                                        |
|  2 | DERIVED     | <derived3>          | ALL  | NULL          | NULL        | NULL    | NULL  | 19546 |   100.00 | Using temporary; Using filesort                    |
|  2 | DERIVED     | m                   | ALL  | PRIMARY       | NULL        | NULL    | NULL  |     1 |   100.00 | Using where; Using join buffer (Block Nested Loop) |
|  3 | DERIVED     | run_details         | ALL  | NULL          | NULL        | NULL    | NULL  | 19546 |   100.00 | NULL                                               |
+----+-------------+---------------------+------+---------------+-------------+---------+-------+-------+----------+----------------------------------------------------+
6 rows in set, 1 warning (6.90 sec)

Anything else you would like to add:

I'm not familiar enough with gorm to know how to force it to generate a more efficient query.

@xiaogaozi
Copy link

We faced the same issue, ListRuns API become very slowly when resource_references table is large (about 6000 rows). In most case the API couldn't send response back.

@xiaogaozi
Copy link

xiaogaozi commented Sep 26, 2019

More details after explain the query statement, same as @strangemonad-faire reported, query from resource_references table cannot use the referencefilter index. Our Kubeflow Pipelines version is 0.1.30.

Note: This is a test database, please ignore the rows column.

mysql> EXPLAIN SELECT subq.*, CONCAT("[",GROUP_CONCAT(r.Payload SEPARATOR ","),"]") AS refs FROM (SELECT rd.*, CONCAT("[",GROUP_CONCAT(m.Payload SEPARATOR ","),"]") AS metrics FROM (SELECT UUID, DisplayName, Name, StorageState, Namespace, Description, CreatedAtInSec, ScheduledAtInSec, FinishedAtInSec, Conditions, PipelineId, PipelineName, PipelineSpecManifest, WorkflowSpecManifest, Parameters, PipelineRuntimeManifest, WorkflowRuntimeManifest FROM run_details WHERE UUID in (SELECT ResourceUUID FROM resource_references as rf WHERE (rf.ResourceType = 'Run' AND rf.ReferenceUUID = 'd314b06d-c7ce-4ead-b075-9809ddd5dba5' AND rf.ReferenceType = 'Experiment')) AND StorageState <> 'STORAGESTATE_ARCHIVED' ORDER BY CreatedAtInSec DESC, UUID DESC LIMIT 11) AS rd LEFT JOIN run_metrics AS m ON rd.UUID=m.RunUUID GROUP BY rd.UUID) AS subq LEFT JOIN (select * from resource_references where ResourceType='Run') AS r ON subq.UUID=r.ResourceUUID GROUP BY subq.UUID ORDER BY CreatedAtInSec DESC, UUID DESC;
+----+-------------+---------------------+--------+-------------------------+-----------------+---------+----------------------------+------+-----------------------------------------------------------+
| id | select_type | table               | type   | possible_keys           | key             | key_len | ref                        | rows | Extra                                                     |
+----+-------------+---------------------+--------+-------------------------+-----------------+---------+----------------------------+------+-----------------------------------------------------------+
|  1 | PRIMARY     | <derived2>          | ALL    | NULL                    | NULL            | NULL    | NULL                       |    6 | Using temporary; Using filesort                           |
|  1 | PRIMARY     | <derived5>          | ALL    | NULL                    | NULL            | NULL    | NULL                       |    5 | Using where; Using join buffer (Block Nested Loop)        |
|  5 | DERIVED     | resource_references | ALL    | referencefilter         | NULL            | NULL    | NULL                       |    6 | Using where                                               |
|  2 | DERIVED     | <derived3>          | ALL    | NULL                    | NULL            | NULL    | NULL                       |    6 | Using temporary; Using filesort                           |
|  2 | DERIVED     | m                   | ALL    | PRIMARY                 | NULL            | NULL    | NULL                       |    1 | Using where; Using join buffer (Block Nested Loop)        |
|  3 | DERIVED     | rf                  | ref    | PRIMARY,referencefilter | referencefilter | 771     | const,const,const          |    6 | Using where; Using index; Using temporary; Using filesort |
|  3 | DERIVED     | run_details         | eq_ref | PRIMARY                 | PRIMARY         | 257     | mlpipeline.rf.ResourceUUID |    1 | Using where                                               |
+----+-------------+---------------------+--------+-------------------------+-----------------+---------+----------------------------+------+-----------------------------------------------------------+

But even if the referencefilter index could be used (e.g. through FORCE INDEX (referencefilter)), the above query may still very slowly. The reason is select * from resource_references where ResourceType='Run' could match many rows if resource_references table is very large. The less serious problem is SELECT ResourceUUID FROM resource_references as rf WHERE (rf.ResourceType = 'Run' AND rf.ReferenceUUID = 'xxx' AND rf.ReferenceType = 'Experiment') statement, this query will match all rows in specified experiment.

@xiaogaozi
Copy link

@IronPan @gaoning777 Do you guys have time to address this issue?

@strangemonad-faire
Copy link
Author

FWIW, for the time being, I'm working around the issue by manually making sure old workflows are deleted from k8s via kubectl delete workflow and then manually deleting sql records for old runs

    select UUID, DisplayName, Name, StorageState, Namespace, Description, CreatedAtInSec, ScheduledAtInSec, 
        Conditions, PipelineId,  FinishedAtInSec
    from run_details
    where FinishedAtInSec is not null and FinishedAtInSec <> 0 and FinishedAtInSec < {max_finished_at_in_sec}
    order by FinishedAtInSec asc limit {limit};
    begin;
    delete from resource_references
    where ResourceType = 'Run' and ResourceUUID in ({run_uuid_strings});

    delete from run_metrics
    where RunUUID in ({run_uuid_strings});

    delete from run_details
    where UUID in ({run_uuid_strings});
    commit;

@xiaogaozi
Copy link

@strangemonad-faire We also have considered the solution that delete MySQL records manually. But sadly this isn't a long-term way. It's just a workaround.

@xiaogaozi
Copy link

Screen Shot 2019-10-12 at 10 58 38

Just FYI, we have made some small changes try to optimize ListRuns API performance. The image above show the "Select Latency" metric of AWS RDS, you could see it become much lower and more smooth than before (red arrow indicates the time we deploy new version of Kubeflow).

The basic idea is remove useless query and columns, e.g. we removed:

  1. select * from resource_references where ResourceType='Run'
  2. WorkflowSpecManifest and WorkflowRuntimeManifest columns when select from run_details table (these two columns might be very large)

You could see the details in here (ignore .travis.yml change it just for build Docker image).

@strangemonad-faire If you have time could you try this optimization see if can help you.

@xiaogaozi
Copy link

Any Kubeflow maintainer could address this issue? @Ark-kun @gaoning777 @IronPan @jingzhang36 @Bobgy

@rmgogogo rmgogogo self-assigned this Oct 23, 2019
@IronPan
Copy link
Member

IronPan commented Oct 23, 2019

@xiaogaozi @strangemonad-faire Thank you a lot for dive into this and found the issue. I guess the part that joining the metrics and resource reference is the root cause. These information is still needed I believe. We'll take a look and get back soon.

@rmgogogo
Copy link
Contributor

It looks the PR 2559 fixed this issue. Feel free to reopen if not. Also welcome PR.
#2559

@xiaogaozi
Copy link

Thank you for the elegant fix! We'll test later to verify the change.

I still have one question about if we can delete WorkflowSpecManifest and WorkflowRuntimeManifest these two fields from API response. Based on our test result (see below images), the API response time reduce significantly. For example:

  • Size: from maximum 62.2KB to 2.5KB, reduce 96%.
  • Time: from maximum 986ms to 267ms, reduce 73%.
  • Waterfall: because "Waiting (TTFB)" (the green bar) and "Content Download" (the blue bar) reduce significantly, "Stalled" (the gray bar) reduced also (HTTP 1.x has maximum 6 concurrent connections limit).

Screen Shot 2019-10-08 at 18 11 53

before

Screen Shot 2019-10-09 at 15 57 44

after

magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this issue Oct 22, 2023
Signed-off-by: Suresh Nakkeran <suresh.n@ideas2it.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants