Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[testing] kfp-ci cluster related tests flaky #6815

Closed
Bobgy opened this issue Oct 27, 2021 · 12 comments
Closed

[testing] kfp-ci cluster related tests flaky #6815

Bobgy opened this issue Oct 27, 2021 · 12 comments
Assignees

Comments

@Bobgy
Copy link
Contributor

Bobgy commented Oct 27, 2021

example error log

HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'Date': 'Tue, 19 Oct 2021 18:22:34 GMT', 'Vary': 'Origin', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-Powered-By': 'Express', 'X-Xss-Protection': '0', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'S=cloud_datalab_tunnel=HMzjedbUtOfvCUxatmKXGynHsJuhMxw5of6HG3PVShE; Path=/; Max-Age=3600'})
HTTP response body: {"error":"Failed to create a new run.: InternalServerError: Failed to store run v2-sample-test-zcmzl to table: Error 1205: Lock wait timeout exceeded; try restarting transaction","code":13,"message":"Failed to create a new run.: InternalServerError: Failed to store run v2-sample-test-zcmzl to table: Error 1205: Lock wait timeout exceeded; try restarting transaction","details":[{"@type":"type.googleapis.com/api.Error","error_message":"Internal Server Error","error_details":"Failed to create a new run.: InternalServerError: Failed to store run v2-sample-test-zcmzl to table: Error 1205: Lock wait timeout exceeded; try restarting transaction"}]}

@Bobgy
Copy link
Contributor Author

Bobgy commented Oct 27, 2021

/assign @capri-xiyue @Bobgy

@Bobgy
Copy link
Contributor Author

Bobgy commented Oct 27, 2021

/cc @chensun

@Bobgy
Copy link
Contributor Author

Bobgy commented Oct 27, 2021

Are there other causes for flakiness?

@chensun
Copy link
Member

chensun commented Oct 29, 2021

Saw another cause of flakiness in https://oss-prow.knative.dev/view/gs/oss-prow/pr-logs/pull/kubeflow_pipelines/6796/kubeflow-pipelines-samples-v2/1453537694247292928

This step is in Failed state with this message: OOMKilled (exit code 137)

It happened on the xgboost sample.

@capri-xiyue
Copy link
Contributor

For the lock wait timeout issue, Looks like it is because the backend api has a lot of transactions involved in mysql part. Maybe deadlock happened or the mysql is not fined tuned. FYI: https://stackoverflow.com/questions/5836623/getting-lock-wait-timeout-exceeded-try-restarting-transaction-even-though-im

@Bobgy
Copy link
Contributor Author

Bobgy commented Nov 1, 2021

The following comments try to solve the following problem, because it shows up the most often.

[{"@type":"type.googleapis.com/api.Error","error_message":"Internal Server Error","error_details":"Failed to create a new run.: InternalServerError: Failed to store run v2-sample-test-zcmzl to table: Error 1205: Lock wait timeout exceeded; try restarting transaction"}]}

Connect to in-cluster mysql DB via:

kubectl run -it -n kubeflow --rm --image=mysql:8.0.12 --restart=Never mysql-client -- mysql -h mysql

Here's current DB size:

mysql> select count(*) from run_details;
+----------+
| count(*) |
+----------+
|   150004 |
+----------+
1 row in set (5.98 sec)
mysql> select count(*) from experiments;
+----------+
| count(*) |
+----------+
|    65422 |
+----------+
1 row in set (0.07 sec)

Some queries on run seem to run for a very long time.

I'm trying to use show full processlist to figure out which ones take a long time.

@Bobgy
Copy link
Contributor Author

Bobgy commented Nov 1, 2021

This query stays in preparing state for more than 1 minute:

UPDATE run_details SET StorageState = ? WHERE UUID in (SELECT ResourceUUID FROM resource_references as rf WHERE (rf.ResourceType = ? AND rf.ReferenceUUID = ? AND rf.ReferenceType = ?))

Based on https://dba.stackexchange.com/a/121846, it seems our query should not use a select subquery, instead it should use JOIN for better performance.

I verified an actual query is also slow:

mysql> UPDATE run_details set storageState = "abc" WHERE UUID in (SELECT ResourceUUID FROM resource_references as rf WHERE (rf.ResourceType = "def" AND rf.ReferenceUUID = "ggg" AND rf.ReferenceType = "ccc"));
Query OK, 0 rows affected (6.27 sec)
Rows matched: 0  Changed: 0  Warnings: 0

@Bobgy
Copy link
Contributor Author

Bobgy commented Nov 1, 2021

I tried to rewrite this query using JOIN, and it's now much faster:

mysql> explain UPDATE run_details as runs JOIN resource_references as rf ON runs.UUID = rf.ResourceUUID
    -> SET StorageState = "def"
    -> WHERE rf.ResourceType = "c" AND rf.ReferenceUUID = "b" AND rf.ReferenceType = "c";
+----+-------------+-------+------------+--------+-------------------------+-----------------+---------+----------------------------+------+----------+-------------+
| id | select_type | table | partitions | type   | possible_keys           | key             | key_len | ref                        | rows | filtered | Extra       |
+----+-------------+-------+------------+--------+-------------------------+-----------------+---------+----------------------------+------+----------+-------------+
|  1 | SIMPLE      | rf    | NULL       | ref    | PRIMARY,referencefilter | referencefilter | 771     | const,const,const          |    1 |   100.00 | Using index |
|  1 | UPDATE      | runs  | NULL       | eq_ref | PRIMARY                 | PRIMARY         | 257     | mlpipeline.rf.ResourceUUID |    1 |   100.00 | NULL        |
+----+-------------+-------+------------+--------+-------------------------+-----------------+---------+----------------------------+------+----------+-------------+
2 rows in set (0.00 sec)

mysql>
mysql> UPDATE run_details as runs JOIN resource_references as rf ON runs.UUID = rf.ResourceUUID
    -> SET StorageState = "def"
    -> WHERE rf.ResourceType = "c" AND rf.ReferenceUUID = "b" AND rf.ReferenceType = "c";
Query OK, 0 rows affected (0.00 sec)
Rows matched: 0  Changed: 0  Warnings: 0

rewrote query

UPDATE run_details as runs JOIN resource_references as rf ON runs.UUID = rf.ResourceUUID
SET StorageState = "def"
WHERE rf.ResourceType = "c" AND rf.ReferenceUUID = "b" AND rf.ReferenceType = "c";

@Bobgy
Copy link
Contributor Author

Bobgy commented Nov 1, 2021

Found out the offending query in source code, actually jingzhang left a TODO to optimize it : )

// TODO(jingzhang36): use inner join to replace nested query for better performance.

@Bobgy
Copy link
Contributor Author

Bobgy commented Nov 1, 2021

Temporarily skipped archive experiment step in tests to verify whether it helps resolve the flakiness.

@Bobgy
Copy link
Contributor Author

Bobgy commented Nov 1, 2021

@Bobgy
Copy link
Contributor Author

Bobgy commented Nov 2, 2021

Closing because I think we get the flakiness resolved, please reopen if not.

@Bobgy Bobgy closed this as completed Nov 2, 2021
KFP Runtime Triage automation moved this from P0 to Closed Nov 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

3 participants