Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

make DBController tolerant to wrong framework request #4889

Closed
hzy46 opened this issue Sep 9, 2020 · 11 comments · Fixed by #5284
Closed

make DBController tolerant to wrong framework request #4889

hzy46 opened this issue Sep 9, 2020 · 11 comments · Fixed by #5284
Labels

Comments

@hzy46
Copy link
Contributor

hzy46 commented Sep 9, 2020

  • Currently, database controller assumes all framework requests it receives are valid. Thus it will try to sync every framework request to api server.
  • However, it is hard to validate the request in the rest-server. There is chance that rest-server generates wrong framework requests. Then database controller will keep trying to sync the wrong request forever.
  • We could make DBcontroller tolerant to wrong framework request.
@hzy46 hzy46 changed the title DBcontroller request error tolerance make DBController tolerant to wrong framework request Sep 9, 2020
@scarlett2018 scarlett2018 mentioned this issue Sep 9, 2020
5 tasks
@yqwang-ms
Copy link
Member

yqwang-ms commented Sep 9, 2020

In previous design #4651,
We assume apiserver submission error as our platform issues, such as we generated invalid json object, so we always retry and expose the errors as job events to user.
image

Is that good enough? As you cannot totally distingush between transient and permenant apiserver submission error.

BTW, any real example for the apiserver submission error?
Curious how can we generate invalid framework object

@hzy46
Copy link
Contributor Author

hzy46 commented Sep 9, 2020

In previous design #4651,
We assume apiserver submission error as our platform issues, such as we generated invalid json object, so we always retry and expose the errors as job events to user.
image

Is that good enough? As you cannot totally distingush between transient and permenant apiserver submission error.

BTW, any real example for the apiserver submission error?
Curious how can we generate invalid framework object

If such error happens, I think the user may not able to stop the job. e.g. rest-server generates a wrong request object -> the request object is stored in db -> db controller tries to sync it to api server, but failed -> user stop the job, which will patch the spec.executionType field of the framework request object in db -> db controller tries to sync it to api server, but failed -> user will see his job is always in Stopping status.

I haven't found a real example.

@yqwang-ms
Copy link
Member

yqwang-ms commented Sep 9, 2020

@hzy46 Valid point for the stop request cannot be satisfied, but we have to send stop to apiserver, because you do not know if the job really stopped (or not exist) in k8s. Considering this issue is most caused by our platform issues, under platform issues, everything may not work properly, so hang in stopping is acceptable, and we need to fix the platform issue ASAP instead of directly delete the job or mark the job as stopped.

@hzy46
Copy link
Contributor Author

hzy46 commented Sep 10, 2020

@yqwang-ms

If ApiServer non-404 error, can we mock a Failed Framework event to watcher? I think it could be a clean solution for this problem.

@yqwang-ms
Copy link
Member

No, non-404 error is too general, pls check above comments

@hzy46
Copy link
Contributor Author

hzy46 commented Sep 10, 2020

It should be:

  • If ApiServer reports non-404 error, and it is caused by a invalid protocol, mock a Failed Framework event to watcher.
  • If ApiServer reports other non-404 error, add the event to db.

@yqwang-ms
Copy link
Member

That's ok now. Be sure that we are sure that the error MUST BE "caused by a invalid protocol", and then have 100% confidence to mock Fail job

If ApiServer reports non-404 error, and it is caused by a invalid protocol, mock a Failed Framework event to watcher.

@mydmdm
Copy link
Contributor

mydmdm commented Sep 13, 2020

Is there a specific error code from FC to indicate failed validation?

@yqwang-ms
Copy link
Member

It is the error code from ApiServer instead of FC. We can just use Status code 422 in the first stage.

BTW, would better give one example of this issue, so we know the exact Status code of it

@hzy46
Copy link
Contributor Author

hzy46 commented Nov 18, 2020

It is the error code from ApiServer instead of FC. We can just use Status code 422 in the first stage.

BTW, would better give one example of this issue, so we know the exact Status code of it

An example found in #5093 . And I think status 413 should also be recorded.

@hzy46 hzy46 mentioned this issue Dec 3, 2020
52 tasks
@hzy46
Copy link
Contributor Author

hzy46 commented Dec 3, 2020

Work items for this issue:

  • If non-404 error in poller, store it as a framework event.
  • If any unrecoverable error happens, mock a Failed Framework event to watcher.
    • Add 413 / 422 in the unrecoverable errors.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants