-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added one internal magic to enable retry of session creation. #716
Conversation
**Description** Fix two issues: 1. the self.session_started[1] is set as true too early for exceptional case. 2. the session is added too early[2] before session is actually started. Enhanced the magic kernel by adding one internal magic to allow retry session creation. Currently, the session creation error is treated as fatal error which make it impossible[3] to retry after fixing the endpoint info. [1] https://github.com/jupyter-incubator/sparkmagic/blob/bfabbb39a0249197c2c05c8efe681710fff9151b/sparkmagic/sparkmagic/kernels/kernelmagics.py#L357 [2] https://github.com/jupyter-incubator/sparkmagic/blob/bfabbb39a0249197c2c05c8efe681710fff9151b/sparkmagic/sparkmagic/livyclientlib/sparkcontroller.py#L90 [3] https://github.com/jupyter-incubator/sparkmagic/blob/bfabbb39a0249197c2c05c8efe681710fff9151b/sparkmagic/sparkmagic/kernels/kernelmagics.py#L350 Please see the movitivation for our use cases. **Motivation** To enable user to choose an Spark cluster and connect to it at runtime(means the sparkmagic configuration is not generated in advance), we need to configure the endpoint after kernel starting and enable retry session creation after fixing the incorrect endpoint. We discussed this with Alejandro(aggFTW@) and he suggested to leverage the following _do_not_call_* methods to configure the endpoint dynamically and create session after specifying configuration. 1. _do_not_call_delete_session 2. _do_not_call_change_endpoint 3. _do_not_call_start_session To conditionally allow the session creation retry, the changes introduced a flag to allow retry on previous fatal of session creation. By default, the behavior is kept unchanged and do not allow retry on fatal. **Testing Done** 1. Added unit test. 2. Also ran the code in a notebook to verify. **Backwards Compatibility Criteria (if any)** The enhancement is activated conditionally and no compatibility issue is expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Devin,
No exactly, our case is kind of overlapping with original issue reported by 709 but not exactly the same.
|
session.start() | ||
self.session_manager.add_session(name, session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devstein This is a fair callout. If there is timeout issue[1], we might have session leaking. So we can clean up the session for edge case caused by e.g. timeout.
Any suggestion?
try:
session.start()
except:
if session.is_posted(): session.delete() # is_posted() is decided by self.status != constants.NOT_STARTED_SESSION_STATUS
raise
else:
self.session_manager.add_session(name, session)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Devin,
Made a code fix for the session leaking issue. Please let me know if you have any suggestion or comments.
edwardps@b6c89eb
Will update the PR after confirming the solution.
Thanks,
Ed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Ed, thanks for the quick turnaround. The solution looks good to me. Can you add a comment to explain the logic (avoid leaking sessions)?
Also, cc @aggFTW as he was the last to edit this logic and refactored it. @aggFTW Do you see any issues with changing this logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing. I will add more details in the comment when update this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added info about the session leaking edge case handling in the PR description and git submit.
**Description** When session start() throws exception, session delete() is invoked when session status is not in NOT_STARTED_SESSION_STATUS to clean up resource. This makes it safe to add the session to session manager only when session is started properly. This change is to address the edge case which could cause leaking session. From client side we could see timeout issue but actually the session is already created on Livy side. **Testing Done** Added unit test cases. Also tested in a notebook.
Description
Fix two issues:
Enhanced the magic kernel by adding one internal magic to allow retry session
creation. Currently, the session creation error is treated as fatal error which
make it impossible[3] to retry after fixing the endpoint info.
[1] -
sparkmagic/sparkmagic/sparkmagic/kernels/kernelmagics.py
Line 357 in bfabbb3
[2] -
sparkmagic/sparkmagic/sparkmagic/livyclientlib/sparkcontroller.py
Line 90 in bfabbb3
[3] -
sparkmagic/sparkmagic/sparkmagic/kernels/kernelmagics.py
Line 350 in bfabbb3
The session leaking consideration:
After the change to only add a successfully started session to session manager, it's possible to have a leaking session when session start runs into timeout issue on client side but the session is actually created on Livy side(thanks to comment from @devstein). For example, in session start() logic, we could have timeout exception after post a session to Livy service. In order to address it, we will have session cleanup when initial session start() throws exception and session status is not in initial state.
[4] - https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/sparkmagic/livyclientlib/livysession.py#L151
Please see the movitivation for our use cases.
Motivation
To enable user to choose an Spark cluster and connect to it at runtime(means the
sparkmagic configuration is not generated in advance), we need to configure the endpoint
after kernel starting and enable retry session creation after fixing the incorrect endpoint.
We discussed this with Alejandro(aggFTW@) and he suggested to leverage the following do_not_call*
methods to configure the endpoint dynamically and create session after specifying configuration.
To conditionally allow the session creation retry, the changes introduced a flag
to allow retry on previous fatal of session creation. By default, the behavior
is kept unchanged and do not allow retry on fatal.
Testing Done
Backwards Compatibility Criteria (if any)
The enhancement is activated conditionally and no compatibility issue is expected.
Checklist
CHANGELOG.md
fileREADME.md
file (N/A as no user visible feature added. The internal magic is added for kernel integration purpose)