Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix linux tests to report correctly, get rid of stream destroyed messages on raw kernel shutdown #12643

Merged
merged 7 commits into from Jun 30, 2020

Conversation

IanMatthewHuff
Copy link
Member

@IanMatthewHuff IanMatthewHuff commented Jun 29, 2020

For #12539

  • Pull request represents a single change (i.e. not fixing disparate/unrelated things in a single PR).
  • Title summarizes what is changing.
  • Has a news entry file (remember to thank yourself!).
  • Appropriate comments and documentation strings in the code.
  • Has sufficient logging.
  • Has telemetry for enhancements.
  • Unit tests & system/integration tests are added/updated.
  • Test plan is updated as appropriate.
  • package-lock.json has been regenerated by running npm install (if dependencies have changed).
  • The wiki is updated with any design decisions/details.

@IanMatthewHuff IanMatthewHuff added the no-changelog No news entry required label Jun 29, 2020
@IanMatthewHuff
Copy link
Member Author

On the first commit of this got a successful test run with Linux / Mac / Windows results all in:
https://dev.azure.com/ms/vscode-python/_build/results?buildId=91149&view=results

@IanMatthewHuff IanMatthewHuff changed the title mac fixed? Fix linux tests to report correctly, get rid of stream destroyed messages on raw kernel shutdown Jun 29, 2020
@IanMatthewHuff
Copy link
Member Author

One Mac and Linux jupyter server and raw kernel daemon processes were not getting shut down on dispose. The server would be closed, but the proc.kill was not killing the process. They would pile up and start to hold onto more and more ports on the machines which would lead to problems (on linux it was slowing things down so much the machine was stopping reporting, mac tests usually finished, but logs were packed with Port XXXX blocked messages). This reproed on my local mac machine.

nc.","type":"Conda","envPath":"/usr/local/miniconda","envName":"base","displayName":"Python 3.7.7 64-bit ('base': conda)","__store":true}, Arg 3: true 2020-06-29T09:16:36.1498470Z Info 2020-06-29 09:16:36: [I 09:16:36.083 NotebookApp] The port 8888 is already in use, trying another port. 2020-06-29T09:16:36.1499250Z [I 09:16:36.083 NotebookApp] The port 8889 is already in use, trying another port. 2020-06-29T09:16:36.1500140Z [I 09:16:36.084 NotebookApp] The port 8890 is already in use, trying another port. 2020-06-29T09:16:36.1500810Z [I 09:16:36.085 NotebookApp] The port 8891 is already in use, trying another port. 2020-06-29T09:16:36.1501440Z [I 09:16:36.086 NotebookApp] The port 8892 is already in use, trying another port. 2020-06-29T09:16:36.1502070Z [I 09:16:36.086 NotebookApp] The port 8843 is already in use, trying another port. 2020-06-29T09:16:36.1502690Z [I 09:16:36.087 NotebookApp] The port 8822 is already in use, trying another port. 2020-06-29T09:16:36.1503330Z [I 09:16:36.088 NotebookApp] The port 8878 is already in use, trying another port. 2020-06-29T09:16:36.1503970Z [I 09:16:36.088 NotebookApp] The port 8850 is already in use, trying another port. 2020-06-29T09:16:36.1504600Z [I 09:16:36.089 NotebookApp] The port 8918 is already in use, trying another port. 2020-06-29T09:16:36.1505230Z [I 09:16:36.090 NotebookApp] The port 8842 is already in use, trying another port. 2020-06-29T09:16:36.1505850Z [I 09:16:36.091 NotebookApp] The port 8859 is already in use, trying another port. 2020-06-29T09:16:36.1506490Z [I 09:16:36.091 NotebookApp] The port 8829 is already in use, trying another port. 2020-06-29T09:16:36.1507120Z [I 09:16:36.092 NotebookApp] The port 8933 is already in use, trying another port. 2020-06-29T09:16:36.1507750Z [I 09:16:36.093 NotebookApp] The port 8964 is already in use, trying another port. 2020-06-29T09:16:36.1508390Z [I 09:16:36.093 NotebookApp] The port 8909 is already in use, trying another port. 2020-06-29T09:16:36.1509020Z [I 09:16:36.094 NotebookApp] The port 8810 is already in use, trying another port. 2020-06-29T09:16:36.1509800Z [I 09:16:36.095 NotebookApp] The port 8832 is already in use, trying another port. 2020-06-29T09:16:36.1510500Z [I 09:16:36.095 NotebookApp] The port 8817 is already in use, trying another port. 2020-06-29T09:16:36.1511150Z [I 09:16:36.096 NotebookApp] The port 8962 is already in use, trying another port. 2020-06-29T09:16:36.1511780Z [I 09:16:36.097 NotebookApp] The port 8833 is already in use, trying another port. 2020-06-29T09:16:36.1512410Z [I 09:16:36.097 NotebookApp] The port 8976 is already in use, trying another port. 2020-06-29T09:16:36.1513030Z [I 09:16:36.098 NotebookApp] The port 8963 is already in use, trying another port. 2020-06-29T09:16:36.1513660Z [I 09:16:36.099 NotebookApp] The port 8794 is already in use, trying another port. 2020-06-29T09:16:36.1514650Z [I 09:16:36.100 NotebookApp] The port 8788 is already in use, trying another port. 2020-06-29T09:16:36.1515310Z [I 09:16:36.100 NotebookApp] The port 8891 is already in use, trying another port. 2020-06-29T09:16:36.1515950Z [I 09:16:36.101 NotebookApp] The port 8833 is already in use, trying another port. 2020-06-29T09:16:36.1516570Z [I 09:16:36.102 NotebookApp] The port 8800 is already in use, trying another port. 2020-06-29T09:16:36.1517210Z [I 09:16:36.102 NotebookApp] The port 8844 is already in use, trying another port. 2020-06-29T09:16:36.1518180Z [I 09:16:36.103 NotebookApp] The port 8855 is already in use, trying another port. 2020-06-29T09:16:36.1518840Z [I 09:16:36.104 NotebookApp] The port 8964 is already in use, trying another port. 2020-06-29T09:16:36.1519460Z [I 09:16:36.105 NotebookApp] The port 8965 is already in use, trying another port. 2020-06-29T09:16:36.1520100Z [I 09:16:36.105 NotebookApp] The port 8787 is already in use, trying another port. 2020-06-29T09:16:36.1521470Z [I 09:16:36.111 NotebookApp] Serving notebooks from local directory: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/6bb312c4-e12a-4e18-bd9f-aaab7e9d7bea 2020-06-29T09:16:36.1522330Z [I 09:16:36.111 NotebookApp] The Jupyter Notebook is running at: 2020-06-29T09:16:36.1523330Z [I 09:16:36.111 NotebookApp] http://localhost:8884/?token=785855c39fa5554e995d497dbc3bb3c0b5b6da5c8e17c006 2020-06-29T09:16:36.1524240Z [I 09:16:36.112 NotebookApp] or http://127.0.0.1:8884/?token=785855c39fa5554e995d497dbc3bb3c0b5b6da5c8e17c006 2020-06-29T09:16:36.1525520Z [I 09:16:36.112 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

@IanMatthewHuff
Copy link
Member Author

Also on both mac and windows there were lots of messages like this in the error log both when running the live extension (shows when shutting down the interactive window) and running functional test. Both on windows and mac / linux. This update seems to resolve that as well:
rejected promise not handled within 1 second: Error [ERR_STREAM_DESTROYED]: Cannot call write after a stream was destroyed extensionHostProcess.js:976 stack trace: Error [ERR_STREAM_DESTROYED]: Cannot call write after a stream was destroyed at doWrite (_stream_writable.js:414:19) at writeOrBuffer (_stream_writable.js:402:5) at Socket.write (_stream_writable.js:302:11) at /Users/ianhuff/GitHub/vscode-python-ianhu3/node_modules/vscode-jsonrpc/lib/node/ril.js:134:29 at new Promise (<anonymous>) at WritableStreamWrapper.write (/Users/ianhuff/GitHub/vscode-python-ianhu3/node_modules/vscode-jsonrpc/lib/node/ril.js:124:16) at StreamMessageWriter.doWrite (/Users/ianhuff/GitHub/vscode-python-ianhu3/node_modules/vscode-jsonrpc/lib/common/messageWriter.js:99:33) at /Users/ianhuff/GitHub/vscode-python-ianhu3/node_modules/vscode-jsonrpc/lib/common/messageWriter.js:90:29 at runNextTicks (internal/process/task_queues.js:58:5) at processImmediate (internal/timers.js:412:9)

this.proc.kill();
} catch {
noop();
if (!this.disposed) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue one was that the dispose pattern wasn't checking for already being disposed. This daemon for a jupyter connection we being held both by the connection and the main async disposable service and dispose was being called from both. Would be trying to send notifications to an already killed process. Or trying to signal for a second kill.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can we add this comment into the code, I think its very useful.
Personally I like to see this in the code as reminder of how crucial such a simple block of code is (esy github has these comments, however I think in code would be better - IMHO)

if (this.platformService.isWindows) {
this.proc.kill();
} else {
this.proc.kill('SIGKILL');
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change here to directly send the sigkill on non windows actually cleaned up the processes. Full confession I tried to do the research here, but I'm still a small bit iffy on this. Feels like this makes it more direct close then it needs to be versus default (SIGTERM I think?). However for Jupyter they do directly use a SIGKILL so I believe this works here.

Note that we also used to send an 'exit' notification here. I don't believe this is needed. We were sending this as a notification and then directly closing the process. This message then seemed to often cause that "STREAM_DESTORYED" error as it was trying to be send to a process that had been killed. I believe that if we had needed this before then we would have been awaiting on it to actually happen.

Windows didn't have the issue with the process sticking around in my testing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, i'd put this comment in the code, so we know xactly why we're using two different ways of killing a process based on OS..

@@ -43,10 +45,6 @@ export class PythonKernelDaemon extends BasePythonDaemon implements IPythonKerne
const request = new RequestType0<void, void, void>('kill_kernel');
await this.sendRequestWithoutArgs(request);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The disposed was removed from this class. This was also causing issues with stream destroying. And killing the process. The code for the kill_kernel is here:


And it looks to do exactly what dispose is doing now. Close on windows or sigkill on linux. This was performing the dispose work before the super dispose was even called, which was also leading to message being send to a process that had already been sigkilled.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bravo. This must have been a big pain to find.


In reply to: 447323993 [](ancestors = 447323993)

Copy link

@DonJayamanne DonJayamanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@IanMatthewHuff
Copy link
Member Author

Note: Will test with a full flake run before submitting.

@sonarcloud
Copy link

sonarcloud bot commented Jun 30, 2020

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 0 Code Smells

No Coverage information No Coverage information
1.1% 1.1% Duplication

@IanMatthewHuff IanMatthewHuff marked this pull request as ready for review June 30, 2020 00:19
Copy link

@rchiodo rchiodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@IanMatthewHuff
Copy link
Member Author

@IanMatthewHuff
Copy link
Member Author

Non-required Windows Py3x Single Workspace looks to be timeout and not legit test failure (happening on many PRs) so merging.

@IanMatthewHuff IanMatthewHuff merged commit 2e47093 into microsoft:master Jun 30, 2020
@IanMatthewHuff IanMatthewHuff deleted the dev/ianhu/linuxNighly branch June 30, 2020 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-changelog No news entry required
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants