Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[has a simply repro] TCP ephemeral ports exhausted after lots of early-closed non-blocking connections #3951

Closed
YihaoPeng opened this issue Apr 1, 2019 · 10 comments

Comments

@YihaoPeng
Copy link

YihaoPeng commented Apr 1, 2019

This issue is a repetition of #2913, just to remind Microsoft that #2913 already has a very simple code to reproduce.

In the three months after the reproduction code proposed by philip-searle on 18 Jan, no Microsoft employees paid attention to the old issue. So I can only make a new issue to get attention.

Description

After creating and closing (before established) a large number of non-blocking connections in WSL, all TCP ephemeral ports will be exhausted, then no new TCP connections from WSL or Win32 can be established. Closing related processes in WSL does not release these ports. All new TCP connections or listening will failed and must to restart the LxssManager service to recover.

Reproducible Demo

The demo is from philip-searle's comment of #2913 on 18 Jan.

You can reliably reproduce this issue using the attached program (~80 lines of C): wsl-issue-2913-repro.c.txt

Output from strace looks normal to me and is attached as wsl-issue-2913-repro.strace.zip

The program performs these steps in a loop:

  1. Allocate a TCP socket for IPv4 (AF_INET) family.
  2. Make the socket non-blocking with O_NONBLOCK.
  3. Attempt to connect() the socket to 127.0.0.1:1234
  4. Use getsockname() to obtain the port number and output it to the console.
  5. Close the socket

Environment

In Ubuntu in WSL, build and run the demo with these commands:

apt update
apt install gcc wget
wget -O wsl-issue-2913-repro.c https://github.com/Microsoft/WSL/files/2769821/wsl-issue-2913-repro.c.txt
gcc -o wsl-issue-2913-repro wsl-issue-2913-repro.c
./wsl-issue-2913-repro

Expected Behavior

On a Linux VM I can run the loop several hundred thousand times and see the ports being used cycle through the entire ephemeral range multiple times.

In addition, even if the program has a bug that does not properly release the occupied port, these ports should be automatically released after the program exits.

Observed Behavior

On philip-searle's Windows laptop it loops about 16,000 times and then EINVAL is returned from connect(). At this point the symptoms described in previous comments appear: Win32 programs such as web browsers fail to connect and the output of "netstat -anoq" in a command prompt shows many connections stuck in the "BOUND" state. The only way to get network connections working again is to restart the LxssManager service.

On YihaoPeng's PC with Windows 1809 build 17763.379, the ports will be exhausted after 2899 rounds:

...
Socket 2893 is using port 4930
Socket 2894 is using port 4943
Socket 2895 is using port 4944
Socket 2896 is using port 4945
Socket 2897 is using port 4952
Socket 2898 is using port 4953
Invalid argument returned from connect() - BOUND socket exhaustion likely after 2899 rounds
Exiting (failed)

No ports released after the program exits. If you let the program run repeatedly (so it will immediately take up the ephemeral port released by other programs), you will find that no TCP connections in your Windows can be established. For example, your EDGE browser will not be able to load any page.

Use the following commands to run the program repeatedly:

while true; do ./wsl-issue-2913-repro; done
@YihaoPeng YihaoPeng changed the title [has a simply repro] TCP user port exhaust and network broken after lots of non-blocking connections [has a simply repro] TCP user ports exhausted after lots of early-closing non-blocking connections Apr 1, 2019
@YihaoPeng YihaoPeng changed the title [has a simply repro] TCP user ports exhausted after lots of early-closing non-blocking connections [has a simply repro] TCP ephemeral ports exhausted after lots of early-closing non-blocking connections Apr 1, 2019
@therealkenc
Copy link
Collaborator

Phillip's code below so it doesn't get lost.

#include <errno.h>
#include <string.h>
#include <unistd.h>
#include <netdb.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>

const char *const HOSTNAME = "127.0.0.1";
const char *const PORT = "1234";

void die(const char *const format, ...)
{
	va_list argList;
	va_start(argList, format);
	vprintf(format, argList);
	va_end(argList);
	puts("Exiting (failed)");
	exit(EXIT_FAILURE);
}

static void make_socket_non_blocking(int socket_fd)
{
	int flags = fcntl(socket_fd, F_GETFL, 0);
	if (flags == -1) {
		die("Failed to get socket flags: %d", socket_fd);
	}

	flags |= O_NONBLOCK;
	if (fcntl(socket_fd, F_SETFL, flags) == -1) {
		die("Failed to set socket flags: %d", socket_fd);
	}
}

int main(int argc, char **argv)
{
	struct addrinfo hints = { 0 };
	hints.ai_family = AF_INET;
	hints.ai_socktype = SOCK_STREAM;
	hints.ai_flags = AI_ADDRCONFIG | AI_NUMERICSERV;

	struct addrinfo *res = 0;
	int err = getaddrinfo(HOSTNAME, PORT, &hints, &res);
	if (err != 0) {
		die("failed to resolve socket address (err=%d)\n", err);
	}

	int socket_fd = -1;

	unsigned int sockets_allocated = 0;
	for (;;) {
		socket_fd =
		    socket(res->ai_family, res->ai_socktype, res->ai_protocol);
		if (socket_fd == -1) {
			die("Failed to allocate socket after %d rounds: %s\n",
			    sockets_allocated, strerror(errno));
		}
		make_socket_non_blocking(socket_fd);

		if (connect(socket_fd, res->ai_addr, res->ai_addrlen) == -1) {
			if (errno == 22) {
				die("Invalid argument returned from connect() - BOUND socket exhaustion likely after %d rounds\n", sockets_allocated);
			}
			/* We expect not to be able to connect since there is nothing listening on our target port. */
			/* printf("Failed to connect socket, this is expected (%d): %s\n", errno, strerror(errno)); */
		}

		struct sockaddr_in addr = { 0 };
		socklen_t addrlen = sizeof addr;
		if (getsockname(socket_fd, &addr, &addrlen)) {
			printf
			    ("Failed to get name of socket after %d rounds: %s\n",
			     sockets_allocated, strerror(errno));
		} else {
			/* sockaddr_in is big-endian (except address family) */
			addr.sin_port = ntohs(addr.sin_port);
			printf("Socket %d is using port %d\n",
			       sockets_allocated, addr.sin_port);
		}

		close(socket_fd);
		sockets_allocated++;
	}

	freeaddrinfo(res);
}

@YihaoPeng YihaoPeng changed the title [has a simply repro] TCP ephemeral ports exhausted after lots of early-closing non-blocking connections [has a simply repro] TCP ephemeral ports exhausted after lots of early-closed non-blocking connections Apr 2, 2019
@mpa7
Copy link

mpa7 commented Apr 2, 2019

Running the Sysinternals RamMap utility after the above program has terminated provides some insight - the wsl-issue-2913-repro process still exists as a zombie - no surprise considering it owns all those sockets visible in netstat.exe -anoq:

image

There's actually no need to go to the extreme of port exhaustion to repro the underlying issue. In fact we don't need the above code at all, we can use anything that will call connect() and can be fed a target where there is nothing listening.

Zombie wget and curl (sticking with 127.0.0.1:1234):

wget http://127.0.0.1:1234
curl http://127.0.0.1:1234

Can also try for the same result (it's not a localhost thing):

curl http://www.microsoft.com:1234

If there is something listening on 127.0.0.1:1234 (I used netcat) there is no zombie process:

sudo apt install netcat-openbsd
nc -l 127.0.0.1 1234
curl http://127.0.0.1:1234 (need to now ^D in netcat)

So it seems that WSL does not correctly handle failed connects. It looks as though a reference to the socket is maintained internally which results in the owning process becoming a zombie and port exhaustion will eventually (or quickly) occur as per #2913.

As mentioned in #2913 terminating all WSL processes running under the distro allows the sockets to be released and the zombies go away once WSL runs its cleanup.

@therealkenc
Copy link
Collaborator

There's actually no need to go to the extreme of port exhaustion to repro the underlying issue. In fact we don't need the above code at all, we can use anything that will call connect() and can be fed a target where there is nothing listening.

Great follow up.

I was able to reproduce this here on 18865 with RamMap and your wget repro.

This is going to 'splain some other ill-defined reports, if WSL has been leaking like a sieve all this time nd no one noticed. Usually you don't try to connect to something that isn't there, and those NT processes don't show up in a bog-standard Resource Monitor look see. But if, say, you did an apt update and dozens (or hundreds) of fetches failed for whatever reason, that would make a mess.

@roy-bentley
Copy link

roy-bentley commented Apr 22, 2019

I am also experiencing this issue -- it's very frustrating to have to bounce LXSSManager a couple times a day.

It appears that some process (pstree shows it being spawned by init, which isn't actually the case but I'm not clever enough to determine who's firing it) is spawning timeout 1 wget -q -O- --tries=1 http://169.254.169.254 on a really tight loop. Somehow, the port bound to this process isn't released after it runs and can exhaust 55535 ports in less than a day.

There's probably better ways to get some of this data...:

Ephemeral port usage

C:\>netsh int ipv4 show dynamicport tcp

Protocol tcp Dynamic Port Range
---------------------------------
Start Port      : 10000
Number of Ports : 55535

C:\temp>netstat -anq > ports.txt

roy@rbentley:/mnt/c/temp$ grep 'BOUND' ports.txt | wc -l
10590

roy@rbentley:/mnt/c/temp$ uptime
 13:40:56 up  3:03,  0 users,  load average: 0.52, 0.58, 0.59

Unknown process running on loop

roy@rbentley:/mnt/c/temp$ date
Mon Apr 22 13:47:14 EDT 2019
roy@rbentley:/mnt/c/temp$ ps ux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
roy          6  0.0  0.0  19960  6384 tty1     S    10:37   0:00 -bash
roy        221  0.0  0.0  13768  1428 tty1     S    10:37   0:00 tmux -u -2 -f /usr/share/byobu/profiles/tmuxrc new-session -n - /usr/bin/byobu-shell
roy        254  2.7  0.0  17564  5524 ?        Rs   10:38   5:06 tmux -u -2 -f /usr/share/byobu/profiles/tmuxrc new-session -n - /usr/bin/byobu-shell
roy        258  0.0  0.0  20464  6808 pts/1    Ss   10:38   0:01 /bin/bash
roy      27335  0.0  0.0  14184   880 ?        S    12:58   0:00 timeout 1 wget -q -O- --tries=1 http://169.254.169.254
roy      27337  0.0  0.0      0     0 ?        Z    12:58   0:00 [wget] <defunct>
roy      31511  0.0  0.0  19992  6488 pts/0    Ss   13:03   0:00 /bin/bash
roy      31699  0.0  0.0  19992  6600 pts/2    Ss   13:03   0:00 /bin/bash
roy       3545  2.0  0.0  14184   956 ?        S    13:47   0:00 timeout 1 wget -q -O- --tries=1 http://169.254.169.254
roy       3547  3.0  0.0  20028  2104 ?        S    13:47   0:00 wget -q -O- --tries=1 http://169.254.169.254
roy       3549  0.0  0.0  17380  1920 pts/1    R    13:47   0:00 ps ux
roy@rbentley:/mnt/c/temp$ pstree -hp
init(1)─┬─init(5)───bash(6)───tmux: client(221)
        ├─postgres(32201)─┬─postgres(32218)
        │                 ├─postgres(32219)
        │                 ├─postgres(32220)
        │                 ├─postgres(32221)
        │                 ├─postgres(32222)
        │                 └─postgres(32223)
        ├─timeout(27335)───wget(27337)
        ├─timeout(4738)───wget(4740)
        ├─tmux: server(254)─┬─bash(258)───pstree(4746)
        │                   ├─bash(31511)
        │                   ├─bash(31699)
        │                   ├─sh(4742)───byobu-status(4744)───sed(4747)
        │                   └─sh(4743)───byobu-status(4745)───sed(4748)
        └─{init}(4)
roy@rbentley:/mnt/c/temp$ date
Mon Apr 22 13:47:28 EDT 2019
roy@rbentley:/mnt/c/temp$ ps ux | grep wget | grep -v grep 
roy      27335  0.0  0.0  14184   880 ?        S    12:58   0:00 timeout 1 wget -q -O- --tries=1 http://169.254.169.254
roy      27337  0.0  0.0      0     0 ?        Z    12:58   0:00 [wget] <defunct>
roy       3788  1.0  0.0  14184   956 ?        S    13:47   0:00 timeout 1 wget -q -O- --tries=1 http://169.254.169.254
roy       3790  1.0  0.0  20028  2100 ?        S    13:47   0:00 wget -q -O- --tries=1 http://169.254.169.254
roy@rbentley:/mnt/c/temp$ date
Mon Apr 22 13:47:46 EDT 2019
roy@rbentley:/mnt/c/temp$ ps ux | grep wget | grep -v grep 
roy      27335  0.0  0.0  14184   880 ?        S    12:58   0:00 timeout 1 wget -q -O- --tries=1 http://169.254.169.254
roy      27337  0.0  0.0      0     0 ?        Z    12:58   0:00 [wget] <defunct>
roy       4031  1.0  0.0  14184   960 ?        S    13:47   0:00 timeout 1 wget -q -O- --tries=1 http://169.254.169.254
roy       4033  1.0  0.0  20028  2100 ?        S    13:47   0:00 wget -q -O- --tries=1 http://169.254.169.254
roy@rbentley:/mnt/c/temp$ date
Mon Apr 22 13:48:23 EDT 2019
roy@rbentley:/mnt/c/temp$ ps ux | grep wget | grep -v grep 
roy      27335  0.0  0.0  14184   880 ?        S    12:58   0:00 timeout 1 wget -q -O- --tries=1 http://169.254.169.254
roy      27337  0.0  0.0      0     0 ?        Z    12:58   0:00 [wget] <defunct>
roy       4570  0.0  0.0  14184   956 ?        S    13:48   0:00 timeout 1 wget -q -O- --tries=1 http://169.254.169.254
roy       4573  0.0  0.0  20028  2100 ?        S    13:48   0:00 wget -q -O- --tries=1 http://169.254.169.254
roy@rbentley:/mnt/c/temp$ date
Mon Apr 22 13:52:05 EDT 2019
roy@rbentley:/mnt/c/temp$ ps ux | grep wget | grep -v grep 
roy      27335  0.0  0.0  14184   876 ?        S    12:58   0:00 timeout 1 wget -q -O- --tries=1 http://169.254.169.254
roy      27337  0.0  0.0      0     0 ?        Z    12:58   0:00 [wget] <defunct>
roy       7866  1.0  0.0  14184   956 ?        S    13:52   0:00 timeout 1 wget -q -O- --tries=1 http://169.254.169.254
roy       7868  0.0  0.0  20028  2104 ?        S    13:52   0:00 wget -q -O- --tries=1 http://169.254.169.254

Task Manager view

Looking at the bash terminal doesn't really reflect exactly how quickly it's churning, Task Manager shows a better view.
Task manager

@Brian-Perkins
Copy link

Fixed in Windows Insider Build 18890

@CherryDT
Copy link

CherryDT commented Jul 9, 2019

I'm on 18362 and I'm experiencing something very similar to this issue, in particular all the WSL node processes spawned by VSCode's WSL Remote feature end up using all these ports after a while.

However, I can't see this port usage in netstat, nor in TCPView. Only Get-NetTCPConnection | Group-Object -Property State, OwningProcess | Select -Property Count, Name, @{Name="ProcessName";Expression={(Get-Process -PID ($_.Name.Split(',')[-1].Trim(' '))).Name}}, Group | Sort Count -Descending (thanks to this comment) made the issue apparent to me.

Also, I can't see zombie processes, and the ports are released when I simply close VSCode (which then terminates all the node instances).

The output of the above command looks like this:

Count Name               ProcessName        Group
----- ----               -----------        -----
 2260 Bound, 10684                          {MSFT_NetTCPConnection (InstanceID = "0.0.0.0??58430??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??58429??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??58425??0.0.0.0??0"), MSFT_NetTCPConnection...  1186 Bound, 15120                          {MSFT_NetTCPConnection (InstanceID = "0.0.0.0??59912??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??59911??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??59905??0.0.0.0??0"), MSFT_NetTCPConnection...  1148 Bound, 11772                          {MSFT_NetTCPConnection (InstanceID = "0.0.0.0??65516??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??65515??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??65513??0.0.0.0??0"), MSFT_NetTCPConnection...
  828 Bound, 19268                          {MSFT_NetTCPConnection (InstanceID = "0.0.0.0??61056??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??61055??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??61048??0.0.0.0??0"), MSFT_NetTCPConnection...
  636 Bound, 12232                          {MSFT_NetTCPConnection (InstanceID = "0.0.0.0??65532??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??65531??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??65530??0.0.0.0??0"), MSFT_NetTCPConnection...
  548 Bound, 15604                          {MSFT_NetTCPConnection (InstanceID = "0.0.0.0??55026??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??55025??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??55020??0.0.0.0??0"), MSFT_NetTCPConnection...
  482 Bound, 3704                           {MSFT_NetTCPConnection (InstanceID = "0.0.0.0??51290??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??51289??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??51283??0.0.0.0??0"), MSFT_NetTCPConnection...
  468 Bound, 18820                          {MSFT_NetTCPConnection (InstanceID = "0.0.0.0??60737??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??60736??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??60733??0.0.0.0??0"), MSFT_NetTCPConnection...
  366 Bound, 10400                          {MSFT_NetTCPConnection (InstanceID = "0.0.0.0??62588??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??62587??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??62581??0.0.0.0??0"), MSFT_NetTCPConnection...
  342 Bound, 8744                           {MSFT_NetTCPConnection (InstanceID = "0.0.0.0??61409??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??61408??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??61403??0.0.0.0??0"), MSFT_NetTCPConnection...
  284 Bound, 2896                           {MSFT_NetTCPConnection (InstanceID = "0.0.0.0??60790??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??60789??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??60785??0.0.0.0??0"), MSFT_NetTCPConnection...
  260 Bound, 9728                           {MSFT_NetTCPConnection (InstanceID = "0.0.0.0??61918??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??61917??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??61913??0.0.0.0??0"), MSFT_NetTCPConnection...
  243 Bound, 23976                          {MSFT_NetTCPConnection (InstanceID = "0.0.0.0??51931??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??51916??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??51915??0.0.0.0??0"), MSFT_NetTCPConnection...
  236 Bound, 19896                          {MSFT_NetTCPConnection (InstanceID = "0.0.0.0??64935??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??64934??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??64933??0.0.0.0??0"), MSFT_NetTCPConnection...
  234 Bound, 11800                          {MSFT_NetTCPConnection (InstanceID = "0.0.0.0??55669??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??55666??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??55648??0.0.0.0??0"), MSFT_NetTCPConnection...
  224 Bound, 18020                          {MSFT_NetTCPConnection (InstanceID = "0.0.0.0??65534??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??65533??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??65524??0.0.0.0??0"), MSFT_NetTCPConnection...
  213 Bound, 22988                          {MSFT_NetTCPConnection (InstanceID = "0.0.0.0??53345??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??53343??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??53342??0.0.0.0??0"), MSFT_NetTCPConnection...
  213 Bound, 10676                          {MSFT_NetTCPConnection (InstanceID = "0.0.0.0??64043??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??64038??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??64037??0.0.0.0??0"), MSFT_NetTCPConnection...
  213 Bound, 20760                          {MSFT_NetTCPConnection (InstanceID = "0.0.0.0??63287??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??63282??0.0.0.0??0"), MSFT_NetTCPConnection (InstanceID = "0.0.0.0??63281??0.0.0.0??0"), MSFT_NetTCPConnection...
...
[and the list goes on for a while]

Does anyone know, is this the same issue? Or a different, maybe related one?

@therealkenc
Copy link
Collaborator

Does anyone know, is this the same issue? Or a different, maybe related one?

It objectively isn't the "same issue", that's for sure. Whether it is related or not is hard to tell. Giving you the straight dope: the only way to tell is going to be spinning up Insiders and see if it goes away. I checked my 18932 and bound ports aren't accumulating; but that isn't conclusive evidence of anything really. Thank-you for the report though. Maybe we'll get some me2s.

@NyaMisty
Copy link

NyaMisty commented Nov 22, 2019

Sadly, even if it's fixed in 18890, but Microsoft refused to bump the newer build to customer. We only got a 18362 -> 18363 update in 1909... Unbelievable.

@Po-wei
Copy link

Po-wei commented Jan 8, 2020

@NyaMisty you should go to insider build

@NyaMisty
Copy link

NyaMisty commented Jan 8, 2020

@NyaMisty you should go to insider build

Easy to say, I would rather not afford the instability brought by insider build, only for such a simple fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants