New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: dead Gs cause span fragmentation #9869

Open
randall77 opened this Issue Feb 13, 2015 · 14 comments

Comments

Projects
None yet
8 participants
@randall77
Contributor

randall77 commented Feb 13, 2015

Once we allocate a G, it is allocated forever. We have no mechanism to free them.
We should free dead Gs if they sit in the global free queue for long enough. Or maybe free all of them at each GC?

I noticed this while debugging #8832. The stacks for dead Gs are freed at GC time. This leads to a fragmented heap because spans for G storage and stack storage alternate in the heap. When only the stacks are freed, the resulting free spans won't coalesce because the spans for storing the Gs aren't freed.

@randall77

This comment has been minimized.

Contributor

randall77 commented Feb 13, 2015

@ianlancetaylor ianlancetaylor changed the title from Free dead Gs to runtime: Free dead Gs Feb 13, 2015

@bradfitz bradfitz added this to the Go1.5Maybe milestone Feb 24, 2015

@randall77 randall77 modified the milestones: Go1.6, Go1.5Maybe Jun 25, 2015

@rsc

This comment has been minimized.

Contributor

rsc commented Oct 16, 2015

Rephrased as "dead Gs cause span fragmentation". It's not clear we can easily free dead G's - it's hard to be sure all the pointers to them have been found. I expect there is code in the runtime that depends on G's never being freed (reused to hold other types of data). I don't want to debug that.

How big a problem is the fragmentation?

@rsc rsc changed the title from runtime: Free dead Gs to runtime: dead Gs cause span fragmentation Oct 16, 2015

@randall77

This comment has been minimized.

Contributor

randall77 commented Oct 16, 2015

The fragmentation can be pretty bad, the case in #8832 makes the heap into a repeated pattern of one page allocated (for Gs) and eight pages free (once used for stacks). So although the heap is ~90% free, we'd have to grow the heap for any large allocation.

That said, I don't think it is terribly common to use lots of Gs for a while and then never use them again.

I agree this may be hard. But it is something to think about if we can find a way.

@prashantv

This comment has been minimized.

Contributor

prashantv commented Aug 2, 2017

That said, I don't think it is terribly common to use lots of Gs for a while and then never use them again.

Most network servers tend to create a new goroutine per incoming request (E.g., the HTTP server). If there's a sudden huge spike in requests (often not intentional, but due to a misbehaving external service), it can cause a huge number of goroutines to be created, and even after the requests have been served, the process consumes a huge amount of memory. I'm assuming this is due to the fragmentation described in this issue.

As an example, we saw one process in production with a system usage of > 100GB, while the Go runtime thought it was using < 4GB.

Pprof output:
image

ps output:
image

@mandarjog

This comment has been minimized.

mandarjog commented Oct 23, 2017

I am seeing a similar issue I think. runtime.malg is the most prolific. It is possible that this is in response to a sudden influx of requests.

profile002

@randall77

This comment has been minimized.

Contributor

randall77 commented Oct 23, 2017

@mandarjog : Why do you think this is related to fragmentation?
malg allocates goroutine stacks. Maybe you just have a lot of goroutines and thus lots of stack.

This issue is about what happens after all those goroutines complete. The syndrome I would expect to see is that the heap is mostly unused, including having little memory used by stacks, but large allocations still grow the heap.

@mandarjog

This comment has been minimized.

mandarjog commented Oct 23, 2017

After the go routines are done, I expect the usage to come down.
Should we expect the amount active set of 'G' to be at the high watermark of go routines?

https://user-images.githubusercontent.com/18554027/31787111-a33ae35e-b4d8-11e7-9cf1-2410427c1473.png

We see here that externally measured memory comes back down, but does not go all the way down.
I am not certain that fragmentation is the issue, but it is possible.

@randall77

This comment has been minimized.

Contributor

randall77 commented Oct 23, 2017

Once a G is allocated, it is never freed, so yes the number of Gs at any point is the high water mark of the execution so far.
But, that's only for the G descriptor itself. That does not include a G's stack. Stacks of finished Gs are freed during garbage collection, so that space should not remain at the high water mark.
Stacks are 90%+ of the space used for a goroutine, so I don't think that can account for your graph - you're retaining at least 30% of the high water mark.

This issue is unlikely to affect the amount of memory used as reported by the OS. The fragmentation described in this issue is only for virtual addresses - all the holes described here can (and are) given back to the OS by the scavenger. (Unless you're running on an OS where the page size is >8K).

If you'd like to continue investigating, please open a new issue.

@robarchibald

This comment has been minimized.

robarchibald commented Oct 25, 2017

I'm seeing this issue in an application I wrote as well. It is an HTTP server which essentially kicks off a specialized background web crawler. A "scan" request comes to the http server which throws the crawler into a goroutine and responds to the http request with "scan started successfully". Unfortunately, once it allocates memory, it never releases it no matter how long I wait. I wrote a simple program to show what is happening below. I posted on golang-nuts and was led here. Below is my post in its entirety. For me, this is a showstopper issue. If Go won't release memory from a very simple goroutine and it takes up gigs of memory (on my production app), I'll have to rewrite using something else.

Golang-nuts post:
I've got a nasty memory leak. It appears that Go won't reclaim memory when goroutines are called in a highly parallel fashion.

To illustrate the issue I've written a simple example below. When I run this on my Windows system, I get 800MB to 1 GB of memory consumption.

What's fascinating is that if I change the time.Sleep to 100 milliseconds instead of 1, the process never goes above 6 MB of memory. This is what I would expect since there are no objects here that should be retained in memory. And, it doesn't matter how long I wait, the garbage collector never cleans up the mess either.

I've tried to profile it using pprof and it didn't help me. It helped me find and fix other issues, but not this. It's entirely possible I did it wrong though since I am new to using that tool.

Help! Thanks in advance!

package main

import (
  "bytes"
  "fmt"
  "io/ioutil"
  "math/rand"
  "runtime"
  "time"
)

func main() {
  for i := 0; i < 1000; i++ {
    time.Sleep(time.Millisecond * 1)
    go fakeGetAndSaveData()
  }
  runtime.GC()
  time.Sleep(10 * time.Minute)
}

func fakeGetAndSaveData() {
  var buf bytes.Buffer
  for i := 0; i < 40000; i++ {
    buf.WriteString(fmt.Sprintf("the number is %d\n", i))
  }

  ioutil.WriteFile(fmt.Sprintf("%d.txt", rand.Int()), buf.Bytes(), 0644)
}
@randall77

This comment has been minimized.

Contributor

randall77 commented Oct 25, 2017

@robarchibald I'm pretty sure what you are seeing is not related to this issue. Please open a new one. The reason I'm pretty sure is that the space used is all for bytes.Buffer, it isn't for the goroutine descriptors or their stacks. Even if we retained the descriptors and their stacks, it is only ~8MB (stacks on Windows are 8K to start).

How are you measuring the memory used by the process? The Go runtime should give all the unused memory back to the OS after 5 minutes. Although I admit I'm unsure how this is handled on Windows.

The difference between a 1ms sleep and a 100ms sleep is probably that in the latter, the each goroutine finishes before the next one is generated, so there is only one goroutine at a time. In the former, there are lots of goroutines simultaneously, each one using almost 1MB for its buffer.

@ghost

This comment has been minimized.

ghost commented Oct 25, 2017

Thanks @randall77. I'm measuring memory using the OS. The results are the same whether in Ubuntu 16.04 (production) or Windows 10 (my dev box). And, I specifically set the time.Sleep to 10 minutes because I know that Garbage collection is supposed to cleanup after 5 minutes. Plus, I explicitly call runtime.GC() immediately in case I could get it to collect the trash earlier. No dice either way. Go never releases the memory even after waiting a week. Sorry, I wasn't patient enough to wait longer than that. :)

I can certainly open a new issue, but this isn't an issue with bytes.Buffer. This is a problem with goroutine cleanup. Sure, this particular issue was opened because the descriptor and stack wasn't cleaned up, but I think I'm showing here that the issue is MUCH bigger than that. Nothing is getting cleaned up if the system is busy enough or if the parallelism is high enough... at least that's what I'm assuming here since I can change the delay to be longer and the problem goes away.

It isn't a bytes.Buffer issue. It isn't an ioutil.WriteFile issue. I only used bytes.Buffer for this example and that's not what I'm using in my real application. And, I can do a fmt.Println instead of ioutil.WriteFile and it still does the same thing too. As @prashantv mentioned, he's seen 100 GB of memory usage due to what sounds like a similar issue to mine.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Oct 25, 2017

@6degreeshealth Show us exactly how you are measuring memory. Ideally on Ubuntu. Don't use words to tell us what you are doing, show us the exact commands that you are using, and the exact output that you see. Thanks.

@ghost

This comment has been minimized.

ghost commented Oct 25, 2017

On Linux, I'm using top -b | grep memoryLeak command and on Windows, I use Task Manager. I just ran this again on Ubuntu 16.04 and it looks like it isn't exhibiting the problem after all. When I ran it yesterday it crashed my VM, but it wasn't due to out of memory as I'd assumed. But, I do see the problem clearly on Windows. Sorry, I'll put together a different example for Linux.

Here's what I see on Windows. This is after it's gone idle. Nothing new is being spawned. It's just sitting there. It stays like this for 10 minutes until the program closes.

image

@randall77

This comment has been minimized.

Contributor

randall77 commented Oct 25, 2017

I've opened #22439 to discuss @6degreeshealth 's issues.
Let's leave this bug for G fragmentation issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment