Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: CPU profiling and fork/exec go into infinite loop #5517

Closed
remyoudompheng opened this issue May 20, 2013 · 13 comments
Closed

runtime: CPU profiling and fork/exec go into infinite loop #5517

remyoudompheng opened this issue May 20, 2013 · 13 comments
Milestone

Comments

@remyoudompheng
Copy link
Contributor

@remyoudompheng remyoudompheng commented May 20, 2013

What steps will reproduce the problem?
1. Compile and run the following program

package main

import (
    "crypto/md5"
    "log"
    "net/http"
    _ "net/http/pprof"
    "os/exec"
    "time"
)

func main() {
    go work()
    err := http.ListenAndServe(":9000", nil)
    if err != nil {
        log.Fatal(err)
    }
}

var garbage = make([][]byte, 32)

func work() {
    for i := range garbage {
        garbage[i] = make([]byte, 20<<20)
    }
    tick := time.NewTicker(time.Second / 30)
    for _ = range tick.C {
        for i, s := range garbage {
            for j := range s {
                s[j] = byte(i+j)
            }
        }
        cmd := exec.Command("ls")
        blah, err := cmd.CombinedOutput()
        if err != nil {
            log.Fatal(err)
        }
        h := md5.New()
        h.Write(blah)
        log.Printf("%x", h.Sum(nil))
    }
}

2. Run go tool pprof http://localhost:9000/debug/pprof/profile

What is the expected output? What do you see instead?

Expected: the programs gets profiled.

Instead: curl times out, the program remains stuck in an infinite loop.

Read http://localhost:9000/debug/pprof/symbol
Gathering CPU profile from http://localhost:9000/debug/pprof/profile?seconds=30 for 30
seconds to
  /home/remy/pprof/crash.1369041399.localhost-port9000
Be patient...
Failed to get profile: curl --max-time 90 -s -o
/home/remy/pprof/.tmp.crash.1369041399.localhost-port9000
'http://localhost:9000/debug/pprof/profile?seconds=30': 

strace output:

[pid 24975] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 24975] fork( <unfinished ...>
[pid 24972] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 24972] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 24972] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 24972] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 24975] <... fork resumed> )        = ? ERESTARTNOINTR (To be restarted)
<0.048847>
[pid 24975] fork( <unfinished ...>
[pid 24972] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 24972] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 24972] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 24972] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 24975] <... fork resumed> )        = ? ERESTARTNOINTR (To be restarted)
<0.047955>
[pid 24975] fork( <unfinished ...>
[pid 24972] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 24972] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 24972] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 24972] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 24975] <... fork resumed> )        = ? ERESTARTNOINTR (To be restarted)
<0.048301>
[pid 24975] fork( <unfinished ...>

Which version of Go am I using?

The problem happens with Go 1.0.3 and Go 1.1. When I run strace on the Go 1.1 binary,
the program goes out of the infinite loop after a few seconds. On Go 1.0.3, the program
remains stuck.
@remyoudompheng

This comment has been minimized.

Copy link
Contributor Author

@remyoudompheng remyoudompheng commented May 20, 2013

Comment 1:

The garbage in the example code is meant to make the fork longer (20-40ms here), you may
need to increase the garbage size in order to reproduce.
@remyoudompheng

This comment has been minimized.

Copy link
Contributor Author

@remyoudompheng remyoudompheng commented May 20, 2013

Comment 2:

I forgot system information: this is on Linux amd64
At home I use Archlinux:
$ uname -a
Linux mastermind 3.8.0-1-mastermind #1 SMP PREEMPT Thu Feb 21 10:34:25 CET 2013 x86_64
GNU/Linux
I also encounter the issue on Debian old stable (squeeze) which runs Linux 2.6.32.
@dvyukov

This comment has been minimized.

Copy link
Member

@dvyukov dvyukov commented Jun 16, 2013

Comment 3:

Simpler reproducer:
package main
import (
    "fmt"
    "os/exec"
    "time"
)
//var ballast = make([][]byte, 2e8)
func main() {
    go func() {
        prev := time.Now()
        for now := range time.NewTicker(time.Millisecond).C {
            fmt.Println(now.Sub(prev))
            prev = now
        }   
    }()
    for {
        exec.Command("ls").CombinedOutput()
    }
}
With the ballast there are that 15ms delays:
999.337us
996.273us
1.001548ms
1.003009ms
1.000131ms
1.022096ms
15.632691ms
321.769us
995.469us
999.496us
1.000067ms
999.972us
1.000004ms
1.000169ms
999.43us
16.426056ms
583.624us
990.718us
1.012777ms
987.411us
999.783us
999.792us
1.000195ms
999.824us
15.317368ms
687.186us
995.589us
999.954us
@dvyukov

This comment has been minimized.

Copy link
Member

@dvyukov dvyukov commented Jun 16, 2013

Comment 4:

The problem is that fork on linux uses RawSyscall, so it completely blocks the process.
One solution may be to carefully use vfork, it should be much faster.
Another solution may be to carefully use entersyscall/exitsyscall
around fork, but the child process must not execute exitsyscall
because it can deadlock.
@alberts

This comment has been minimized.

Copy link
Contributor

@alberts alberts commented Jul 8, 2013

Comment 5:

Seems related to issue #5838. Will fork with whole heap marked MADV_DONTFORK also be
faster?
@rsc

This comment has been minimized.

Copy link
Contributor

@rsc rsc commented Jul 30, 2013

Comment 6:

Labels changed: added priority-later, go1.2, removed priority-triage.

@dvyukov

This comment has been minimized.

Copy link
Member

@dvyukov dvyukov commented Jul 31, 2013

Comment 7:

Owner changed to @dvyukov.

@dvyukov

This comment has been minimized.

Copy link
Member

@dvyukov dvyukov commented Jul 31, 2013

Comment 8:

comments #3 and #4 were intended for issue #5717
@dvyukov

This comment has been minimized.

Copy link
Member

@dvyukov dvyukov commented Jul 31, 2013

Comment 9:

funny, if you preempt clone() with signals frequently enough, it... never finishes:
https://code.google.com/p/gperftools/issues/detail?id=278
@dvyukov

This comment has been minimized.

Copy link
Member

@dvyukov dvyukov commented Aug 8, 2013

Comment 10:

Status changed to Started.

@dvyukov

This comment has been minimized.

Copy link
Member

@dvyukov dvyukov commented Aug 13, 2013

Comment 11:

This issue was closed by revision e33e476.

Status changed to Fixed.

@mdempsky

This comment has been minimized.

Copy link
Member

@mdempsky mdempsky commented Feb 27, 2015

Would simply using pthread_sigmask() to temporarily block SIGPROF while calling clone() have worked too? Or would generating SIGPROF and/or interrupting another thread still have caused clone() failures?

Could there be any profiling inaccuracies if CPU profiling timers continue to countdown during clone(), but instead the SIGPROF signals are sent to a different thread? Would it be worse than the status quo where we disable SIGPROF entirely and then re-enable it?

I ask because technically I think right now calling syscall.ForkExec and runtime.SetCPUProfileRate concurrently could cause problems, as they both try to enable/disable the process-wide CPU profiler. This can be fixed with a locking protocol, of course, but it seems like a slightly simpler solution would be to make SetCPUProfileRate exclusively responsible for modifying the process-wide CPU profiler state, and have ForkExec just modify the thread-local SIGPROF mask state.

@dvyukov

This comment has been minimized.

Copy link
Member

@dvyukov dvyukov commented Feb 27, 2015

Would simply using pthread_sigmask() to temporarily block SIGPROF while calling clone() have worked too?

I don't remember. I should have been added a test for that issue.

@rsc rsc added this to the Go1.2 milestone Apr 14, 2015
@rsc rsc removed the go1.2 label Apr 14, 2015
@golang golang locked and limited conversation to collaborators Jun 24, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.