Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK 3.9.4: Goroutine leak in reportToRun #11805

Closed
Segflow opened this issue Feb 9, 2023 · 13 comments
Closed

SDK 3.9.4: Goroutine leak in reportToRun #11805

Segflow opened this issue Feb 9, 2023 · 13 comments
Labels
bug Categorizes issue or PR as related to a bug. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.

Comments

@Segflow
Copy link

Segflow commented Feb 9, 2023

In helm.sh/helm/v3 v3.9.4 the method reportToRun is implemented as follow:

func (i *Install) reportToRun(c chan<- resultMessage, rel *release.Release, err error) {
	i.Lock.Lock()
	if err != nil {
		rel, err = i.failRelease(rel, err)
	}
	c <- resultMessage{r: rel, e: err}
	i.Lock.Unlock()
}

I have confirmed that after performing an install action (using RunWithContext) leads to goroutine leak.

image

The above screenshot shows that goroutines are stuck in reportToRun. Here a sample stack trace of one of those routines:

goroutine 15463 [chan send, 32 minutes]:
helm.sh/helm/v3/pkg/action.(*Install).reportToRun(0xc0026086e0, 0xc0026804c0?, 0x3?, {0x22dd9c0?, 0xc006a7bf20?})
	/go/pkg/mod/helm.sh/helm/v3@v3.9.4/pkg/action/install.go:435 +0xd6
helm.sh/helm/v3/pkg/action.(*Install).performInstall(0xc0026086e0, 0xc005269f60?, 0xc005216310, {0x0?, 0x0?, 0x0?}, {0xc0026804c0, 0x3, 0x4})
	/go/pkg/mod/helm.sh/helm/v3@v3.9.4/pkg/action/install.go:389 +0x2ea
created by helm.sh/helm/v3/pkg/action.(*Install).RunWithContext
	/go/pkg/mod/helm.sh/helm/v3@v3.9.4/pkg/action/install.go:349 +0x128a

The block seems to be in the chan send call.

On another note, I believe if we fall into the error path, the i.Lock will stay locked forever.

@Segflow
Copy link
Author

Segflow commented Feb 9, 2023

According to this code:


// Run executes the installation with Context
func (i *Install) RunWithContext(ctx context.Context, chrt *chart.Chart, vals map[string]interface{}) (*release.Release, error) {
        ...
        rChan := make(chan resultMessage)
	doneChan := make(chan struct{})
	defer close(doneChan)
	go i.performInstall(rChan, rel, toBeAdopted, resources)
	go i.handleContext(ctx, rChan, doneChan, rel)
	result := <-rChan
	//start preformInstall go routine
	return result.r, result.e
}

We read from rChan only once. But both performInstall and handleContext do call reportToRun (more then once) which tries to send to rChan.

So basically rChan is getting consumed only once why we send to it more then once.

@joejulian joejulian added bug Categorizes issue or PR as related to a bug. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Feb 10, 2023
@joejulian
Copy link
Contributor

Have you checked to see if that's still in main?

@Segflow
Copy link
Author

Segflow commented Feb 10, 2023

Yes, it's the same code in main

@zhanghaohao
Copy link

I have seen the same problem in my program.
My program build helm client and call (*Upgrade)RunWithContext to perform upgrade/install.
After the program runs for a period, it ends up with OOM.
so i used pprof to parse the memory allocation and goroutine, i noticed memory leak and goroutine leak.
memory allocation flame graph as follow:
截屏2023-02-11 10 44 48
goroutine flame graph as follow :
截屏2023-02-11 10 50 13

When we run helm client command to call (*Upgrade)RunWithContext, it is fine, because the main func exits after upgrade/install finished. but if we develop program by call (*Upgrade)RunWithContext, the main func will not exit(until we want it exit), and the goroutine leak happens, will caused memory leak which is a bigger problem.

@Segflow
Copy link
Author

Segflow commented Feb 16, 2023

@joejulian is there any plan to work on this issue soon?

@zhanghaohao
Copy link

zhanghaohao commented Feb 20, 2023

this problem had been solved according to the code in tag: v3.9.4

	go u.releasingUpgrade(rChan, upgradedRelease, current, target, originalRelease)  
	go u.handleContext(ctx, doneChan, ctxChan, upgradedRelease)  
	select {  
	case result := <-rChan:
		return result.r, result.e
	case result := <-ctxChan:
		return result.r, result.e
	}

related issue #10439
could close the issue now

@Segflow
Copy link
Author

Segflow commented Feb 20, 2023

No that in the upgrade release path not install release path

@zhanghaohao
Copy link

zhanghaohao commented Feb 22, 2023

@Segflow sorry, i didnot notice that you are mentioning install release path rather than upgrade release path.
after reading the install code carefully, i realized that it should not lead to goroutine leak in normal cases.

// Run executes the installation with Context
func (i *Install) RunWithContext(ctx context.Context, chrt *chart.Chart, vals map[string]interface{}) (*release.Release, error) {
        ...
        rChan := make(chan resultMessage)
	doneChan := make(chan struct{})
	defer close(doneChan)
	go i.performInstall(rChan, rel, toBeAdopted, resources)
	go i.handleContext(ctx, rChan, doneChan, rel)
	result := <-rChan
	//start preformInstall go routine
	return result.r, result.e
}

according to the above code, in normal cases, performInstall should finish with result sent to rChan, and result := <-rChan is unblocked, then close doneChan before return.
the close(doneChan) will lead handleContext goroutine return. this is what we expected.

func (i *Install) handleContext(ctx context.Context, c chan<- resultMessage, done chan struct{}, rel *release.Release) {
	select {
	case <-ctx.Done():
		err := ctx.Err()
		i.reportToRun(c, rel, err)
	case <-done:
		return
	}
}

@Segflow how do you think about this?

@Segflow
Copy link
Author

Segflow commented Mar 1, 2023

The issue is performInstall also calls reportToRun, this leads to have both

	go i.performInstall(rChan, rel, toBeAdopted, resources)
	go i.handleContext(ctx, rChan, doneChan, rel)

Send to the channel. We have this issue in our production

@zhanghaohao
Copy link

zhanghaohao commented Mar 3, 2023

i guess the case <-ctx.Done() in handleContext is triggered in your case, this could lead to goroutine leak due to performInstall and handleContext both are sending to rChan.
but goroutine leak donot happen in other cases.

Please correct me if i am wrong.

@Segflow
Copy link
Author

Segflow commented Mar 3, 2023

indeed

@Segflow
Copy link
Author

Segflow commented Apr 10, 2023

Any update on this please?

@mattfarina
Copy link
Collaborator

There's a PR to address this at #11978

mattfarina added a commit that referenced this issue Apr 11, 2023
During the install process there was a place where an install
process could be stuck trying to write to a channel. This would
happen when a context had completed prior to performInstall
finishing. In a short running Helm Client this was not a problem.
But, for long running applications that use Helm as an SDK there
are problems where a memory leak ends up happening due to
goroutines never being able to complete.

This fix provides a means for performInstall to write to its
channel using the method already used to fix the upgrade
issue of the same kind.

Fixes #11805

Signed-off-by: Matt Farina <matt.farina@suse.com>
(cherry picked from commit 7c9d636)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Categorizes issue or PR as related to a bug. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.
Projects
None yet
Development

No branches or pull requests

4 participants