Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential hang for eCapture under really rare circumstances #500

Closed
ruitianzhong opened this issue Mar 3, 2024 · 2 comments
Closed

Potential hang for eCapture under really rare circumstances #500

ruitianzhong opened this issue Mar 3, 2024 · 2 comments
Labels
good first issue Good for newcomers improve question Further information is requested

Comments

@ruitianzhong
Copy link
Contributor

ruitianzhong commented Mar 3, 2024

Describe the bug
Due to the design flaw of iworker model, under rare condition, eCapture might hang.

This bug is discovered through code review and to reliably prove it, I design a PoC(Proof of Concept) .

To Reproduce

This is my poc commit ruitianzhong@a2e4111.

It essentially do two thing:

  • Firstly add a unit test that write a event(force the EventProcessor.Serve() go routine to create a worker) and sleep two second(wait the worker to be deleted). Finally, the unit test write tons of the same event.
  • Secondly inject some delay in the following code before ew.Close (in pkg/event_processor/iworker.go):
if ew.tickerCount > MaxTickerCount {
				//ew.processor.GetLogger().Printf("eventWorker TickerCount > %d, event closed.", MaxTickerCount)
				time.Sleep(time.Second*15) // inject some delay here (just to show the possible interleaving)
				ew.Close()
				return
			}

That is reasonable, because we should assume that go routine might be scheduled out at any time.

On the poc commit, run

go test  -v ./pkg/event_processor/  -run TestHang 

The output:

begin Write if you can not see end later, it mean eventProcessor.Serve routine might hang
end Write
begin Write if you can not see end later, it mean eventProcessor.Serve routine might hang
end Write
begin Write if you can not see end later, it mean eventProcessor.Serve routine might hang
end Write
begin Write if you can not see end later, it mean eventProcessor.Serve routine might hang

Obviously, eCapture hang.

Expected behavior
eCapture might be slowed down, but should not hang.

@ruitianzhong
Copy link
Contributor Author

Bug Analysis

In the following code, assuming now the worker routine is before ew.Close()

case _ = <-ew.ticker.C:
			// 输出包
			if ew.tickerCount > MaxTickerCount {
				//ew.processor.GetLogger().Printf("eventWorker TickerCount > %d, event closed.", MaxTickerCount)
				ew.Close()
				return
			}

Just at the same time, multiple events from the same process might be generated and the are passed firstly toEventProcessor.Serve routine through ep.incoming channel(1024) , and then sent to Worker routine through ew.incoming channel(16). These events with same uuid are allocated to the same worker that executed before ew.Close(). Because the worker do not received from the channel at the moment , the event is put in the buffer allocated for the channel and it would not be blocked for now . However, This event will be lost, because this worker will never handle the event in buffer. Moreover, if the many events are generated, the buffer in ew.incoming(size is 16) is full and the EventProcessor.Serve routine hang first which is impossible to recovered.

Again, that is because the worker is about to be removed and will never receive events from that channel.

With the EventProcessor.Serve routine hang, the main routine that generates events will ultimately hang because the ep.incoming channel(with size of 1024) will be full sooner or later.

from top to down                                                         
event generate from main routine
|
|
| ep.incoming channel(1024)          
|
EventProcessor.Serve routine 
|
|
|
| ew.incoming channel(16)
|
|
Worker routine

Root Cause
The hang is due to the fact that worker returned by getWorkerByUUID() may be retired(i.e, never read from the channel). It should be fixed.

@cfc4n cfc4n added good first issue Good for newcomers improve question Further information is requested labels Mar 3, 2024
ruitianzhong added a commit to ruitianzhong/ecapture that referenced this issue Mar 3, 2024
Signed-off-by: ruitianzhong <ruitian-zhong@outlook.com>
@cfc4n
Copy link
Member

cfc4n commented Mar 3, 2024

Please wait a moment, let me understand the issue.

@cfc4n cfc4n closed this as completed in 144c89a Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers improve question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants