Skip to content

Response unpaused but never calls outgoing block hook #175

@dirkmc

Description

@dirkmc

It seems that under heavy load, a Graphsync response to a request for data will sometimes get stuck in a paused state.

When I run the new lotus concurrent deals test TestDealCyclesConcurrent in the itest kit with concurrency of 32, the test consistently hangs.
It appears that all the retrieval deals complete except for one or two, and the logs for these incomplete deals indicate that the problem is in Graphsync's pausing mechanism.
The branches I used to produce this state are:

  • lotus exp/rebase-itests-refactor-kit
    Includes the new TestDealCyclesConcurrent test
  • go-fil-markets dirkmc/log-validate-pull
    Adds a log line that connects the deal ID to the channel ID so we can match logs at the go-fil-markets level to the go-data-transfer level
  • go-data-transfer dirkmc/log-outgoing-block
    Adds some logging in the Graphsync transport so we can see if the sent / outgoing block hooks are called

I run the concurrent deals test in lotus:

go test ./itests -run TestDealCyclesConcurrent -v

I grep the output for each deal ID until I find the deal that hangs. Then I look for the log line that connects the deal ID to the data transfer channel ID:

2021-06-21T15:03:29.259+0200	INFO	markets-rtvl-reval	requestvalidation/requestvalidation.go:78	validate pull for deal 1624280585293275020 for data transfer channel 12D3KooWCGuvzoKipy5Y6DRGeecSH57R2hzS7xtrhFELef7CAY9h-12D3KooWLqudbUu45uQcJCh1eMk7GeH1ZesUGyUqqcpGLapSXxGs-1624280585292810052

I use that to grep the logs for both the deal ID and the channel ID so as to get all relevant logs:

grep "1624280585293275020\|1624280585292810052" /tmp/out.txt
 
2021-06-21T15:03:29.128+0200	INFO	markets	loggers/loggers.go:25	retrieval client event	{"name": "ClientEventOpen", "deal ID": "1624280585293275020", "state": "DealStatusNew", "message": ""}
2021-06-21T15:03:29.259+0200	INFO	dt-impl	impl/events.go:19	channel 12D3KooWCGuvzoKipy5Y6DRGeecSH57R2hzS7xtrhFELef7CAY9h-12D3KooWLqudbUu45uQcJCh1eMk7GeH1ZesUGyUqqcpGLapSXxGs-1624280585292810052: opened
                                                                                                                                                             ?V)????u,} to channel ID 12D3KooWCGuvzoKipy5Y6DRGeecSH57R2hzS7xtrhFELef7CAY9h-12D3KooWLqudbUu45uQcJCh1eMk7GeH1ZesUGyUqqcpGLapSXxGs-1624280585292810052
2021-06-21T15:03:29.259+0200	INFO	markets	loggers/loggers.go:25	retrieval client event	{"name": "ClientEventDealProposed", "deal ID": "1624280585293275020", "state": "DealStatusWaitForAcceptance", "message": ""}
2021-06-21T15:03:29.259+0200	INFO	dt_graphsync	graphsync/graphsync.go:476	12D3KooWCGuvzoKipy5Y6DRGeecSH57R2hzS7xtrhFELef7CAY9h-12D3KooWLqudbUu45uQcJCh1eMk7GeH1ZesUGyUqqcpGLapSXxGs-1624280585292810052, pull request queued, req=&{BCid:bafk2bzacedr7tnpc62gvo2wcaxswfab4yduzmkv44kodcxbnpxrhaqhdbidk6 Type:0 Paus:false Part:false Pull:true Stor:0xc01035b758 Vouch:0xc01035b788 VTyp:RetrievalDealProposal/1 XferID:1624280585292810052 RestartChannel:--0}
2021-06-21T15:03:29.259+0200	DEBUG	dt_graphsync	graphsync/graphsync.go:520	12D3KooWCGuvzoKipy5Y6DRGeecSH57R2hzS7xtrhFELef7CAY9h-12D3KooWLqudbUu45uQcJCh1eMk7GeH1ZesUGyUqqcpGLapSXxGs-1624280585292810052: received request for data (pull)
2021-06-21T15:03:29.259+0200	INFO	dt-impl	impl/events.go:300	channel 12D3KooWCGuvzoKipy5Y6DRGeecSH57R2hzS7xtrhFELef7CAY9h-12D3KooWLqudbUu45uQcJCh1eMk7GeH1ZesUGyUqqcpGLapSXxGs-1624280585292810052: received new channel request from 12D3KooWCGuvzoKipy5Y6DRGeecSH57R2hzS7xtrhFELef7CAY9h
2021-06-21T15:03:29.259+0200	INFO	markets-rtvl-reval	requestvalidation/requestvalidation.go:78	validate pull for deal 1624280585293275020 for data transfer channel 12D3KooWCGuvzoKipy5Y6DRGeecSH57R2hzS7xtrhFELef7CAY9h-12D3KooWLqudbUu45uQcJCh1eMk7GeH1ZesUGyUqqcpGLapSXxGs-1624280585292810052
2021-06-21T15:03:29.390+0200	INFO	markets	loggers/loggers.go:30	retrieval provider event	{"name": "ProviderEventOpen", "deal ID": "1624280585293275020", "receiver": "12D3KooWCGuvzoKipy5Y6DRGeecSH57R2hzS7xtrhFELef7CAY9h", "state": "DealStatusNew", "message": ""}
2021-06-21T15:03:29.668+0200	INFO	markets	loggers/loggers.go:30	retrieval provider event	{"name": "ProviderEventDealAccepted", "deal ID": "1624280585293275020", "receiver": "12D3KooWCGuvzoKipy5Y6DRGeecSH57R2hzS7xtrhFELef7CAY9h", "state": "DealStatusUnsealing", "message": ""}
2021-06-21T15:03:29.736+0200	DEBUG	dt_graphsync	graphsync/graphsync.go:570	12D3KooWCGuvzoKipy5Y6DRGeecSH57R2hzS7xtrhFELef7CAY9h-12D3KooWLqudbUu45uQcJCh1eMk7GeH1ZesUGyUqqcpGLapSXxGs-1624280585292810052: pausing graphsync response
2021-06-21T15:03:29.736+0200	DEBUG	dt_graphsync	graphsync/graphsync.go:969	12D3KooWCGuvzoKipy5Y6DRGeecSH57R2hzS7xtrhFELef7CAY9h-12D3KooWLqudbUu45uQcJCh1eMk7GeH1ZesUGyUqqcpGLapSXxGs-1624280585292810052: received request for data
                                                                                                                                                             ?V)????u,} to channel ID 12D3KooWCGuvzoKipy5Y6DRGeecSH57R2hzS7xtrhFELef7CAY9h-12D3KooWLqudbUu45uQcJCh1eMk7GeH1ZesUGyUqqcpGLapSXxGs-1624280585292810052
2021-06-21T15:03:29.741+0200	INFO	markets	loggers/loggers.go:30	retrieval provider event	{"name": "ProviderEventUnsealComplete", "deal ID": "1624280585293275020", "receiver": "12D3KooWCGuvzoKipy5Y6DRGeecSH57R2hzS7xtrhFELef7CAY9h", "state": "DealStatusUnsealed", "message": ""}
2021-06-21T15:03:29.805+0200	INFO	dt-impl	impl/impl.go:391	resume channel 12D3KooWCGuvzoKipy5Y6DRGeecSH57R2hzS7xtrhFELef7CAY9h-12D3KooWLqudbUu45uQcJCh1eMk7GeH1ZesUGyUqqcpGLapSXxGs-1624280585292810052
2021-06-21T15:03:29.805+0200	DEBUG	dt_graphsync	graphsync/graphsync.go:1055	12D3KooWCGuvzoKipy5Y6DRGeecSH57R2hzS7xtrhFELef7CAY9h-12D3KooWLqudbUu45uQcJCh1eMk7GeH1ZesUGyUqqcpGLapSXxGs-1624280585292810052: unpausing response
2021-06-21T15:03:29.805+0200	WARN	dt-impl	impl/impl.go:400	Error attempting to resume 12D3KooWCGuvzoKipy5Y6DRGeecSH57R2hzS7xtrhFELef7CAY9h-12D3KooWLqudbUu45uQcJCh1eMk7GeH1ZesUGyUqqcpGLapSXxGs-1624280585292810052 at transport level: request is not paused
2021-06-21T15:03:29.998+0200	INFO	dt-impl	impl/events.go:188	channel 12D3KooWCGuvzoKipy5Y6DRGeecSH57R2hzS7xtrhFELef7CAY9h-12D3KooWLqudbUu45uQcJCh1eMk7GeH1ZesUGyUqqcpGLapSXxGs-1624280585292810052: received new response, accepting channel
2021-06-21T15:03:29.999+0200	INFO	markets	loggers/loggers.go:25	retrieval client event	{"name": "ClientEventDealAccepted", "deal ID": "1624280585293275020", "state": "DealStatusAccepted", "message": ""}
2021-06-21T15:03:31.318+0200	INFO	markets	loggers/loggers.go:25	retrieval client event	{"name": "ClientEventPaymentChannelAddingFunds", "deal ID": "1624280585293275020", "state": "DealStatusPaymentChannelAddingInitialFunds", "message": ""}
2021-06-21T15:03:32.511+0200	INFO	markets	loggers/loggers.go:25	retrieval client event	{"name": "ClientEventPaymentChannelReady", "deal ID": "1624280585293275020", "state": "DealStatusPaymentChannelAllocatingLane", "message": ""}
2021-06-21T15:03:32.725+0200	INFO	markets	loggers/loggers.go:25	retrieval client event	{"name": "ClientEventLaneAllocated", "deal ID": "1624280585293275020", "state": "DealStatusOngoing", "message": ""}

Normally after ProviderEventUnsealComplete there would be a ProviderEventBlockSent event.

This log line indicates that hookActions.PauseResponse() is called when a new request for data arrives:

2021-06-21T15:03:29.736+0200	DEBUG	dt_graphsync	graphsync/graphsync.go:570	12D3KooWCGuvzoKipy5Y6DRGeecSH57R2hzS7xtrhFELef7CAY9h-12D3KooWLqudbUu45uQcJCh1eMk7GeH1ZesUGyUqqcpGLapSXxGs-1624280585292810052: pausing graphsync response

These log lines indicate that Graphsync.UnpauseResponse is called but that Graphsync does not believe the response is paused:

2021-06-21T15:03:29.805+0200	DEBUG	dt_graphsync	graphsync/graphsync.go:1055	12D3KooWCGuvzoKipy5Y6DRGeecSH57R2hzS7xtrhFELef7CAY9h-12D3KooWLqudbUu45uQcJCh1eMk7GeH1ZesUGyUqqcpGLapSXxGs-1624280585292810052: unpausing response
2021-06-21T15:03:29.805+0200	WARN	dt-impl	impl/impl.go:400	Error attempting to resume 12D3KooWCGuvzoKipy5Y6DRGeecSH57R2hzS7xtrhFELef7CAY9h-12D3KooWLqudbUu45uQcJCh1eMk7GeH1ZesUGyUqqcpGLapSXxGs-1624280585292810052 at transport level: request is not paused

Therefore the Graphsync transfer remains in the paused state, the outgoing block hook is never called (see the log lines that would be output if it were called) and ProviderEventBlockSent is never fired.

Metadata

Metadata

Assignees

No one assigned

    Labels

    need/triageNeeds initial labeling and prioritization

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions