This socket has been ended by the other party #4

shnhrrsn · 2018-01-09T09:43:51Z

Getting this during the heartbeat for long(ish) running workers:

1/9/2018 4:42:21 AM{ Error: This socket has been ended by the other party
1/9/2018 4:42:21 AM    at Socket.writeAfterFIN [as write] (net.js:351:12)
1/9/2018 4:42:21 AM    at Promise (/apps/api/node_modules/faktory-client/lib/client.js:164:19)
1/9/2018 4:42:21 AM    at new Promise (<anonymous>)
1/9/2018 4:42:21 AM    at Client.send (/apps/api/node_modules/faktory-client/lib/client.js:160:12)
1/9/2018 4:42:21 AM    at Client.beat (/apps/api/node_modules/faktory-client/lib/client.js:223:17)
1/9/2018 4:42:21 AM    at withConnection.c (/apps/api/node_modules/faktory-worker/lib/manager.js:112:34)
1/9/2018 4:42:21 AM    at Manager.withConnection (/apps/api/node_modules/faktory-worker/lib/manager.js:46:14)
1/9/2018 4:42:21 AM    at <anonymous> code: 'EPIPE' }

The text was updated successfully, but these errors were encountered:

jbielick · 2018-01-11T20:52:36Z

Do you know how long the process was running when this happened?
How about the number of jobs processed—or were they mostly idle?

shnhrrsn · 2018-01-15T13:56:55Z

~500k jobs processed across about 5 different workers, timed out after 12 hours or so I believe (I didn't discover until after the fact). Based on the logs, it looked like it may have been idle for a little while before it timed out.

lypanov · 2018-01-17T13:38:24Z

Seeing the same issue here also.

jbielick · 2018-01-17T19:24:08Z

This is helpful, thanks for the info.

Do you have logging or debug enabled for the workers? You don't have any heartbeat log statements do you? That would also be helpful if they were available.

I'll take a look into this.

jbielick · 2018-01-21T16:00:39Z

Did the node process die when this happened or was it just hanging around?

jbielick · 2018-02-06T00:57:34Z

I added some fixes to gracefully reconnect and continue working on connection interruptions. Could you try v0.7.0?
It's now published to npm.

31184d6

Please reopen if this persists.

emhagman · 2020-02-14T18:45:49Z

I am getting this error in 3.3.2.

Error: This socket has been ended by the other party
    at Socket.writeAfterFIN [as write] (net.js:447:14)
    at /app/node_modules/faktory-worker/lib/connection.js:165:19
    at executorPatched (/app/node_modules/cls-bluebird/lib/shimOnCancel.js:37:21)
    at Promise._execute (/app/node_modules/bluebird/js/release/debuggability.js:313:9)
    at Promise._resolveFromExecutor (/app/node_modules/bluebird/js/release/promise.js:488:18)
    at Promise._resolveFromExecutor (/app/node_modules/cls-bluebird/lib/shimOnCancel.js:41:40)
    at new Promise (/app/node_modules/bluebird/js/release/promise.js:79:10)
    at Connection.send (/app/node_modules/faktory-worker/lib/connection.js:164:12)
    at /app/node_modules/faktory-worker/lib/client.js:163:39
    at /app/node_modules/generic-pool/lib/Pool.js:469:14
    at clsBind (/app/node_modules/cls-hooked/context.js:172:17)
    at /app/node_modules/dd-trace/packages/dd-trace/src/plugins/util/promise.js:28:23
    at Scope._activate (/app/node_modules/dd-trace/packages/dd-trace/src/scope/base.js:40:14)
    at Scope.activate (/app/node_modules/dd-trace/packages/dd-trace/src/scope/base.js:11:17)
    at /app/node_modules/dd-trace/packages/dd-trace/src/plugins/util/promise.js:27:27
    at tryCatcher (/app/node_modules/bluebird/js/release/util.js:16:23)
    at Promise._settlePromiseFromHandler (/app/node_modules/bluebird/js/release/promise.js:517:31)
    at Promise._settlePromise (/app/node_modules/bluebird/js/release/promise.js:574:18)
    at Promise._fulfillPromises (/app/node_modules/bluebird/js/release/promise.js:674:14)
    at Promise._settlePromises (/app/node_modules/bluebird/js/release/promise.js:700:18)
    at _drainQueueStep (/app/node_modules/bluebird/js/release/async.js:138:12)
    at _drainQueue (/app/node_modules/bluebird/js/release/async.js:131:9)
    at Async._drainQueues (/app/node_modules/bluebird/js/release/async.js:147:5)
    at Immediate.Async.drainQueues (/app/node_modules/bluebird/js/release/async.js:17:14)
    at processImmediate (internal/timers.js:439:21)
    at process.topLevelDomainCallback (domain.js:130:23) {
  code: 'EPIPE'
}

It is happening during a job that takes a decent amount of time to finish. At the same time, Faktory is complaining about no heartbeat. I assume this means my task is blocking the event loop and preventing the heartbeat from working. Regardless, this shouldn't throw an error and crash the process.

Here are my Faktory logs around that time:

faktory_1  | E 2020-02-14T18:35:00.215Z error=read tcp 172.19.0.2:7419->10.1.0.88:55022: use of closed network connection Unexpected socket error
faktory_1  | W 2020-02-14T18:35:00.215Z Reaped 1 lingering connections, this is a sign your workers are having problems
faktory_1  | W 2020-02-14T18:35:00.215Z All worker processes should send a heartbeat every 15 seconds

Line in question: https://github.com/jbielick/faktory_worker_node/blob/master/lib/connection.js#L165

Trying to write to a socket that has been closed.

jbielick · 2020-02-15T01:21:09Z

@emhagman so the server is closing the connection in the midst of this library trying to use that socket?

emhagman · 2020-02-18T17:11:41Z

Correct, Faktory is closing the connection and then the lib is trying to call this.socket.write on a closed connection, which crashes the server.

It is definitely being removed from Faktory because the task that is running is blocking the event loop and the worker cannot send the heartbeat to Faktory in time (every 15s or it is removed).

I understand the ideal scenario is to not block the event loop and I have fixed that issue but if there is non-performant code, it would be ideal if the server doesn't crash if the connection is reaped by Faktory as there could be other jobs running on the worker.

jbielick · 2020-02-19T00:34:23Z

How often does this occur?

We're using a connection pool in this library.

Before a client sends any command, a connection is checked out from the pool

faktory_worker_node/lib/client.js

Lines 162 to 164 in 69f6f7f

    
           send(command) { 
        
             return this.pool.use(conn => conn.send(command)); 
        
           }

.

Before a connection can be checked out from the pool, it is validated that it is connected.

faktory_worker_node/lib/client.js

Line 58 in 69f6f7f

testOnBorrow: true,

and

faktory_worker_node/lib/connection-factory.js

Lines 60 to 62 in 69f6f7f

    
           validate(connection) { 
        
             return connection.connected; 
        
           }

Before digging further (and I'll probably need some DEBUG=faktory* logs from you) I'm trying to reason about how this error would occur.

Is it possible that this is occurring so fast (or the event loop is blocked) that the connection is being used before the "onClose" callback is able to be called for it? That's the only guess I have. The event loop being blocked is sort of a fatal issue IMO. I'm not sure there's a lot of reasonable expectations to be made under those conditions. Writing to the socket would be a synchronous execution, while waiting for a response or handling the "socket closed" event would be async.

jbielick self-assigned this Jan 11, 2018

jbielick added the faktory-client label Jan 18, 2018

jbielick closed this as completed Feb 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This socket has been ended by the other party #4

This socket has been ended by the other party #4

shnhrrsn commented Jan 9, 2018

jbielick commented Jan 11, 2018

shnhrrsn commented Jan 15, 2018 •

edited

lypanov commented Jan 17, 2018

jbielick commented Jan 17, 2018

jbielick commented Jan 21, 2018

jbielick commented Feb 6, 2018

emhagman commented Feb 14, 2020 •

edited

jbielick commented Feb 15, 2020

emhagman commented Feb 18, 2020 •

edited

jbielick commented Feb 19, 2020

This socket has been ended by the other party #4

This socket has been ended by the other party #4

Comments

shnhrrsn commented Jan 9, 2018

jbielick commented Jan 11, 2018

shnhrrsn commented Jan 15, 2018 • edited

lypanov commented Jan 17, 2018

jbielick commented Jan 17, 2018

jbielick commented Jan 21, 2018

jbielick commented Feb 6, 2018

emhagman commented Feb 14, 2020 • edited

jbielick commented Feb 15, 2020

emhagman commented Feb 18, 2020 • edited

jbielick commented Feb 19, 2020

shnhrrsn commented Jan 15, 2018 •

edited

emhagman commented Feb 14, 2020 •

edited

emhagman commented Feb 18, 2020 •

edited