Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This socket has been ended by the other party #4

Closed
shnhrrsn opened this issue Jan 9, 2018 · 10 comments
Closed

This socket has been ended by the other party #4

shnhrrsn opened this issue Jan 9, 2018 · 10 comments
Assignees

Comments

@shnhrrsn
Copy link

shnhrrsn commented Jan 9, 2018

Getting this during the heartbeat for long(ish) running workers:

1/9/2018 4:42:21 AM{ Error: This socket has been ended by the other party
1/9/2018 4:42:21 AM    at Socket.writeAfterFIN [as write] (net.js:351:12)
1/9/2018 4:42:21 AM    at Promise (/apps/api/node_modules/faktory-client/lib/client.js:164:19)
1/9/2018 4:42:21 AM    at new Promise (<anonymous>)
1/9/2018 4:42:21 AM    at Client.send (/apps/api/node_modules/faktory-client/lib/client.js:160:12)
1/9/2018 4:42:21 AM    at Client.beat (/apps/api/node_modules/faktory-client/lib/client.js:223:17)
1/9/2018 4:42:21 AM    at withConnection.c (/apps/api/node_modules/faktory-worker/lib/manager.js:112:34)
1/9/2018 4:42:21 AM    at Manager.withConnection (/apps/api/node_modules/faktory-worker/lib/manager.js:46:14)
1/9/2018 4:42:21 AM    at <anonymous> code: 'EPIPE' }
@jbielick jbielick self-assigned this Jan 11, 2018
@jbielick
Copy link
Owner

Do you know how long the process was running when this happened?
How about the number of jobs processed—or were they mostly idle?

@shnhrrsn
Copy link
Author

shnhrrsn commented Jan 15, 2018

~500k jobs processed across about 5 different workers, timed out after 12 hours or so I believe (I didn't discover until after the fact). Based on the logs, it looked like it may have been idle for a little while before it timed out.

@lypanov
Copy link

lypanov commented Jan 17, 2018

Seeing the same issue here also.

@jbielick
Copy link
Owner

This is helpful, thanks for the info.

Do you have logging or debug enabled for the workers? You don't have any heartbeat log statements do you? That would also be helpful if they were available.

I'll take a look into this.

@jbielick
Copy link
Owner

Did the node process die when this happened or was it just hanging around?

@jbielick
Copy link
Owner

jbielick commented Feb 6, 2018

I added some fixes to gracefully reconnect and continue working on connection interruptions. Could you try v0.7.0?
It's now published to npm.

31184d6

Please reopen if this persists.

@jbielick jbielick closed this as completed Feb 6, 2018
@emhagman
Copy link
Contributor

emhagman commented Feb 14, 2020

I am getting this error in 3.3.2.

Error: This socket has been ended by the other party
    at Socket.writeAfterFIN [as write] (net.js:447:14)
    at /app/node_modules/faktory-worker/lib/connection.js:165:19
    at executorPatched (/app/node_modules/cls-bluebird/lib/shimOnCancel.js:37:21)
    at Promise._execute (/app/node_modules/bluebird/js/release/debuggability.js:313:9)
    at Promise._resolveFromExecutor (/app/node_modules/bluebird/js/release/promise.js:488:18)
    at Promise._resolveFromExecutor (/app/node_modules/cls-bluebird/lib/shimOnCancel.js:41:40)
    at new Promise (/app/node_modules/bluebird/js/release/promise.js:79:10)
    at Connection.send (/app/node_modules/faktory-worker/lib/connection.js:164:12)
    at /app/node_modules/faktory-worker/lib/client.js:163:39
    at /app/node_modules/generic-pool/lib/Pool.js:469:14
    at clsBind (/app/node_modules/cls-hooked/context.js:172:17)
    at /app/node_modules/dd-trace/packages/dd-trace/src/plugins/util/promise.js:28:23
    at Scope._activate (/app/node_modules/dd-trace/packages/dd-trace/src/scope/base.js:40:14)
    at Scope.activate (/app/node_modules/dd-trace/packages/dd-trace/src/scope/base.js:11:17)
    at /app/node_modules/dd-trace/packages/dd-trace/src/plugins/util/promise.js:27:27
    at tryCatcher (/app/node_modules/bluebird/js/release/util.js:16:23)
    at Promise._settlePromiseFromHandler (/app/node_modules/bluebird/js/release/promise.js:517:31)
    at Promise._settlePromise (/app/node_modules/bluebird/js/release/promise.js:574:18)
    at Promise._fulfillPromises (/app/node_modules/bluebird/js/release/promise.js:674:14)
    at Promise._settlePromises (/app/node_modules/bluebird/js/release/promise.js:700:18)
    at _drainQueueStep (/app/node_modules/bluebird/js/release/async.js:138:12)
    at _drainQueue (/app/node_modules/bluebird/js/release/async.js:131:9)
    at Async._drainQueues (/app/node_modules/bluebird/js/release/async.js:147:5)
    at Immediate.Async.drainQueues (/app/node_modules/bluebird/js/release/async.js:17:14)
    at processImmediate (internal/timers.js:439:21)
    at process.topLevelDomainCallback (domain.js:130:23) {
  code: 'EPIPE'
}

It is happening during a job that takes a decent amount of time to finish. At the same time, Faktory is complaining about no heartbeat. I assume this means my task is blocking the event loop and preventing the heartbeat from working. Regardless, this shouldn't throw an error and crash the process.

Here are my Faktory logs around that time:

faktory_1  | E 2020-02-14T18:35:00.215Z error=read tcp 172.19.0.2:7419->10.1.0.88:55022: use of closed network connection Unexpected socket error
faktory_1  | W 2020-02-14T18:35:00.215Z Reaped 1 lingering connections, this is a sign your workers are having problems
faktory_1  | W 2020-02-14T18:35:00.215Z All worker processes should send a heartbeat every 15 seconds

Line in question: https://github.com/jbielick/faktory_worker_node/blob/master/lib/connection.js#L165

Trying to write to a socket that has been closed.

@jbielick
Copy link
Owner

@emhagman so the server is closing the connection in the midst of this library trying to use that socket?

@emhagman
Copy link
Contributor

emhagman commented Feb 18, 2020

Correct, Faktory is closing the connection and then the lib is trying to call this.socket.write on a closed connection, which crashes the server.

It is definitely being removed from Faktory because the task that is running is blocking the event loop and the worker cannot send the heartbeat to Faktory in time (every 15s or it is removed).

I understand the ideal scenario is to not block the event loop and I have fixed that issue but if there is non-performant code, it would be ideal if the server doesn't crash if the connection is reaped by Faktory as there could be other jobs running on the worker.

@jbielick
Copy link
Owner

How often does this occur?

We're using a connection pool in this library.

Before a client sends any command, a connection is checked out from the pool

send(command) {
return this.pool.use(conn => conn.send(command));
}
.

Before a connection can be checked out from the pool, it is validated that it is connected.

testOnBorrow: true,

and

validate(connection) {
return connection.connected;
}

Before digging further (and I'll probably need some DEBUG=faktory* logs from you) I'm trying to reason about how this error would occur.

Is it possible that this is occurring so fast (or the event loop is blocked) that the connection is being used before the "onClose" callback is able to be called for it? That's the only guess I have. The event loop being blocked is sort of a fatal issue IMO. I'm not sure there's a lot of reasonable expectations to be made under those conditions. Writing to the socket would be a synchronous execution, while waiting for a response or handling the "socket closed" event would be async.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants