Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slamming Pub/Sub with publishJSON requests yields google-gax timeout errors #885

Closed
willbattel opened this issue Feb 12, 2020 · 13 comments
Closed
Assignees
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@willbattel
Copy link

I've noticed numerous other related issues, but followed the recommendation here #770 (comment) to post a new issue in isolation for posterity, just in case.

Environment details

  • OS: Google App Engine Standard
  • Node.js version: ^10
  • npm version: N/A
  • @google-cloud/pubsub version: 1.5.0

Steps to reproduce

We have a utility service, run only occasionally on demand of our data science team, where we are streaming documents from Firestore, and publishing each document received to a Pub/Sub topic in a different GCP project. The first few thousand messages are published without issue, but at some arbitrary point, about 100 seconds into slamming Pub/Sub with these requests, we repeatedly encounter the following error, after which the App Engine instance is unrecoverable.

Error: Retry total timeout exceeded before any response was received 
    at repeat (/srv/node_modules/@google-cloud/pubsub/node_modules/google-gax/build/src/normalCalls/retries.js:65:31) 
    at Timeout.setTimeout [as _onTimeout] (/srv/node_modules/@google-cloud/pubsub/node_modules/google-gax/build/src/normalCalls/retries.js:100:25) 
    at ontimeout (timers.js:436:11) 
    at tryOnTimeout (timers.js:300:5) 
    at listOnTimeout (timers.js:263:5) 
    at Timer.processTimers (timers.js:223:10) code: 4 } } reason: { Error: Retry total timeout exceeded before any response was received 
  at repeat (/srv/node_modules/@google-cloud/pubsub/node_modules/google-gax/build/src/normalCalls/retries.js:65:31) 
  at Timeout.setTimeout [as _onTimeout] (/srv/node_modules/@google-cloud/pubsub/node_modules/google-gax/build/src/normalCalls/retries.js:100:25) 
  at ontimeout (timers.js:436:11) 
  at tryOnTimeout (timers.js:300:5) 
  at listOnTimeout (timers.js:263:5) 
  at Timer.processTimers (timers.js:223:10) code: 4

I have simplified+obfuscated our code below.


'use strict'

const express = require('express')

const { Firestore } = require('@google-cloud/firestore')
const { PubSub } = require('@google-cloud/pubsub')

const firestore = new Firestore()
const pubsub = new PubSub()

const app = express()

app.get("/", (req, res) => {
    const promises = []
    firestore.collection("someCollection").stream().on('data', (doc) => {
        const promise = new Promise(async (resolve, reject) => {

            // ...
            // data validation/sanitation
            // ...

            pubsub.topic('projects/some-project/topics/some-topic').publishJSON({
                fieldA: data.fieldA,
                fieldB: data.fieldB,
                fieldC: data.fieldC
            }, {
                foo: doc.id,
                bar: 'baz'
            }, (error, messageId) => {
                if (error || !messageId) { reject(error) }
                else resolve(messageId)
            })

        })

        promises.push(promise)
    }).on('end', () => {
        console.log("Received all documents.")

        Promise.all(promises)
        .catch((error) => {
            console.error(`Failed to publish one or more messages. ${JSON.stringify(error, null, 4)}`)
        })
        .finally(() => {
            console.log("Finished dumping collection.")
            res.status(200).send('OK').end()
        })
    })
})

// Start the server
const PORT = process.env.PORT || 8080
app.listen(PORT, () => {
    console.log(`App listening on port ${PORT}`)
})

module.exports = app

Additionally, here is our package.json dependency list including dependencies used but not shown in above code.

"@google-cloud/firestore": "3.5.0",
"@google-cloud/pubsub": "1.5.0",
"@google-cloud/storage": "4.3.1",
"express": "4.17.1",
"lodash": "4.17.15"

Please let me know if I can provide further information useful for finding the problem+solution. Thanks!

@product-auto-label product-auto-label bot added the api: pubsub Issues related to the googleapis/nodejs-pubsub API. label Feb 12, 2020
@feywind
Copy link
Collaborator

feywind commented Feb 12, 2020

@willbattel Thanks for posting this! I'm going to link this to the meta-issue I've been building in regards to things that be grpc-js issues. I don't guess you've tried with the C++ bindings?

@feywind feywind added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Feb 12, 2020
@willbattel
Copy link
Author

willbattel commented Feb 12, 2020

@feywind just tried it now. Following the grpc technique shown here #770 (comment), we're curiously getting the same error in Stackdriver but the process survives noticeably longer (4 minutes) before tripping over itself and dying.

Looking at the Pub/Sub console, it seems significantly more messages did get published than before changing to grpc. Looks like at least twice as many messages were successfully sent than before. The console shows a max Publish message request count of 85.68/second before it died, which is the highest we've ever seen it get to.

EDIT: Running it again yielded a different error- not sure what happened the first time but here is the stack trace when using grpc

Error: 4 DEADLINE_EXCEEDED: Deadline exceeded 
    at Object.callErrorFromStatus (/srv/node_modules/@grpc/grpc-js/build/src/call.js:30:26) 
    at Http2CallStream.call.on (/srv/node_modules/@grpc/grpc-js/build/src/client.js:96:33) 
    at Http2CallStream.emit (events.js:203:15) 
    at process.nextTick (/srv/node_modules/@grpc/grpc-js/build/src/call-stream.js:100:22) 
    at process._tickCallback (internal/process/next_tick.js:61:11) 
  code: 4, 
  details: 'Deadline exceeded', 
  metadata: Metadata { internalRepr: Map {}, options: {} }

@willbattel
Copy link
Author

willbattel commented Feb 12, 2020

@feywind upon further inspection, it looks like almost all, if not all, documents did get sent over to the receiving subscriber. I'm still not sure why we see the errors in Stackdriver (see EDIT in previous comment), but if those errors hadn't appeared it would have seemed to work without issue when using grpc. We got about 3,000 of those errors, and about 35,000 documents were streamed from Firestore.

@feywind
Copy link
Collaborator

feywind commented Feb 28, 2020

Going through the issue list again here. Is this something that's still going on for you? If so, I can see if someone on the grpc team has any ideas.

@willbattel
Copy link
Author

I believe so, but let me take a closer look again this weekend and get back to you with more specifics.

@ordinz
Copy link

ordinz commented Feb 28, 2020

I am receiving this message now simply doing one call with .publishJSON

Error: Retry total timeout exceeded before any response was received
    at repeat (/srv/node_modules/google-gax/build/src/normalCalls/retries.js:65:31)
    at Timeout.setTimeout [as _onTimeout] (/srv/node_modules/google-gax/build/src/normalCalls/retries.js:100:25)
    at ontimeout (timers.js:498:11)
    at tryOnTimeout (timers.js:323:5)
    at Timer.listOnTimeout (timers.js:290:5) code: 4 } my_topic {"my": "data"}" 

@feywind
Copy link
Collaborator

feywind commented May 5, 2020

Can anyone who's had this problem go ahead and try it with the latest pubsub library version? I notice that gRPC C++ was mentioned above, grpc-js has seen a lot of stability improvements.

I'm going to see if someone who handles google-gax can look.

@willbattel
Copy link
Author

@feywind I'll revisit this as soon as I can- likely the week after next.

@meredithslota meredithslota added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Jun 22, 2020
@sameerRsystems
Copy link

I am also facing same problem. Did you guys found fix?

@ordinz
Copy link

ordinz commented Jul 13, 2020 via email

@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Aug 10, 2020
@feywind
Copy link
Collaborator

feywind commented Aug 27, 2020

As of 2.5.0, we're explicitly pulling in a new gax (and grpc-js) here. Did anyone see any improvement from that? There was also a microgenerator bug having to do with batching support.

@willbattel
Copy link
Author

We're not currently using the program that produced the errors, so I cannot comment on any improvements unfortunately. If we do restore that service I'll take a look.

In any case, thanks for looking into this!

@feywind
Copy link
Collaborator

feywind commented Sep 3, 2020

It kind of sounds like everyone who was having problems has moved off to other solutions. That's a bummer, but also I'm glad you're not blocked! If anyone still ends up needing help or has more comments on it, please feel free to re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

6 participants