Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bigquery: AppendRows eventually deadlocks / high rate of context deadline exceeded #9660

Closed
benjaminRomano opened this issue Mar 28, 2024 · 5 comments
Assignees
Labels
api: bigquery Issues related to the BigQuery API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@benjaminRomano
Copy link

Client
Bigquery v1.60.0

Environment
GKE

Go Environment
go v1.22

Code

// Stream creation
featureSetStream, err := writeClient.NewManagedStream(ctx,
   managedwriter.WithDestinationTable(conf.FeatureSetTableID),
   managedwriter.WithType(managedwriter.CommittedStream),
   managedwriter.WithSchemaDescriptor(featureProtoDescriptor),
   managedwriter.EnableWriteRetries(true),
)

// We poll from a pubsub queue, do some processing that generates `FeatureSets` proto 

// Convert protos to serialized rows (~100-200 quite small rows produced)
for _, featureSet := range results.FeatureSets {
item := &inferno.FeatureBQ{
	TraceId:      traceID,
	Properties:   featureSet.Properties,
	Features:     featureSet.Features,
	FeatureQuery: featureSet.FeatureQuery,
	TraceIdStr:   ptr.String(id_generator.ToString(traceID)),
}

serialized, err := proto.Marshal(item)
if err != nil {
	return err
}

items = append(items, serialized)
}

// Append the rows to BigQuery
//
// After a couple hours this stream degrades 
// and we consistently hit context deadline timeouts (context deadline is ~3 minutes)
_, err := featureSetStream.AppendRows(ctx, serializedRows)

Expected behavior

We'd expect that the AppendRows call complete quickly and not eventually get stuck.

Actual behavior

Actual behavior is that AppendRows degrades and gets stuck. We need to restart the process to recover. This degrades after a couple hours.

Screenshots
Below is a screenshot of the latency of the AppendRows call.

Screenshot 2024-03-28 at 10 35 21 AM

Other Context

In the interim, we are restarting the process when we get alerted on this issue.

What's interesting is that this process is writing to another Bigquery stream in parallel and never hits this degradation issue. There should be a 1:1 relationship with how many AppendRows calls are being made to each stream. For each message processed from pubsub, we write to both streams. The number of serialized rows being appended per AppendRows call for that other stream should be quite substantially larger (~1k to 5k rows) while each row is slightly smaller.

I suspect there is a remaining deadlock somewhere within the managedwriter. I have Google Cloud Profiler enabled for this service, but I didn't see anything noticeable from the CPU / threads profile view.

Other things i tried was switching from committed stream to default stream, but that had no impact.

Let me know if there is any debugging logs I can provide @shollyman

@benjaminRomano benjaminRomano added the triage me I really want to be triaged. label Mar 28, 2024
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the BigQuery API. label Mar 28, 2024
@benjaminRomano
Copy link
Author

Another thing I notice is that we do have high rates of EOF errors on the connection, which is what led to the detection of this prior issue: #9540

Screenshot 2024-03-28 at 11 05 08 AM

@codyoss codyoss added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. and removed triage me I really want to be triaged. labels Apr 9, 2024
@shollyman
Copy link
Contributor

Starting to look at this.

EOF on send should trigger us to mark the connection as being unhealthy, so the next send that uses the connection should cause reconnect. Possibly we're doing something unusual with a received response? Is there any other noteworthy events leading up to the spike in latency?

One way to get more information here might be to add a grpc streaming interceptor that logs more detail about the stream interactions. I've got some old code that does this, but likely needs to be refreshed. I'll work on that next.

@benjaminRomano
Copy link
Author

benjaminRomano commented Apr 10, 2024

Ah forgot to update this ticket. I root caused this issue last week with GCP Support.

So the issue was that we were sending malformed proto descriptor and receiving the following error from upstream:

Invalid proto schema: BqMessage.proto: FeatureBQ.properties: ".snap.inferno.Property" is not defined. BqMessage.proto: FeatureBQ.features: ".snap.inferno.Feature" is not defined. Entity: projects/sc-inferno/datasets/inferno/tables/feature-sets/streams/Cic3ZDhiOWI4NC0wMDAwLTI2NmEtYjc0OC1mNDAzMDQ1ZDJiYjA6czQ

In our client code, we had the following:

// Example proto
m := &myprotopackage.MyCompiledMessage{}
descriptorProto := protodesc.ToDescriptorProto(m.ProtoReflect().Descriptor())

instead of this:

m := &myprotopackage.MyCompiledMessage{}
// Ensure noralize descriptor to collect nested messages, fix-up oneOfs, etc.
descriptorProto, err := adapt.NormalizeDescriptor(m.ProtoReflect().Descriptor())

if err != nil {
	// TODO: Handle error.
}

The NormalizeDescriptor was accidentally removed from a refactor.

Ultimately, we were sending batches of bad data to the backend and that would get rejected and lead to the EOF errors. We did not see the underlying bad proto error because we were not calling result.GetResult(ctx) which would've returned the error message to us.

Couple thoughts:

  • Can we detect malformed protos at registration time (WithSchemaDescriptor) instead of when we send the proto descriptor up to the server?
  • The bigquery server logs should capture upload requests with malformed proto data. Looing at Bigquery logs from Stackdriver, we can see the server just logging OK. Ideally, the invalid proto schema message would be logged by the server.
  • At high error rates, there may be some unexpected deadlocking behavior -- Resolving this is not a big deal since this is just a symptom of a different problem.
  • Adding logs for BQ interactions makes sense that could've helped us narrow it down. Since we weren't calling GetResult we were dependent on GCP Support to send us the underlying BQ server error message to us

@shollyman
Copy link
Contributor

Thanks for the additional details.

Detecting malformed DescriptorProto messages at instantiation time would require us to mimic the same descriptor validation and building logic in both the clients and the service, which is problematic. That said, I've opened a potential FR (internally b/333755890) against the service to expose some kind of validation RPC that would allow us to check compatibility before sending data.

The managedwriter package will currently produce opencensus metrics (I've got some feature work to enable the same on the opentelemetry side), so it's another avenue for monitoring ingestion health. The current metrics/views can be seen in the package docs: https://pkg.go.dev/cloud.google.com/go/bigquery/storage/managedwriter#pkg-variables

@shollyman
Copy link
Contributor

Given there's no immediately actionable elements for this current issue, I'm going to go ahead and close it out at this time. However, if you have more feedback please let me know here or via another issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

3 participants