bigquery: `AppendRows` eventually deadlocks / high rate of context deadline exceeded #9660

benjaminRomano · 2024-03-28T17:47:19Z

Client
Bigquery v1.60.0

Environment
GKE

Go Environment
go v1.22

Code

// Stream creation
featureSetStream, err := writeClient.NewManagedStream(ctx,
   managedwriter.WithDestinationTable(conf.FeatureSetTableID),
   managedwriter.WithType(managedwriter.CommittedStream),
   managedwriter.WithSchemaDescriptor(featureProtoDescriptor),
   managedwriter.EnableWriteRetries(true),
)

// We poll from a pubsub queue, do some processing that generates `FeatureSets` proto 

// Convert protos to serialized rows (~100-200 quite small rows produced)
for _, featureSet := range results.FeatureSets {
item := &inferno.FeatureBQ{
	TraceId:      traceID,
	Properties:   featureSet.Properties,
	Features:     featureSet.Features,
	FeatureQuery: featureSet.FeatureQuery,
	TraceIdStr:   ptr.String(id_generator.ToString(traceID)),
}

serialized, err := proto.Marshal(item)
if err != nil {
	return err
}

items = append(items, serialized)
}

// Append the rows to BigQuery
//
// After a couple hours this stream degrades 
// and we consistently hit context deadline timeouts (context deadline is ~3 minutes)
_, err := featureSetStream.AppendRows(ctx, serializedRows)

Expected behavior

We'd expect that the AppendRows call complete quickly and not eventually get stuck.

Actual behavior

Actual behavior is that AppendRows degrades and gets stuck. We need to restart the process to recover. This degrades after a couple hours.

Screenshots
Below is a screenshot of the latency of the AppendRows call.

Other Context

In the interim, we are restarting the process when we get alerted on this issue.

What's interesting is that this process is writing to another Bigquery stream in parallel and never hits this degradation issue. There should be a 1:1 relationship with how many AppendRows calls are being made to each stream. For each message processed from pubsub, we write to both streams. The number of serialized rows being appended per AppendRows call for that other stream should be quite substantially larger (~1k to 5k rows) while each row is slightly smaller.

I suspect there is a remaining deadlock somewhere within the managedwriter. I have Google Cloud Profiler enabled for this service, but I didn't see anything noticeable from the CPU / threads profile view.

Other things i tried was switching from committed stream to default stream, but that had no impact.

Let me know if there is any debugging logs I can provide @shollyman

The text was updated successfully, but these errors were encountered:

benjaminRomano · 2024-03-28T18:07:59Z

Another thing I notice is that we do have high rates of EOF errors on the connection, which is what led to the detection of this prior issue: #9540

shollyman · 2024-04-09T21:03:05Z

Starting to look at this.

EOF on send should trigger us to mark the connection as being unhealthy, so the next send that uses the connection should cause reconnect. Possibly we're doing something unusual with a received response? Is there any other noteworthy events leading up to the spike in latency?

One way to get more information here might be to add a grpc streaming interceptor that logs more detail about the stream interactions. I've got some old code that does this, but likely needs to be refreshed. I'll work on that next.

benjaminRomano · 2024-04-10T00:35:04Z

Ah forgot to update this ticket. I root caused this issue last week with GCP Support.

So the issue was that we were sending malformed proto descriptor and receiving the following error from upstream:

Invalid proto schema: BqMessage.proto: FeatureBQ.properties: ".snap.inferno.Property" is not defined. BqMessage.proto: FeatureBQ.features: ".snap.inferno.Feature" is not defined. Entity: projects/sc-inferno/datasets/inferno/tables/feature-sets/streams/Cic3ZDhiOWI4NC0wMDAwLTI2NmEtYjc0OC1mNDAzMDQ1ZDJiYjA6czQ

In our client code, we had the following:

// Example proto
m := &myprotopackage.MyCompiledMessage{}
descriptorProto := protodesc.ToDescriptorProto(m.ProtoReflect().Descriptor())

instead of this:

m := &myprotopackage.MyCompiledMessage{}
// Ensure noralize descriptor to collect nested messages, fix-up oneOfs, etc.
descriptorProto, err := adapt.NormalizeDescriptor(m.ProtoReflect().Descriptor())

if err != nil {
	// TODO: Handle error.
}

The NormalizeDescriptor was accidentally removed from a refactor.

Ultimately, we were sending batches of bad data to the backend and that would get rejected and lead to the EOF errors. We did not see the underlying bad proto error because we were not calling result.GetResult(ctx) which would've returned the error message to us.

Couple thoughts:

Can we detect malformed protos at registration time (WithSchemaDescriptor) instead of when we send the proto descriptor up to the server?
The bigquery server logs should capture upload requests with malformed proto data. Looing at Bigquery logs from Stackdriver, we can see the server just logging OK. Ideally, the invalid proto schema message would be logged by the server.
At high error rates, there may be some unexpected deadlocking behavior -- Resolving this is not a big deal since this is just a symptom of a different problem.
Adding logs for BQ interactions makes sense that could've helped us narrow it down. Since we weren't calling GetResult we were dependent on GCP Support to send us the underlying BQ server error message to us

shollyman · 2024-04-10T20:02:13Z

Thanks for the additional details.

Detecting malformed DescriptorProto messages at instantiation time would require us to mimic the same descriptor validation and building logic in both the clients and the service, which is problematic. That said, I've opened a potential FR (internally b/333755890) against the service to expose some kind of validation RPC that would allow us to check compatibility before sending data.

The managedwriter package will currently produce opencensus metrics (I've got some feature work to enable the same on the opentelemetry side), so it's another avenue for monitoring ingestion health. The current metrics/views can be seen in the package docs: https://pkg.go.dev/cloud.google.com/go/bigquery/storage/managedwriter#pkg-variables

shollyman · 2024-04-11T17:45:22Z

Given there's no immediately actionable elements for this current issue, I'm going to go ahead and close it out at this time. However, if you have more feedback please let me know here or via another issue.

benjaminRomano added the triage me I really want to be triaged. label Mar 28, 2024

product-auto-label bot added the api: bigquery Issues related to the BigQuery API. label Mar 28, 2024

blunderbuss-gcf bot assigned shollyman Mar 28, 2024

codyoss added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. and removed triage me I really want to be triaged. labels Apr 9, 2024

benjaminRomano closed this as completed Apr 10, 2024

benjaminRomano reopened this Apr 10, 2024

shollyman closed this as completed Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bigquery: `AppendRows` eventually deadlocks / high rate of context deadline exceeded #9660

bigquery: `AppendRows` eventually deadlocks / high rate of context deadline exceeded #9660

benjaminRomano commented Mar 28, 2024

benjaminRomano commented Mar 28, 2024

shollyman commented Apr 9, 2024

benjaminRomano commented Apr 10, 2024 •

edited

shollyman commented Apr 10, 2024

shollyman commented Apr 11, 2024

bigquery: AppendRows eventually deadlocks / high rate of context deadline exceeded #9660

bigquery: AppendRows eventually deadlocks / high rate of context deadline exceeded #9660

Comments

benjaminRomano commented Mar 28, 2024

benjaminRomano commented Mar 28, 2024

shollyman commented Apr 9, 2024

benjaminRomano commented Apr 10, 2024 • edited

shollyman commented Apr 10, 2024

shollyman commented Apr 11, 2024

bigquery: `AppendRows` eventually deadlocks / high rate of context deadline exceeded #9660

bigquery: `AppendRows` eventually deadlocks / high rate of context deadline exceeded #9660

benjaminRomano commented Apr 10, 2024 •

edited