Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Why "ECONNRESET on read" is not transient error #9478

Closed
0rps opened this issue Feb 27, 2024 · 3 comments · Fixed by #10154
Closed

storage: Why "ECONNRESET on read" is not transient error #9478

0rps opened this issue Feb 27, 2024 · 3 comments · Fixed by #10154
Assignees
Labels
api: storage Issues related to the Cloud Storage API. type: question Request for information or clarification. Not an issue.

Comments

@0rps
Copy link

0rps commented Feb 27, 2024

Recently I added more monitoring in the software I am developing and noticed in storage module the following sequence of errors during listing bucket:

Logs:

2024-02-05 07:54:58 UTC [ERROR] a transient error has happened: Get "https://storage.googleapis.com/storage/v1/...&versions=false": read tcp IP_A:49888->IP_B:443: read: connection reset by peer
2024-02-05 07:54:58 UTC [ERROR] a transient error has happened: Get "https://storage.googleapis.com/storage/v1/...&versions=false": read tcp IP_A:49888->IP_B:443: read: connection reset by peer

// here is non transient error and it is unpacked:
2024-02-05 07:54:58 UTC [INFO] [component:storage] type: "<*net.OpError Value>", value: "read tcp IP_A:49888->IP_B:443: read: connection reset by peer"
2024-02-05 07:54:58 UTC [INFO] [component:storage] type: "<*os.SyscallError Value>", value: "read: connection reset by peer"
2024-02-05 07:54:58 UTC [INFO] [component:storage] type: "<syscall.Errno Value>", value: "connection reset by peer"
2024-02-05 07:54:58 UTC [INFO] [component:storage] error is not unpackable

// the error is propagated further 
2024-02-05 07:54:58 UTC [ERROR] ... error while listing the source bucket: error while listing 'gs://...': read tcp IP_A:49888->IP_B:443: read: connection reset by peer

Google client creation

client, err := storage.NewClient(ctx, options...)  
// ...
client.SetRetry(storage.WithPolicy(storage.RetryAlways))  
client.SetRetry(storage.WithErrorFunc(customErrorFunc)
c = &CloudClient{client: client}
_, err := c.ListPrefix(...)
if err != nil {
  return fmt. Errorf("error while listing %q: %w", ..., err)
}

Code for custom retry function:

func unpackError(err error) {  
    if err == nil {  
       return  
    }  
    rValue := reflect.ValueOf(err)  
    logger.Infof("type: %q, value: %q", rValue.String(), err.Error())  
  
    if e, ok := err.(interface{ Unwrap() error }); ok {  
       unpackError(e.Unwrap())  
    } else {  
       logger.Infoln("error is not unpackable")  
    }  
}  

func customErrorFunc(err error) bool {  
       isRetry := storage.ShouldRetry(err)(err)  
       if isRetry {  
          logger.Error(TransientErrorf("a transient error has happened: %v", err))  
       } else {  
          unpackError(err)  
       }  
       return isRetry    
}

Code for listing:

func (c CloudClient) ListPrefix(ctx context.Context, location *Location, timeout time.Duration) ([]File, error) {  
    ctx, cancel := context.WithTimeout(ctx, timeout)  
    defer cancel()  
  
    query := &storage.Query{  
       Prefix:      strings.TrimRight(location.Path, "/") + "/",  
       Versions:    false,  
       StartOffset: "",  
       EndOffset:   "",  
       Projection:  0,  
    }  
  
    err := query.SetAttrSelection([]string{"Bucket", "Name", "Size", "MD5", "Updated"})  
    if err != nil {  
       logger.Fatalf(err.Error())  
    }  
  
    iter := c.client.Bucket(location.Bucket).Objects(ctx, query)  
    var result []File  
  
    for {  
       next, err := iter.Next()  
       if err == iterator.Done {  
          break  
       }  
  
       if err != nil {  
          return nil, fmt.Errorf("error while listing '%s': %w", location, err)  
       }  
		files = append(files, extractFile(next))
    }  
  
    return files, nil  
}

I have read the documentation which states that only url.Error is considered a transient error in the case of "ECONNRESET". I suppose that the error is raised during reading (this is my assumption). I have also seen a similar (but not identical) discussion here: Azure/go-autorest#450. There is also a link to a Go standard library test, but it doesn't clarify things https://go.dev/src/net/net_test.go.

What is the goal of my issue?
From my perspective (and from a high-level view of this functionality), any error related to "ECONNRESET" should always be considered retryable. Could you explain whether this makes sense or if perhaps I am mistaken?

@0rps 0rps added the triage me I really want to be triaged. label Feb 27, 2024
@product-auto-label product-auto-label bot added the api: storage Issues related to the Cloud Storage API. label Feb 27, 2024
@tritone tritone added type: question Request for information or clarification. Not an issue. and removed triage me I really want to be triaged. labels Feb 28, 2024
@tritone
Copy link
Contributor

tritone commented Feb 28, 2024

Thanks for reporting. Yes, we do intend to retry any ECONNRESET error, and we do that with *url.Error: https://github.com/googleapis/google-cloud-go/blob/storage/v1.38.0/storage/invoke.go#L119

We could potentially add a special case for *net.OpError as well? I wasn't aware that ECONNRESET could also come in through this type.

Also the idea of WithErrorFunc is that you can supply your own predicate so feel free to customize that for your application as well.

@0rps
Copy link
Author

0rps commented Feb 29, 2024

Thanks for the answer. I have already added custom handling of ECONNRESET in the project. I was just wondering if there are any pitfalls to handle only url.Error type errors in such cases and received the answer :)

BTW, looks like url.Error wraps net.OpError with ECONNRESET only during connection establishment (I guess so).

What should I do with this issue, close by myself/you will close it/keep open for future improvement? Also, is it OK to submit the PR for this case (if I make it)?

@tritone
Copy link
Contributor

tritone commented Feb 29, 2024

We can add net.OpError with ECONNRESET as a special case as well. I'd accept that change in the repo if you'd like to make it.

tritone added a commit to tritone/google-cloud-go that referenced this issue May 11, 2024
We are seeing these errors surfaced via net.OpError as well as
url.Error. Update the ShouldRetry function accordingly.

Also, use net.ErrClosed sentinel over string matching.

Fixes googleapis#9478
tritone added a commit that referenced this issue May 13, 2024
We are seeing these errors surfaced via net.OpError as well as
url.Error. Update the ShouldRetry function accordingly.

Also, use net.ErrClosed sentinel over string matching.

Fixes #9478
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants