Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Cloud Spanner Receiver gracefully handles unreadable databases #26732

Merged

Conversation

spidercensus
Copy link
Contributor

@spidercensus spidercensus commented Sep 18, 2023

Description:
Fixing a bug. Google Cloud Spanner Receiver currently generates an exception and exits if it attempts to read data from a database that doesn't exist. However it's normal for a single receiver to poll multiple databases, so this is not graceful failure. This PR makes a change to gracefully generate an error in case of an unreadable missing database and then continue reading other databases..

Issue: 14624

Changelog:

  • DatabaseReader.Read returns a vector of metadata.MetricsDatapoint and a multierr.Errors object instead of the result vector and a single Error.
  • DatabaseReader.Read appends any errors encountered to the multierr.Errors object instead of returning (nil, Error).
  • DatabaseReader.Read adds a NewPartialScrapeError to multierr.Errors object if errors were encountered while reading and some results were still populated into the result vector
  • ProjectReader.Read returns a vector of metadata.MetricsDatapoint and a Errors object instead of the result vector and nil.
  • ProjectReader.Read combines all multierr.Errors objects into a single Error before returning them as err.
  • ProjectReader.Read returns a NewPartialScrapeError instead of err if errors were encountered while reading and some results were still populated into the result vector.
  • googleCloudSpannerReceiver.Scrape returns a vector of metadata.MetricsDatapoint and a Error object instead of just the result vector.
  • All methods mentioned above continue processing metrics after an error is encountered.

@spidercensus spidercensus requested review from a team and bogdandrutu September 18, 2023 17:18
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Sep 18, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@spidercensus spidercensus changed the title Google Cloud Spanner Receiver gracefully generates an error in case o… Google Cloud Spanner Receiver gracefully handles unreadable databases Sep 18, 2023
@spidercensus spidercensus force-pushed the spidercensus/fix-googlecloudspanner branch from c2234a0 to 2397c5e Compare September 18, 2023 17:24
Copy link
Member

@songy23 songy23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a changelog

@afarbos
Copy link

afarbos commented Sep 18, 2023

I close my PR in favor of this one.
I believe we would need 2 thing that I covered in mine (#26860):

  1. We need to enable this at the 3 level: systems tables metrics, database and project
  2. We need to show this error as partial for the scraper to still retrieve the error. See https://github.com/open-telemetry/opentelemetry-collector/blob/1f5ff4b6b7ccd1f5806c5ce3d50aa1f6196349f2/receiver/scraperhelper/scrapercontroller.go#L188-L214

@spidercensus spidercensus marked this pull request as draft September 18, 2023 19:12
@spidercensus
Copy link
Contributor Author

I close my PR in favor of this one. I believe we would need 2 thing that I covered in mine (#26860):

  1. We need to enable this at the 3 level: systems tables metrics, database and project
  2. We need to show this error as partial for the scraper to still retrieve the error. See https://github.com/open-telemetry/opentelemetry-collector/blob/1f5ff4b6b7ccd1f5806c5ce3d50aa1f6196349f2/receiver/scraperhelper/scrapercontroller.go#L188-L214

I merged in the changes from this PR. Thanks @afarbos

@spidercensus spidercensus marked this pull request as ready for review September 18, 2023 19:55
Copy link
Member

@songy23 songy23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go.mod/go.sum deps changes detected, please run "make gotidy" and commit the changes in this PR.

@spidercensus
Copy link
Contributor Author

go.mod/go.sum deps changes detected, please run "make gotidy" and commit the changes in this PR.
Thanks, I'll review the tests. This is my first time contributing to this repo.

@spidercensus spidercensus marked this pull request as draft September 20, 2023 17:35
@spidercensus spidercensus marked this pull request as ready for review September 20, 2023 22:28
@spidercensus
Copy link
Contributor Author

spidercensus commented Sep 27, 2023

% make -C receiver/googlecloudspannerreceiver do-unit-tests-with-cover
...

go test -race -timeout 300s -parallel 4 --tags="" -coverprofile=coverage.txt -covermode=atomic ./...
ok      github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudspannerreceiver   1.998s  coverage: 96.3% of statements
ok      github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudspannerreceiver/internal/datasource       1.543s   coverage: 100.0% of statements
ok      github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudspannerreceiver/internal/filter   2.580s  coverage: 95.2% of statements
ok      github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudspannerreceiver/internal/filterfactory    2.162s   coverage: 97.3% of statements
ok      github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudspannerreceiver/internal/metadata 2.009s  coverage: 97.6% of statements
ok      github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudspannerreceiver/internal/metadataconfig   2.769s   coverage: [no statements]
ok      github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudspannerreceiver/internal/metadataparser   2.392s   coverage: 100.0% of statements
ok      github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudspannerreceiver/internal/statsreader      3.001s   coverage: 64.1% of statements
go tool cover -html=coverage.txt -o coverage.html
% make -C receiver/googlecloudspannerreceiver lint                    
running /Users/.../opentelemetry-collector-contrib/.tools/misspell -error
/Users/.../opentelemetry-collector-contrib/.tools/golangci-lint run --allow-parallel-runners --verbose --build-tags integration --timeout=30m --path-prefix googlecloudspannerreceiver
INFO [config_reader] Config search paths: [./ /Users/.../opentelemetry-collector-contrib/receiver/googlecloudspannerreceiver /Users/.../opentelemetry-collector-contrib/receiver /Users/.../opentelemetry-collector-contrib /Users/... /Users /] 
INFO [config_reader] Used config file ../../.golangci.yml 
INFO [lintersdb] Active 24 linters: [decorder depguard errcheck errorlint exhaustive exportloopref gci gocritic gofmt goimports gosec gosimple govet ineffassign misspell predeclared reassign revive staticcheck tenv unconvert unparam unused wastedassign] 
INFO [loader] Using build tags: [integration]     
INFO [loader] Go packages loading at mode 575 (compiled_files|exports_file|imports|name|types_sizes|deps|files) took 331.151167ms 
INFO [runner/filename_unadjuster] Pre-built 0 adjustments in 3.035833ms 
INFO [linters_context/goanalysis] analyzers took 0s with no stages 
INFO [runner] Issues before processing: 83, after processing: 0 
INFO [runner] Processors filtering stat (out/in): skip_dirs: 83/83, exclude: 83/83, exclude-rules: 0/83, skip_files: 83/83, cgo: 83/83, filename_unadjuster: 83/83, path_prettifier: 83/83, autogenerated_exclude: 83/83, identifier_marker: 83/83 
INFO [runner] processing took 4.107706ms with stages: exclude-rules: 1.704667ms, identifier_marker: 906.458µs, path_prettifier: 671.625µs, autogenerated_exclude: 626.542µs, skip_dirs: 190.5µs, cgo: 3.667µs, filename_unadjuster: 1.458µs, nolint: 584ns, max_same_issues: 458ns, fixer: 292ns, skip_files: 209ns, exclude: 208ns, source_code: 208ns, uniq_by_line: 166ns, severity-rules: 166ns, path_shortener: 84ns, max_per_file_from_linter: 83ns, path_prefixer: 83ns, sort_results: 83ns, diff: 83ns, max_from_linter: 82ns 
INFO [runner] linters took 426.765ms with stages: goanalysis_metalinter: 422.37575ms 
INFO File cache stats: 0 entries of total size 0B 
INFO Memory: 9 samples, avg is 46.8MB, max is 89.9MB 
INFO Execution took 772.344083ms     

spidercensus

This comment was marked as outdated.

@spidercensus spidercensus force-pushed the spidercensus/fix-googlecloudspanner branch from e3f97c0 to 64ae47c Compare October 4, 2023 16:22
@spidercensus spidercensus reopened this Oct 4, 2023
@spidercensus spidercensus force-pushed the spidercensus/fix-googlecloudspanner branch from 24c3f04 to d25ec56 Compare October 6, 2023 18:21
@spidercensus spidercensus reopened this Oct 6, 2023
@spidercensus spidercensus marked this pull request as ready for review October 27, 2023 20:24
@spidercensus
Copy link
Contributor Author

@harishbohara11 Could you please validate that this works on your end as well?

I used the following config to check this:


exporters:
  logging: {
    verbosity: normal
  }
  prometheus:
    endpoint: '0.0.0.0:9090'
extensions:
  health_check: {}
  memory_ballast:
    size_in_percentage: 10
  pprof:
receivers:
  prometheus/collector:
    config:
      scrape_configs:
        - job_name: "collector-metrics"
          static_configs:
            - targets: ["localhost:8888"]
          scrape_interval: 1m
          scrape_timeout: 10s
          metric_relabel_configs:
  otlp:
    protocols:
      grpc:
        endpoint: null
      http:
        endpoint: null
  googlecloudspanner:
    collection_interval: 60s
    top_metrics_query_max_rows: 100
    backfill_enabled: false
    cardinality_total_limit: 100000
    hide_topn_lockstats_rowrangestartkey: false
    projects:
      - project_id: <redacted>
        instances:
          - instance_id: <redacted>
            databases:
              - <redacted>
              - no_database_on_existing_instance
          - instance_id: non_existent_instance
            databases:
              - no_database_on_non_existent_instance
service:
  telemetry:
    metrics:
      address: :8888
    logs:
      level: info

  extensions:
    - health_check
    - memory_ballast
    - pprof
  pipelines:
    # This pipeline gathers metrics about this collector itself and forwards them to datadog.
    metrics/collector-dd:
      receivers:
        - prometheus/collector
      processors:
        - memory_limiter
        - batch
      exporters:
        - prometheus
    # This pipeline gathers metrics from Spanner and forwards them to datadog.
    metrics:
      receivers:
        - googlecloudspanner
        - otlp
      processors:
        - memory_limiter
        - batch
      exporters:
        - logging
        - prometheus
processors:
  batch:
    send_batch_size: 4000
    timeout: 200ms
    send_batch_max_size: 5000

  memory_limiter:
    limit_percentage: 80
    spike_limit_percentage: 10
    check_interval: 1s

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Nov 15, 2023
@harishbohara11
Copy link

@bogdandrutu Can you please review the PR? As a codeowner of GoogleCloudSpannerReceiver, @KiranmayiB has provided approval.

@mx-psi
Copy link
Member

mx-psi commented Nov 16, 2023

@spidercensus Thanks for your PR and apologies that it is taking so long to merge this. Given we now have a codeowner approval I can do a final review and merge this, but before this I need you to do the following steps so that CI passes:

  1. Merge main into this branch and make sure that the versions for go.opentelemetry.io modules in receiver/googlecloudspannerreceiver/go.mod are v0.90.1 or v1.0.0 for all modules
  2. Run make gotidy and make genotelcontribcol
  3. Remove the receiver/googlecloudspannerreceiver/coverage.html and receiver/googlecloudspannerreceiver/coverage.txt files

Could you do these steps and ping me when finished? Thanks!

@github-actions github-actions bot removed the Stale label Nov 17, 2023
Copy link
Contributor

github-actions bot commented Dec 2, 2023

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Dec 2, 2023
@spidercensus
Copy link
Contributor Author

This build process really needs to be made platform-independent (Dockerized). There were far too many times that I had to re-run go.mod because my version of go was slightly different than what was expected by the build tests.

@mx-psi Please review.

@github-actions github-actions bot removed the Stale label Dec 5, 2023
@mx-psi mx-psi merged commit b0947a2 into open-telemetry:main Dec 5, 2023
83 checks passed
@github-actions github-actions bot added this to the next release milestone Dec 5, 2023
@mx-psi
Copy link
Member

mx-psi commented Dec 5, 2023

@spidercensus Thanks for following the steps, there was one more merge conflict but it was easy to fix so I fixed it myself :) I am sorry it took so long to merge this, but it finally got through!

There were far too many times that I had to re-run go.mod because my version of go was slightly different than what was expected by the build tests.

If you have time, would you be able to open an issue about this? If not, any further details are helpful. In general, we aim to have compatibility with all supported Go versions (currently Go 1.20 and Go 1.21), but dependency management in Go can sometimes be a bit unwieldy, and e.g. right now the make gotidy target output can differ between Go 1.20 and Go 1.21. We should at the very least document this, but would like to first confirm that this is the issue you faced

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants