New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[receiver/vcenter]: Optimize vCenter Receiver to use concurrency #31837
Labels
Comments
schmikei
added
enhancement
New feature or request
needs triage
New item requiring triage
labels
Mar 19, 2024
Pinging code owners for receiver/vcenter: @djaglowski @schmikei. See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Me and @StefanKurek plan to be taking this effort on, feel free to assign us to it |
maybe this spike can help: #30624 |
djaglowski
pushed a commit
that referenced
this issue
Apr 17, 2024
**Description:** <Describe what has changed.> <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> Changes the method for collecting VMs used by the `vccenterreceiver` to the more time-efficient `CreateContainerView` method. This is the first step to addressing the issue linked below. **Link to tracking Issue:** #31837 **Testing:** These changes were tested on an environment with 200+ virtual machines. The original collection time was ~80 seconds. Collection times with these changes are ~40 seconds. **Documentation:** N/A --------- Co-authored-by: Stefan Kurek <stefan.kurek@observiq.com>
rimitchell
pushed a commit
to rimitchell/opentelemetry-collector-contrib
that referenced
this issue
May 8, 2024
…lemetry#32201) **Description:** <Describe what has changed.> <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> Changes the method for collecting VMs used by the `vccenterreceiver` to the more time-efficient `CreateContainerView` method. This is the first step to addressing the issue linked below. **Link to tracking Issue:** open-telemetry#31837 **Testing:** These changes were tested on an environment with 200+ virtual machines. The original collection time was ~80 seconds. Collection times with these changes are ~40 seconds. **Documentation:** N/A --------- Co-authored-by: Stefan Kurek <stefan.kurek@observiq.com>
djaglowski
pushed a commit
that referenced
this issue
May 13, 2024
**Description:** <Describe what has changed.> There were already some improvements made as far as how networks calls were made centered around Virtual Machines. This allowed collection times to decrease from ~90s to ~27s in an environment with 1 Cluster, 2 Hosts, & 280 VMs. Making similar changes for all resource types helped to further decrease collection times. Now collection time has decreased from ~27s to <~3s for the same environment. Here's a general list of the changes made: - Now makes all network calls (per datacenter) first and stores returned data. - Processes this data afterwards to convert to OTEL resources/metrics (refactored to new file). - Moves all metric recording to metrics.go to keep consistent. - Moves all resource builder creation to resources.go to keep consistent. - Updates/fixes tests. **Link to tracking Issue:** <Issue number if applicable> #31837 Although this issue prescribes a solution to the problem (goroutines) which ended up not being necessary **Testing:** <Describe what testing was performed and which tests were added.> Unit Tests & Integration Tests Passing as well as Manual Testing in Local Environments **Documentation:** <Describe the documentation added.> N/A
jlg-io
pushed a commit
to jlg-io/opentelemetry-collector-contrib
that referenced
this issue
May 14, 2024
…metry#32991) **Description:** <Describe what has changed.> There were already some improvements made as far as how networks calls were made centered around Virtual Machines. This allowed collection times to decrease from ~90s to ~27s in an environment with 1 Cluster, 2 Hosts, & 280 VMs. Making similar changes for all resource types helped to further decrease collection times. Now collection time has decreased from ~27s to <~3s for the same environment. Here's a general list of the changes made: - Now makes all network calls (per datacenter) first and stores returned data. - Processes this data afterwards to convert to OTEL resources/metrics (refactored to new file). - Moves all metric recording to metrics.go to keep consistent. - Moves all resource builder creation to resources.go to keep consistent. - Updates/fixes tests. **Link to tracking Issue:** <Issue number if applicable> open-telemetry#31837 Although this issue prescribes a solution to the problem (goroutines) which ended up not being necessary **Testing:** <Describe what testing was performed and which tests were added.> Unit Tests & Integration Tests Passing as well as Manual Testing in Local Environments **Documentation:** <Describe the documentation added.> N/A
@schmikei I think this can be closed now with the recent performance enhancements. I'll let you decide. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Component(s)
receiver/vcenter
Is your feature request related to a problem? Please describe.
I'm lead to believe that larger environments have a harder time using the receiver due to the large number of requests that need to be made.
For larger environments the amount of requests can show how the current synchronous code has some room for improvement. I think once we scaled up it was pretty evident that we were bottlenecking on requests, particularly for VMs.
Describe the solution you'd like
From my recollection I don't think that the MetricsBuilder maintains some state that prevented easy parallelization, however we could parallelize/batch the requests to get better performance.
The tooling mitmproxy really helped us identify the issue as the massive number of requests made individually were stacking up to be quite a time sink.
Describe alternatives you've considered
No response
Additional context
For an environment of 2000 VMs it was taking sometimes minutes to complete a collection, based off of early experiments, hoping to cut it down to a collection interval of seconds.
The text was updated successfully, but these errors were encountered: