-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Grafana Agent flow mode plugins #350
Comments
Plugins have been on my mind a lot, especially whether they're even viable. I would like to personally write this next proposal, but I would like to see others take the other 4 (though I'll still want to be involved to some extent). |
A few thoughts and concerns: Development ExperienceOne of the key selling points of the flow philosophy is that a component is a single go package that self-registers and is a relatively standalone, testable piece of code. I wouldn't want plugins to introduce a different experience. If a developer has to choose whether they are making a plugin or a compiled component early in the process, that would be undesirable. Ideally, you should be able to import the exact same go package without modifications and run it as a plugin the same as if it were compiled in for maximum portability and flexibility. That may be unrealistic, but I think it should be a goal. I would hate to fracture a very young open-source community into vastly different runtime modes (which is why lua would be a very hard sell for me). Living without pluginsThe status quo is that all of the current components are compiled into the agent binary. The self-registration mechanism makes that really nice because you can import with Since plugins will almost certainly have a performance cost, I'd argue they need to have a significant ease of use benefit over the current paradigm to be worth it. I remain skeptical any of the currently available solutions for go will do that, but I'd love to be proven wrong. We could alternately dedicate time to normalizing and facilitating the creation of custom agent binaries with arbitrary combinations of component packages. I made a proof of concept for personal uses, and it has some rough edges, but was not intolerable. With some docs and maybe some tooling, we could make it pretty easy for somebody to create an agent with (or without) whatever components they want. |
This is a goal I share, but it's not discussed in this design doc since I don't go over the API at all. It should be possible, but it may require new restrictions on a component's API. In particular, if plugins are built using WASM or system binaries, exporting interfaces introduces a new challenge. The plugin engine would ned to be able to provide some value for an interface across plugins, but Go doesn't allow interfaces to be built at runtime. A workaround for this is to introduce some kind of code generation to build interface implementations, but I don't know if that's something we'd want to do since it complicates the build process. It is, however, possible to build functions at runtime. If we were to restrict our existing APIs such that you could only export structs of functions, then plugins would be able to work as native components do today, and both native components and plugins would be built exactly the same.
Developing a component in a plugin will not be easier than the current paradigm, but it won't be harder either. However, plugins solve important problems that we've been facing:
This is exactly what the RFC is arguing that we shouldn't do:
We see this pattern with OpenTelemetry collector, and I'm overall not a fan of the distribution-type model for the reason above. The scenario of "this distribution doesn't have a component I want, so I have to fork it or beg the maintainers to add it in" can be seen as user-hostile. Do you have counterarguments for why adopting a distribution model is better than a plugin model? |
I'm coming at this from a slightly different angle having recently switch to Grafana Agent after using either vector.dev or one of the various distros of the OTEL collector for the past couple of years. The first thing I think is important to note is that having a single binary with all the "plugins" installed into it (whether they are the plugins that are being proposed, existing components or a combination of both) is actually really handy for most users as it means they don't have to worry about whether they've compiled the correct code and can just deploy a single binary/container. This is not, however, advocacy for keeping the status quo, and the point about OTEL Distros is very valid! I have always appreciated the DataDog approach to plugins, which boils down to "You want to install it easily? You contribute upstream. You want something specific to you? Drop the code into The idea that a user could develop a plugin locally, run it on their own platform, and then contribute to "core" if they wanted to is a nice pattern, and it even allows folks to release plugins under their own github/NPM/whatever repo and have Grafana Agent "pick it up" from a directory on the filesystem if they want to use another license. It does, however, mean that the agent needs to be able to load from disk at launch, and potentially be able to "reload" everything from disk whilst running depending on how advanced we want to make it. Not sure if that makes sense, so ask any questions and I'll do my best to clarify! 😆 |
I'll give my 2 cents here as this kinda hit a soft spot for me recently. One could argue that the approach that open telemetry took with the official and contrib distro of their collector allowed vendors to provide support for their own specific platform. Making the collector as agnostic as can be. but in the reality of things, the freedom of users in way was greatly reduced. ProsAllowing plugins and extensions to be easily added opened up the community both individual OSS developers and vendors to extend the functionality, it preserved the ability to keep "core" support on the official collector and not be "cornered" into giving out a guarantee of support for components not developed and approved by the core community. ConsAllowing vendors to develop plugins which provide custom support for their platforms allow the vendors to start implementing logic and requirements which diverge from the Otel spec. this causes some issues as the vendor now requires the users to implement custom logic in their systems which essentially create a sort of vendor locking. My example is simple, a vendor i am using requires OTEL signals to be exported with 4 specific headers which indicate classification of the signals sent and is used to index the signals. these headers are not able to be provided with dynamic values based on the signal sent so i'm left with having to use their provided distro of the otel collector contrib which knows how to perform some deduction of headers from the signal being passed through the exporter. this makes life harder to adopt different agents as not all vendor are happy to develop support for all shipping agents. Keeping stuff as close the the standard is always a good idea on the users perspective as it keeps the freedom to choose the solution which best fits their own specific needs. Between this RFC and making the river language as complete as possible. I would choose the latter anytime. With that said. i completely agree with @proffalken about always keeping stuff as a single binary. if i need to compile the agent on my CI it make life so much harder as i need to keep track of changes in the build process of the agent instead of having to simply download a released version and run it. in that sense creating a contrib distro of the agent and providing an easy interface for devs to add components while keeping up with the upstream is the best way to go for these types of things IMHO. |
I'm excited to see how this plays out; plugins sound like an exciting approach to a more modular Agent in the future! 👀 One thing that sticks out was the "Ability to migrate existing components", and whether this should be a hard requirement. I feel the added value of plugins might be enough even if we cannot migrate all current components. For example, if the performance overhead was a bit too much for say, Are you worried that the two classes of components that come with a performance related warning would feel like second-class citizens? |
Addressing some of the comments here: This proposal was very high level, so it probably didn't do a great job at helping envision what plugins could potentially be. I could imagine adding something like this to a Flow config:
This hypothetical design has a few interesting attributes:
This is just a sketch, and I'm not sure what the final proposal would look like, but I do not want plugins to require people to recompile the agent. cc @proffalken
IMO, this is a good thing. Locking in components to only do OpenTelemetry will cause progress to be bottlenecked by when OpenTelemetry adopts a change. By necessity of attempting to be a global standard for all of telemetry data, OpenTelemetry will be slower to adopt new additions, as it needs to be careful. For example, the I don't want Flow to be limited to only OpenTelemetry components, and we don't even do that today; we have multiple sets of plugins from different ecosystem (prometheus., loki., pyroscope., otelcol., discovery.*). I also don't want to limit plugins to only dealing with telemetry data. If someone wants to write a plugin with a component that provisions architecture, they should be free to do so.
Unfortunately, the -contrib approach really doesn't fix the problems the maintainers are facing today as I mentioned earlier, specifically the one around dependency hell. If someone wants ~all the components, they will struggle to keep their distribution up to date. Plugins will be a challenge to implement, but I think it will give users much more flexibility around what components are used, and prevent community fragmentation as there will only be one official binary of Flow, with many different plugins for different components to use.
Yes, and I also don't want to play favorites :) It would feel weird to me personally if we said "prometheus, otelcol, loki, pyroscope all get to stay in core for performance but everything else must be a plugin," especially since two of those are Grafana Labs products. We can make Flow be an open platform, but it means playing on the same field as everyone else. I would prefer us to measure the overall impact of plugins and try our best to make the overhead as small as possible so we can become that open platform. cc @tpaschalis |
I think we should firstly decide on what the user experience should be. For example:
The other proposals should be based off of that user experience goal. However, I am not sure if this should fall within this proposal or a sub-proposal. |
At this point, we're not sure what the technical limitations of plugins are. That will drive what we're going to be able to deliver, which may change what we end up exposing to end users. While I'd normally agree to start from the user experience, I think this is a problem where some (but not all) technical information needs to be figured out first. |
Adding my 2 cents since I'm interested. Would love to see something where we could easily import or use telegraf plugins, since there are so many... Either linked as a go plugin, or referenced in code? Not sure... |
The most recent copy of this proposal can be found on Google docs. The below is the original version of this proposal for posterity.
Background
Currently, new capabilities to Grafana Agent Flow can only be added by contributing a new component to the official Git repository (https://github.com/grafana/agent).
Having a centralized repository of components causes issues for an open source project to thrive:
Other projects, such as the OpenTelemetry Collector, solve this problem by having different distributions of the collector. While distribution solves the issues above, it also fragments the community, as different distributions may have different subsets of components, making migration between distributions difficult.
I propose that we support a plugin system for Grafana Agent flow mode, allowing sets of components to be provided by external plugins which can be loaded at runtime into the Grafana Agent process.
This proposal serves as a high-level proposal of plugins to achieve maintainer and community consensus on the long-term goals.
Goals
Non-goals
Proposal
Flow mode should introduce the concept of a "plugin," where a plugin is some loadable code that provides one or more components that can be defined in a Flow configuration.
The mechanism through which plugins are created, retrieved and defined are not in scope for this proposal:
Requirements
These are high-level requirements that plugins must achieve:
If it is not possible for us to create a plugin system which meets these two requirements, we should consider abandoning the plugin model.
Performance
The biggest concern with plugins is performance of component communication. Today, communication between two components (such as
prometheus.scrape
sending metrics toprometheus.remote_write
) is largely achieved using shared memory, as it's internally represented by a native function call.However, plugin communication is unlikely to be able to use shared memory; the only mechanism where shared memory is available is through the plugin package, but that doesn't support Windows, making it OS dependent. All other potential communication mechanisms will involve some kind of message marshaling and unmarshaling between running plugins and the plugin host (Grafana Agent).
Before full development on plugins begins, a proof of concept is needed that demonstrates the overhead of message marshaling and unmarshaling to prove the viability of plugins.
Sub-proposals
For plugins to be fully realized, we need at least these five proposals to build on top of this one:
These proposals will likely be written by different people across a large period of time. The first proposal, plugin component communication, is a prerequisite for all other plugins, as it will prove or disprove whether plugins can be performant.
Delivery plan
Assuming plugins are viable, it will be delivered in four phases:
The text was updated successfully, but these errors were encountered: