Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add allowlist setting for ingest processors #14479

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

andrross
Copy link
Member

@andrross andrross commented Jun 20, 2024

Add a new static setting that lets an operator choose specific ingest processors to enable by name. The behavior is as follows:

  • If the allowlist setting is not defined, all installed processors are enabled. This is the status quo.
  • If the allowlist setting is defined as the empty set, then all processors are disabled.
  • If the allowlist setting contains the names of valid processors, only those processors are enabled.
  • If the allowlist setting contains a name of a processor that does not exist, then the server will fail to start with an IllegalStateException listing which processors were defined in the allowlist but are not installed.
  • If the allowlist setting is changed between server restarts then any ingest pipeline using a now-disabled processor will fail. This is the same experience if a pipeline used a processor defined by a plugin but then that plugin were to be uninstalled across restarts.

Related to #14439

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❕ Gradle check result for 1722742: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Contributor

✅ Gradle check result for a4d3675: SUCCESS

Add a new static setting that lets an operator choose specific ingest
processors to enable by name. The behavior is as follows:

- If the allowlist setting is not defined, all installed processors are
  enabled. This is the status quo.
- If the allowlist setting is defined as the empty set, then all processors
  are disabled.
- If the allowlist setting contains the names of valid processors, only those
  processors are enabled.
- If the allowlist setting contains a name of a processor that does not exist,
  then the server will fail to start with an IllegalStateException
  listing which processors were defined in the allowlist but are not
  installed.
- If the allowlist setting is changed between server restarts then any
  ingest pipeline using a now-disabled processor will fail. This is the
  same experience if a pipeline used a processor defined by a plugin but
  then that plugin were to be uninstalled across restarts.

Related to opensearch-project#14439

Signed-off-by: Andrew Ross <andrross@amazon.com>
@@ -109,6 +112,13 @@ public class IngestService implements ClusterStateApplier, ReportingService<Inge

private static final Logger logger = LogManager.getLogger(IngestService.class);

public static final Setting<List<String>> PROCESSORS_ALLOWLIST_SETTING = Setting.listSetting(
"ingest.processors.allowlist",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be ingest.pipeline.processors.allowlist? Note we'll have the corresponding search variant as well. "pipeline" felt redundant for "ingest", because in the context of the code base "ingest" is kind of short hand for "ingest pipeline". That's not true in the case of "search"/"search pipepline".

Copy link
Collaborator

@reta reta Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andrross , three issue here (as I see it):

  • the processors are coming from IngestCommonPlugin so we should reflect that by prefixing the setting with the plugins as per our conventions (and moving it to plugin/module as well):

    plugins.ingest.pipeline.processors.
    
  • the plugin should filter out the processors (IngestCommonPlugin::getProcessors) with respect to the settings configured, the IngestService should not be involved I believe since the processors could be contributed by other plugins (the service would never see these processors)

  • nit, may be just using *.allowed suffix is sufficient? No strong opinions vs allowlist but list sounds like a bit redundant

Copy link
Member Author

@andrross andrross Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reta I don't think we should scope this to just the ingest-common plugin. This is intended to be an allowlist that opts in to a specific set of processors, regardless of which plugin they come from. My understanding is that IngestService is the single place to register all processors from all plugins, hence the place that the "allowlist" filter should go.

Copy link
Collaborator

@reta reta Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks very inconsistent to me - why we do that in core? What if plugin is uninstalled? The setting looks detached from the context to me. And again, processors could come from other plugins, installing the plugin AND asking for the settings update (which will happen 100% if specified in first place) does not look like friendly experience.

Copy link
Member Author

@andrross andrross Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the specific use case: I want to enable some, but not all, of the processors defined in the neural-search plugin and none of the processors in ingest-common.

Agree there are sharp edges here, but it is intended only for very customized installations and most users/operators will never need to interact with it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fe. because we have no idea of how plugins implement things

That's true, but individual processors are already top-level, user-facing concepts that end users can reference in isolation, so processor implementations must provide a reasonable experience if used by themselves.

I still think a core-level allow list for ingest processors is reasonable, mostly because using named processors is part of the core API. I think it is better than the alternative of implementing separate settings in each of ingest-common, ingest-user-agent, ingest-geo-ip, and neural-search in order to meet the requirement I listed above.

Could we end up building an allowlist setting for every extension point that registers named extensions? Maybe. I would say we can take it on a case-by-case basis. It probably won't make sense for everything.

Copy link
Collaborator

@reta reta Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, but individual processors are already top-level, user-facing concepts that end users can reference in isolation, so processor implementations must provide a reasonable experience if used by themselves.

That is the thing, none of the processors is subject of undesigned exclusions / inclusions list, by supporting that where is actually is implemented would make perfect sense.

Could we end up building an allowlist setting for every extension point that registers named extensions? Maybe. I would say we can take it on a case-by-case basis. It probably won't make sense for everything.

Case by case won't apply here (that would be a mess), we are clearly having a very generic problem: there are external pieces coming to core that we try to impose the control on.

As a tangible example here - the issue with codecs opensearch-project/custom-codecs#148 , this is 100% same use case where some things (codecs) are coming from plugin to core that we want to restrict. I thought we settled on the solution there that it has to be custom-codecs responsibility to configure codecs which is in disagreement with this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see opensearch-project/custom-codecs#148 as slightly different because there is no concept of "experimental" codecs today, which is what that issue is trying to implement. We don't currently have a use case for defining experimental codecs directly in the core (to my knowledge), so I'm hesitant about adding that concept only to support a use case needed by custom-codecs. If the request was for a setting to select/restrict supported codecs, selecting from the union of plugin-provided and core-provided codecs, then I think I would be fine with the same pattern here.

Case by case won't apply here (that would be a mess)

The requirement is to enable none of the processors in ingest-common, ingest-user-agent, and ingest-geo-ip and a subset of processors in neural-search. Practically speaking, isn't defining a distinct setting in each of those places more of a mess than a single core-defined setting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the custom distribution requires a feature to turn off few processors and they are packaged as modules today in the distribution.

My thoughts, I would like an answer to "does exposing a knob to choose/allow list/deny list processors help the community". Thinking about it I'm not convinced enough reading through the feature request.[1]

by supporting that where is actually is implemented would make perfect sense.

I agree with @reta on this, the implementation lies far away from where its getting excluded. I would prefer to have the provider make the decision what to provide(processors in this case).

If we take defining plugin by plugin approach, I see we have to re-implement the setting as suggested by @andrross .

I have another way, what if we define the setting ingest.processors.allowlist and let the plugins filter out the processors they'd like to provide ?
The upside is, :server module is not bloated with custom configuration and code.
The downside is, the plugins can choose to ignore the configuration, a.k.a no real enforcement of setting by OpenSearch. Is this a reasonable downside we can live with?

[1] #14439

Copy link
Member Author

@andrross andrross Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside is, the plugins can choose to ignore the configuration, a.k.a no real enforcement of setting by OpenSearch

@saratvemulapalli Eh, I think that's a pretty big downside. This would be a pretty obscure setting, and there is an extremely high likelihood that it would be unintentionally missed by plugin developers. Additionally, the crux of @reta's point is that the provider should determine whether or not things are excludable at all, and I don't think silently ignoring an apparent exclusion is a good experience (which is what would happen if a provider intentionally ignored the setting).

I'd be more inclined to implement the setting in neural-search and then just rm -rf the unneeded modules.

Comment on lines -146 to -147
mockBulkRequest = mock(BulkRequest.class);
lenient().when(mockBulkRequest.batchSize()).thenReturn(1);
Copy link
Member Author

@andrross andrross Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies in advance to reviewers, there are a couple small refactorings here that weren't strictly necessary:

  • This mock is needed at all. I replaced all usages with new BulkRequest().
  • I replaced the custom mapOf functions with the JDK-provided ones

Copy link
Contributor

❌ Gradle check result for f15a5c0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants