Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Centralized Scheduled Job Identity Management #2905

Open
cwperks opened this issue Jun 26, 2023 · 10 comments
Open

[RFC] Centralized Scheduled Job Identity Management #2905

cwperks opened this issue Jun 26, 2023 · 10 comments
Labels
enhancement New feature or request triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable.

Comments

@cwperks
Copy link
Member

cwperks commented Jun 26, 2023

Scheduled Job Security

Abstract

The JobScheduler Plugin of OpenSearch extends the core capabilities of OpenSearch by providing a mechanism for scheduling work to be performed on a schedule. Many default plugins of OpenSearch take advantage of job scheduling to perform different tasks. Examples include:

JobScheduler is a simple plugin that allows plugins to register jobs with the scheduler and job scheduler will periodically sweep all registered jobs to determine if a job is due to be run. If a job is due to be run, Job Scheduler will invoke the job in the respective plugin. Since OpenSearch is a distributed system, Job Scheduler also provides a locking service to prevent jobs from being run on multiple nodes if its not necessary.

Current State of Job Scheduler Security

The JobScheduler Plugin is an ExtensiblePlugin meaning that other plugins can extend an extension point defined by JobScheduler to define jobs and register them with the Job Scheduler plugin. Each plugin that registers a job will register an index associated with the jobs and for every new instance of a job, the plugin will write a document to its jobs index with metadata it needs to run the job including the job’s schedule.

In OpenSearch, requests are stateless and the Security Plugin will wrap each REST Handler to authenticate the request. When a request is authenticated the Security Plugin serializes the authenticated user into the ThreadContext. In today’s security model, plugins will:

  1. Define an API that creates a scheduled job when handling the REST Request
  2. Parse the User from the ThreadContext using libraries in common-utils
  3. Store a copy of the user at the time of scheduled job creation (Example from Anomaly Detection).
  • The information they store includes the mapped roles of the user which are used later on for running the job.
  1. When the scheduled job is created successfully, the Plugin developer returns a successful message when handling the API Request. At this point in time the request to create a job has been handled and the thread handling the request is finished running - the ThreadContext is no longer available

Later on, Job Scheduler will determine that the newly scheduled job is due to be run. At that point in time the flow is typically:

  1. When it comes time to run the job, there is an entirely new unauthenticated ThreadContext.
  2. The plugin’s JobRunner’s runJob method is called and the first thing job runner’s typically do is read the info they stored about the job from their job index
  3. This includes reading the copy of the user at the time of scheduled job creation
  4. Plugins utilize a feature of the security plugin called Roles Injection **** to inject the roles they stored alongside the job definition into the ThreadContext to instruct the Security Plugin to use those roles when evaluating permissions

Problems with current state

  1. Plugins dictate to security what permissions to evaluate requests with
  2. There are dangling references to users outside of the security index(es)
  3. Changes to user’s authz are not propagated to the copies stored in the disparate plugin jobs indices
  4. Without referential integrity for users, if a user is deleted then there may be orphaned references stored in the disparate plugin jobs indices
  5. Most plugin jobs indices are system indices meaning a user needs to connect with admin certificate to change the identity associated with a job
  6. There is no UX to present or manage identities associated with jobs - it is opaque

Proposed Change - Centralized Scheduled Job Identity Management

To solve for the problems listed above, the security team is proposing a new feature to manage identities associated with scheduled jobs called Centralized Scheduled Job Identity Management. The goal of centralized scheduled job identity management is to invert the control flow of plugins dictating what permissions to run a job with and instead put Security/Identity into the middle to enforce that scheduled jobs run with security and reduce the burden of security on plugin developers. Roles Injection is one of the features of the security plugin that has to be deprecated and removed in order to move towards the goal of running Custom Plugins in OpenSearch securely.

Centralization is a key component of this feature. Centralization in this sense means storing the identities associated with scheduled jobs in a single secure index owned by the Security Plugin. By storing the identities into a single index, the Security plugin will better enforce referential integrity meaning that changes to user’s can be propagated to this index to ensure that jobs work with correct authz or can be disabled/revoked if a user is removed.

To put the Identity system into the middle of the control flow, a new interface will be added to the IdentityPlugin to define a ScheduledJobIdentityManager which will provide key methods that must be implemented to provide security for scheduled jobs

  1. On the Creation of a Job the identity system will implement associateUserWithScheduledJob to store the currently authenticated user in the central scheduled job identity index
  2. When it comes time to Run a Job, the identity system will provideSecurityContext for the job which means populating the ThreadContext (in case of plugins) or issuing an access token (in case of extensions)
  3. To support Deletion of a Job, the identity system will provide a deleteScheduledJobIdentity to remove the scheduled job identity entry from the centralized scheduled job identity index

The next bullet points are optional (not currently supported in the current system), but would make this system robust for OpenSearch extensibility and provide a richer Job Scheduling experience by introduced Scheduled Job Identity Management

  1. If a job is run out of process (as with extensions) this system will support refreshSecurityContext to issue new access tokens to the job runner for longer running jobs
  2. To better support the changing of identities associated with scheduled jobs this system could support associatedUserWithScheduledJob with not only the currently authenticated user. If a user with high enough privileges could associate jobs with other users than this method may allow the changing of identities associated with a job.

Diagrams

Current Job Definition of Anomaly Detector with copy of user at time of scheduled job creation

.opendistro-anomaly-detector-jobs
{
    "name": "ad-job-1",
    "schedule":
    {
        "interval":
        {
            "start_time": 16844430461234,
            "period": 10,
            "unit": "Minutes"
        }
    },
    "window_delay":
    {
        "period":
        {
            "interval": 1,
            "unit": "Minutes"
        }
    },
    "enabled": true,
    "enabled_time": 1684443041234,
    "last_update_time": 1684443041234,
    "lock_duration_seconds": 600,
    "user":
    {
        "name": "anomaly_user",
        "backend_roles":
        [
            "backend_role"
        ],
        "roles":
        [
            "all_access"
        ],
        "custom_attribute_names":
        [],
        "user_requested_tenant": "global"
    }
}

With Centralized Scheduled Job Identity Management this would be broken up and stored as:

.opendistro-anomaly-detector-jobs
{
    "name": "ad-job-1",
    "schedule":
    {
        "interval":
        {
            "start_time": 16844430461234,
            "period": 10,
            "unit": "Minutes"
        }
    },
    "window_delay":
    {
        "period":
        {
            "interval": 1,
            "unit": "Minutes"
        }
    },
    "enabled": true,
    "enabled_time": 1684443041234,
    "last_update_time": 1684443041234,
    "lock_duration_seconds": 600
}

and the security scheduled job operator index:

.opendistro-security-scheduled-job-identity
{
  "job_id": "<job_id>",
  "job_index": ".opendistro-anomaly-detector-jobs", // This and the field above are a direct reference to the document in .opendistro-anomaly-detector-jobs index 
  "user":
    {
        "name": "anomaly_user",
        "backend_roles":
        [
            "backend_role"
        ],
        "roles":
        [
            "all_access"
        ],
        "custom_attribute_names":
        [],
        "user_requested_tenant": "global"
    }
}

Job Creation

Centralized Scheduled Job Identity Management - Job Creation (2) (1)

Job Execution

graph LR

subgraph JobScheduler
  A[Job Execution] --> B[Anomaly Detector Job]
end


subgraph OpenSearch.IdentityPlugin
  B --> C[ScheduledJobIdentityManager.provideSecurityContext]
end

subgraph AnomalyDetectorPlugin
  C --> D[Anomaly Detection Job Runner]
end

subgraph OpenSearch
  D --> E[Anomaly Detection Results]
end
Loading

Implementation Considerations

Since plugin’s often stash the ThreadContext when writing a job to their jobs index, the security plugin will not be able to read the currently authenticated user from the thread context. To solve for this, the security team will create a new secure enclave of the ThreadContext for Security information and that context will be unstashable and immutable. With plugins, security will not rely on the plugin to supply the identity to associated with a scheduled job, it will always default to the currently authenticated user.

Problem with authz lookup for external users

With internal users its possible to evaluate their roles at any given moment when authorizing a request (*except for host mapping where the calling IP/Address or hostname is needed). For external users (SAML, OICD and LDAP) there is not currently mechanism in the security plugin for looking up the latest user information at any given time. The security plugin extracts backend roles from external identity providers on successful login with the external identity provider, but in a scheduled job there is no login process with the external IdP.

In the current model of scheduled job security, roles are frozen at the time of scheduled job creation.

Appendix

mapped roles - mapped roles are not the same as direct user role mappings. The security plugin has an expressive roles mapping system which supports mapping user’s directly or indirectly to roles in 4 different ways:

  1. Direct user mapping - i.e. userA is mapped to roleX
  2. Through a backend role mapping - i.e. backendRole1 is mapped to roleY. If userA has backendRole1 then they are indirectly mapped to roleY
  3. Host Mapping - Requests from a certain IP or hostname (wildcard supported) are mapped to roleZ
  4. And_Backend_Roles mapping - Like backend roles, but the user must have all of the backend roles in the list to be mapped to that role

Scheduled Job Identity / Scheduled Job Operator are terms that can be used interchangeably

@cwperks cwperks added enhancement New feature or request untriaged Require the attention of the repository maintainers and may need to be prioritized labels Jun 26, 2023
@DarshitChanpura
Copy link
Member

[Triage] Thanks for filing the RFC. Community member are welcome to chime in their thoughts.

@DarshitChanpura DarshitChanpura added triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable. and removed untriaged Require the attention of the repository maintainers and may need to be prioritized labels Jun 26, 2023
@peternied
Copy link
Member

peternied commented Jul 5, 2023

Proposed Change - Centralized Scheduled Job Identity Management

Centralization implies there is something job scheduler to be focused around, this be addressed directly by requiring the identity of an operator when a job is scheduled?

I would advocate for switching to a model that uses REST APIs for scheduling jobs so verification can be performed on parameters - allowing requirements to be strictly enforced and prevent propagation of bad practices in future code.


Sounds there is a requirement for a 'long term' user impersonation system, using the Security Plugin to store refresh token (?) / generate fresh access tokens. I am 100% agreement that we need to build this, could we get a separate design for this system? Note; I don't think we need all that detail in this design, but I'd love to see this further explored.


Could you make a sequence diagram showing the before vs after state of the system with the capture/injection of user context?


Finally, building off of the REST API concept, I think job schedulers implementation is pretty hard to follow. Instead of just centralizing job identity, if we had all the core job information in a controlled location this would enable better management for OpenSearch operators. I think there are major improvements around traceability and management of scheduled jobs if we add legit management APIs rather.

@cwperks
Copy link
Member Author

cwperks commented Jul 13, 2023

Thank you for the reply @peternied. The problem with adding a REST API for job registration is that Job Scheduler is written in a way to provide the highest flexibility to plugins to store data needed for the running of a job. That must include the schedule, for job scheduler to parse, but otherwise its left up to plugin's to store any metadata specific to that job that they need to run. As an example, contrast anomaly-detection to reporting.

  1. Anomaly Detection Job - mappings/anomaly-detector-jobs.json
  2. Reporting Job - Example of a reporting job

By giving the plugin/extension the same flexibility to write a job definition that suits its needs, it would provide a path to more secure job scheduling with minimal changes required by plugin/extensions. By having a one size fits all REST API it would not be able to provide as strong of a schema as index mappings do today so IMO its better to retain the flexibility of job definitions in disparate plugin jobs indices, but to centralize the identities (perhaps the schedules could be centralized as well)

Could you make a sequence diagram showing the before vs after state of the system with the capture/injection of user context?

I will more formalize this in a mermaid diagram, but this is a simplified sequence diagram in the current plugin scheduled job security

  1. Plugin reads user from TC
  2. Plugin stashes TC
  3. Plugin writes job to job index with copy of user
  4. JS runs a job when its due to be run
  5. In the plugin job runner, the copy of the user is read from what was stored with the job
  6. Roles injection

In centralized job security it would be as follows:

The workflows are analogous between plugin and extension, but I'll break it down for each:

Plugin

  1. Plugin writes job to job index
  2. Identity associates job with currently authenticated user
  3. JS runs a job when its due to be run
  • Identity provides the security context in this step. This would not rely on plugin to do roles injection

Extension

  1. Extension uses service account token to write to job index
  2. Extension supplies the on-behalf-of token for the authenticated user in the job document. (Similar to how JS expects schedule: to exist)
  3. Identity verifies the supplied token and associates the job with the user of the token that was supplied
  4. JS runs a job when its due to be run
  • In this step Identity issues an access (on-behalf-of) token for the user that the job is associated with

@cwperks
Copy link
Member Author

cwperks commented Jul 13, 2023

@peternied I suppose it is possible for Job Scheduler to write to any plugin's jobs index and still be given the index mapping guarantees from the comment above. Instead of validation being done on the REST Request, Job Scheduler would try indexing what it is given and give back a response based on the outcome of the indexing action. I imagine this was not done for plugins, because plugins don't need to use the REST-layer when performing actions but since extensions are out of process it could or should be reconsidered.

To zoom in on what job registration could look like using a REST API:

POST /_plugins/_job_scheduler/job/register 
--headers
Content-Type: application/json
Authorization: Bearer <obo_token>
--data
// Some of these are required, some are optional and there would also need to be support for custom metadata
{
    "name": "ad-job-1",
    "schedule":
    {
        "interval":
        {
            "start_time": 16844430461234,
            "period": 10,
            "unit": "Minutes"
        }
    },
    "window_delay":
    {
        "period":
        {
            "interval": 1,
            "unit": "Minutes"
        }
    },
    "enabled": true,
    "enabled_time": 1684443041234,
    "last_update_time": 1684443041234,
    "lock_duration_seconds": 600,
}

I am still gravitating towards the design in the OP of this RFC for ease of transition, but the REST API brings up an interesting point. At any point with the scheduling of jobs, should we have a grantable permission to a user that is simply: "Can this user schedule jobs?"

In the current system its implicitly permissioned. i.e. UserA can create detectors therefor UserA implicitly gives permission for a job to be scheduled on their behalf. Should it instead be 2 questions?

  1. Can UserA create a detector?
  2. If the answer above is yes, then does UserA also give permission to schedule jobs on their behalf?

There currently is nothing around the second question in the current system and it could be argued that user's should be empowered to control and know what is happening on their behalf in the system.

@krishna-ggk
Copy link
Contributor

This is a great and thorough proposal, and thank you, @cwperks, for putting it together. I have a few thoughts, and I would love to understand your perspectives.

While the job-scheduler appears to be a good focal point for a solution, there are other plugins like cross-cluster-replication that rely on the persistent task framework - another async task framework that persists the security context in the cluster state. Given that we are moving towards having appropriate security constructs for async jobs, this would be a great opportunity to design a common solution for all async use cases.

@peternied: Sounds there is a requirement for a 'long term' user impersonation system, using the Security Plugin to store refresh token (?) / generate fresh access tokens. I am 100% agreement that we need to build this, could we get a separate design for this system? Note; I don't think we need all that detail in this design, but I'd love to see this further explored.

+1. The coupling between the core (Identity) and knowledge of job-scheduler entities seems odd and may be more appropriate when the job-scheduler is moved to the core and can be generalized for all async use cases. In the meantime, as a first step, we could consider the notion of a verifiable signed token for such async use cases, which can optionally be time-bound (e.g., async search use cases may not require the identity to be valid for a lifetime).

Roles Injection is one of the features of the security plugin that has to be deprecated and removed in order to move towards the goal of running Custom Plugins in OpenSearch securely.

I would like to understand more about this consideration. At a high level, I agree that the current implementation of role injection may not be the best. However, role-based identities are common in cloud services such as AWS. For example, IAM roles can be associated with a background activity to be used by AWS services. This can be super helpful when an Admin user (who may be part of both the feature and admin roles) initiates an action. If we go by all the permissions of the identity, it would mean that the stored identity has more privileges than necessary.

@krishna-ggk
Copy link
Contributor

krishna-ggk commented Jul 17, 2023

If the answer above is yes, then does UserA also give permission to schedule jobs on their behalf?

While this adds more control, the job is more of an internal detail and user will have to understand a feature uses this internal detail. This can be remedied to an extent with concepts like action-group, but playing devil's advocate, this seems like an overkill. Essentially what usecases can we accomplish by not permitting scheduling the job but exposing the feature?

@cwperks
Copy link
Member Author

cwperks commented Jul 17, 2023

@krishna-ggk Thank you for the review and feedback! I will try to generalize this outside of only the context of job scheduling to any async operations like async-search. As @peternied mentioned, the focus should be on a 'long-term' user impersonation system like the refresh token/access token model of OAuth and that's something I was considering in an earlier iteration.

One of the goals I had in this design was creating a migration path from plugin -> extension to make the experience seamless to an end-user. i.e. end user is mostly unaware of anomaly detection running as a plugin or as an extension.

Through the process of analyzing and designing security for extensions we have encountered quite a few areas in the plugin model that we'd like to change for extensions or any third-party code and create mechanisms to keep the core secure without implicitly trusting an extension as is done in the plugin ecosystem today.

I would like to understand more about this consideration. At a high level, I agree that the current implementation of role injection may not be the best.

It would be great to have a feature where a user can define the permissions a job can run with, especially if the user setting up a job has a high level of privilege like an admin. Roles injection is not used for that in the plugin ecosystem though. Roles injection is typically used outside of an authenticated user context to simulate the original user's permissions and plugins have the freedom to inject any roles into the threadcontext. That works in an ecosystem with implicitly trusted plugins, but can be exploited to run any action with the all_access role.

Utilizing tokens, I would be in favor of assign permissions to tokens and using ephemeral (in-memory) roles when evaluating the permissions of a token.

Something like:

POST /_plugins/_security/api/token
{
  "permissions": {
    "index_permissions": 
    [
      {
        "index_patterns": ["logs_*"],
        "allowed_actions": ["read"]
      }
    ],
    "cluster_permissions": ["cluster_composite_ops"]
  },
  "expires_after": "3600" // 1hr
}

@shikharj05
Copy link
Contributor

Great proposal @cwperks ! I have couple of queries/thoughts-

Changes to user’s authz are not propagated to the copies stored in the disparate plugin jobs indices

If user's authZ change is w.r.t change in permissions in a role, this should be reflected today, right? Do you mean User's (and hence a Job's mapping to a role) once stored doesn't get reflected?

Identity associates job with currently authenticated user
JS runs a job when its due to be run
Identity provides the security context in this step. This would not rely on plugin to do roles injection

This wouldn't solve the problem we have for plugins today about authz changes(user to role mapping) not propagating, right?

If the answer above is yes, then does UserA also give permission to schedule jobs on their behalf?

We should evaluate if userA has permissions to schedule jobs or not, checking if a userA has permission to grant on-behalf-of access might be good addition too. For example, if an admin wants to restrict users that can generate on-behalf-of tokens directly or in-directly while accessing features of the system, they should be able to do so. This improves security for on-behalf-of tokens as well.

What are your thoughts about treating jobs as entities and hence to support mapping of jobs to roles? i.e. instead of relying on calling user permissions and storing them in an index can we save a job configuration object which includes a role to be used by the job. Only authorized-users would be able to schedule jobs, we can also perform additional check such that user scheduling a job with a role mapping should also have access to the role being mapped to the job - this ensures users cannot use jobs to escalate privileges. I haven't thought thoroughly on this but I know this needs thinking on solving the plugin -> extension migration path as well.

@cwperks
Copy link
Member Author

cwperks commented Jul 18, 2023

Thank you @shikharj05! See responses below.

If user's authZ change is w.r.t change in permissions in a role, this should be reflected today, right?

The current implementation of Job Scheduler Security stores a copy of the user with mapped roles and later injects those back into the threadcontext at job execution time. If the permission on one of those roles changes then that would get reflected at runtime. On the other hand, if a user is mapped additional roles or removed from roles than those changes do not get propagated. I would describe this as freezing the roles at the time that the job is created, but that is not transparent to the user.

This wouldn't solve the problem we have for plugins today about authz changes(user to role mapping) not propagating, right?

One of the goals of centralization is to solve this problem. In the current model of JS Security, plugins store copies of the user in many different indices so there is no referential integrity. If there's a change in a User, the security plugin has no awareness of all of the places the user is stored in OpenSearch to update those references.

What are your thoughts about treating jobs as entities and hence to support mapping of jobs to roles?

IMO I would prefer to think of a "token" entity where a token has permissions associated with it and a token can be associated with a job at job registration. (See the end of this comment: #2905 (comment)). Some users can create tokens and the token's permissions would be a subset of the user's permissions.

One solution I previously considered for scheduled job/async ops security was to allow users to create auth tokens and supply those auth tokens at job registration. You can imagine another tab in the security-dashboards-plugin called "tokens" that would be similar to how the "roles" section works. Tokens can be managed in this section where they would be able to be revoked or created. Creation of a token would look similar to creating a role where a token can have a name associated with it and permissions. One of the problems I was thinking about with this approach was that tokens should not have indefinite lifetimes which was not compatible with the current JS Security model where once a job is scheduled it will work forever. In the proposal in this OP, it would keep issuing access tokens every time a job run is triggered but it would also provide a mechanism for a user to disable an individual job and stop it from issuing new access tokens.

@cwperks
Copy link
Member Author

cwperks commented Jul 18, 2023

+1. The coupling between the core (Identity) and knowledge of job-scheduler entities seems odd and may be more appropriate when the job-scheduler is moved to the core and can be generalized for all async use cases. In the meantime, as a first step, we could consider the notion of a verifiable signed token for such async use cases, which can optionally be time-bound (e.g., async search use cases may not require the identity to be valid for a lifetime).

I would argue that Security and any plugin that utilizes roles injection are tightly coupled today. This proposal reduces the coupling by introducing a new interface in core called IdentityPlugin.getScheduledJobIdentityManager (or something to that effect). That extension point defines an interface that an IdentityPlugin would need to implement to provide security for scheduled jobs which would minimally be 3 methods (any input around naming is encouraged!):

  1. associateJobWithUser
  2. provideSecurityContext
  3. removeScheduledJobIdentity

JobScheduler would need the ability to get a handle on the currently installed IdentityService (or a Noop) if one is not installed and call on these methods in the appropriate places. provideSecurityContext would be a key extension point here and would give plugins an offramp from roles injection for scheduled job security if this solution is also extended to plugins.

...which can optionally be time-bound

Does this mean that indefinite tokens could or should be allowed? IMO I like giving cluster admins the option of issuing indefinite access tokens without recommending them as the most secure solution. Of course they would always have the option to time-bound an access token and revoke any tokens suspected of being compromised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable.
Projects
None yet
Development

No branches or pull requests

5 participants