Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK Quota Support #3102

Merged
merged 52 commits into from
Sep 7, 2019
Merged

SDK Quota Support #3102

merged 52 commits into from
Sep 7, 2019

Conversation

kaiwalyajoshi
Copy link
Collaborator

@kaiwalyajoshi kaiwalyajoshi commented Jun 12, 2019

This feature requires support for enforceRole on Marathon groups found in Marathon v1.9.73 and Mesos v1.9.0 available starting DC/OS 2.0.
By default Marathon does not set enforceRole=true on group creation, and existing semantics are maintained.

Deploy new service in a group with quota enabled

Hello-World is used in the example below but this is applicable to any SDK based service.
To create a service named /dev/hello-world in group dev with quota consumed from role dev

  1. Create a group with enforceRole:
cat >create-group.json<<EOF
{
    "id":"/dev",
    "enforceRole":true
}
EOF
  1. Create Marathon group:
dcos marathon group add create-group.json
  1. Populate the service options:
cat >hello-world-dev-options.json<<EOF
{
    "service":{
        "name":"/dev/hello-world"
    }
}
EOF
  1. Install service.
dcos package install hello-world --options=namespace-options-foo-enforce-role.json --yes
  1. Ensure SDK scheduler and pods have been launched under the dev role via Mesos UI.

Migrate an existing deployed service to use quota support

To upgrade an existing service to a new version of the SDK with quota support, use the following procedure.
We will use Hello-World again pre-installed in group foo in the example below but this is applicable to any SDK based service.

  1. Create a file with the current service-name and the following additional options:
cat >hello-world-foo-options.json<<EOF
{
    "service":{
        "name":"/foo/hello-world",
        "role": "foo",
        "enable_role_migration": true
    }
}
EOF
  • role: Specifies the quota enforced role we're migrating towards, which is foo in this example.
  • enable_role_migration: Notifies the scheduler that its pods will be migrated between legacy and quota enforced roles. The scheduler subscribes with both roles when this flag is set.
  1. Update the scheduler to use the quota enforced role.
dcos hello-world --name="/foo/hello-world" update start --options=hello-world-foo-options.json.json
  1. At this point the scheduler will be upgraded and will use quota from the foo role. The deployed pods will be unaffected and will use their previous roles.
  2. Issue pod replace commands to migrate all the pods in the service to the quota enforced role.
dcos hello-world --name="/foo/hello-world" pod replace hello-0

The hello-0 pod will be migrated to consume quota from foo
5. Create a file with the current service-name and the following options to signal the end of the migration:

cat >hello-world-foo-disable-migration.json<<EOF
{
    "service":{
        "name":"/foo/hello-world",
        "role": "foo",
        "enable_role_migration": false
    }
}
EOF
  1. Update the scheduler to stop subscribing to the legacy role.
dcos hello-world --name="/foo/hello-world" update start --options=hello-world-foo-disable-migration.json.json

At this point, the scheduler and all the previous running pods have been migrated to the quota enforced role.

Strict Mode Clusters

For strict mode clusters, additional role permissions are required and must be setup before deploying the service.

  1. New service in a group with enforceRole=true
    Example:
  • New service with name /dev/hello-world will need permissions to the dev role
dcos security org users grant <service-account> dcos:mesos:master:reservation:role:dev create
  1. Migrating an existing service to a quota enforced role.
    Example:
  • Existing service with name /foo/hello-world will need permissions to both the foo and foo__hello-world-role roles
dcos security org users grant <service-account> dcos:mesos:master:reservation:role:foo create
dcos security org users grant <service-account> dcos:mesos:master:reservation:role:foo__hello-world-role create

Pod Pre-Reserved Roles

For pods which specify pre-reserved roles (eg slave_public), the scheduler will issue a hierarhical role depending on the value of role.

Example:

  • Pod Pre-Reserved Role: slave_public and role=slave_public. These permissions are required:
dcos security org users grant <service-account> dcos:mesos:master:reservation:role:slave_public/dev__hello-world-role create
  • Pod Pre-Reserved Role: slave_public and role=dev. These permissions are required:
dcos security org users grant <service-account> dcos:mesos:master:reservation:role:slave_public/dev create

When performing migration between legacy to enforced group roles via enable_role_migration, both permissions above will be required.

Downgrading to and older non-quota aware version of the scheduler

This section details the procedures to downgrade from a quota enforced role to a shipped non-quota enforced release.
The process is the same as migrating an existing service to the quoted role
The key difference is that role should be slave_public to indicate migration towards the legacy roles.

cat >hello-world-foo-downgrade.json<<EOF
{
    "service":{
        "name":"/foo/hello-world",
        "role": "slave_public",
        "enable_role_migration": true
    }
}
EOF

The remaining scheduler update and pod-replace operations must be issued to move the scheduler and pods into the legacy roles.
Once all the pods have been migrated, the scervice can be downgraded to an earlier release which isn't quota aware.

JIRA: DCOS-54278

* Retrieve role from env.

* Remove irritating 100 line checkstyle limitation.

* Name change getSchedulerNamespace -> getServiceNamespace

* Add support for configureable roles via an environment-variable.
* Add non-namespaced role to list of roles to subscribe with.

* Add new Builders that copy existing instances.

* Add missing getters/setters from Builder.

* Print associated role with incoming offers.

* Add support for maintaining current role on service-role update.

* Arguments to assert were swapped, expected in place of actual.

* Compare contents of lists as ordering can change and isn't important here.

* Fix failing tests for scheduler upgrade.

* Fix runtime errors, undo test changes.
* Add non-namespaced role to list of roles to subscribe with.

* Add new Builders that copy existing instances.

* Add missing getters/setters from Builder.

* Print associated role with incoming offers.

* Add support for maintaining current role on service-role update.

* Arguments to assert were swapped, expected in place of actual.

* Compare contents of lists as ordering can change and isn't important here.

* Fix failing tests for scheduler upgrade.

* Fix runtime errors, undo test changes.

* WIP: Role changes do not trigger requirement for new TaskConfig.

* WIP: Disable validation of TaskVolumes for now.

* Prevent scheduler from applying a role change on incomplete previous deployment. Include conditional validation for role changes.

* Revert fixRoleChange introduced earlier.
- Update scheduler to new role.
- Update pods to new role via pod replace, restart scheduler in between each replaced pod to ensure mixed-mode roles are applicable.
- Add additional pods post scheduler update, ensure that new pods are launched under the new role.
Configure the role the framework subscribes with via these two environment variables:

- MARATHON_ENFORCE_GROUP_ROLE - Determines if we use the Mesos supplied role for quota or revert back to legacy behaviour.
- MESOS_ALLOCATION_ROLE - Specified the role the scheduler subscribes to Mesos with and which role the new footprint will be created under.
@kaiwalyajoshi kaiwalyajoshi changed the title [DRAFT] [WIP] SDK Quota Support SDK Quota Support Jul 15, 2019
@kaiwalyajoshi kaiwalyajoshi removed the wip label Jul 15, 2019
testing/sdk_upgrade.py Outdated Show resolved Hide resolved
kaiwalyajoshi and others added 5 commits July 24, 2019 19:48
…uration as Marathon now injects this based on the group settings.
…efaultResourceSet.java

Co-Authored-By: Tarun Gupta Akirala <takirala@users.noreply.github.com>
…efaultResourceSet.java

Co-Authored-By: Tarun Gupta Akirala <takirala@users.noreply.github.com>
…/DefaultConfigValidators.java

Co-Authored-By: Tarun Gupta Akirala <takirala@users.noreply.github.com>
…ulerConfig.java

Co-Authored-By: Tarun Gupta Akirala <takirala@users.noreply.github.com>
@kaiwalyajoshi kaiwalyajoshi marked this pull request as ready for review September 4, 2019 16:50
@kaiwalyajoshi kaiwalyajoshi mentioned this pull request Sep 5, 2019
takirala
takirala previously approved these changes Sep 5, 2019
Copy link
Contributor

@takirala takirala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had couple queries, good to merge after those are answered. 🚢

Had some ideas on code cleanup in test_quota_deployment and the usage of null in SchedulerConfig. I created
suggestions.diff.txt a diff file with some suggestions. Nothing blocking though.


# Add an extra pod to each.
marathon_config["env"]["HELLO_COUNT"] = "3"
marathon_config["env"]["WORLD_COUNT"] = "4"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had a closer look, but would this affect the minimum number of nodes on the cluster (due to placement constraints ) - and if yes, is this within that limit ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both hello and world pods have the placement constraint "[[\"hostname\", \"UNIQUE\"]]", in our SI we spin-up a cluster with 5 agents, so this fits into the normal test configuration.

current_task_roles = service_roles["task-roles"]

# We must have some role!
assert len(current_task_roles) > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can delete this line

@@ -98,6 +98,7 @@ PyJWT==1.7.1
pylint==2.3.1
PyNaCl==1.3.0
pytest==4.1.0
pytest-dependency==0.4.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we have to add this to test_requirements.txt as well ?!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think so, pytest harness gets all of its deps from frozen_requirements.txt

testing/sdk_utils.py Outdated Show resolved Hide resolved
testing/sdk_utils.py Outdated Show resolved Hide resolved

if "roles" in service_state:
# MUTI_ROLE
current_service_roles["framework-roles"] = service_state["roles"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if this is a multi role, what would be the result of service_state["role"] - is it a non existent key OR defaulted to an empty list ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

service_state["role"] = "*"

testing/sdk_utils.py Show resolved Hide resolved
tools/universe/package_builder.py Outdated Show resolved Hide resolved
Co-Authored-By: Tarun Gupta Akirala <takirala@users.noreply.github.com>
@kaiwalyajoshi kaiwalyajoshi merged commit 124bdb3 into master Sep 7, 2019
@kaiwalyajoshi kaiwalyajoshi deleted the sdk-quota branch December 16, 2019 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants