SDK Quota Support #3102

kaiwalyajoshi · 2019-06-12T23:16:38Z

This feature requires support for enforceRole on Marathon groups found in Marathon v1.9.73 and Mesos v1.9.0 available starting DC/OS 2.0.
By default Marathon does not set enforceRole=true on group creation, and existing semantics are maintained.

Deploy new service in a group with quota enabled

Hello-World is used in the example below but this is applicable to any SDK based service.
To create a service named /dev/hello-world in group dev with quota consumed from role dev

Create a group with enforceRole:

cat >create-group.json<<EOF
{
    "id":"/dev",
    "enforceRole":true
}
EOF

Create Marathon group:

dcos marathon group add create-group.json

Populate the service options:

cat >hello-world-dev-options.json<<EOF
{
    "service":{
        "name":"/dev/hello-world"
    }
}
EOF

Install service.

dcos package install hello-world --options=namespace-options-foo-enforce-role.json --yes

Ensure SDK scheduler and pods have been launched under the dev role via Mesos UI.

Migrate an existing deployed service to use quota support

To upgrade an existing service to a new version of the SDK with quota support, use the following procedure.
We will use Hello-World again pre-installed in group foo in the example below but this is applicable to any SDK based service.

Create a file with the current service-name and the following additional options:

cat >hello-world-foo-options.json<<EOF
{
    "service":{
        "name":"/foo/hello-world",
        "role": "foo",
        "enable_role_migration": true
    }
}
EOF

role: Specifies the quota enforced role we're migrating towards, which is foo in this example.
enable_role_migration: Notifies the scheduler that its pods will be migrated between legacy and quota enforced roles. The scheduler subscribes with both roles when this flag is set.

Update the scheduler to use the quota enforced role.

dcos hello-world --name="/foo/hello-world" update start --options=hello-world-foo-options.json.json

At this point the scheduler will be upgraded and will use quota from the foo role. The deployed pods will be unaffected and will use their previous roles.
Issue pod replace commands to migrate all the pods in the service to the quota enforced role.

dcos hello-world --name="/foo/hello-world" pod replace hello-0

The hello-0 pod will be migrated to consume quota from foo
5. Create a file with the current service-name and the following options to signal the end of the migration:

cat >hello-world-foo-disable-migration.json<<EOF
{
    "service":{
        "name":"/foo/hello-world",
        "role": "foo",
        "enable_role_migration": false
    }
}
EOF

Update the scheduler to stop subscribing to the legacy role.

dcos hello-world --name="/foo/hello-world" update start --options=hello-world-foo-disable-migration.json.json

At this point, the scheduler and all the previous running pods have been migrated to the quota enforced role.

Strict Mode Clusters

For strict mode clusters, additional role permissions are required and must be setup before deploying the service.

New service in a group with enforceRole=true
Example:

New service with name /dev/hello-world will need permissions to the dev role

dcos security org users grant <service-account> dcos:mesos:master:reservation:role:dev create

Migrating an existing service to a quota enforced role.
Example:

Existing service with name /foo/hello-world will need permissions to both the foo and foo__hello-world-role roles

dcos security org users grant <service-account> dcos:mesos:master:reservation:role:foo create
dcos security org users grant <service-account> dcos:mesos:master:reservation:role:foo__hello-world-role create

Pod Pre-Reserved Roles

For pods which specify pre-reserved roles (eg slave_public), the scheduler will issue a hierarhical role depending on the value of role.

Example:

Pod Pre-Reserved Role: slave_public and role=slave_public. These permissions are required:

dcos security org users grant <service-account> dcos:mesos:master:reservation:role:slave_public/dev__hello-world-role create

Pod Pre-Reserved Role: slave_public and role=dev. These permissions are required:

dcos security org users grant <service-account> dcos:mesos:master:reservation:role:slave_public/dev create

When performing migration between legacy to enforced group roles via enable_role_migration, both permissions above will be required.

Downgrading to and older non-quota aware version of the scheduler

This section details the procedures to downgrade from a quota enforced role to a shipped non-quota enforced release.
The process is the same as migrating an existing service to the quoted role
The key difference is that role should be slave_public to indicate migration towards the legacy roles.

cat >hello-world-foo-downgrade.json<<EOF
{
    "service":{
        "name":"/foo/hello-world",
        "role": "slave_public",
        "enable_role_migration": true
    }
}
EOF

The remaining scheduler update and pod-replace operations must be issued to move the scheduler and pods into the legacy roles.
Once all the pods have been migrated, the scervice can be downgraded to an earlier release which isn't quota aware.

JIRA: DCOS-54278

* Retrieve role from env. * Remove irritating 100 line checkstyle limitation. * Name change getSchedulerNamespace -> getServiceNamespace * Add support for configureable roles via an environment-variable.

* Add non-namespaced role to list of roles to subscribe with. * Add new Builders that copy existing instances. * Add missing getters/setters from Builder. * Print associated role with incoming offers. * Add support for maintaining current role on service-role update. * Arguments to assert were swapped, expected in place of actual. * Compare contents of lists as ordering can change and isn't important here. * Fix failing tests for scheduler upgrade. * Fix runtime errors, undo test changes.

* Add non-namespaced role to list of roles to subscribe with. * Add new Builders that copy existing instances. * Add missing getters/setters from Builder. * Print associated role with incoming offers. * Add support for maintaining current role on service-role update. * Arguments to assert were swapped, expected in place of actual. * Compare contents of lists as ordering can change and isn't important here. * Fix failing tests for scheduler upgrade. * Fix runtime errors, undo test changes. * WIP: Role changes do not trigger requirement for new TaskConfig. * WIP: Disable validation of TaskVolumes for now. * Prevent scheduler from applying a role change on incomplete previous deployment. Include conditional validation for role changes. * Revert fixRoleChange introduced earlier.

- Update scheduler to new role. - Update pods to new role via pod replace, restart scheduler in between each replaced pod to ensure mixed-mode roles are applicable. - Add additional pods post scheduler update, ensure that new pods are launched under the new role.

Configure the role the framework subscribes with via these two environment variables: - MARATHON_ENFORCE_GROUP_ROLE - Determines if we use the Mesos supplied role for quota or revert back to legacy behaviour. - MESOS_ALLOCATION_ROLE - Specified the role the scheduler subscribes to Mesos with and which role the new footprint will be created under.

testing/sdk_upgrade.py

…CE_GROUP_ROLE

…oling with upstream branch.

frameworks/helloworld/tests/test_quota_upgrade_downgrade.py

sdk/scheduler/src/main/java/com/mesosphere/sdk/framework/FrameworkConfig.java

sdk/scheduler/src/main/java/com/mesosphere/sdk/scheduler/SchedulerConfig.java

…uration as Marathon now injects this based on the group settings.

…efaultResourceSet.java Co-Authored-By: Tarun Gupta Akirala <takirala@users.noreply.github.com>

…/DefaultConfigValidators.java Co-Authored-By: Tarun Gupta Akirala <takirala@users.noreply.github.com>

…ulerConfig.java Co-Authored-By: Tarun Gupta Akirala <takirala@users.noreply.github.com>

…on tests in SDK)

- Fix invalid use of marathon group delete. - Introduce pytest-dependency

takirala

Had couple queries, good to merge after those are answered. 🚢

Had some ideas on code cleanup in test_quota_deployment and the usage of null in SchedulerConfig. I created
suggestions.diff.txt a diff file with some suggestions. Nothing blocking though.

frameworks/helloworld/tests/test_quota_deployment.py

takirala · 2019-09-05T22:00:33Z

frameworks/helloworld/tests/test_quota_upgrade.py

+
+    # Add an extra pod to each.
+    marathon_config["env"]["HELLO_COUNT"] = "3"
+    marathon_config["env"]["WORLD_COUNT"] = "4"


I haven't had a closer look, but would this affect the minimum number of nodes on the cluster (due to placement constraints ) - and if yes, is this within that limit ?

Both hello and world pods have the placement constraint "[[\"hostname\", \"UNIQUE\"]]", in our SI we spin-up a cluster with 5 agents, so this fits into the normal test configuration.

takirala · 2019-09-05T22:01:02Z

frameworks/helloworld/tests/test_quota_upgrade.py

+    current_task_roles = service_roles["task-roles"]
+
+    # We must have some role!
+    assert len(current_task_roles) > 0


we can delete this line

takirala · 2019-09-05T22:03:22Z

frozen_requirements.txt

@@ -98,6 +98,7 @@ PyJWT==1.7.1
 pylint==2.3.1
 PyNaCl==1.3.0
 pytest==4.1.0
+pytest-dependency==0.4.0


i think we have to add this to test_requirements.txt as well ?!

Don't think so, pytest harness gets all of its deps from frozen_requirements.txt

sdk/scheduler/src/main/java/com/mesosphere/sdk/config/validate/DefaultConfigValidators.java

testing/sdk_utils.py

takirala · 2019-09-05T23:48:06Z

testing/sdk_utils.py

+
+        if "roles" in service_state:
+            # MUTI_ROLE
+            current_service_roles["framework-roles"] = service_state["roles"]


So if this is a multi role, what would be the result of service_state["role"] - is it a non existent key OR defaulted to an empty list ?

service_state["role"] = "*"

testing/sdk_utils.py

tools/universe/package_builder.py

Co-Authored-By: Tarun Gupta Akirala <takirala@users.noreply.github.com>

Add initial scaffolding to support configurable roles.

0a2efd4

kaiwalyajoshi added wip do not merge labels Jun 12, 2019

kaiwalyajoshi added 9 commits June 12, 2019 19:45

Add fix missing header for conditional mustache entry.

a7a2673

Add support for configureable roles. (#3103)

b1aa781

* Retrieve role from env. * Remove irritating 100 line checkstyle limitation. * Name change getSchedulerNamespace -> getServiceNamespace * Add support for configureable roles via an environment-variable.

Fix previous merge issue, remove fixRoleChange.

cce2393

Merge branch 'master' into sdk-quota

495fb2d

Fix issues reported from pre-commit-hook black.

4ec2788

kaiwalyajoshi changed the title ~~[DRAFT] [WIP] SDK Quota Support~~ SDK Quota Support Jul 15, 2019

kaiwalyajoshi requested review from takirala and kvish July 15, 2019 20:59

kaiwalyajoshi removed the wip label Jul 15, 2019

kaiwalyajoshi mentioned this pull request Jul 16, 2019

SDK v0.57.0 Release Candidate #3131

Closed

kvish reviewed Jul 18, 2019

View reviewed changes

testing/sdk_upgrade.py Outdated Show resolved Hide resolved

kaiwalyajoshi added 6 commits July 19, 2019 13:58

Bump libmesos bundle to libmesos-bundle-1.14-alpha.tar.gz

a07ce40

Update how service groups are detected, use sanitized name.

19daf25

Fix formatting errors from Black pre-commit hook

5b1ca5f

Update env-var from MARATHON_ENFORCE_GROUP_ROLE to MARATHON_APP_ENFOR…

41c5f75

…CE_GROUP_ROLE

Merge branch 'master' into sdk-quota to resolve conflicts found by to…

89acbc7

…oling with upstream branch.

Fix bug in use of service roles.

88d7497

takirala reviewed Jul 24, 2019

View reviewed changes

kaiwalyajoshi and others added 5 commits July 24, 2019 19:48

Remove toggle for MARATHON_APP_ENFORCE_GROUP_ROLE from service config…

50af795

…uration as Marathon now injects this based on the group settings.

Update sdk/scheduler/src/main/java/com/mesosphere/sdk/specification/D…

167e251

…efaultResourceSet.java Co-Authored-By: Tarun Gupta Akirala <takirala@users.noreply.github.com>

Update sdk/scheduler/src/main/java/com/mesosphere/sdk/specification/D…

a966f8c

…efaultResourceSet.java Co-Authored-By: Tarun Gupta Akirala <takirala@users.noreply.github.com>

Update sdk/scheduler/src/main/java/com/mesosphere/sdk/config/validate…

cc90154

…/DefaultConfigValidators.java Co-Authored-By: Tarun Gupta Akirala <takirala@users.noreply.github.com>

Update sdk/scheduler/src/main/java/com/mesosphere/sdk/scheduler/Sched…

a20e93b

…ulerConfig.java Co-Authored-By: Tarun Gupta Akirala <takirala@users.noreply.github.com>

kaiwalyajoshi added 13 commits August 26, 2019 12:32

Fix pre-commit formatting issues.

951dc0c

Keep recovery timeout consistent with other tests.

3d36428

Add fixes for strict-mode clusters. (#3169)

561b5dd

Fix black pre-commit failures.

7936632

Tooling bugfix, only uninstall current service-name.

c360dd2

Temporarily disable deployment tests. (These are covered by integrati…

6bf2e88

…on tests in SDK)

Remove intermediate folder when setting enforceRole settings.

f05c548

Seperate quota related tests to use seperate marathon group.

6570d0d

Rename to

69c0ed8

Rename quota_migration_mode to enable_role_migration

d5f1da4

Add missing legacy role permissions.

162558a

Fix black formatting issues.

70e0407

SDK Quota: Isolate and fix SI failures (#3174)

ae576f1

- Fix invalid use of marathon group delete. - Introduce pytest-dependency

kaiwalyajoshi marked this pull request as ready for review September 4, 2019 16:50

kaiwalyajoshi removed the do not merge label Sep 4, 2019

kaiwalyajoshi requested review from takirala and kvish September 4, 2019 16:51

kaiwalyajoshi mentioned this pull request Sep 5, 2019

SDK 0.57.0 GA #3178

Closed

takirala previously approved these changes Sep 5, 2019

View reviewed changes

Apply suggestions from Tarun's code-review.

8f64279

Co-Authored-By: Tarun Gupta Akirala <takirala@users.noreply.github.com>

kaiwalyajoshi dismissed takirala’s stale review via 8f64279 September 6, 2019 18:33

kaiwalyajoshi added 7 commits September 6, 2019 14:55

Merge branch 'master' into sdk-quota

a3b8b4e

Fix compile error from merge-conflict.

5023c8a

Address code-review suggestions for the scheduler.

ae4f848

Address code-review suggestions for tooling and testing.

9654d29

Fix black formatting issues.

8d42213

Update Mesos to 1.9.0 and libmesos-bundle to 1.14-beta.

d45b44d

Merge branch 'master' into sdk-quota

d74153a

kaiwalyajoshi merged commit 124bdb3 into master Sep 7, 2019

kaiwalyajoshi deleted the sdk-quota branch December 16, 2019 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDK Quota Support #3102

SDK Quota Support #3102

kaiwalyajoshi commented Jun 12, 2019 •

edited

Loading

takirala left a comment

takirala Sep 5, 2019

kaiwalyajoshi Sep 6, 2019

takirala Sep 5, 2019

takirala Sep 5, 2019

kaiwalyajoshi Sep 6, 2019

takirala Sep 5, 2019

kaiwalyajoshi Sep 6, 2019

SDK Quota Support #3102

SDK Quota Support #3102

Conversation

kaiwalyajoshi commented Jun 12, 2019 • edited Loading

Deploy new service in a group with quota enabled

Migrate an existing deployed service to use quota support

Strict Mode Clusters

Pod Pre-Reserved Roles

Downgrading to and older non-quota aware version of the scheduler

takirala left a comment

Choose a reason for hiding this comment

takirala Sep 5, 2019

Choose a reason for hiding this comment

kaiwalyajoshi Sep 6, 2019

Choose a reason for hiding this comment

takirala Sep 5, 2019

Choose a reason for hiding this comment

takirala Sep 5, 2019

Choose a reason for hiding this comment

kaiwalyajoshi Sep 6, 2019

Choose a reason for hiding this comment

takirala Sep 5, 2019

Choose a reason for hiding this comment

kaiwalyajoshi Sep 6, 2019

Choose a reason for hiding this comment

kaiwalyajoshi commented Jun 12, 2019 •

edited

Loading