Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for try and undo allocate #508

Merged
merged 8 commits into from
Aug 14, 2023

Conversation

atantawi
Copy link
Collaborator

Two calls are added to try to allocate a consumer, and if unaccepted, to undo the effect of the allocation trial. If the trial is accepted, no further action is needed. Otherwise, the undo has to be called right after the try allocation, without making any calls to change the trees or allocate/deallocate consumers. These operations are intended only during Normal mode.

  • TryAllocate(treeName, consumerID)
  • UndoAllocate(treeName, consumerID)
  • TryAllocateForest(forestName, consumerID)
  • UndoAllocateForest(forestName, consumerID)

Copy link
Contributor

@z103cb z103cb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests that cover concurrent usage of the TryAllocate, etc are needed.

Copy link
Contributor

@z103cb z103cb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atantawi,

When I have added the test case below to the pkg/quotaplugins/quota-forest/quota-manager/quota/quotamanager_test.go, I have received this error:

--- FAIL: TestQuotaManagerParallelTryUndoAllocation (0.00s)
    --- FAIL: TestQuotaManagerParallelTryUndoAllocation/Gold_consumer_2 (0.01s)
        /Users/laurentiu.bradin/work/repos/multi-cluster-app-dispatcher/pkg/quotaplugins/quota-forest/quota-manager/quota/quotamanager_test.go:512: 
            	Error Trace:	quotamanager_test.go:512
            	Error:      	Received unexpected error:
            	            	failed undo allocate tree name test-tree
            	Test:       	TestQuotaManagerParallelTryUndoAllocation/Gold_consumer_2
            	Messages:   	No error expected when calling UndoAllocate consumer
FAIL
FAIL	github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/quotaplugins/quota-forest/quota-manager/quota	0.233s
FAIL

This points toward an issue of using the TryAllocate / UndoAllocate combo from multiple threads. I have did not spent a lot of time trying to understand the true cause of the issue, but I suspect that the single snapshot you are using in the tree controller might not be a good thing.

While the current MCAD design only has a single thread (ScheduleNext()) and the implementation you are proposing might be sufficient for this case, the implementation from this PR severely limits any future refactorings that might want to use multiple goroutines to increase the paralism and throughput of MCAD. I am strongly suggesting that you improve the overall design of the qm library to account for this use case.

My objection to merging this PR as is bassed upon the limitations exposed by this testcase.

cc: @astefanutti

func TestQuotaManagerParallelTryUndoAllocation(t *testing.T) {
	forestName := "unit-test-1"
	qmManagerUnderTest := quota.NewManager()

	err := qmManagerUnderTest.AddForest(forestName)
	assert.NoError(t, err, "No error expected when adding a forest")
	testTreeName, err := qmManagerUnderTest.AddTreeFromString(
		`{
			"kind": "QuotaTree",
			"metadata": {
			  "name": "test-tree"
			},
			"spec": {
			  "resourceNames": [
				"cpu",
				"memory"
			  ],
			  "nodes": {		
				"root": {
				  "parent": "nil",
				  "hard": "true",
				  "quota": {
					"cpu": "10",
					"memory": "256"
				  }
				},		
				"gold": {
				  "parent": "root",
				  "hard": "true",
				  "quota": {
					"cpu": "10",
					"memory": "256"
				  }
				}
			  }
			}
		}`)
	if err != nil {
		assert.Fail(t, "No error expected when adding a tree to forest")
	}
	err = qmManagerUnderTest.AddTreeToForest(forestName, testTreeName)
	assert.NoError(t, err, "No error expected when adding a tree from forest")
	modeSet := qmManagerUnderTest.SetMode(quota.Normal)
	assert.True(t, modeSet, "Setting the mode should not fail.")

	// Define the test table
	var tests = []struct {
		name     string
		consumer utils.JConsumer
	}{
		// the table itself
		{"Gold consumer 1",
			utils.JConsumer{
				Kind: "Consumer",
				MetaData: utils.JMetaData{
					Name: "gold-consumer-data-1",
				},
				Spec: utils.JConsumerSpec{
					ID: "gold-consumer-1",
					Trees: []utils.JConsumerTreeSpec{
						{
							TreeName: testTreeName,
							GroupID:  "gold",
							Request: map[string]int{
								"cpu":    4,
								"memory": 4,
								"gpu":    0,
							},
							Priority:      0,
							CType:         0,
							UnPreemptable: false,
						},
					},
				},
			},
		},
		// the table itself
		{"Gold consumer 2",
			utils.JConsumer{
				Kind: "Consumer",
				MetaData: utils.JMetaData{
					Name: "gold-consumer-data-2",
				},
				Spec: utils.JConsumerSpec{
					ID: "gold-consumer-2",
					Trees: []utils.JConsumerTreeSpec{
						{
							TreeName: testTreeName,
							GroupID:  "gold",
							Request: map[string]int{
								"cpu":    4,
								"memory": 4,
								"gpu":    0,
							},
							Priority:      0,
							CType:         0,
							UnPreemptable: false,
						},
					},
				},
			},
		},
	}
	// Execute tests in parallel
	for _, tc := range tests {
		tc := tc // capture range variable
		t.Run(tc.name, func(t *testing.T) {
			t.Parallel()
			// Get list of quota management tree IDs
			qmTreeIDs := qmManagerUnderTest.GetTreeNames()

			consumerInfo, err := quota.NewConsumerInfo(tc.consumer)
			assert.NoError(t, err, "No error expected when building consumer")
			assert.Contains(t, qmTreeIDs, tc.consumer.Spec.Trees[0].TreeName)

			added, err := qmManagerUnderTest.AddConsumer(consumerInfo)
			assert.NoError(t, err, "No error expected when adding consumer")
			assert.True(t, added, "Consumer is expected to be added")

			response, err := qmManagerUnderTest.TryAllocate(tc.consumer.Spec.Trees[0].TreeName, consumerInfo.GetID())
			if !assert.NoError(t, err, "No error expected when calling TryAllocate consumer") {
				assert.Equal(t, 0, len(strings.TrimSpace(response.GetMessage())), "A empty response is expected")
				assert.True(t, response.IsAllocated(), "The allocation should succeed")
			}

			//simulate a long running consumer that has quota allocated
			time.Sleep(10 * time.Millisecond)

			err = qmManagerUnderTest.UndoAllocate(tc.consumer.Spec.Trees[0].TreeName, consumerInfo.GetID())
			assert.NoError(t, err, "No error expected when calling UndoAllocate consumer")

		})
	}
}

@atantawi
Copy link
Collaborator Author

Added test described in #508 (review) into quotamanagerundo_test.go
The current design is to put the burden on the caller of the quota manager (MCAD in this case) to make sure that TryAllocate() and UndoAllocate() are run in an atomic fashion. So, a lock is needed for that purpose, as documented in the README file.
An alternate design is to have that lock internal to the quota manager. This is discussed and an issue will be created. In the meantime, I believe we could move with the current design, given its limitations until an alternative is developed.

@z103cb
Copy link
Contributor

z103cb commented Aug 11, 2023

Added test described in #508 (review) into quotamanagerundo_test.go
The current design is to put the burden on the caller of the quota manager (MCAD in this case) to make sure that TryAllocate() and UndoAllocate() are run in an atomic fashion. So, a lock is needed for that purpose, as documented in the README file.
An alternate design is to have that lock internal to the quota manager. This is discussed and an issue will be created. In the meantime, I believe we could move with the current design, given its limitations until an alternative is developed.

This makes sense. Enhancing the README to explicitly define the usage in concurrent situations and fixing the failing unit test would allow this PR to be merged.

@asm582
Copy link
Member

asm582 commented Aug 11, 2023

/approve

@z103cb
Copy link
Contributor

z103cb commented Aug 14, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Aug 14, 2023
Copy link
Contributor

@z103cb z103cb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@openshift-ci
Copy link

openshift-ci bot commented Aug 14, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: asm582, z103cb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 55c21c1 into project-codeflare:main Aug 14, 2023
2 checks passed
@atantawi atantawi deleted the undo-support branch August 14, 2023 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants