Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MGMT-4066 Add enabled operator CPU and memory requirements to host validation #1167

Conversation

jordigilh
Copy link
Contributor

@jordigilh jordigilh commented Feb 24, 2021

This PR updates the CPU and memory validations for the host to add the resources request for the enabled operators to check that the cluster's CPU and usable memory are sufficient to deploy the selected operators (LSO, CNV or OCS) in that node.

@machacekondra @masayag @pkliczewski @jakub-dzon @gamli75 can you take a look in the meanwhile?

/Jordi

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 24, 2021
@app-sre-bot
Copy link

Can one of the admins verify this patch?

@openshift-ci-robot openshift-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. api-review Categorizes an issue or PR as actively needing an API review. labels Feb 24, 2021
@jordigilh jordigilh force-pushed the MGMT-4066_add_operator_CPU_Memory_reqs_to_cluster_validation branch from 15f8a5b to fc8fa9c Compare February 24, 2021 16:58
Copy link
Contributor

@jakub-dzon jakub-dzon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the spirit of the https://issues.redhat.com/browse/MGMT-4066 story was to take operator requirements into account while checking general OCP memory and cpu requirements in the host validation (like:

or . using GetCPURequirementFor* and GetMemoryRequirementFor* Operator methods.

@gamli75 - is the above description correct?

// sum(Node usable memory) > sum (minimum memory required for each enabled operator)
func (v *clusterValidator) sufficientMemoryAvailableForOperators(c *clusterPreprocessContext) validationStatus {

opMgr := operators.NewManager(v.log)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the manager that's created in main.go- the OCS validator is stateful and that may in the future have impact on this code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Will do.

return boolValue(true)
}

// gather the total memory usable available in the cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about extracting total cluster memory collection to a method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at the code to see if this logic is being used elsewhere. Thanks for the suggestion 😃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: moved to its own struct method


// gather the total CPU available in the cluster
var tca int64
for _, h := range c.cluster.Hosts {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved as well.

)

func (v validationID) category() (string, error) {
switch v {
case IsMachineCidrDefined, isMachineCidrEqualsToCalculatedCidr, isApiVipDefined, isApiVipValid, isIngressVipDefined, isIngressVipValid,
isClusterCidrDefined, isServiceCidrDefined, noCidrOverlapping, networkPrefixValid, IsDNSDomainDefined, IsNtpServerConfigured:
return "network", nil
case AllHostsAreReadyToInstall, SufficientMastersCount:
case AllHostsAreReadyToInstall, SufficientMastersCount, SufficientMemoryAvailable, SufficientCPUAvailable:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wonder if this shouldn't be as part of "operators" instead of "hosts-data". Because it's actually operators related. But I would leave this decision to someone from assisted team.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'd also like to hear @gamli75 's feedback on this.

@jordigilh
Copy link
Contributor Author

I think that the spirit of the https://issues.redhat.com/browse/MGMT-4066 story was to take operator requirements into account while checking general OCP memory and cpu requirements in the host validation (like:

or

. using GetCPURequirementFor* and GetMemoryRequirementFor* Operator methods.
@gamli75 - is the above description correct?

Agreed. I will refactor the code to align with this idea.

@jordigilh jordigilh force-pushed the MGMT-4066_add_operator_CPU_Memory_reqs_to_cluster_validation branch from fc8fa9c to 0f365e3 Compare February 27, 2021 23:27
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 27, 2021
@jordigilh jordigilh force-pushed the MGMT-4066_add_operator_CPU_Memory_reqs_to_cluster_validation branch 5 times, most recently from b44b1c1 to d6dae62 Compare February 28, 2021 18:00
@gamli75
Copy link
Contributor

gamli75 commented Mar 1, 2021

I think that the spirit of the https://issues.redhat.com/browse/MGMT-4066 story was to take operator requirements into account while checking general OCP memory and cpu requirements in the host validation (like:

or

. using GetCPURequirementFor* and GetMemoryRequirementFor* Operator methods.
@gamli75 - is the above description correct?

Yes, this is correct.

@jordigilh
Copy link
Contributor Author

jordigilh commented Mar 1, 2021

@avishayt What do you think about adding host disk requirement validation? OCS has such requirements and we can add them to the CPU and memory validations.
Update: after further deliberation, it makes no sense at the host level to validate disk availability for OCS. I will not be implementing such validation.

@jordigilh
Copy link
Contributor Author

I think that the spirit of the https://issues.redhat.com/browse/MGMT-4066 story was to take operator requirements into account while checking general OCP memory and cpu requirements in the host validation (like:

or

. using GetCPURequirementFor* and GetMemoryRequirementFor* Operator methods.
@gamli75 - is the above description correct?

Yes, this is correct.

Understood, I will refactor the code to do the validation per host instead.

@jordigilh jordigilh force-pushed the MGMT-4066_add_operator_CPU_Memory_reqs_to_cluster_validation branch from d6dae62 to 1a82b33 Compare March 2, 2021 15:28
@jordigilh
Copy link
Contributor Author

@priyanka19-98 can you please take a look at this PR?

One thing that I found puzzling is that the minimum memory validation for hosts uses the Physical memory (here and here), but OCS and I see CNV as well now, use the Usable memory instead. I think they should all use the usable memory but I leave it up to the maintainers to decide.

/cc @gamli75

@jordigilh
Copy link
Contributor Author

@pkliczewski @jakub-dzon @machacekondra @masayag can you take another look?

@jordigilh jordigilh changed the title [WIP] MGMT-4066 Add enabled operator CPU and memory requirements to cluster validation MGMT-4066 Add enabled operator CPU and memory requirements to cluster validation Mar 2, 2021
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 2, 2021
@jordigilh jordigilh force-pushed the MGMT-4066_add_operator_CPU_Memory_reqs_to_cluster_validation branch 2 times, most recently from 233d5ce to 47a8e90 Compare March 2, 2021 16:31
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 2, 2021
}

// getHostCPURequirement returns the minimum amount of CPU core count the cluster must have to install the OCS operator
func (o *ocsOperator) getPerHostCPURequirement(cluster *common.Cluster) (int64, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per host requirements are different and yet to be implemented. As we only perform aggregate resource validations,
for eg, if in compact mode (3 masters), its not compulsary that per host must have atleast 12 cpu, we rely on aggregate 36 cpu and can be present as 7,15,14.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you suggest at this point in time? We are implementing per host validation to ensure that all operators can be deployed based on the available hardware.

var cpu, count int64
switch depType {
case compact:
cpu = o.ocsValidatorConfig.OCSRequiredCompactModeCPUCount
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OCSRequiredCompactModeCPUCount , are aggregate resources, and not per host.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're looking at per host validation on this PR, what do you suggest we should do for OCS at this point?

internal/operators/ocs/ocs_operator.go Outdated Show resolved Hide resolved
return string(b)
}

func GenerateInventoryWithResources(cpu, memory int64, hostname string) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we use this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's being used to generate a host manifest for testing purposes in the /internal/host/transition_test.go test file

default:
v.log.Errorf("Unexpected role %s", c.host.Role)
return ValidationError
if c.host.Role == models.HostRoleMaster || c.host.Role == models.HostRoleWorker || c.host.Role == models.HostRoleAutoAssign {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to check whether c.host.Role != models.HostRoleBootstrap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not successful at creating a test case with this role type, the logic didn't seem to support a state transition with bootstrap. It might be that there is a specific state in which it is supported, but after looking at the OCS validation code, I did not see any mention of the bootstrap as a type of role to check, so I dropped it altogether.

internal/host/validator.go Outdated Show resolved Hide resolved
@jordigilh jordigilh force-pushed the MGMT-4066_add_operator_CPU_Memory_reqs_to_cluster_validation branch 2 times, most recently from 2c844a5 to bd052cd Compare March 4, 2021 16:52
@jordigilh
Copy link
Contributor Author

jordigilh commented Mar 4, 2021

/test assisted-service-aws

@openshift-ci-robot
Copy link

@jordigilh: The /retest command does not accept any targets.
The following commands are available to trigger jobs:

  • /test assisted-service-aws
  • /test e2e-metal-assisted
  • /test e2e-metal-assisted-ipv6
  • /test e2e-metal-assisted-onprem
  • /test e2e-metal-assisted-single-node
  • /test images
  • /test lint

Use /test all to run the following jobs:

  • pull-ci-openshift-assisted-service-master-assisted-service-aws
  • pull-ci-openshift-assisted-service-master-e2e-metal-assisted
  • pull-ci-openshift-assisted-service-master-images
  • pull-ci-openshift-assisted-service-master-lint

In response to this:

/retest assisted-service-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jordigilh
Copy link
Contributor Author

/test assisted-service-aws

@jordigilh jordigilh force-pushed the MGMT-4066_add_operator_CPU_Memory_reqs_to_cluster_validation branch from bd052cd to 6233a9f Compare March 5, 2021 14:46
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 8, 2021
@jordigilh jordigilh force-pushed the MGMT-4066_add_operator_CPU_Memory_reqs_to_cluster_validation branch from 6233a9f to 3ac034e Compare March 8, 2021 16:04
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 8, 2021
@jordigilh jordigilh force-pushed the MGMT-4066_add_operator_CPU_Memory_reqs_to_cluster_validation branch from 3ac034e to fd696be Compare March 8, 2021 17:35
// GetMemoryRequirementForMaster provides master memory requirements for the operator in MB
GetMemoryRequirementForMaster(ctx context.Context, cluster *common.Cluster) (int64, error)
GetMemoryRequirementForMaster(cluster *common.Cluster) (int64, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency please either restore context parameter in the four methods above or remove it from GetDisksRequirement* methods

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that these methods will be used by an API endpoint (https://issues.redhat.com/browse/MGMT-4477), so context.Context parameter will be needed to allow for request-id-bound logging; please revert the removal of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood.

@jordigilh jordigilh force-pushed the MGMT-4066_add_operator_CPU_Memory_reqs_to_cluster_validation branch from fd696be to 165c230 Compare March 9, 2021 15:45
@jordigilh
Copy link
Contributor Author

/test e2e-metal-assisted

@jordigilh
Copy link
Contributor Author

/test assisted-service-aws

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 10, 2021
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gamli75, jordigilh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2021
@gamli75
Copy link
Contributor

gamli75 commented Mar 10, 2021

@jordigilh please talk with @lalon4 - we need to add e2e test cases for this validation when operators are enabled.

@jordigilh jordigilh force-pushed the MGMT-4066_add_operator_CPU_Memory_reqs_to_cluster_validation branch from 165c230 to 7a2bd57 Compare March 10, 2021 16:37
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 10, 2021
@jordigilh jordigilh force-pushed the MGMT-4066_add_operator_CPU_Memory_reqs_to_cluster_validation branch from 7a2bd57 to cb38226 Compare March 10, 2021 16:38
@jordigilh jordigilh force-pushed the MGMT-4066_add_operator_CPU_Memory_reqs_to_cluster_validation branch from cb38226 to 6078837 Compare March 10, 2021 16:44
@gamli75
Copy link
Contributor

gamli75 commented Mar 11, 2021

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 11, 2021
@openshift-merge-robot openshift-merge-robot merged commit 709ddcf into openshift:master Mar 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants