Skip to content

Conversation

@honza
Copy link
Member

@honza honza commented Feb 1, 2021

zouy414 and others added 30 commits November 23, 2020 11:10
Signed-off-by: zouyu <zouy.fnst@cn.fujitsu.com>
Since 1846513, startProvisioning() no
longer requests the state change that actually... starts provisioning,
so rename it to reflect that.
The first time we provision a Node, we must move it from the manageable
to available state. This involves automated cleaning, and it may fail.
If it does, we should attempt to recover.
Set console=ttyS0 so that the IPA console can
be logged for debugging.
Signed-off-by: seivanov <seivanov@mirantis.com>
Handle cleaning failures during provisioning
Add kubebuilder validation for ClockSpeed
Filter out status updates from the reconcile loop
It is up to the consumer, not the host, to decide when a host is ready
to have provisioning instructions added. The checks in the Available()
method will be inlined in cluster-api-provider-metal3.

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
Use the API field directly instead of the convenience method.

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
Move public methods of the host type to private functions in the
controller package.

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
Bug 1907612: Update kubernetes deps to 1.20
Fix a typo in docs/dev-setup.md document
The new provioner is mostly a no-op and prov.IsReady() must return true
Fix outdated namespace docs/dev-setup document.
Allow BMC details to be omitted for Hosts in Unmanaged state
Currently we assume only an ipv4 *or* an ipv6 IP exists in the
inspection data, but it's valid for both to exist in a dual-stack
scenario.

To avoid changing the BMH API, in this case we create an additional
nic entry in the list, which will contain each IP but otherwise
the same data.

Fixes: metal3-io#458
Handle dual-stack configuration in inspection data
remove convenience methods from BareMetalHost
When a call to the provisioner succeeds rather than returning an error
message, that's a good sign, and a reason to not have an error message
set in the Host object. But it doesn't guarantee success: if the
previous failure came at the end of some long-running operation (rather
than outright rejection at the beginning), it could yet fail in exactly
the same way.

Clearing the ErrorType as soon as we start an operation allows us to use
that field to determine whether to force the provisioner to retry in the
presence of an error. (It will only be set if the last thing we did was
record an error, therefore if we see it then we are at the beginning of
a new retry.)

One known issue with this is that if a request to change state in Ironic
results in a 409 Conflict response, this just sets the Dirty flag and is
indistinguishable from success when viewed outside the Ironic
provisioner. If such a conflict occurs, we will effectively skip a retry
attempt, increment the ErrorCount again, and sleep for even longer
before the next attempt.

To ensure that we actually do exponential backoff between retries, leave
the ErrorCount unchanged until the whole operation actually succeeds
(Dirty is no longer true). This is now decoupled from the ClearError()
method that clears the ErrorType and ErrorMessage.
As much as possible, do the clearing of ErrorCount in the host state
machine. The exception is the steady states where we only do power
management - the actionResult types currently lack enough detail to
distinguish when count should be cleared when viewed from the state
machine.

In power management states (Ready/Available, Provisioned, Externally
Provisioned), count the number of errors of any type since the power
state was last successfully changed.

Otherwise, the ErrorCount is cleared when an operation succesfully
completes. Successful re-registration or adoption is never sufficient to
clear the error count, except in the registration state.
Once we see an error in the Node, it returns to the 'enroll' [sic] state
and we don't have a way of determining whether we have seen and saved
that error or not. Previously we always assumed we had not, and didn't
retry the validation unless the credentials had changed.

Add a force flag to indicate that this is a new attempt and should start
again.

Fixes metal3-io#739
If inspection fails at some point before it is actually started in
ironic-inspector, we were just repeatedly retrying it instead of setting
an error. Instead, set an error and retry only after a backoff.
Error is the state that the Node goes into if deleting (i.e.
deprovisioning) it fails. The only valid action from this state is to
try deleting again, so do that rather than attempt to go straight to
manageable.
If deprovisioning fails, we were just repeatedly retrying it instead of
setting an error. Instead, set an error and retry only after a backoff.

If we are deprovisioning because of a deletion, give up after 3 retries
(since this may indicate that the host simply doesn't exist any more)
and allow the Host to be deleted.
honza and others added 11 commits January 26, 2021 09:15
When registering a new host, we check if its bootMACAddress conflicts
with any existing Ironic nodes.  This procedure queries Ironic ports,
and if a port is found, we use the port's NodeUUID to find the
conflicting node.  However, we are using `ports.List()` which only
returns minimal resources by default; e.g. any returned ports will have
their NodeUUID property set to an empty string.

We are now explicitly asking for Ironic to include the `node_uuid`
field.

We also adjust the mock test mechanism to account for this.
Add missing newline in console log
Ironic: Don't adopt after clean failure during deprovisioning
Ask Ironic to include NodeUUID when querying ports
If we can't find a node by name, and have to resort to using MAC address
lookup, we can't accept a node that has a name.  In that scenario, we
have detected a MAC address collision, and human intervention is
required.  We signal this by returning an `operationFailed` provisioning
result with the offending MAC address in the error message.
Fail registration when boot MAC conflicts
Limit the number of hosts simultaneously provisioned
Add "NET_RAW" to ironic-endpoint-keepalived
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: honza

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 1, 2021
@stbenjam
Copy link
Member

stbenjam commented Feb 1, 2021

/test e2e-metal-ipi-virtualmedia

@stbenjam
Copy link
Member

stbenjam commented Feb 1, 2021

/test e2e-metal-ipi

@stbenjam
Copy link
Member

stbenjam commented Feb 1, 2021

/test e2e-metal-assisted

Not sure if that one's here, but we probably want to check it.

@openshift-ci-robot
Copy link

@stbenjam: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

  • /test e2e-metal-ipi
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-ovn-ipv6
  • /test e2e-metal-ipi-virtualmedia
  • /test images
  • /test lint
  • /test unit

Use /test all to run the following jobs:

  • pull-ci-openshift-baremetal-operator-master-e2e-metal-ipi-ovn-ipv6
  • pull-ci-openshift-baremetal-operator-master-images
  • pull-ci-openshift-baremetal-operator-master-lint
  • pull-ci-openshift-baremetal-operator-master-unit
Details

In response to this:

/test e2e-metal-assisted

Not sure if that one's here, but we probably want to check it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@stbenjam
Copy link
Member

stbenjam commented Feb 1, 2021

/test e2e-metal-ipi-ovn-dualstack

@honza
Copy link
Member Author

honza commented Feb 9, 2021

openshift/cluster-baremetal-operator#100 is now merged so let's try a fresh build

/retest

@stbenjam
Copy link
Member

stbenjam commented Feb 9, 2021

/test e2e-meta-ipi

@openshift-ci-robot
Copy link

@stbenjam: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

  • /test e2e-metal-ipi
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-ovn-ipv6
  • /test e2e-metal-ipi-virtualmedia
  • /test images
  • /test lint
  • /test unit

Use /test all to run the following jobs:

  • pull-ci-openshift-baremetal-operator-master-e2e-metal-ipi-ovn-ipv6
  • pull-ci-openshift-baremetal-operator-master-images
  • pull-ci-openshift-baremetal-operator-master-lint
  • pull-ci-openshift-baremetal-operator-master-unit
Details

In response to this:

/test e2e-meta-ipi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@stbenjam
Copy link
Member

stbenjam commented Feb 9, 2021

/test e2e-metal-ipi

@andfasano
Copy link

/retest

1 similar comment
@stbenjam
Copy link
Member

/retest

@honza
Copy link
Member Author

honza commented Feb 11, 2021

#125

@openshift-ci
Copy link

openshift-ci bot commented Feb 11, 2021

@honza: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-metal-ipi-virtualmedia 1cbb1cd link /test e2e-metal-ipi-virtualmedia
ci/prow/e2e-metal-ipi 1cbb1cd link /test e2e-metal-ipi
ci/prow/e2e-metal-ipi-ovn-dualstack 1cbb1cd link /test e2e-metal-ipi-ovn-dualstack
ci/prow/e2e-metal-ipi-ovn-ipv6 1cbb1cd link /test e2e-metal-ipi-ovn-ipv6

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@honza
Copy link
Member Author

honza commented Feb 17, 2021

Superseded by #125

@honza honza closed this Feb 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.