Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore script to set address information to link machines and nodes #730

Merged
merged 5 commits into from Aug 9, 2019

Conversation

dhellmann
Copy link
Member

No description provided.

This reverts commit a991cca.

Some of the removed code is still needed.  In particular, we still
need to set the IPs on master Machines, as we have no other mechanism
to do that.  Without it, the master Machines and Nodes will not get
linked, which the UI depends on.

Related to issue openshift-metal3#260
We should no longer set IPs on the worker Machine objects, as
cluster-api-provider-baremetal should be doing that automatically.

Since we're not setting IPs, there's no requirement for the script to
wait for the worker to come up, either.

This is still a useful verification, so move it to run_ci.sh, instead.

Related to issue openshift-metal3#260
@dhellmann
Copy link
Member Author

This replaces #714

@dhellmann
Copy link
Member Author

@e-minguez it doesn't look like you're in the openshift-metal3 org so I can't assign you as a reviewer on this.

@dhellmann dhellmann requested a review from russellb August 7, 2019 21:28
@dhellmann dhellmann force-pushed the master-machine-ips branch 2 times, most recently from f6a2fa8 to e147cdc Compare August 7, 2019 23:20
@dhellmann dhellmann changed the title Restore script to set address information to link machines and nodes WIP: Restore script to set address information to link machines and nodes Aug 8, 2019
@dhellmann
Copy link
Member Author

This version of the script works, but we have to wait for the master nodes to be reconciled enough that their status blocks are more fully populated before patching the hardware settings.

@e-minguez
Copy link
Contributor

My nodes:

$ oc get nodes
NAME                                                 STATUS   ROLES    AGE   VERSION
openshift-master-0.dev5.kni.lab.eng.bos.redhat.com   Ready    master   18h   v1.14.0+739670a83
openshift-master-1.dev5.kni.lab.eng.bos.redhat.com   Ready    master   18h   v1.14.0+739670a83
openshift-master-2.dev5.kni.lab.eng.bos.redhat.com   Ready    master   18h   v1.14.0+739670a83

My machines:

$ oc get machines -A
NAMESPACE               NAME            INSTANCE   STATE   TYPE   REGION   ZONE   AGE
openshift-machine-api   dev5-master-0                                             18h
openshift-machine-api   dev5-master-1                                             18h
openshift-machine-api   dev5-master-2                                             18h

The 12_csr_hack.sh script (actually the add-machine-ips.sh) fails in my environment:

+ HOST_PROXY_API_PATH=http://localhost:8001/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts
+ wait_for_json oc_proxy http://localhost:8001/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts 10 -H 'Accept: application/json' -H 'Content-Type: application/json'
+ oc --config ocp/auth/kubeconfig proxy
+ local name
+ local url
+ local curl_opts
+ local timeout
+ local start_time
+ local curr_time
+ local time_diff
+ name=oc_proxy
+ url=http://localhost:8001/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts
+ timeout=10
+ shift 3
+ curl_opts='-H Accept: application/json -H Content-Type: application/json'
+ echo -n 'Waiting for oc_proxy to respond'
Waiting for oc_proxy to respond++ date +%s
+ start_time=1565253217
+ curl -g -X GET http://localhost:8001/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts '-H Accept: application/json -H Content-Type: application/json'
+ jq .
+ echo -n .
.++ date +%s
+ curr_time=1565253217
+ time_diff=0
+ [[ 0 -gt 10 ]]
+ sleep 5
Starting to serve on 127.0.0.1:8001
+ curl -g -X GET http://localhost:8001/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts '-H Accept: application/json -H Content-Type: application/json'
+ jq .
+ echo ' Success!'
 Success!
+ return 0
++ oc --config ocp/auth/kubeconfig get node -n openshift-machine-api openshift-master-0.dev5.kni.lab.eng.bos.redhat.com -o json
++ jq -c .status.addresses
+ addresses='[{"address":"10.19.135.105","type":"InternalIP"},{"address":"openshift-master-0.dev5.kni.lab.eng.bos.redhat.com","type":"Hostname"}]'
++ oc --config ocp/auth/kubeconfig get machine -n openshift-machine-api -o json dev5-openshift-master-0.dev5.kni.lab.eng.bos.redhat.com
Error from server (NotFound): machines.machine.openshift.io "dev5-openshift-master-0.dev5.kni.lab.eng.bos.redhat.com" not found
+ machine_data=

If I execute the link-machine-and-node script manually:

$ for i in 0 1 2; do ./link-machine-and-node.sh dev5-master-${i} openshift-master-${i}.dev5.kni.lab.eng.bos.redhat.com; done
...
      checksum: ""
      url: ""
    state: externally provisioned
+ kill 72477
./link-machine-and-node.sh: line 99: kill: (72477) - No such process

And the baremetal hosts now have an IP (only 1 and it should have 2):

oc get baremetalhosts -n openshift-machine-api   openshift-master-0 -o yaml | grep ip:
    - ip: 10.19.135.105

And the console shows some stats as well.

Also, the power status says powered off, but I've modified the script to include the powerstate and after running it again, it now says on:

...
host_patch='
{
  "status": {
    "poweredOn": true,           <---- this
    "hardware": {
...


host_patch='
{
"status": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say to add "poweredOn": true, to also mark the hosts as on (otherwise they say powered off)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ironic will update that field when they are adopted correctly.

The actuator will copy the address and hostname information from the
host object onto the machine, which allows the machine to be linked to
the node. If we edit the machine directly, the actuator overwrites it
with empty data, so edit the host instead.

We now need to wait for the hosts to be registered in order to link
them to machines and nodes, so do this step last.
Worker hosts should have IPs gathered as part of the inspection
process so we shouldn't need to use this hack.
@dhellmann dhellmann changed the title WIP: Restore script to set address information to link machines and nodes Restore script to set address information to link machines and nodes Aug 9, 2019
@dhellmann
Copy link
Member Author

@e-minguez it looks like the nodes in your environment are using fully qualified names, which the scripts aren't expecting. I don't think we can reliably predict every possible configuration there, so we might just have to say that users in those environments need to run the linking script manually instead of relying on the build script. I'm open to ideas about how to improve the scripts, of course.

I'm reluctant to modify the power status flag for 2 reasons. First, the operator is just going to reset it the next time it polls ironic, so the setting won't stick. More importantly, it covers up an issue with the way the hosts were adopted and managed by ironic and I think we should fix that bug instead.

The node only reports 1 IP address, and that's the one we care about here, so the script only adds that address to the host. That is going to make it look like the host has 1 NIC, but that issue will be fixed when we can start collecting the inspection data properly in a future version.

@dhellmann
Copy link
Member Author

@russellb I think this is ready for review now.

@dhellmann dhellmann added the CI check this PR with CI label Aug 9, 2019
@metal3ci
Copy link

metal3ci commented Aug 9, 2019

Build SUCCESS, see build http://10.8.144.11:8080/job/dev-tools/1022/

@russellb
Copy link
Member

russellb commented Aug 9, 2019

lgtm - I wonder if link_machine_and_node.sh is still an accurate name for that script, but that could be a follow up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI check this PR with CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants