Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make DRAC configuration steps more robust #193

Closed
robertodauria opened this issue Dec 1, 2020 · 2 comments · Fixed by #195
Closed

Make DRAC configuration steps more robust #193

robertodauria opened this issue Dec 1, 2020 · 2 comments · Fixed by #195
Assignees

Comments

@robertodauria
Copy link
Contributor

We currently apply both a basic DRAC configuration in stage1 and a full one during stage2. To apply these configurations we use ipmitool, which awaits for confirmation from DRAC after sending each command. For unknown reasons, on R640s some of the commands can take a long time to be confirmed and ipmitool times out.

Both the stage1 and the stage2 scripts should be modified to tolerate these transient failures and keep retrying each command for a few times (e.g. 10).

@nkinkade
Copy link
Contributor

nkinkade commented Dec 1, 2020

Just to clarify what happens in each stage:

stage1

  • DRAC network configurations are applied.
  • Temporary, predefined user name ("stageone") and password are configured.

stage2

  • The final user name ("admin") and random password are applied. The random password is stored in a new GCD entity for the machine.

@nkinkade
Copy link
Contributor

nkinkade commented Dec 9, 2020

tl;dr: at this point this appears to be a firmware issue, which was likely resolved in some newer version of the firmware.

I have a bit more data on this based on my experiences and testing with bringing up BOG04. All machines at BOG04 were stuck in stage1 because of a known bug in the epoxy_client. However, since we can login as root to stage1 boots, I was able to login and experiment with ipmitool manually. What I found is that more often than not, calls to ipmitool to modify network settings would yield something like the following:

root@mlab1-bog04:~# ipmitool lan set 1 ipaddr 200.189.196.132
Setting LAN IP Address to 200.189.196.132
LAN Parameter Data does not match!  Write may have failed.

The command would hang for right about 30s, and then dump that last message and exit. In some of the cases, I found that the value had actually been modified, despite the warning. In other cases, the value was not modified. I found that this problem existed in stage1, as well as when booted to stage3 being part of the cluster.

Searches yielded little, but a number of results indicated that a firmware issue was likely the root cause, so on mlab2-bog04 I upgraded the iDRAC with Lifecycle Controller firmware to 4.20.20.20, the latest version. After the upgrade, calls to ipmitool to modify network settings returned nearly instantly, and worked.

The workaround suggested in the first comment of this issue is likely the easiest fix, for now, and can't hurt in any case. I discovered that despite the warning/error, after a couple of tries the setting eventually took.

Upgrading the firmware could be nice and there are several possible options:

  • Give instructions to site installers on how to upgrade the firmware.
  • Upgrade the firmware manually on each node ourselves as part of bringing a site up.
  • Figure out some way to automate the upgrade of the firmware.

The latter option sounds the best, but at this moment I have no idea how it could be accomplished, thought I am sure it can be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants