PXE Attempting to boot off of IP assigned to eth0 regardless of configuration directives #86

Closed
three18ti opened this Issue Jun 23, 2012 · 20 comments

2 participants

@three18ti

Hello,

I've run into a problem when attempting to boot a machine that it is looking for the image on a subnet assigned to my eth0 card which is an isolated management network.

My Puppet/Razor server is configured to have eth0 ip 10.10.10.1 and is a private storage network. eth1 is 192.168.0.2 and is the public network. My hypervisor connects VMs over the public interface br0.

When I boot the VM it does load the inital menu but when I select the razor menu it attempts to boot http://10.10.10.1:8026/razor/api/boot?mac=XX:XX:XX:XX:XX:XX connection timed out.

you can clearly see where 192.168.0.2 is declared in the /opt/razor/conf/razor_server.conf

# /opt/razor/conf/razor_server.conf:

    # This file is the main configuration for ProjectRazor
#
# -- this was system generated --
#
#
--- !ruby/object:ProjectRazor::Config::Server
image_svc_host: 192.168.0.2
persist_mode: :mongo
persist_host: 127.0.0.1
persist_port: 27017
persist_timeout: 10
admin_port: 8025
api_port: 8026
image_svc_port: 8027
mk_tce_mirror_port: 2157
mk_checkin_interval: 60
mk_checkin_skew: 5
mk_uri: http://192.168.0.2:8026
mk_register_path: /razor/api/node/register
mk_checkin_path: /razor/api/node/checkin
mk_fact_excl_pattern: (^facter.$)|(^id$)|(^kernel.$)|(^memoryfree$)|(^operating.$)|(^osfamily$)|(^path$)|(^ps$)|(^ruby.$)|(^selinux$)|(^ssh.*$)|(^$
mk_log_level: Logger::ERROR
mk_tce_mirror_uri: http://localhost:2157/tinycorelinux
mk_tce_install_list_uri: http://localhost:2157/tinycorelinux/tce-install-list
mk_kmod_install_list_uri: http://localhost:2157/tinycorelinux/kmod-install-list
image_svc_path: /opt/razor/image
register_timeout: 120
force_mk_uuid: ''
default_ipmi_power_state: 'off'
default_ipmi_username: ipmi_user
default_ipmi_password: ipmi_password
daemon_min_cycle_time: 30
rz_mk_boot_debug_level: ''

Screen Shot of Error: http://i.imgur.com/7Qpjj.png This is after the inital menu

Thanks for any ideas.

Edit, here's the output of: razor -w boot default '{"mac":"00:00:00:00:00:00"}'

# razor -w boot default '{"mac":"00:00:00:00:00:00"}'
#!ipxe
kernel http://192.168.0.2:8027/razor/image/mk/kernel maxcpus=1 || goto error
initrd http://192.168.0.2:8027/razor/image/mk/initrd || goto error
boot || goto error

:error
echo ERROR, will reboot in 60
sleep 60
reboot
@tjmcs

The configuration file shown (above) is only used after the first checkin is complete. That configuration is passed back to the Microkernel in response to its first successful checkin and, since it will probably be different from the initial configuration of the Microkernel the Microkernel Controller (a Ruby daemon that is running on the Microkernel itself) will be restarted and start using that new configuration for all checkins after the initial checkin.

So the obvious question to ask at this point is the following. If the Microkernel uses this configuration only after the first checkin is successful, how does it know where to checkin initially? The answer is that the Microkernel uses the IP address passed to it as the "next-server" in the initial DHCP response as the IP address that it should check in with first. Changing the Razor server configuration file won't help with the initial checkin, but correctly configuring your DHCP server's "next-server" IP address will (so that it points to the IP address on which the Razor server is listening).

Let me know if this resolves your issue. If so, we can close this issue.

@three18ti

Hello tjmcs,

Thanks for the reply and the clarification. Glad to know my wheel spinning was actually because I was looking in the wrong place and not something wrong with razor.

That said, here is my dhcpd.conf, I have already set the next-server.

# BEGIN DHCP Header
# ----------
# dhcpd.conf
# ----------
authoritative;
default-lease-time 3600;
max-lease-time 86400;
log-facility daemon;

# ----------
# Options
# ----------
option domain-name "lan";
option domain-name-servers 192.168.0.2;
option ntp-servers us.pool.ntp.org;
option fqdn.no-client-update on;  # set the "O" and "S" flag bits
option fqdn.rcode2 255;
option pxegrub code 150 = text;
# END DHCP Header
# BEGIN ddns (Dynamic DNS Updates)
ddns-update-style none;
# END ddns (Dynamic DNS Updates)
# BEGIN PXE Section
next-server 192.168.0.2;
filename "pxelinux.0";
# END PXE Section
# BEGIN DHCP Extra configurations
include "/etc/dhcp/dhcpd.pools";
include "/etc/dhcp/dhcpd.hosts";
# END DHCP Extra configurations
@three18ti

I have tried the following configs in my dhcpd.pools

# DHCP Pools
#################################
# ops.dc1.example.net
#################################
subnet 192.168.0.0 netmask 255.255.0.0 {
  pool
  {
    range 192.168.1.1 192.168.1.100;
  }

  option subnet-mask 255.255.0.0;
  option routers 192.168.0.1;

  filename "pxelinux.0"; # default loading
  next-server 192.168.0.2;
}

# DHCP Pools
#################################
# ops.dc1.example.net
#################################
subnet 192.168.0.0 netmask 255.255.0.0 {
  pool
  {
    range 192.168.1.1 192.168.1.100;
  }

  option subnet-mask 255.255.0.0;
  option routers 192.168.0.1;
}
@tjmcs

Are you seeing DHCP requests from the Microkernel in the syslog on the DHCP server?

@three18ti

I am indeed:

Jun 22 23:11:39 shepard dhcpd: DHCPDISCOVER from 52:54:00:76:5e:0f via eth1
Jun 22 23:11:39 shepard dhcpd: DHCPOFFER on 192.168.1.2 to 52:54:00:76:5e:0f via eth1
Jun 22 23:11:41 shepard dhcpd: DHCPREQUEST for 192.168.1.2 (192.168.0.2) from 52:54:00:76:5e:0f via eth1
Jun 22 23:11:41 shepard dhcpd: DHCPACK on 192.168.1.2 to 52:54:00:76:5e:0f via eth1

I do get the Razor Boot menu, and am able to select "Automatic Razor Node Boot", that's when everything goes sideways.

@tjmcs
@three18ti

Nope, Intel cards all around,

#  Shepard, Puppet / Razor server [eth1] (also NFS server on eth1)
04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01)
04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01)

#  Kitt, KVM Hypervisor (Note the 3Com card is not in use at the moment)
05:00.0 Ethernet controller: Intel Corporation 82573E Gigabit Ethernet Controller (Copper) (rev 03)
06:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller
07:03.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 78)

# A different VM based on same libvirt xml
00:03.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 20)

I did indeed experience the issues described in issue #70, wherein I received a "Protocol Not Supported" error. Adding the '=' to the /var/lib/tftpboot/pxelinux.cfg/default file got me to this point.

So, I turn on the VM, it goes through the initial pxe config, finds pxelinux.0 and gives me the boot menu, when I tell it to boot into the razor kernel I get this error.

I've created a video of the boot process [00:35] http://www.youtube.com/watch?v=Ur8OlMgMEWk&feature=youtu.be

@tjmcs

I think the issue is that the Razor server is not using the configuration you showed (above). It's trying to find the image service at 10.10.10.1, not at 192.168.0.2

Did you change the configuration that is shown (above)? If so, did you restart the Razor server afterwards (by running a command like 'razor_daemon.rb restart')?

@three18ti

I think the issue is that the Razor server is not using the configuration you showed (above).

I would agree with you, but I thought this config wan't read until after the initial mk boot. (Did I misunderstand you above)

Did you change the configuration that is shown (above)?

Sorry for the confusion, can you please indicate -which- config you are asking about?

I -did- manually kill all of the related razor_damon.rb services (btw, would you accept a pull request updating the README to include a line about adding including /opt/razor/bin in $PATH?) including the node applications. for some reason razor_daemon.rb stop doesn't actually stop the razor daemon or any related services (which is intended behavior?) so I manually killed all the processes and started razor_daemon.rb with a razor_daemon.rb start

I am trying to install razor using puppet, so since I was working on the puppet master (and hadn't managed my puppet master with puppet yet, it seemed too meta) I ran the "install script":

puppet apply /etc/puppet/modules/razor/tests/init.pp --verbose

Which is where I think the 10.10.10.1 is coming from (I could be -way- off here, so please keep in mind this may be a red herring), other than the example at http://puppetlabs.com/blog/puppet-razor-module/ using 10.0.10.1 the ips are -real- similar.

@tjmcs
@tjmcs
@tjmcs
@three18ti

Here's the output of razor config

root@shepard:/var/log/test# razor config
ProjectRazor Config:
image_svc_host: 192.168.0.2 
persist_mode: mongo 
persist_host: 127.0.0.1 
persist_port: 27017 
persist_timeout: 10 
admin_port: 8025 
api_port: 8026 
image_svc_port: 8027 
mk_tce_mirror_port: 2157 
mk_checkin_interval: 60 
mk_checkin_skew: 5 
mk_uri: http://192.168.0.2:8026 
mk_register_path: /razor/api/node/register 
mk_checkin_path: /razor/api/node/checkin 
mk_fact_excl_pattern: (^facter.*$)|(^id$)|(^kernel.*$)|(^memoryfree$)|(^operating.*$)|(^osfamily$)|(^path$)|(^ps$)|(^ruby.*$)|(^selinux$)|(^ssh.*$)|(^swap.*$)|(^timezone$)|(^uniqueid$)|(^uptime.*$)|(.*json_str$) 
mk_log_level: Logger::ERROR 
mk_tce_mirror_uri: http://localhost:2157/tinycorelinux 
mk_tce_install_list_uri: http://localhost:2157/tinycorelinux/tce-install-list 
mk_kmod_install_list_uri: http://localhost:2157/tinycorelinux/kmod-install-list 
image_svc_path: /opt/razor/image 
register_timeout: 120 
force_mk_uuid:  
default_ipmi_power_state: off 
default_ipmi_username: ipmi_user 
default_ipmi_password: ipmi_password 
daemon_min_cycle_time: 30 
node_expire_timeout: 300 
rz_mk_boot_debug_level: 

The initial IP address used by Razor when it creates the configuration file (in ${RAZOR_HOME}/conf/razor_serve.conf) is just the first IP address it can find on your Razor server.

eth0 is 10.10.10.1
eth1 is 192.168.0.2

"the first Ip" is likely 10.10.10.1. I can theoretically swap IP configurations on eth0 and eth1 but I don't really think this is a good solution, as it forces network configuration on the end user.

You'll, no doubt, have to change the value it picks (and restart Razor) to get it to work.

I thought I had...; can you eli5 and explain how to set the value it picks?

It is odd that the razor_daemon.rb script isn't finding the appropriate services and shutting them down (during a "restart" or "stop")...are you running the script from the same directory that you were in when you ran the "razor_daemon.rb start" command?

Should I run this from a specific directory? I exported /opt/razor/bin to my $PATH variable so I execute razor_daemon.rb ad nauseum from any directory.

I added:

if [ -d "/opt/razor/bin/" ] ; then
  PATH="$PATH:/opt/razor/bin/"
fi

to my .profile

@three18ti

Hey tjmcs,

Thanks for working with me tonight. Where in the world are you?

As, it's 03:00 here (MDT, GMT -6) (actually 02:57, but by the time I finish typing...), I'll likely be throwing in the towel for the night in the next 30-45 minutes (just... one...more thing...) can we talk on IRC tomorrow? Might be quicker. whatever, I'm on #puppet regularly and my email is my github username at gmail. however I can make your life easier (haha this makes me neurotic to see "however" without a comma, even though it is used correctly without a comma). (If you prefer to work over e-mail/issue updates I'm happy to continue here too.) please let me know what is easiest for you.

(03:02 BTW, see above)

@tjmcs
@tjmcs
@three18ti

I have no idea why you are seeing what you are seeing; the output of the 'razor config' command shows that Razor thinks the image_svc_host is running at at 192.168.0.2 but then Razor is creating an iPXE boot script using an IP address of 10.10.10.1? (at least that's what was going on in the video that you sent me)

Yep, exactly that, which is why I think the initial install of:

puppet apply /etc/puppet/modules/razor/tests/init.pp --verbose

has some impact here.

Perhaps there is still an instance of the image service running (because you weren't able to use the razor_daemon.rb script to restart things cleanly?) that is using the old IP address for it's configuration? There are two node instances (one for the image service and the other for the razor CLI and RESTful interface)...perhaps you only restarted the second one when you restarted things manually?

I killed everything that showed up under a ps aux | grep razor.

On the off chance it's something stupid like an stuck open file handle, I am rebooting the machine. I'll update you once it completes the cycle.

I tend to run the razor_daemon.rb script from the Razor home directory. It creates a "pid" file that it uses to track where the daemon process is that is being controlled using that script. If you start it in one directory and try to stop it (or restart it) in another directory I'm not sure what would happen, to tell you the truth. I'm afraid it might not be able to find the pid file, so it wouldn't know which process to stop (or restart)...

for clarity, you mean /opt/razor .

If you start it in one directory and try to stop it (or restart it) in another directory I'm not sure what would happen

Actually, I did see a razor_daemon.pid file in my /root/ directory, which I thought was odd. This explains the behavior.

On that note, would you say it's -bad- that I exported /opt/razor/bin to my $PATH variable? Given this statement, I should think so. Is there an init script / upstart job / system d profile? I mean, as I understand it, razor is a system service and should be treated as such.

I'm afraid it might not be able to find the pid file, so it wouldn't know which process to stop (or restart)...

That would explain the behavior. Is there a way to define a static pid file (where do OSs store their PID files normally, that's gotta be a thing) that way it doesn't matter where you run the init script, it will restart all of the processes normally?

2, reboot the Razor server machine and restart razor using the razor_daemon.rb script...again, you might want to double check the configuration file before you restart the Razor server using that script

I rebooted shepard (the razor server) and then cd'd to /opt/razor and ran razor_daemon.rb start

When I rebooted the VM I received the same boot sequence as identified in my video above.

If those don't work, this may have to wait until I can get some help from Nan Liu (the creator of the Puppet Module you are using) or Nick Weaver (my partner in crime on this project and the one most familiar with the server-side of Razor...I'm more familiar with the Microkernel that Razor uses than the innards of the server itself), and that might be sometime early next week...I'm off to get some sleep, but let me know if you can get things reset properly

Yea, I think there may be something with the initial install of the Razor module that is locking up this config (again the IP used is 10.0.10.X [note the class B octet]) but I would think there's a way to fix it. I noted in IRC on #puppet that /var/lib/tftpboot/menu.c32 appears to be a compiled binary (forgive me I know little about TFTP and even less about PXE; the irony being TFTP is UDP...) so maybe it's not getting updated properly? grasping at straws here.

and that might be sometime early next week.

Look, I'm really excited as to what this means for the (excuse the phrase) "DevOps" community (may as well have said the "Cloud" community...), and I'm -extremely- motivated to get a working technology stack so I can start making money (That's the general hope anyway... at least support myself and family without a "job"... I -hate- working for a closed minded boss), but I get the fact that you are likely volunteering your free time for... well... free (unless you're a puppetlabs employee in which case you're likely volunteering your off hours time to support the opn source branch of your company), so please, as you have time; I can't pay you until I turn a profit, and burned out developers don't produce good code.

and that might be sometime early next week
I'm going to try to take some time off this weekend

Please do. Please tell your boss that some very important anonymous person on the internet told you it is imperative that you take some time off. Seriously. I can wait. I mean really... I have big dreams, but they aren't making me any money yet, so I couldn't possibly make less, right? (are you on reddit? please PM me your reddit username. seriously, I don't care if you're on weekend... some life or death shit)

I'm on the west coast, so it's a whole hour earlier here...

Really curious as to where now.

I don't tend to use IRC much, but do work off of most of the other chat networks (Google, AIM, YIM, Skype).

This is funny to me. IDK why you would ignore the core internet chat but succumb to all the other chats... ;) I'm on google if that works for you.

I'm going to try to take some time off this weekend, so the earliest I could chat would probably be Monday...

again, please take the time off, and please don't do it at my request. From what I see of your public history you deserve a couple days off. This is the Internet, I'm not going anywhere until I solve my problem ;).

so the earliest I could chat would probably be Monday...

Let me know your preferred method for communication. If this is it, I'm happy to oblige; I just thought there were perhaps more efficient methods to communicate to help us dig in.

thanks again for your help so far, I know we haven't fixed anything, but we have ruled out some possibilities which is a big step too. I'm sure well figure it out.

BTW, on a hunch, I added a second interface on storage network, but there is no IP handed out over the storage network, anyway it gets stuck in an infinate loop, so this is not the way to solve the problem.

Tomorrow, I'll try switching the interfaces and the physical connection.

so the earliest I could chat would probably be Monday...

find me on freenode. I lurk in #puppet always.

@tjmcs
@three18ti

tjmcs, thank you very much. :) You can't see it, but I have the biggest shit eating grin right now.

(iow: IT WORKS IT WORKS IT WORKS !!!111!!!)

Oddly, razor sees 2 nics (remember I added a NIC to the VM during troubleshooting but quickly removed it) even though there is only one. Do you think that's a razor problem or a libvirt problem? (I'm happy to go bug those guys for a while; aren't you supposed to be taking the weekend off ;) )

Unless you want really want to chase down the issue I'm just going to destroy the VM and create a new one; call it good.

Thanks again for all your help.

@tjmcs

I think I'm as glad as you are that there is finally a solution to this issue :)

I think I'll leave the libvirt issues for you to sort out on your own. Having fought with libvirt many times over the past year or so I know how difficult it can be to sort out issues (especially network configuration issues) with that virtualization framework...

Just so you know, Nick (Weaver) and I are not Puppet Labs employees, but we are the creators of Razor (so we really are willing to go the extra mile to make sure that your experience with our "baby" is as positive as possible :)

Anyway, once again I'm glad that fix worked. I'm going to go ahead and mark this issue as closed, but feel free to reopen it if something else comes up...

@tjmcs tjmcs closed this Jun 23, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment