Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several release machines offline #2217

Closed
BethGriggs opened this issue Mar 11, 2020 · 24 comments
Closed

Several release machines offline #2217

BethGriggs opened this issue Mar 11, 2020 · 24 comments

Comments

@BethGriggs
Copy link
Member

I think release builds are backing up due to the following machines being offline in ci-release:

  • release-ibm-aix71-ppc64_be-1
  • release-ibm-rhel7-s390x-1
  • release-rackspace-win2012r2_vs2019-x64-1
  • release-rackspace-win2012r2_vs2019-x64-2
@sam-github
Copy link
Contributor

sam-github commented Mar 11, 2020

$ cat /home/iojs/jenkins_console.log
...
WARNING: connect timed out
Mar 10, 2020 12:00:41 AM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: https://ci-release.nodejs.org/ provided port:41913 is not reachable

I restarted jenkins, but the failure remains:

SEVERE: https://ci-release.nodejs.org/ provided port:41913 is not reachable

and is true:

$ telnet ci-release.nodejs.org 80
Trying...
Connected to ci-release.nodejs.org.
Escape character is '^]'.
^CConnection closed.
$ telnet ci-release.nodejs.org 41913
Trying...
^C

@sam-github
Copy link
Contributor

sam-github commented Mar 11, 2020

About Mar 9, looks like
https://ci-release.nodejs.org/computer/release-ibm-rhel7-s390x-1/
started having problems, and now its just not able to find the server:

Mar 03 01:03:01 release-ibm-rhel7-s390x-1 java[35305]: INFO: Connected
Mar 09 17:35:18 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 5:35:18 PM hudson.remoting.jnlp.Main$CuiListener status
Mar 09 17:35:18 release-ibm-rhel7-s390x-1 java[35305]: INFO: Terminated
Mar 09 17:35:29 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 5:35:29 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
Mar 09 17:35:29 release-ibm-rhel7-s390x-1 java[35305]: INFO: Master isn't ready to talk to us on https://ci-release.nodejs.org/tcpSlaveAgentListener/. Will try again: response code=502
Mar 09 17:35:40 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 5:35:40 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
Mar 09 17:35:40 release-ibm-rhel7-s390x-1 java[35305]: INFO: Master isn't ready to talk to us on https://ci-release.nodejs.org/tcpSlaveAgentListener/. Will try again: response code=503
Mar 09 17:35:50 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 5:35:50 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
Mar 09 17:35:50 release-ibm-rhel7-s390x-1 java[35305]: INFO: Master isn't ready to talk to us on https://ci-release.nodejs.org/tcpSlaveAgentListener/. Will try again: response code=503
Mar 09 17:36:00 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 5:36:00 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
Mar 09 17:36:00 release-ibm-rhel7-s390x-1 java[35305]: INFO: Master isn't ready to talk to us on https://ci-release.nodejs.org/tcpSlaveAgentListener/. Will try again: response code=503
Mar 09 17:36:11 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 5:36:11 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
Mar 09 17:36:11 release-ibm-rhel7-s390x-1 java[35305]: INFO: Master isn't ready to talk to us on https://ci-release.nodejs.org/tcpSlaveAgentListener/. Will try again: response code=503
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 5:36:21 PM hudson.remoting.jnlp.Main$CuiListener status
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: INFO: Performing onReconnect operation.
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 5:36:21 PM hudson.remoting.jnlp.Main$CuiListener status
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: INFO: onReconnect operation failed.
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 5:36:21 PM hudson.remoting.jnlp.Main$CuiListener status
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: INFO: Locating server among [https://ci-release.nodejs.org/]
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 5:36:21 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping]
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 5:36:21 PM hudson.remoting.jnlp.Main$CuiListener status
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: INFO: Agent discovery successful
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: Agent address: ci-release.nodejs.org
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: Agent port:    41913
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: Identity:      7f:50:64:6f:c9:2f:5d:9d:0d:d8:b8:a2:28:d8:93:03
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 5:36:21 PM hudson.remoting.jnlp.Main$CuiListener status
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: INFO: Handshaking
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 5:36:21 PM hudson.remoting.jnlp.Main$CuiListener status
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: INFO: Connecting to ci-release.nodejs.org:41913
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 5:36:21 PM hudson.remoting.jnlp.Main$CuiListener status
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: INFO: Trying protocol: JNLP4-connect
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 5:36:21 PM hudson.remoting.jnlp.Main$CuiListener status
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: INFO: Remote identity confirmed: 7f:50:64:6f:c9:2f:5d:9d:0d:d8:b8:a2:28:d8:93:03
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 5:36:21 PM hudson.remoting.jnlp.Main$CuiListener status
Mar 09 17:36:21 release-ibm-rhel7-s390x-1 java[35305]: INFO: Connected
Mar 09 18:19:30 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 6:19:30 PM hudson.remoting.jnlp.Main$CuiListener status
Mar 09 18:19:30 release-ibm-rhel7-s390x-1 java[35305]: INFO: Terminated
Mar 09 18:19:41 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 6:19:41 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
Mar 09 18:19:41 release-ibm-rhel7-s390x-1 java[35305]: INFO: Master isn't ready to talk to us on https://ci-release.nodejs.org/tcpSlaveAgentListener/. Will try again: response code=503
Mar 09 18:19:52 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 6:19:52 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
Mar 09 18:19:52 release-ibm-rhel7-s390x-1 java[35305]: INFO: Master isn't ready to talk to us on https://ci-release.nodejs.org/tcpSlaveAgentListener/. Will try again: response code=503
Mar 09 18:20:02 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 6:20:02 PM hudson.remoting.jnlp.Main$CuiListener status
Mar 09 18:20:02 release-ibm-rhel7-s390x-1 java[35305]: INFO: Performing onReconnect operation.
Mar 09 18:20:02 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 6:20:02 PM hudson.remoting.jnlp.Main$CuiListener status
Mar 09 18:20:02 release-ibm-rhel7-s390x-1 java[35305]: INFO: onReconnect operation failed.
Mar 09 18:20:02 release-ibm-rhel7-s390x-1 java[35305]: Mar 09, 2020 6:20:02 PM hudson.remoting.jnlp.Main$CuiListener status
... eventually got connected:
Mar 09 18:20:03 release-ibm-rhel7-s390x-1 java[35305]: INFO: Connected
... then lost connection:
Mar 10 00:03:38 release-ibm-rhel7-s390x-1 java[35305]: INFO: Terminated
... bounced a couple times, and has been in a systemd restart loop ever since:
Mar 10 00:10:00 release-ibm-rhel7-s390x-1 systemd[1]: jenkins.service: main process exited, code=exited, status=255/n/a
Mar 10 00:10:00 release-ibm-rhel7-s390x-1 systemd[1]: Unit jenkins.service entered failed state.
Mar 10 00:10:00 release-ibm-rhel7-s390x-1 systemd[1]: jenkins.service failed.
Mar 10 00:10:30 release-ibm-rhel7-s390x-1 systemd[1]: jenkins.service holdoff time over, scheduling restart
...
Mar 11 12:39:18 release-ibm-rhel7-s390x-1 java[58080]: SEVERE: https://ci-release.nodejs.org/ provided port:41913 is not reachable
Mar 11 12:39:18 release-ibm-rhel7-s390x-1 java[58080]: java.io.IOException: https://ci-release.nodejs.org/ provided port:41913 is not reachable
Mar 11 12:39:18 release-ibm-rhel7-s390x-1 java[58080]: at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:303)
Mar 11 12:39:18 release-ibm-rhel7-s390x-1 java[58080]: at hudson.remoting.Engine.innerRun(Engine.java:527)
Mar 11 12:39:18 release-ibm-rhel7-s390x-1 java[58080]: at hudson.remoting.Engine.run(Engine.java:488)
Mar 11 12:39:18 release-ibm-rhel7-s390x-1 systemd[1]: jenkins.service: main process exited, code=exited, status=255/n/a
Mar 11 12:39:18 release-ibm-rhel7-s390x-1 systemd[1]: Unit jenkins.service entered failed state.
Mar 11 12:39:18 release-ibm-rhel7-s390x-1 systemd[1]: jenkins.service failed.
[linux1@release-ibm-rhel7-s390x-1 ~]$ date
Wed Mar 11 12:39:31 EDT 2020

@sam-github
Copy link
Contributor

It feels like a firewall issue, not sure if its on node infra side or not. AFAICT, the two machines I've looked at so far, though both "IBM", are provided on different orgs, with different networks, so that suggests its a nodejs side infra problem.

@sam-github
Copy link
Contributor

sam-github commented Mar 11, 2020

I don't have access to secrets:build/release, so can't reach the two windows hosts to see if they are also suffering from the same core problem.

ping: @rvagg @mhdawson

@mhdawson
Copy link
Member

I'll look at the firewall config

@mhdawson
Copy link
Member

Looks like at least for the 2 ibm machines, they were removed from the firewall. I do see that when I added them I had test in the name instead of release although the IPs were correct. @rvagg did you do some cleanup.

Adding those 2 back and will see if that resolves. If so I'll check the windows ones.

@mhdawson
Copy link
Member

release-ibm-rhel7-s390x-1 is back online, The aix one might need a kick to get it back

@mhdawson
Copy link
Member

I don't see any windows machines in the Ansible templates under release which is a bit strange and makes checking the ips harder.

@mhdawson
Copy link
Member

Does look like windows ones were removed as well. Can't double check the IPs but will add back the entries that match the names in the ci

@sam-github
Copy link
Contributor

Did you check nodejs-private/secrets:build/release/inventory.yml? I can't decrypt, but I think they might be in there.

@sam-github
Copy link
Contributor

aix is up.

@mhdawson
Copy link
Member

Windows machines seem to be back up as well

@sam-github
Copy link
Contributor

OK, lets not close until we've tracked down why it went away, or do you know? Is it possible the encrypted inventory.yml is not correct, and the firewall config got refreshed based on that?

I wonder, because neither @AshCripps or I have access to that file, so its possible it didn't get the release machine IPs added to it, and if its used to build firewalls, perhaps thats why they lost the manual config.

Or is there some other reason you can see?

@mhdawson
Copy link
Member

@sam-github they are in nodejs-private/secrets:build/release/inventory.yml. My confusion was having some of the info in the other file/

@mhdawson
Copy link
Member

mhdawson commented Mar 11, 2020

To clarify, inventory.yml in the secrets:build/release directory includes the windows machines that were removed, it does not include the ibm ones. Right now each of the 2 only have partial info.

The key question before closing what triggered the update to the firewall config which resulted in machines being lost. We should follow up with @rvagg as I think he'd be the only other one who would have updated since my last update on Feb 25.

@sam-github
Copy link
Contributor

sam-github commented Mar 11, 2020

So, any idea what process caused the reset of the firewall rules? This seems like it was a side-effect of the recent jenkins server upgrade, which suggests that perhaps it was done by ansible, and that ansible doesn't know about these two new (ish) release machines?

EDIT: crossed comments :-). OK, lets wait for @rvagg to comment.

@mhdawson
Copy link
Member

Right, as far as I know there are no automated processes for updating the firewall config, just manual updates.

@rvagg
Copy link
Member

rvagg commented Mar 11, 2020

I don't think I've touched it for quite a while, certainly not since Feb 25. But I did restart the server, and ci.nodejs.org too, for security updates 2 days ago which was obviously a trigger for a reset to an old state. So I suppose rules.v4 wasn't updated properly?

@mhdawson
Copy link
Member

These are the instructions we have for how to update: https://github.com/nodejs/build/blob/master/ansible/MANUAL_STEPS.md#adding-firewall-entries-for-jenkins-workers

Which is what I've always followed (and I'm guessing similarly whoever added the windows machines). If more needs to be done than that can you PR in what it is or add to this issue so that we can get it captured.

@rvagg
Copy link
Member

rvagg commented Mar 13, 2020

I'm suspecting there might be something borked on this machine wrt iptables.

/etc/cron.hourly/iptables-save does a service iptables-persistent save, but iptables-persistent doesn't seem to be an actual service.

Whatever happens, the iptables rules need to be saved to /etc/iptables/rules.v4, that's what the iptables-persistent service is supposed to do.

When I edit the firewall, I just do it directly:

  1. Craft the iptables rule in /etc/iptables/rules.v4
  2. Copy and paste it onto the commandline and prefix it with iptables to run it

Where I need to delete a rule, I do this:

  1. iptables -L --line-numbers > /tmp/iptables.numbers
  2. Find the rule to delete and run iptables -D jnlp 88 on that number.
  3. If there are multiple rules to delete, start at the highest number and go lower, because the numbers change as you remove them from the list.

This is quite manual and the steps in MANUAL_STEPS.md should be safer, but they don't seem to be properly saving on this machine:

# Generated by iptables-save v1.6.0 on Thu Apr 11 21:10:51 2019

The major differences with current rules are:

< -A jnlp -s 129.33.196.199/32 -m comment --comment release-ibm-aix71-ppc64_be-1 -j ACCEPT
< -A jnlp -s 148.100.86.101/32 -m comment --comment release-rhel7-s390x-1-1 -j ACCEPT
< -A jnlp -s 104.130.6.184/32 -m comment --comment release-rackspace-win2012r2_vs2019-x64-1 -j ACCEPT
< -A jnlp -s 104.130.158.22/32 -m comment --comment release-rackspace-win2012r2_vs2019-x64-2 -j ACCEPT

which I'm guessing corresponds to the machines that were offline?

For now, I've just run iptables-save > /etc/iptables/rules.v4 to persist the current set. They'll be restored when the machine restarts. But for this issue to be closed, we need to figure out why persistence isn't working. Has something been deleted from this machine? Is something misconfigured?

Maybe if @sam-github gets access to this machine this can be one of his initial tasks?

@rvagg
Copy link
Member

rvagg commented Mar 13, 2020

just added back release-nearform-macos10.15-x64-1, that was missing too

@mhdawson
Copy link
Member

@rvagg those were the ones I just added because they were missing.

The answer here would suggest that it's not configured correctly: https://unix.stackexchange.com/questions/125833/why-isnt-the-iptables-persistent-service-saving-my-changes.

We could just update that cronjob to have
iptables-save > /etc/iptables/rules.v4

@github-actions
Copy link

github-actions bot commented Jan 8, 2021

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

@github-actions github-actions bot added the stale label Jan 8, 2021
@mhdawson
Copy link
Member

mhdawson commented Jan 8, 2021

This is long since stale, closing.

@mhdawson mhdawson closed this as completed Jan 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants