-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various errors when launching a cluster #283
Comments
@pferrel - I took the liberty of creating a new issue for you. Please post the details of your issue here and not on #238, per my request:
Please post your config, as well as the output you are seeing. Once you do, I will delete the off-topic comments on #238. |
Switched to Amazon Linux, still get the errors. At least once the master came up and I see 1 of the 2 slave workers in the GUI.
|
latest project-poc config file only one slave the above was the same config but 2 slaves:
|
Are you using the latest release of Flintrock? Please post the version of Flintrock you are using. Please also try your launch with vanilla Amazon Linux 2. The AMI IDs are at the bottom of that page. If you are still having problems with the latest versions of Flintrock + Amazon Linux 2, then on the next launch failure don't destroy the cluster and instead SSH in to the master and take a look at the Spark logs to see why the master is failing to come up. That should give you some clues as to what's going on. |
Sorry, changed the comment above, yes I was using the Amazon Linux AMI. The most recent try brought up one slave successfully, trying 2 now. |
yes, latest 0.10.0 |
Adding slaves seems to have worked, I created a cluster with 1 slave, the GUI at the time (yes refreshed)showed none working. I added 2 now I have 3 slaves running. If I can believe the GUI, adding the 2 slaves also fixed the previously non-connected slave. Again from a newb perspective (I have run Spark often for at least 4 years, but am new to flintrock) there seem to be timeout issues here. |
Please help me help you by following my debugging instructions. The latest version of Flintrock is 0.11.0, not 0.10.0. Please try launching a fresh cluster with the latest versions of Flintrock and Amazon Linux 2. Specifically, I recommend If you still have Spark master timeout issues with this configuration, please post the contents of the Spark master log as that will give us some clues as to why the Spark master is not coming up. |
Looking back at the logs you posted, it looks like you installed Flintrock via Homebrew, which is a community-supported distribution (i.e. I don't maintain it) and which is unfortunately out of date. If you install Flintrock via |
Ah, yes. I was surprised when I saw the version, I thought the latest was 0.11. Will try tomorrow. |
ok, switched to 0.11.0 and get the same timeouts for ssh here is my config:
CLI + error
This happens almost every time I try to launch. Yes, I am using the Apache mirrors for download but this happens before the download since you need to have ssh to download. In my experience by hand it takes much longer for ssh to become active on an instance than flintrock is allowing. Is there some way to change the ssh timeout or insert a delay before it is tried? |
This issue is different from the one you were reporting earlier. Before the Spark master was having trouble coming up. Now, SSH is failing to connect with a specific banner error. The latter suggests an issue with the AMI or user you have configured, as communicated in Flintrock’s error message. This is almost certainly not related to the wait timeout. Please try the AMI I suggested earlier before trying your own setup. It will help narrow down what’s going on. In fact, please try the default config Flintrock provides by deleting any existing config files you have and then calling The default config works and is tested before every release. Start there and carefully work your way from there to the setup you actually want. We’re spending a lot of time gong back and forth because we don’t have a common baseline to start testing from. Starting with that common baseline will cut down further back and forth. |
I have used the standard Amazon Linux AMI and the ec2-user. The key is obviously working. I did the flintrock configure before customizing as shown above. I suppose you see above that there are timeouts before, what I assume is a retry eventually works. Maybe something in my AWS config causes a slower initialization? I have to install in the correct subnet and VPC so I can try default config but will have to move back to this if I'm going to use fintrock. I'm conversant enough in python (barely) to mod with delays or retries if you can point me in the right direction. Thanks in any case. |
SSH timeouts are normal when an instance is coming up. Every successful cluster launch will show that in its debug output. What's not normal is the SSH banner error.
This error suggests there is something with your SSH key, SSH user, or with the AMI. Adding more retries or longer delays between tries won't help. Flintrock retries connecting only when there is an SSH timeout. Retrying when another error (like this banner error) is thrown doesn't make sense. If you want to try anyway, you can play around with the code here, but I don't think you'll get anywhere by doing that. Note, by the way, that this error is different from what you were originally reporting. Originally, you were reporting this problem:
This happens much further along in the launch process, meaning Flintrock was able to connect to the instances fine and do most of its work. Somewhere between when you saw this Spark master error and when you saw the SSH banner error, you must have changed something about your config or setup. I doubt upgrading Flintrock from 0.10.0 to 0.11.0 is what changed one error to the other. This conversation must be frustrating for you. It certainly is for me. It is difficult for me to help when you aren't using the vanilla Amazon Linux 2 AMI and default Flintrock configuration I am suggesting. Starting from a known good configuration and working from there to the desired configuration is a basic debugging tactic. Unless we apply this tactic, I don't think I can help you any further. |
I AM using flintrock 0.11.0 and the Vanilla Amazon Linux AMI. Please read the report above. This is the AMI they recommend when you launch any instance. I took the AMI ID from the launch instance dialog by selecting the Vanilla Amazon Linux option and copying the AMI ID. In any case this config WORKS but intermittently so there is nothing wrong with AMI, user, or key. The fault must be in the timing since as I say sometimes is works. Also I'm not sure why you changed the name of the issue since the error reported in the CLI is ssh timeout. This is how people will search for the solution or suggestions. |
I suggested I changed the issue title because, as I pointed out in my previous message, you initially reported Spark master timeouts, which is a different issue from the SSH timeouts we are now talking about. I am also still confused how the error changed from the Spark master timeouts to the SSH timeouts. |
The default AMI for Flintrock is not available for my zone. Further the link in the README.md to available AMIs is no longer active. |
What link? |
This seems to be resolved by using the specific AIM: |
OK, that's good. If you now have a working baseline, it should be easier to identify what specifically breaks things by introducing changes to that working baseline one by one. Knowing that Flintrock works with the configuration I've recommended at least rules out issues with your VPC, subnet, or security groups, which is good. If keeping everything the same and simply changing the AMI is what breaks things, then we can hone in on that to understand why. But it's critical to know that the AMI is the only thing that has changed from the working baseline to break things. If you'd like to keep debugging this, please provide an update on what exactly broke things from the working baseline and I'd be happy to reopen this issue. |
Hello, I am trying to launch a cluster of m5.xlarge instances and I am running into the same issue. I have changed the default AMI to the AMI you gave (ami-0b8d0d6ac70e5750c) and that still doesn't work. Below, snapshots of what error messages appear on the console. I would be very grateful for your help, thank you in advance |
I have constant failures to launch a cluster. With 2 slaves, I get an ssh timeout on one or more machines. Unfortunately the machines are actually created and so end up costing me--even if I tell flintrock to destroy them.
Does this intermittent timeout have something to do with the speed of Apache mirrors? This issue is hte only one mentioned with the
--debug
switch. BTW I use the Ubuntu AMI and "ubuntu" user.Originally posted by @pferrel in #238 (comment)
The text was updated successfully, but these errors were encountered: