Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Using the cloud.keypair option breaks "knife cluster show", other discovery #95

Closed
pcn opened this Issue · 6 comments

3 participants

@pcn

I found that using cloud.keypair would let me specify the name of the keyfile in the way that I want (e.g. an ssh keypair named in ec2 named "foo" can now be used if I have ~/.ssh/foo.pem, or better yet ~/.ssh/id_foo which is what I use).

However, as soon as I use the keypair option, when a node is bootstrapped it can successfully launch, bootstrap, and wrap up, but subsequent knife cluster commands can't seem to find it.

I haven't found what defines "cluster" in cluster_chef. I'd assumed it was node attributes in nodes in an account, but it seems that the keypair plays a part as well?

I've been trying to figure out what's going on, but so far I haven't figured out where the key is:

Launching machines
  CassandraPN-datanode-0:        creating cloud server

  +------------------------+------------+---------+----------+--------------+------------+-----------+------------+-----------------+---------+---------+
  | Name                   | InstanceID | State   | Flavor   | Image        | AZ         | Public IP | Private IP | Created At      | Volumes | Elastic |
  +------------------------+------------+---------+----------+--------------+------------+-----------+------------+-----------------+---------+---------+
  | CassandraPN-datanode-0 | i-3c7e245e | pending | t1.micro | ami-4b4ba522 | us-east-1d |           |            | 20120112-210955 |         |         |
  +------------------------+------------+---------+----------+--------------+------------+-----------+------------+-----------------+---------+---------+
Waiting for servers:
    0/1  |                                                  |  0:53    CassandraPN-datanode-0:  Syncing to cloud        
  CassandraPN-datanode-0:         labeling servers and volumes
  CassandraPN-datanode-0:         tagging CassandraPN-datanode-0 with {"name"=>"CassandraPN-datanode-0"}
    1/1  |**************************************************|  0:54                                                     

  +------------------------+------------+---------+----------+--------------+------------+--------------+--------------+-----------------+---------+----+
  | Name                   | InstanceID | State   | Flavor   | Image        | AZ         | Public IP    | Private IP   | Created At      | Volumes | Ela|
  +------------------------+------------+---------+----------+--------------+------------+--------------+--------------+-----------------+---------+----+
  | CassandraPN-datanode-0 | i-3c7e245e | running | t1.micro | ami-4b4ba522 | us-east-1d | 50.17.103.38 | 10.196.14.36 | 20120112-210955 |         |    |
  +------------------------+------------+---------+----------+--------------+------------+--------------+--------------+-----------------+---------+----+

So it was crated, it is happy. Then (with some extra info printed out, part of my tracking this down):

847 pn@PN-mac 16:10 ~/dvcs/Knewton-chef-site-cookbooks $ knife cluster show CassandraPN 
Inventorying servers in CassandraPN cluster, all facets, all servers
"Cluster's class is:"
[:facets, :undefined_servers, :cluster, :cluster_name, :cluster_role, :facet, :has_facet?, :find_facet, :servers, :slice, :to_s, :resolve!, :create_clus]
"<ClusterChef::Cluster {\"name\"=>:CassandraPN, \"environment\"=>:_default @facets=>[\"datanode\"]}>"
"target.class is:"
ClusterChef::ServerSlice
"<ClusterChef::ServerSlice {} [\"CassandraPN-datanode-0\"]>"
  +------------------------+-------+-------------+------------+-----------+------------+------------+--------+----+----------+
  | Name                   | Chef? | State       | InstanceID | Public IP | Private IP | Created At | Flavor | AZ | Env      |
  +------------------------+-------+-------------+------------+-----------+------------+------------+--------+----+----------+
  | CassandraPN-datanode-0 | yes   | not running |            |           |            |            |        |    | _default |
  +------------------------+-------+-------------+------------+-----------+------------+------------+--------+----+----------+

It's gone, and there doesn't seem to be any indication as to why. If I launch with --bootstrap, then the node seems to have all appropriate ohai etc. info, but the show command provides the same info.

Trying to run other commands indicates that discovery is broken here, too:

848 pn@PN-mac 16:10 ~/dvcs/Knewton-chef-site-cookbooks $ knife cluster ssh CassandraPN  'ls'
Inventorying servers in CassandraPN cluster, all facets, all servers
"Cluster's class is:"
[:facets, :undefined_servers, :cluster, :cluster_name, :cluster_role, :facet, :has_facet?, :find_facet, :servers, :slice, :to_s, :resolve!, :create_clus]
"<ClusterChef::Cluster {\"name\"=>:CassandraPN, \"environment\"=>:_default @facets=>[\"datanode\"]}>"
FATAL: No nodes returned from search!

Except for the two pull requests I've got, I don't think I've made any changes except for print statements relating to the keypair. Any help is appreciated on this, it's keeping me from being able to give cluster_chef to developers.

Commenting out the keypair clause in my cloud statement causes everything to work.

Thanks,

-Peter

@mrflip
Owner

right so -- the ec2 machine's keypair is the thing that defines its membership in a cluster.
This dates from before ec2 allowed tagging -- keypairs and security groups did double-duty as metadata for cluster and roles.

Is it important that you be able to vary the ec2_keypair from the cluster name, or just the filename for that ec2_keypair's private key?

Right now all of the following are intertwingled:

  • cluster chef: cluster name is default for cloud.keypair
  • cluster chef: amazon ec2 keypair attribute of a machine, set at launch
  • amazon: adds corresponding public key to ~ubuntu/.ssh/authorized_keys
  • ssh'ing (knife or otherwise): the filename of the public key for that keypair
  • cluster chef fog: filename where the keypair is written, if cluster_chef creates it.
  • cluster_chef discovery: first filter in instance discovery is a match on the ec2 keypair instance attribute

it's surprising behavior to not be able to set the keypair, but my first reaction is to remove the functionality to set it. Do you require that the cluster 'foo' be able to contain machines that amazon knows as the keypair 'fnord'?

It is however dumb that it is so hard to specify a keypair location, and I've hit some other problems with that too. I'm going to respond in a bug just for that -- let's leave this one centered around the 'setting keypair different from cluster makes discovery fail'.

@pcn
    This dates from before ec2 allowed tagging -- keypairs and security groups
 did double-duty as metadata for cluster and roles.

OK, that's interesting, and I wasn't aware of the need for this. I still have a hard time finding out where/how in the discovery this happens, but I guess if I look through the code knowing this I may have better luck understanding this bit.

Is it important that you be able to vary the ec2_keypair from the cluster name, 
or just the filename for that ec2_keypair's private key?

It's very important to be able to disassociate the ec2_keypair from the cluster name.

Do you require that the cluster 'foo' be able to contain machines that amazon knows as the keypair 'fnord'?

Yes, absolutely. Our use case is this:

  • We have an ops rotation of about a dozen devs.
  • Each of those devs should be able to manipulate all of the clusters.
  • We intend to run a variety of clusters - cassandra, app A, app B, app C, etc.
  • Each cluster may represent a code release. So instead of just the ProductCluster, we will probably be deploying the ProductCluster_001056 cluster, in the morning, and in ProductCluster_001057 for the next release and have them live side-by-side.

So, having separate keypairs for each release creates some corner cases, like deploying a cluster by person1. When there are issues to collaborate with person2 (e.g. some kind of deployment failure), person1 has to share this particular private key which is otherwise dynamically created, meaning that we may have a lot of keys around that people will have to share. If Person3 wants to participate, then the sharing will have to continue, etc. This gets confusing fast. There are ways to make this better when everything is working (install correct users, enable their access via directory services, passwd, etc) but when things break it will cause confusion when we need more order.

Being able to use a single default keypair whose name != cluster_name means that we can share the single key for everyone on ops within a single aws account. It also seems to me that it would make the identification of the cluster easier to understand since the keypair is an obscure config option (maybe that's only my opinion because amazon already has a handful of keys and environment variables and other things with the name "key" that needs to be set, and this is just one of them). It would be so much clearer if the identification of the cluster was based on groups and tags, or just tags if possible.

A compromise is to have public and private key on the launching host, and to upload the same public key with another name and just have knife understand the mapping from original key name to the name that was uploaded, but that also gets confusing really fast, but it is workable (not preferable, though).

@mrflip
Owner

Good use case, and with tags existing now there's no reason to fudge the keypair.

The discovery portion proceeds as follows. (Why is it this complicated? Mostly due to our transition from the thing that preceded cluster_chef. I think this can be simplified, but carefully)

  1. ClusterChef module discovers all servers (this cluster and others) using fog, memoizes it.

      def self.fog_servers
        return @fog_servers if @fog_servers
        Chef::Log.debug("Using fog to catalog all servers")
        @fog_servers = ClusterChef.fog_connection.servers.all
      end
  2. the Cluster object discovers all its potential instances by filtering on the keypair

      def fog_servers
        @fog_servers ||= ClusterChef.fog_servers.select{|fs| fs.key_name == cluster_name.to_s && (fs.state != "terminated") }
      end
    
  3. The cluster object's discover! method calls discover_fog_servers. This walks the list of candidate instances and pairs each with the corresponding Server object: either using the ec2 tags on the instance itself (awesome, but useful only on ec2) or using data registered in the chef node (universal, but only works if the node has run to completion and registered with chef).

  4. The external interface is via ServerSlice, which just says "hey all my servers, what is your fog instance"? As long as instances are reliably paired with servers, it doesn't care.

So... I think you're safe to actually delete the Cluster.fog_servers method entirely -- make sure you move the fs.state != "terminated" clause into discover_fog_servers.

I'd like @temujin9 to sign off on any pull, but overall I think this change should be made.

@pcn

OK, that was a lot easier than I thought. Thanks for shedding light. It seems that the minimal change to get this working is this:

diff --git a/lib/cluster_chef/discovery.rb b/lib/cluster_chef/discovery.rb
index c6ce396..bb63235 100644
--- a/lib/cluster_chef/discovery.rb
+++ b/lib/cluster_chef/discovery.rb
@@ -44,7 +44,7 @@ module ClusterChef
   protected

     def fog_servers
-      @fog_servers ||= ClusterChef.fog_servers.select{|fs| fs.key_name == cluster_name.to_s && (fs.state != "terminated") }
+      @fog_servers ||= ClusterChef.fog_servers.select{|fs| fs.tags["cluster"] == cluster_name.to_s && (fs.state != "terminated") }
     end

     # Walk the list of chef nodes and

adding fallback code if @fog_servers is null to check the chef node attributes could be implemented as a sanity check, or maybe as just a fallback. I'll look at that in the next few days. I'm going to experiment and see if it causes any oddities like hiding servers in chef that aren't reflected in tags, vice versa, or unknowns.

@pcn

So far the simple change 2 updates prior is still working for me, and is still necessary in ironfan. I haven't needed any fallback cases, but as I've mentioned the default behavior has never worked for me - the keys have been created, uploaded, placed on the filesystem, but net::ssh never looks for them in the right place.

Looking forward to the real fix, or to the above patch making it into the mainstream.

@temujin9 temujin9 closed this in 3a02256
@temujin9

Finally got this into 3.x; sorry for the delay. (FYI: pull requests are far more visible.)

Could you check that things still work for you in 4.x? I re-engineered that part, but don't have this use-case anywhere internally, so it's harder to test for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.