Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
branch: master
Fetching contributors…

Cannot retrieve contributors at this time

202 lines (135 sloc) 6.749 kb

Tips and Notes

If you’re west, first run from the shell


  export EC2_URL=https://us-west-1.ec2.amazonaws.com
<pre><code>

Instance attributes: disable_api_termination and delete_on_termination

To set delete_on_termination to ‘true’ after the fact, run the following


  ec2-modify-instance-attribute -v i-0704be6c --block-device-mapping /dev/sda1=vol-XX8d2c80::true

(You’ll have to modify the instance and volume to suit)

If you set disable_api_termination to true, in order to terminate the node run


  ec2-modify-instance-attribute -v i-0704be6c --disable-api-termination false

Standalone Chef server

installing from apt

You need to have a apt world acct

https://launchpad.net/~jtimberman/archive/opschef/packages
sudo add-apt-repository ppa:user/ppa-name

If you have your own chef server already

  • make a chef-server sercurity group
  • have one of the security groups on your chef server be open to machines in the chef-client group.

Common Problems

If you’re having 401 Authentication errors

After spinning down/starting up your cluster, if you use role-based names (zaius-slave-0 etc) you need to do


  knife client bulk-delete 'zaius-.*'

(Use a more limited regex if some of those are still alive!) Chef won’t authenticate a client if one already exists for that name.

Still doesn’t work?

  • You must set the node_name in /etc/chef/client.rb — the one in /etc/chef/client-config.json doesn’t impact connecting with the chef server.
  • Check the /etc/chef/validation.pem file. The client script should leave alone whatever you place there.
  • Verify that no such client already exists.
  • Now blow away the client.rb file and re-run chef server. It should authenticate as the node name you set.
  • If not, tail the chef-server log on the chef server — is it a problem on the server end? is there connection activity when you expect?

‘start runsvdir’ returned 1, expected 0

For an error like


/usr/bin/chef-client:19
/usr/lib/ruby/gems/1.8/gems/chef-0.8.16/bin/../lib/chef/mixin/command.rb:181:in `handle_command_failures': start runsvdir returned 1, expected 0 (Chef::Exceptions::Exec)
	from /usr/lib/ruby/gems/1.8/gems/chef-0.8.16/bin/../lib/chef/mixin/command.rb:124:in `run_command'
	from /usr/lib/ruby/gems/1.8/gems/chef-0.8.16/bin/../lib/chef/provider/execute.rb:49:in `action_run'
	from /usr/lib/ruby/gems/1.8/gems/chef-0.8.16/bin/../lib/chef/runner.rb:60:in `send'

chef has become confused about the state of the runit coordinating daemon. No idea why this sometimes happens, but


sudo restart runsvdir

and then re-running the chef client seems to usually work

h3.

If you see


/usr/lib/ruby/gems/1.8/gems/chef-0.8.16/bin/../lib/chef/mixin/command.rb:181:in `handle_command_failures’: apt-get update returned 100, expected 0 (Chef::Exceptions::Exec)

try running apt-get update by hand. It will probably suggest trying sudo dpkg --configure -a which may help.

Debugging

Debugging chef client (and client on boot-up)

If you installed from bootstrap, runit handles the logs:


  tail -f -n1000 /etc/sv/chef-client/log/main/current 

The opscode chef AMIs and the apt both log to /var/log/chef/client.log:


  tail -f -n1000 /var/log/chef/client.log

In the below we’ll tail both files — feel free to leave off the log file that’s uninteresting.

Kickstart the chef-client

  • If you need to kickstart the chef-client, log into the machine as ubuntu user and

  sudo service chef-client stop # so that it doesn't try running while you're experimenting
  cd /etc/chef
  tail -f /etc/sv/chef-client/log/main/current /var/log/chef/client.log &
  sudo chef-client
  # ...
  sudo service chef-client start # once you're done

If the node is confused about its identity (gives you `error!': 401 "Unauthorized" (Net::HTTPServerException)): remove /etc/chef/client-config.json and /etc/chef/client.pem and re-run sudo chef-client

Debugging chef everything


  tail -n200 -f /etc/sv/chef-*/log/main/current
  for foo in rabbitmq-server thttpd couchdb chef-{solr,solr-indexer,client,server,server-webui} cassandra ; do sudo service $foo restart ; done

Debug chef server bootstrap

Using on the chef server can help debug authentication problems


  tail -f -n1000 /tmp/user_data-progress.log /var/log/dpkg.log /etc/sv/chef-client/log/main/current 

Immediately after If the webui doesn’t log you in, try doing sudo service chef-server-webui restart — it occasionally will fail to create the admin user for some reason.

Debugging hadoop


  tail -f /var/log/hadoop/hadoop-hadoop-namenode-chef.infochimps.com.log &
  sudo service hadoop-0.20-datanode status
  # ... and so on ...
  sudo service hadoop-0.20-datanode restart

Cassandra

  • Logs are in /etc/

To check that cassandra works as it should:


  grep ListenAddress /etc/cassandra/storage-conf.xml
  irb
  # jump into irb, and plug your ip address into the line below.
  require 'rubygems' ; require 'cassandra' ; include Cassandra::Constants 
  twitter = Cassandra.new('Twitter', '10.162.67.95:9160')
  user = {'screen_name' => 'buttonscat'} ;
  twitter.insert(:Users, '5', user)
  twitter.get(:Users, '5')  

EC2 Instance settings

Instance attributes: disable_api_termination and delete_on_termination

To set delete_on_termination to ‘true’ after the fact, run the following


  ec2-modify-instance-attribute -v i-0704be6c --block-device-mapping /dev/sda1=vol-e98d2c80::true

(You’ll have to modify the instance and volume to suit)

If you set disable_api_termination to true, in order to terminate the node run


  ec2-modify-instance-attribute -v i-0704be6c --disable-api-termination false

curl http://169.254.169.254/latest/user-data

Doing something to every node


. ~/.hadoop-ec2/aws_private_setup.sh ; hadoop-ec2 list gibbon > /tmp/gibbon-hosts.log ; cat /tmp/gibbon-hosts.log | cut -f4 > /tmp/gibbon-slaves.log 

for foo in `cat /tmp/gibbon-slaves.log` ; do echo $foo ; sshkp $foo 'for svc in chef-client hadoop-0.20-{tasktracker,datanode} ; do sudo service $svc stop ; done' & done

On-the-fly backup of your namenode metadata

bkupdir=/ebs2/hadoop-nn-backup/`date +"%Y%m%d"`

for srcdir in /ebs*/hadoop/hdfs/ /home/hadoop/gibbon/hdfs/ ; do
destdir=$bkupdir/$srcdir ; echo $destdir ;
sudo mkdir -p $destdir ;

done

Jump to Line
Something went wrong with that request. Please try again.