Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chess example problem #32

Closed
ianmiell opened this issue Jan 24, 2015 · 2 comments
Closed

chess example problem #32

ianmiell opened this issue Jan 24, 2015 · 2 comments

Comments

@ianmiell
Copy link

Tried setting up the chess example on a 2-node coreos cluster - any idea what's wrong here?

The nodes were called coreos-1 and coreos-2. I can reproduce this at will with a script iyi.

core@coreos-1 ~ $  export SHUTIT_BACKUP_PS1_oO07eN35=$PS1 && PS1='SHUTIT_TMP#oYCuhA5v>' && unset PROMPT_COMMAND && stty cols 320
SHUTIT_TMP#oYCuhA5v>wget -qO- https://github.com/pachyderm-io/pfs/raw/master/deploy/static/1Node.tar.gz | tar -zxf -
SHUTIT_TMP#oYCuhA5v>fleetctl start 1Node/*
Triggered global unit router.service start
Unit master-0-1.service launched on 4a854405.../10.132.128.22
Unit announce-master-0-1.service launched on 4a854405.../10.132.128.22
SHUTIT_TMP#oYCuhA5v>fleetctl list-machines
MACHINE     IP      METADATA
4a854405... 10.132.128.22   -
7de2866d... 10.132.129.103  -
SHUTIT_TMP#oYCuhA5v>fleetctl list-fleetctl list-units^C
SHUTIT_TMP#oYCuhA5v>fleetctl list-units
UNIT                MACHINE             ACTIVE  SUB
announce-master-0-1.service 4a854405.../10.132.128.22   active  running
master-0-1.service      4a854405.../10.132.128.22   active  running
router.service          4a854405.../10.132.128.22   active  running
router.service          7de2866d.../10.132.129.103  active  running
SHUTIT_TMP#oYCuhA5v>git clone https://github.com/pachyderm/chess.git
Cloning into 'chess'...
remote: Counting objects: 22808, done.
remote: Compressing objects: 100% (22781/22781), done.
remote: Total 22808 (delta 27), reused 22798 (delta 17)
Receiving objects: 100% (22808/22808), 14.10 MiB | 6.07 MiB/s, done.
Resolving deltas: 100% (27/27), done.
Checking connectivity... done.
Checking out files: 100% (22757/22757), done.
SHUTIT_TMP#oYCuhA5v>cd chess
SHUTIT_TMP#oYCuhA5v>cd data/
SHUTIT_TMP#oYCuhA5v>./send_sample chess
SHUTIT_TMP#oYCuhA5v>curl pfs/file/chess
curl: (6) Couldn't resolve host 'pfs'
SHUTIT_TMP#oYCuhA5v>curl -XGET localhost/job/chess
Get http://coreos-1:53442/job/chess: dial tcp: lookup coreos-1: no such host
SHUTIT_TMP#oYCuhA5v>ping !$
ping coreos-1
ping: unknown host coreos-1
SHUTIT_TMP#oYCuhA5v>hostname
coreos-1
SHUTIT_TMP#oYCuhA5v>curl -XGET 10.132.128.22/job/chess
Get http://coreos-1:53442/job/chess: dial tcp: lookup coreos-1: no such host
SHUTIT_TMP#oYCuhA5v>ping coreos-1
ping: unknown host coreos-1
SHUTIT_TMP#oYCuhA5v>cat /etc/resolv.conf 
# This file is managed by systemd-resolved(8). Do not edit.
#
# Third party programs must not access this file directly, but
# only through the symlink at /etc/resolv.conf. To manage
# resolv.conf(5) in a different way, replace the symlink by a
# static file or a different symlink.

nameserver 8.8.8.8
nameserver 2001:4860:4860::8844
nameserver 2001:4860:4860::8888
SHUTIT_TMP#oYCuhA5v>cat /etc/hosts
cat: /etc/hosts: No such file or directory
SHUTIT_TMP#oYCuhA5v>docker ps -a
CONTAINER ID        IMAGE                  COMMAND                CREATED             STATUS                      PORTS                   NAMES
471c7d4a8637        pachyderm/pfs:latest   "/go/bin/master 0-1"   12 minutes ago      Up 12 minutes               0.0.0.0:53442->80/tcp   master-0-1             
ddf8422b9a4e        pachyderm/pfs:latest   "mkfs.btrfs /var/lib   12 minutes ago      Exited (0) 12 minutes ago                           desperate_archimedes   
2a231f90f6ae        pachyderm/pfs:latest   "/go/bin/router 1"     12 minutes ago      Up 12 minutes               0.0.0.0:80->80/tcp      router                 
1db59b30d70d        pachyderm/pfs:latest   "truncate /var/lib/p   12 minutes ago      Exited (0) 12 minutes ago                           suspicious_meitner     
SHUTIT_TMP#oYCuhA5v>logout
@jdoliner
Copy link
Member

Hi, really sorry you ran in to this. I think I have an idea what's going on here.

tl;dr: I think it's a problem with dns.

Pfs uses etcd to announce its services, this is handled by the announce-master-*-*.service services. Those services do etcdctl set /pfs/master/0-1 {HOST_NAME}:port. The router service then uses this to figure out how to contact the master. However this only works if the router can do a dns lookup. How did you setup the CoreOS instances? I remember running in to this problem when I tried to get pfs setup on Vagrant.

There are a few easy ways I can think of to fix this for you short term:

  • echo "coreos-1 10.132.128.22" >/etc/hosts is a quick hack that should make things start working. This would also be a good way to confirm the DNS theory.
  • GCE and EC2 both have DNS by default, shoot me an email at jd@pachyderm.io and I'll setup a hosted instance for you to play around with.

Action items for pfs:

  • If DNS is indeed the problem this error message should explicitly mention that as a likely cause.
  • Docs should do a better job of describing this in the "getting started" section.
  • We should look in to making pfs not depend on DNS. I think this will be doable via flannel or something similar.

@ianmiell
Copy link
Author

Hi Joe,

Thanks for the great response! The project does have "Alpha" plastered
firmly all over it so don't stress about newb issues :)

Unfortunately setting up /etc/hosts like so (and the other way round as you
suggested, which I assume was a mistake).

SHUTIT_TMP#GusY7I2m>cat /etc/hosts
10.132.128.22 coreos-1
10.132.129.103 coreos-2

I can only assume coreos has its own way of doing name resolution?

Was my send_sample call correct btw? If so, I can update the chess example
documentation. Tho it returned an error code 6, possibly because of the
lookup issues.

I'm not too familiar with coreos, so am without all my usual debugging
tools :(

Don't worry about setting something up for me. I can try the others myself;
it's really a DO automation that I was after.

Thanks again.

Ian

On Sat, Jan 24, 2015 at 10:25 PM, Joe Doliner notifications@github.com
wrote:

Hi, really sorry you ran in to this. I think I have an idea what's going
on here.

tl;dr: I think it's a problem with dns.

Pfs uses etcd to announce its services, this is handled by the
announce-master--.service services. Those services do etcdctl set
/pfs/master/0-1 {HOST_NAME}:port. The router service then uses this to
figure out how to contact the master. However this only works if the router
can do a dns lookup. How did you setup the CoreOS instances? I remember
running in to this problem when I tried to get pfs setup on Vagrant.

There are a few easy ways I can think of to fix this for you short term:

  • echo "coreos-1 10.132.128.22" >/etc/hosts is a quick hack that
    should make things start working. This would also be a good way to confirm
    the DNS theory.
  • GCE and EC2 both have DNS by default, shoot me an email at
    jd@pachyderm.io and I'll setup a hosted instance for you to play
    around with.

Action items for pfs:

  • If DNS is indeed the problem this error message should explicitly
    mention that as a likely cause.
  • Docs should do a better job of describing this in the "getting
    started" section.
  • We should look in to making pfs not depend on DNS. I think this
    will be doable via flannel or something similar.


Reply to this email directly or view it on GitHub
#32 (comment).

@bufdev bufdev closed this as completed Jul 1, 2015
chainlink added a commit that referenced this issue Jun 30, 2021
Add kube version and icon
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants