installation #2

pgte · 2015-02-06T11:31:35Z

I'm going to do an installation and try to go by the book instead of trying to guess so that the onboarding of new developers can get easier.

pgte · 2015-02-06T11:32:19Z

Ran into this problem when doing ./scripts/install-vagrant.sh:

Installing Paz on Vagrant
Please install etcdctl. Aborting.

lukebond · 2015-02-06T11:33:18Z

Perfect, will update the README.

See #2

pgte · 2015-02-06T11:59:51Z

More progress, now reports 2 failed units. Here is the tail of the output:

Starting paz runlevel 1 units
+ fleetctl -strict-host-key-checking=false start unitfiles/1/paz-orchestrator-announce.service unitfiles/1/paz-orchestrator.service unitfiles/1/paz-scheduler-announce.service unitfiles/1/paz-scheduler.service unitfiles/1/paz-service-directory-announce.service unitfiles/1/paz-service-directory.service
####################################################################
WARNING: fleetctl (0.8.3) is older than the latest registered
version of fleet found in the cluster (0.9.0). You are strongly
recommended to upgrade fleetctl to prevent incompatibility issues.
####################################################################
Unit paz-service-directory.service launched
Unit paz-orchestrator.service launched
Unit paz-scheduler.service launched on bef73231.../172.17.8.101
Unit paz-scheduler-announce.service launched on bef73231.../172.17.8.101
Unit paz-orchestrator-announce.service launched on 09938dfe.../172.17.8.102
Unit paz-service-directory-announce.service launched on f37795e5.../172.17.8.103
+ echo Successfully started all runlevel 1 paz units on the cluster with Fleet
Successfully started all runlevel 1 paz units on the cluster with Fleet
Waiting for runlevel 1 services to be activated...
Activating: 2 | Active: 2 | Failed: 2...
Failed unit detected

Any hints on how to debug this?

lukebond · 2015-02-06T12:15:02Z

Some debugging tips:

Which units are failing?

$ fleetctl --endpoint=http://172.17.8.101:4001 list-units
UNIT                                    MACHINE                     ACTIVE      SUB
paz-orchestrator-announce.service       4e4038bb.../172.17.8.103    inactive    dead
paz-orchestrator.service                4e4038bb.../172.17.8.103    failed      failed
paz-scheduler-announce.service          7a70d1e8.../172.17.8.101    inactive    dead
paz-scheduler.service                   7a70d1e8.../172.17.8.101    failed      failed
paz-service-directory-announce.service  43049642.../172.17.8.102    inactive    dead
paz-service-directory.service           43049642.../172.17.8.102    failed      failed

Viewing the logs of a failed service:

$ fleetctl -strict-host-key-checking=false --endpoint=http://172.17.8.101:4001 journal paz-orchestrator

(add -f to follow logs)

SSH into the machine:

$ cd coreos-vagrant
$ vagrant ssh core-0[1,2,3]

View system logs (after SSHing):

$ journalctl

I'm getting the same issue you are atm, so will be spending time debugging it this weekend. Using the alpha channel of CoreOS means things sometimes things change between releases.

lukebond · 2015-02-08T14:07:44Z

Another tip: When viewing the journal for a service, if you see an HTTP 403 from Docker then check your quay.io credential environment variables as described in the README.

lukebond · 2015-02-08T15:34:38Z

@pgte try again now, it's working for me after making a few fixes.

pgte · 2015-02-11T10:40:38Z

A bit more progress, but still failing for me.
By the log it looks like I may need access to some quay.io repos:

$ fleetctl -strict-host-key-checking=false --endpoint=http://172.17.8.101:4001 journal paz-orchestrator
####################################################################
WARNING: fleetctl (0.8.3) is older than the latest registered
version of fleet found in the cluster (0.9.0). You are strongly
recommended to upgrade fleetctl to prevent incompatibility issues.
####################################################################
-- Logs begin at Wed 2015-02-11 10:36:02 UTC, end at Wed 2015-02-11 10:37:56 UTC. --
Feb 11 10:36:38 core-02 systemd[1]: Starting paz-orchestrator: Main API for all paz services and monitor of services in etcd....
Feb 11 10:36:38 core-02 docker[993]: WARNING: Invalid auth configuration file
Feb 11 10:36:41 core-02 docker[993]: Pulling repository quay.io/yldio/paz-orchestrator
Feb 11 10:36:43 core-02 systemd[1]: paz-orchestrator.service: control process exited, code=exited status=1
Feb 11 10:36:43 core-02 systemd[1]: Failed to start paz-orchestrator: Main API for all paz services and monitor of services in etcd..
Feb 11 10:36:43 core-02 systemd[1]: Unit paz-orchestrator.service entered failed state.
Feb 11 10:36:43 core-02 systemd[1]: paz-orchestrator.service failed.
Feb 11 10:36:43 core-02 docker[993]: time="2015-02-11T10:36:43Z" level="fatal" msg="HTTP code: 403"

lukebond · 2015-02-11T11:09:21Z

A 403 suggests missing or incorrect quay.io credentials. In the installation section of the README there is a recent addition stating that it can now read credentials from your ~/.dockercfg file. Do docker login https://quay.io and enter your quay.io credentials and then try installation again. It should take your creds from ~/.dockercfg and put it on each VM.

pgte · 2015-02-11T11:18:50Z

Downloaded .dockercfg from quay.io and installed it in ~/.dockercfg.

→ cat /Users/pedroteixeira/.dockercfg
{
 "quay.io": {
  "auth": "XXX",
  "email": "i@pgte.me"
 }
}

looks ok. But now, when I run the installation script I get:

→ scripts/install-vagrant.sh
Installing Paz on Vagrant
Attempt to autoload Docker config from /Users/pedroteixeira/.dockercfg FAILED
You must set the $DOCKER_AUTH environment variable

lukebond · 2015-02-11T11:21:19Z

the registry key "quay.io" needs to be "https://quay.io" at the moment. I'll open an issue for this as it's too brittle and should work with or without the protocol.

lukebond · 2015-02-11T11:23:14Z

Created issue #7 for this.

pgte · 2015-02-11T11:34:14Z

That fixed the reading of the file.
Also, I was getting 403 because of not belonging to the org (Github org staff doesn't apply to quay.io).
Perhaps document this fact somewhere?

See #2.

pgte · 2015-02-11T13:02:28Z

Hmmm... now I get a 500. Here is the log for the orchestrator:

→ fleetctl -strict-host-key-checking=false --endpoint=http://172.17.8.101:4001 journal paz-orchestrator
####################################################################
WARNING: fleetctl (0.8.3) is older than the latest registered
version of fleet found in the cluster (0.9.0). You are strongly
recommended to upgrade fleetctl to prevent incompatibility issues.
####################################################################
-- Logs begin at Wed 2015-02-11 13:00:00 UTC, end at Wed 2015-02-11 13:01:43 UTC. --
Feb 11 13:00:41 core-02 docker[1054]: time="2015-02-11T13:00:41Z" level="fatal" msg="HTTP code: 500"
Feb 11 13:00:41 core-02 systemd[1]: Unit paz-orchestrator.service entered failed state.
Feb 11 13:00:41 core-02 systemd[1]: paz-orchestrator.service failed.
Feb 11 13:00:41 core-02 systemd[1]: Starting paz-orchestrator: Main API for all paz services and monitor of services in etcd....
Feb 11 13:00:45 core-02 docker[1105]: Pulling repository quay.io/yldio/paz-orchestrator
Feb 11 13:00:46 core-02 systemd[1]: paz-orchestrator.service: control process exited, code=exited status=1
Feb 11 13:00:46 core-02 systemd[1]: Failed to start paz-orchestrator: Main API for all paz services and monitor of services in etcd..
Feb 11 13:00:46 core-02 systemd[1]: Unit paz-orchestrator.service entered failed state.
Feb 11 13:00:46 core-02 systemd[1]: paz-orchestrator.service failed.
Feb 11 13:00:46 core-02 docker[1105]: time="2015-02-11T13:00:46Z" level="fatal" msg="HTTP code: 500"

lukebond · 2015-02-11T14:06:07Z

Hmm not very enlightening. Could you post some logs from the host around that time using journalctl please?

lukebond · 2015-02-15T17:12:18Z

Any luck with with @pgte? Can you confirm if you were running the integration test script or install-vagrant?

Confirmed working on ArchLinux \o/

No9 · 2015-03-03T00:32:16Z

Had a dive into this over the weekend.
Ran into an issue where is was getting timeouts when logging into the quay.io server when running

$ sudo docker login https://quay.io

FATA[0036] Error Response from daemon v1 ping attempt failed with error: Get https://quay.io/v1/ping: dail tcp: i/o timeout

The error number FATA[0036] could change.
Confirmed by quay.io as a problem their side with route53.

Workaround was to put an entry in to hosts after finding out where quay.io resolved to.
N.B. ping is blocked so I used wget

 wget quay.io
--2015-03-02 23:57:25--  http://quay.io/
Resolving quay.io (quay.io)... 184.73.156.14, 50.17.243.21, 54.243.34.28, ...
Connecting to quay.io (quay.io)|184.73.156.14|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://quay.io/ [following]

So I Put the entry

184.73.156.14 quay.io

Into my /etc/hosts file and login was fine

tomgco · 2015-03-03T10:34:30Z

Hey @No9, thanks for having a look and reporting this, it shouldn't be necessary to log into quay.io any more, however we are looking to deploy to https://registry.hub.docker.com as well, issue #23.

I tried to replicate your login issue however it was successful for me, if anyone else has any problems to this then we can add a notice in the README.

No9 · 2015-03-03T10:48:47Z

Thanks @tomgco
FYI I think this is the line that was printing the message if quay.io wasn't logged into https://github.com/yldio/paz/blob/master/scripts/helpers.sh#L9

lukebond · 2015-03-03T10:59:13Z

Looks like it's time to just remove all that Docker auth stuff from the installation process. It's probably silently working for those of us who still have the credentials in our ~/.dockercfg and failing for those who don't. It's no longer needed since the Docker repos are now public and won't become private again.

Created #27

twilson63 · 2015-03-04T04:16:49Z

Thanks for Paz, looking forward to playing with it, I tried to install via vagrant:

How long should paz take to install via vagrant install?

Starting paz runlevel 1 units
Unit paz-scheduler.service launched on 257b40cd.../172.17.8.102
Unit paz-orchestrator-announce.service launched on 23965b52.../172.17.8.103
Unit paz-service-directory.service launched on f441edc7.../172.17.8.101
Unit paz-service-directory-announce.service launched on f441edc7.../172.17.8.101
Unit paz-scheduler-announce.service launched on 257b40cd.../172.17.8.102
Unit paz-orchestrator.service launched on 23965b52.../172.17.8.103
Successfully started all runlevel 1 paz units on the cluster with Fleet
Waiting for runlevel 1 services to be activated...
Activating: 6 | Active: 0 | Failed: 0...

Any ideas, what I might be doing wrong?

lukebond · 2015-03-04T09:37:41Z

@twilson63 thanks for taking it for a spin!

There is no error in what you're seeing here, but the next step will take a while. It has started the units on the cluster but "starting" involves pulling the Docker images before running them. The base images (usually Ubuntu) are quite big and will take a while. If the units are evenly distributed across the cluster by Fleet then each host in your cluster will be pulling the same base images. Not ideal and it takes a while.

The Activating/Active/Failed file is using grep and awk on the output of fleetctl list-units in your cluster. Once they all say "Active" it will be finished.

If anything goes wrong at this point please use fleetctl -strict-host-key-checking=false --endpoint=http://172.17.8.101:4001 list-units to see what has failed, and use fleetctl -strict-host-key-checking=false --endpoint=http://172.17.8.101:4001 journal <SERVICENAME> to see the logs for a given service.

twilson63 · 2015-03-04T12:09:23Z

Great! I think everything is running, but I can't seem to access any of the ip address. I have very limited use with vagrant, once everything is up, should I be able to access the web service by opening a browser to the ip http://172.17.8.101/

Thanks for the help!

lukebond · 2015-03-04T12:13:56Z

If you've done the /etc/hosts step you should be able to hit the Web UI at http://paz-web.paz

The services are all exposed on random ports by docker so there's nothing on port 80 but HAProxy, and that is configured to check for the service you want (the prefix in front of .paz) and forward it onto the right service. (If you're interested it also does a similar thing internally, forwarding purely by service name, e.g. "paz-scheduler"). So since "paz-web.paz" doesn't route anywhere on the internet you need to do the /etc/hosts hack. I appreciate that none of this is obvious at the moment given the current state of the docs.

twilson63 · 2015-03-04T12:25:53Z

Cool,

I think I fubared something: I will try again:

On Wed, Mar 4, 2015 at 7:13 AM, Luke Bond notifications@github.com wrote:

If you've done the /etc/hosts step you should be able to hit the Web UI
at http://paz-web.paz

The services are all exposed on random ports by docker so there's nothing
on port 80 but HAProxy, and that is configured to check for the service you
want (the prefix in front of .paz) and forward it onto the right service.
(If you're interested it also does a similar thing internally, forwarding
purely by service name, e.g. "paz-scheduler"). So since "paz-web.paz"
doesn't route anywhere on the internet you need to do the /etc/hosts hack.
I appreciate that none of this is obvious at the moment given the current
state of the docs.

—
Reply to this email directly or view it on GitHub
https://github.com/yldio/paz/issues/2#issuecomment-77147539.

Tom Wilson
Jack Russell Software Company Division of CareKinesis
494 Wando Park Blvd
Mount Pleasant, SC 29464
Phone: 843-606-6484
Mobile: 843-469-5856
Email: tom@jackhq.com
Web: http://www.jackhq.com
Calendar:
http://www.google.com/calendar/embed?src=tom%40jackrussellsoftware.com&ctz=America/New_York
http://www.jackhq.com/calendar

This e-mail may contain information that is confidential, privileged or
otherwise protected from disclosure by the Health Insurance Portability and
Accountability Act (HIPAA) and other state and federal laws. This
information is intended only for the individual names above. Any review,
use disclosure or dissemination of this material is strictly prohibited.
If you receive this information in error, please notify CareKinesis
immediately at 888-974-2763 and delete the original at once.

lukebond · 2015-03-23T17:24:51Z

A lot has changed since this issue was opened and it now spans a few people different issues. Going to close it and please open others with updates. Thanks all for the contributions!

lukebond added a commit that referenced this issue Feb 6, 2015

Added etcdctl and fleetctl instructions to README

1b07dc8

See #2

lukebond added a commit that referenced this issue Feb 11, 2015

Clarified quay.io instructions.

8170f3b

See #2.

lukebond mentioned this issue Feb 15, 2015

Re-running Paz cluster #9

Closed

tomgco added the help wanted label Feb 26, 2015

jemgold mentioned this issue Mar 3, 2015

Quickstart section paz-sh/paz-sh.github.io#9

Open

lukebond mentioned this issue Mar 3, 2015

Remove registry credentials from installation script and user-data files #27

Closed

lukebond closed this as completed Mar 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

installation #2

installation #2

pgte commented Feb 6, 2015

pgte commented Feb 6, 2015

lukebond commented Feb 6, 2015

pgte commented Feb 6, 2015

lukebond commented Feb 6, 2015

lukebond commented Feb 8, 2015

lukebond commented Feb 8, 2015

pgte commented Feb 11, 2015

lukebond commented Feb 11, 2015

pgte commented Feb 11, 2015

lukebond commented Feb 11, 2015

lukebond commented Feb 11, 2015

pgte commented Feb 11, 2015

pgte commented Feb 11, 2015

lukebond commented Feb 11, 2015

lukebond commented Feb 15, 2015

No9 commented Mar 3, 2015

tomgco commented Mar 3, 2015

No9 commented Mar 3, 2015

lukebond commented Mar 3, 2015

twilson63 commented Mar 4, 2015

lukebond commented Mar 4, 2015

twilson63 commented Mar 4, 2015

lukebond commented Mar 4, 2015

twilson63 commented Mar 4, 2015

lukebond commented Mar 23, 2015

installation #2

installation #2

Comments

pgte commented Feb 6, 2015

pgte commented Feb 6, 2015

lukebond commented Feb 6, 2015

pgte commented Feb 6, 2015

lukebond commented Feb 6, 2015

Which units are failing?

Viewing the logs of a failed service:

SSH into the machine:

View system logs (after SSHing):

lukebond commented Feb 8, 2015

lukebond commented Feb 8, 2015

pgte commented Feb 11, 2015

lukebond commented Feb 11, 2015

pgte commented Feb 11, 2015

lukebond commented Feb 11, 2015

lukebond commented Feb 11, 2015

pgte commented Feb 11, 2015

pgte commented Feb 11, 2015

lukebond commented Feb 11, 2015

lukebond commented Feb 15, 2015

No9 commented Mar 3, 2015

tomgco commented Mar 3, 2015

No9 commented Mar 3, 2015

lukebond commented Mar 3, 2015

twilson63 commented Mar 4, 2015

lukebond commented Mar 4, 2015

twilson63 commented Mar 4, 2015

lukebond commented Mar 4, 2015

twilson63 commented Mar 4, 2015

lukebond commented Mar 23, 2015