Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

installation #2

Closed
pgte opened this issue Feb 6, 2015 · 25 comments
Closed

installation #2

pgte opened this issue Feb 6, 2015 · 25 comments

Comments

@pgte
Copy link

pgte commented Feb 6, 2015

I'm going to do an installation and try to go by the book instead of trying to guess so that the onboarding of new developers can get easier.

@pgte
Copy link
Author

pgte commented Feb 6, 2015

Ran into this problem when doing ./scripts/install-vagrant.sh:

Installing Paz on Vagrant
Please install etcdctl. Aborting.

@lukebond
Copy link
Contributor

lukebond commented Feb 6, 2015

Perfect, will update the README.

@pgte
Copy link
Author

pgte commented Feb 6, 2015

More progress, now reports 2 failed units. Here is the tail of the output:

Starting paz runlevel 1 units
+ fleetctl -strict-host-key-checking=false start unitfiles/1/paz-orchestrator-announce.service unitfiles/1/paz-orchestrator.service unitfiles/1/paz-scheduler-announce.service unitfiles/1/paz-scheduler.service unitfiles/1/paz-service-directory-announce.service unitfiles/1/paz-service-directory.service
####################################################################
WARNING: fleetctl (0.8.3) is older than the latest registered
version of fleet found in the cluster (0.9.0). You are strongly
recommended to upgrade fleetctl to prevent incompatibility issues.
####################################################################
Unit paz-service-directory.service launched
Unit paz-orchestrator.service launched
Unit paz-scheduler.service launched on bef73231.../172.17.8.101
Unit paz-scheduler-announce.service launched on bef73231.../172.17.8.101
Unit paz-orchestrator-announce.service launched on 09938dfe.../172.17.8.102
Unit paz-service-directory-announce.service launched on f37795e5.../172.17.8.103
+ echo Successfully started all runlevel 1 paz units on the cluster with Fleet
Successfully started all runlevel 1 paz units on the cluster with Fleet
Waiting for runlevel 1 services to be activated...
Activating: 2 | Active: 2 | Failed: 2...
Failed unit detected

Any hints on how to debug this?

@lukebond
Copy link
Contributor

lukebond commented Feb 6, 2015

Some debugging tips:

Which units are failing?

$ fleetctl --endpoint=http://172.17.8.101:4001 list-units
UNIT                                    MACHINE                     ACTIVE      SUB
paz-orchestrator-announce.service       4e4038bb.../172.17.8.103    inactive    dead
paz-orchestrator.service                4e4038bb.../172.17.8.103    failed      failed
paz-scheduler-announce.service          7a70d1e8.../172.17.8.101    inactive    dead
paz-scheduler.service                   7a70d1e8.../172.17.8.101    failed      failed
paz-service-directory-announce.service  43049642.../172.17.8.102    inactive    dead
paz-service-directory.service           43049642.../172.17.8.102    failed      failed

Viewing the logs of a failed service:

$ fleetctl -strict-host-key-checking=false --endpoint=http://172.17.8.101:4001 journal paz-orchestrator

(add -f to follow logs)

SSH into the machine:

$ cd coreos-vagrant
$ vagrant ssh core-0[1,2,3]

View system logs (after SSHing):

$ journalctl

I'm getting the same issue you are atm, so will be spending time debugging it this weekend. Using the alpha channel of CoreOS means things sometimes things change between releases.

@lukebond
Copy link
Contributor

lukebond commented Feb 8, 2015

Another tip: When viewing the journal for a service, if you see an HTTP 403 from Docker then check your quay.io credential environment variables as described in the README.

@lukebond
Copy link
Contributor

lukebond commented Feb 8, 2015

@pgte try again now, it's working for me after making a few fixes.

@pgte
Copy link
Author

pgte commented Feb 11, 2015

A bit more progress, but still failing for me.
By the log it looks like I may need access to some quay.io repos:

$ fleetctl -strict-host-key-checking=false --endpoint=http://172.17.8.101:4001 journal paz-orchestrator
####################################################################
WARNING: fleetctl (0.8.3) is older than the latest registered
version of fleet found in the cluster (0.9.0). You are strongly
recommended to upgrade fleetctl to prevent incompatibility issues.
####################################################################
-- Logs begin at Wed 2015-02-11 10:36:02 UTC, end at Wed 2015-02-11 10:37:56 UTC. --
Feb 11 10:36:38 core-02 systemd[1]: Starting paz-orchestrator: Main API for all paz services and monitor of services in etcd....
Feb 11 10:36:38 core-02 docker[993]: WARNING: Invalid auth configuration file
Feb 11 10:36:41 core-02 docker[993]: Pulling repository quay.io/yldio/paz-orchestrator
Feb 11 10:36:43 core-02 systemd[1]: paz-orchestrator.service: control process exited, code=exited status=1
Feb 11 10:36:43 core-02 systemd[1]: Failed to start paz-orchestrator: Main API for all paz services and monitor of services in etcd..
Feb 11 10:36:43 core-02 systemd[1]: Unit paz-orchestrator.service entered failed state.
Feb 11 10:36:43 core-02 systemd[1]: paz-orchestrator.service failed.
Feb 11 10:36:43 core-02 docker[993]: time="2015-02-11T10:36:43Z" level="fatal" msg="HTTP code: 403"

@lukebond
Copy link
Contributor

A 403 suggests missing or incorrect quay.io credentials. In the installation section of the README there is a recent addition stating that it can now read credentials from your ~/.dockercfg file. Do docker login https://quay.io and enter your quay.io credentials and then try installation again. It should take your creds from ~/.dockercfg and put it on each VM.

@pgte
Copy link
Author

pgte commented Feb 11, 2015

Downloaded .dockercfg from quay.io and installed it in ~/.dockercfg.

→ cat /Users/pedroteixeira/.dockercfg
{
 "quay.io": {
  "auth": "XXX",
  "email": "i@pgte.me"
 }
}

looks ok. But now, when I run the installation script I get:

→ scripts/install-vagrant.sh
Installing Paz on Vagrant
Attempt to autoload Docker config from /Users/pedroteixeira/.dockercfg FAILED
You must set the $DOCKER_AUTH environment variable

@lukebond
Copy link
Contributor

the registry key "quay.io" needs to be "https://quay.io" at the moment. I'll open an issue for this as it's too brittle and should work with or without the protocol.

@lukebond
Copy link
Contributor

Created issue #7 for this.

@pgte
Copy link
Author

pgte commented Feb 11, 2015

That fixed the reading of the file.
Also, I was getting 403 because of not belonging to the org (Github org staff doesn't apply to quay.io).
Perhaps document this fact somewhere?

lukebond added a commit that referenced this issue Feb 11, 2015
@pgte
Copy link
Author

pgte commented Feb 11, 2015

Hmmm... now I get a 500. Here is the log for the orchestrator:

→ fleetctl -strict-host-key-checking=false --endpoint=http://172.17.8.101:4001 journal paz-orchestrator
####################################################################
WARNING: fleetctl (0.8.3) is older than the latest registered
version of fleet found in the cluster (0.9.0). You are strongly
recommended to upgrade fleetctl to prevent incompatibility issues.
####################################################################
-- Logs begin at Wed 2015-02-11 13:00:00 UTC, end at Wed 2015-02-11 13:01:43 UTC. --
Feb 11 13:00:41 core-02 docker[1054]: time="2015-02-11T13:00:41Z" level="fatal" msg="HTTP code: 500"
Feb 11 13:00:41 core-02 systemd[1]: Unit paz-orchestrator.service entered failed state.
Feb 11 13:00:41 core-02 systemd[1]: paz-orchestrator.service failed.
Feb 11 13:00:41 core-02 systemd[1]: Starting paz-orchestrator: Main API for all paz services and monitor of services in etcd....
Feb 11 13:00:45 core-02 docker[1105]: Pulling repository quay.io/yldio/paz-orchestrator
Feb 11 13:00:46 core-02 systemd[1]: paz-orchestrator.service: control process exited, code=exited status=1
Feb 11 13:00:46 core-02 systemd[1]: Failed to start paz-orchestrator: Main API for all paz services and monitor of services in etcd..
Feb 11 13:00:46 core-02 systemd[1]: Unit paz-orchestrator.service entered failed state.
Feb 11 13:00:46 core-02 systemd[1]: paz-orchestrator.service failed.
Feb 11 13:00:46 core-02 docker[1105]: time="2015-02-11T13:00:46Z" level="fatal" msg="HTTP code: 500"

@lukebond
Copy link
Contributor

Hmm not very enlightening. Could you post some logs from the host around that time using journalctl please?

@lukebond
Copy link
Contributor

Any luck with with @pgte? Can you confirm if you were running the integration test script or install-vagrant?

Confirmed working on ArchLinux \o/

@No9
Copy link

No9 commented Mar 3, 2015

Had a dive into this over the weekend.
Ran into an issue where is was getting timeouts when logging into the quay.io server when running

$ sudo docker login https://quay.io

FATA[0036] Error Response from daemon v1 ping attempt failed with error: Get https://quay.io/v1/ping: dail tcp: i/o timeout

The error number FATA[0036] could change.
Confirmed by quay.io as a problem their side with route53.

Workaround was to put an entry in to hosts after finding out where quay.io resolved to.
N.B. ping is blocked so I used wget

 wget quay.io
--2015-03-02 23:57:25--  http://quay.io/
Resolving quay.io (quay.io)... 184.73.156.14, 50.17.243.21, 54.243.34.28, ...
Connecting to quay.io (quay.io)|184.73.156.14|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://quay.io/ [following]

So I Put the entry

184.73.156.14 quay.io 

Into my /etc/hosts file and login was fine

@tomgco
Copy link
Member

tomgco commented Mar 3, 2015

Hey @No9, thanks for having a look and reporting this, it shouldn't be necessary to log into quay.io any more, however we are looking to deploy to https://registry.hub.docker.com as well, issue #23.

I tried to replicate your login issue however it was successful for me, if anyone else has any problems to this then we can add a notice in the README.

@No9
Copy link

No9 commented Mar 3, 2015

Thanks @tomgco
FYI I think this is the line that was printing the message if quay.io wasn't logged into https://github.com/yldio/paz/blob/master/scripts/helpers.sh#L9

@lukebond
Copy link
Contributor

lukebond commented Mar 3, 2015

Looks like it's time to just remove all that Docker auth stuff from the installation process. It's probably silently working for those of us who still have the credentials in our ~/.dockercfg and failing for those who don't. It's no longer needed since the Docker repos are now public and won't become private again.

Created #27

@twilson63
Copy link

Thanks for Paz, looking forward to playing with it, I tried to install via vagrant:

How long should paz take to install via vagrant install?

Starting paz runlevel 1 units
Unit paz-scheduler.service launched on 257b40cd.../172.17.8.102
Unit paz-orchestrator-announce.service launched on 23965b52.../172.17.8.103
Unit paz-service-directory.service launched on f441edc7.../172.17.8.101
Unit paz-service-directory-announce.service launched on f441edc7.../172.17.8.101
Unit paz-scheduler-announce.service launched on 257b40cd.../172.17.8.102
Unit paz-orchestrator.service launched on 23965b52.../172.17.8.103
Successfully started all runlevel 1 paz units on the cluster with Fleet
Waiting for runlevel 1 services to be activated...
Activating: 6 | Active: 0 | Failed: 0...

Any ideas, what I might be doing wrong?

@lukebond
Copy link
Contributor

lukebond commented Mar 4, 2015

@twilson63 thanks for taking it for a spin!

There is no error in what you're seeing here, but the next step will take a while. It has started the units on the cluster but "starting" involves pulling the Docker images before running them. The base images (usually Ubuntu) are quite big and will take a while. If the units are evenly distributed across the cluster by Fleet then each host in your cluster will be pulling the same base images. Not ideal and it takes a while.

The Activating/Active/Failed file is using grep and awk on the output of fleetctl list-units in your cluster. Once they all say "Active" it will be finished.

If anything goes wrong at this point please use fleetctl -strict-host-key-checking=false --endpoint=http://172.17.8.101:4001 list-units to see what has failed, and use fleetctl -strict-host-key-checking=false --endpoint=http://172.17.8.101:4001 journal <SERVICENAME> to see the logs for a given service.

@twilson63
Copy link

Great! I think everything is running, but I can't seem to access any of the ip address. I have very limited use with vagrant, once everything is up, should I be able to access the web service by opening a browser to the ip http://172.17.8.101/

Thanks for the help!

@lukebond
Copy link
Contributor

lukebond commented Mar 4, 2015

If you've done the /etc/hosts step you should be able to hit the Web UI at http://paz-web.paz

The services are all exposed on random ports by docker so there's nothing on port 80 but HAProxy, and that is configured to check for the service you want (the prefix in front of .paz) and forward it onto the right service. (If you're interested it also does a similar thing internally, forwarding purely by service name, e.g. "paz-scheduler"). So since "paz-web.paz" doesn't route anywhere on the internet you need to do the /etc/hosts hack. I appreciate that none of this is obvious at the moment given the current state of the docs.

@twilson63
Copy link

Cool,

I think I fubared something: I will try again:

On Wed, Mar 4, 2015 at 7:13 AM, Luke Bond notifications@github.com wrote:

If you've done the /etc/hosts step you should be able to hit the Web UI
at http://paz-web.paz

The services are all exposed on random ports by docker so there's nothing
on port 80 but HAProxy, and that is configured to check for the service you
want (the prefix in front of .paz) and forward it onto the right service.
(If you're interested it also does a similar thing internally, forwarding
purely by service name, e.g. "paz-scheduler"). So since "paz-web.paz"
doesn't route anywhere on the internet you need to do the /etc/hosts hack.
I appreciate that none of this is obvious at the moment given the current
state of the docs.


Reply to this email directly or view it on GitHub
https://github.com/yldio/paz/issues/2#issuecomment-77147539.

Tom Wilson
Jack Russell Software Company Division of CareKinesis
494 Wando Park Blvd
Mount Pleasant, SC 29464
Phone: 843-606-6484
Mobile: 843-469-5856
Email: tom@jackhq.com
Web: http://www.jackhq.com
Calendar:
http://www.google.com/calendar/embed?src=tom%40jackrussellsoftware.com&ctz=America/New_York
http://www.jackhq.com/calendar

This e-mail may contain information that is confidential, privileged or
otherwise protected from disclosure by the Health Insurance Portability and
Accountability Act (HIPAA) and other state and federal laws. This
information is intended only for the individual names above. Any review,
use disclosure or dissemination of this material is strictly prohibited.
If you receive this information in error, please notify CareKinesis
immediately at 888-974-2763 and delete the original at once.

@lukebond
Copy link
Contributor

A lot has changed since this issue was opened and it now spans a few people different issues. Going to close it and please open others with updates. Thanks all for the contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants