Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start-kube-docker not working in Vagrant image #161

Closed
brinman2002 opened this issue Nov 27, 2015 · 23 comments
Closed

start-kube-docker not working in Vagrant image #161

brinman2002 opened this issue Nov 27, 2015 · 23 comments

Comments

@brinman2002
Copy link

Trying to run Pachyderm in Vagrant using the Vagrantfile/init.sh in the Github documentation QUICKSTART.md. gcr.io/google_containers/hyperkube:v1.1.2 container does not start.

Steps to reproduce:

vagrant destroy # or download per README.md
vagrant up
vagrant ssh

go get github.com/pachyderm/pachyderm/...
cd ~/go/src/github.com/pachyderm/pachyderm
etc/kube/start-kube-docker.sh
~/pachyderm_vagrant$ vagrant version
Installed Version: 1.7.4
Latest Version: 1.7.4

You're running an up-to-date version of Vagrant!

Console log: kubeNotStarting.txt

@brinman2002
Copy link
Author

Couple things I noted; I don't see anything in the script starting the rethinkdb container so I've done that manually (docker run -d rethinkdb:2.0.4). The other is that the kubelet container just doesn't start hyperkube. If I commit the container and run it with bash, then execute the same command in the script, it seems to start but can't seem to connect to rethinkdb on 8080. Still learning my way around docker though; I'll keep playing with it.

edit: sorry for the comment spam, but I did realize that 8080 is one of the kube ports and I was misreading the quickstart "to check if it worked" bit.

@brinman2002
Copy link
Author

Here is the output from running hyperkube manually.

hyperkube_fromcmdline.txt

One problem with the container is that the volumes don't appear to be linking to the host properly.

root@65f570570abc:/# ls /var/run/   
kubernetes  lock  utmp
vagrant@vagrant-ubuntu-vivid-64:~/go/src/github.com/pachyderm/pachyderm$ ls /var/run
acpid.socket  cloud-init    dhclient.eth0.pid  initctl    motd.dynamic  plymouth    rpcbind       rsyslogd.pid     sshd       thermald    utmp
atd.pid       crond.pid     docker             initramfs  mount         pppconfig   rpcbind.lock  screen           sshd.pid   tmpfiles.d  uuidd
blkid         crond.reboot  docker.pid         lock       network       puppet      rpcbind.sock  sendsigs.omit.d  sysconfig  udev
chef          dbus          docker.sock        log        pcscd         resolvconf  rpc_pipefs    shm              systemd    user

@brinman2002
Copy link
Author

Unfortunately I didn't really keep good track of everything I tried, but it does seem like Docker is the problem or at least the symptom of another problem. I've gotten kubernetes to start and pachctl to at least connect to it by dropping back to earlier versions of everything, but then the enhancements that seem to be needed by Pachyderm aren't there. I've gotten the "kubelet" container to start by committing the container created by the script and running /bin/bash and then running hyperkube manually, but that doesn't start up the pod.

Not directly related, but the experimental Docker build also can't stop containers due to it complaining about permissions. Based on doc for Kubernetes, I added cgroup_enable=memory swapaccount=1 to the Kernel parameters for the VM but I wonder if there is some other Kernel setup needed?

@jdoliner
Copy link
Member

So I'm a little confused on the status of this issue. You have Kubernetes up and running but it's the wrong version? What happens when you try to start the correct version?

Also could you try this without Vagrant? All you need is Docker and Golang which I think makes Vagrant sort of unneeded. I think we should consider just removing the Vagrant file.

@brinman2002
Copy link
Author

So I'm a little confused on the status of this issue. You have Kubernetes up and running but it's the wrong version? What happens when you try to start the correct version?

Sorry for the confusion. At one point Kubernetes would start when I referenced an earlier version. I probably wasn't using the Pachyderm master.json and Pachyderm didn't work correctly. I imagine that this would be expected. The correct version as referenced by the start kube script does not start correctly.

Also could you try this without Vagrant? All you need is Docker and Golang which I think makes Vagrant sort of unneeded. I think we should consider just removing the Vagrant file.

Vagrant makes it possible to blow away bad experiments and start over. Is Docker or any of the other technologies used known to not play well in a VM? I do understand if you don't want to maintain an official Vagrantfile.

Thanks

@brinman2002
Copy link
Author

Ok, I do have to correct myself. I am able to get a Kubernetes cluster working by using the instructions on their site substituting the 1.1.2 version instead. But, that leaves me with this error-

vagrant@ubuntu-14:~/go/src/github.com/pachyderm/pachyderm$ $HOME/go/bin/pachctl create-cluster
ReplicationController "pfsd-rc" is invalid: spec.template.spec.containers[0].securityContext.privileged: forbidden '<*>(0xc2092af1f8)true'
vagrant@ubuntu-14:~/go/src/github.com/pachyderm/pachyderm$ $HOME/go/bin/kubectl get svc
NAME         LABELS                                    SELECTOR      IP(S)        PORT(S)
etcd         app=etcd,suite=pachyderm                  app=etcd      10.0.0.173   2379/TCP
                                                                                  2380/TCP
kubernetes   component=apiserver,provider=kubernetes   <none>        10.0.0.1     443/TCP
rethink      app=rethink,suite=pachyderm               app=rethink   10.0.0.194   8080/TCP
                                                                                  28015/TCP
                                                                                  29015/TCP

In this setup the custom master.json doesn't get copied to the Kubernetes image, so presumably the configuration in it is a part of the problem?

Also forgot to mention that I switched to a Phusion based image (phusion/ubuntu-14.04-amd64) instead of the default Ubuntu as they are supposed to be more Docker friendly. I think Pachyderm is a great idea and I'm hoping I can get this working.

@jaybennett89
Copy link

@brinman2002, Docker is very similar to Vagrant in it's functions in that it also manages the automated deployment of virtual machines. So why run a virtualization Inception? Try docker on its own!

@brinman2002
Copy link
Author

Much to my surprise, Docker doesn't work correctly in Vagrant (with VirtualBox as the backend anyway) but it is working (better) on bare metal. I can only assume it is doing lower level virtualization that isn't supported by VirtualBox.

I'm still not completely working but I'm past all of the issues documented here.

@jdoliner
Copy link
Member

jdoliner commented Dec 4, 2015

@brinman2002 what sorts of issues are you hitting with bare metal? make launch works for me and I think all I did to get the environment working is install Docker and Go, but there's probably some little things that I've forgotten at this point.

@brinman2002
Copy link
Author

I just tried to update and now make install doesn't work. The launch goal triggers install so it fails as well. When I made the previous comment, I was using the /etc/kube script as that was what was suggested on issue 160.

~/go/src/github.com/pachyderm/pachyderm$ make install
GO15VENDOREXPERIMENT=1 go install ./src/cmd/pachctl
# go.pedge.io/pkg/sync
../../../go.pedge.io/pkg/sync/lazy_loader.go:21: undefined: atomic.Value
../../../go.pedge.io/pkg/sync/lazy_loader.go:22: undefined: atomic.Value
# golang.org/x/crypto/ssh
../../../golang.org/x/crypto/ssh/keys.go:492: undefined: crypto.Signer
# github.com/pachyderm/pachyderm/src/pkg/shard
src/pkg/shard/shard.pb.log.go:10: undefined: protolog.Register
src/pkg/shard/shard.pb.log.go:10: undefined: protolog.MessageType_MESSAGE_TYPE_EVENT
src/pkg/shard/shard.pb.log.go:10: undefined: protolog.Message
src/pkg/shard/shard.pb.log.go:11: undefined: protolog.Register
src/pkg/shard/shard.pb.log.go:11: undefined: protolog.MessageType_MESSAGE_TYPE_EVENT
src/pkg/shard/shard.pb.log.go:11: undefined: protolog.Message
src/pkg/shard/shard.pb.log.go:12: undefined: protolog.Register
src/pkg/shard/shard.pb.log.go:12: undefined: protolog.MessageType_MESSAGE_TYPE_EVENT
src/pkg/shard/shard.pb.log.go:12: undefined: protolog.Message
src/pkg/shard/shard.pb.log.go:13: undefined: protolog.Register
src/pkg/shard/shard.pb.log.go:13: too many errors
Makefile:55: recipe for target 'install' failed
make: *** [install] Error 2

@brinman2002
Copy link
Author

The previous error was from not having my machine set up correctly. I think I have it set up right now but get is still having issues:

~$ go get github.com/pachyderm/pachyderm/...
package github.com/pachyderm/pachyderm/vendor/github.com/emicklei/go-restful/examples/google_app_engine
    imports google.golang.com/appengine: unrecognized import path "google.golang.com/appengine"
package github.com/pachyderm/pachyderm/vendor/github.com/emicklei/go-restful/examples/google_app_engine
    imports google.golang.com/appengine/memcache: unrecognized import path "google.golang.com/appengine/memcache"
package github.com/pachyderm/pachyderm/vendor/github.com/emicklei/go-restful/examples/google_app_engine/datastore
    imports google.golang.com/appengine/datastore: unrecognized import path "google.golang.com/appengine/datastore"
package github.com/pachyderm/pachyderm/vendor/github.com/emicklei/go-restful/examples/google_app_engine/datastore
    imports google.golang.com/appengine/user: unrecognized import path "google.golang.com/appengine/user"
package github.com/pachyderm/pachyderm/vendor/go.pedge.io/protolog/cmd/protoc-gen-protolog
    imports go.pedge.io/proto/plugin: cannot find package "go.pedge.io/proto/plugin" in any of:
    /opt/go/src/go.pedge.io/proto/plugin (from $GOROOT)
    /home/brandon/go/src/go.pedge.io/proto/plugin (from $GOPATH)

@jdoliner
Copy link
Member

jdoliner commented Dec 6, 2015

Hmm, I don't fully understand go get but go get github.com/pachyderm/pachyderm seems to work for me.

@brinman2002
Copy link
Author

Yeah I'm not sure what changed but make install is working now.

Do you recommend make launch or etc/kube/start-kube-docker.sh to run Pachyderm? make launch still uses Docker Compose, which you said wasn't "a viable way to deploy".

@brinman2002
Copy link
Author

Running etc/kube/start-kube-docker.sh seems to be working now. Thanks!

@jdoliner
Copy link
Member

jdoliner commented Dec 6, 2015

If you're on master make launch should launch you a kubernetes cluster and then get pachyderm running on that cluster. You will still see some docker-compose output because our containers are still built through docker compose.

I'm trying to get docker-compose ripped out soon, but our unit tests still use it.

@brinman2002
Copy link
Author

Not completely working after all. I can create repos with pachctl but it doesn't create commits.

~/go/src/github.com/pachyderm/pachyderm$ sudo  $(which pachctl) start-commit foo
rpc error: code = 2 desc = "btrfs subvolume create /pfs/btrfs/repo/foo/120d16b28a28451c918fe4762cd8ac24: exit status 1\n\tERROR: can't access to '/pfs/btrfs/repo/foo'\n"
~/go/src/github.com/pachyderm/pachyderm$ ls /pfs/foo
ls: cannot access /pfs/foo: Input/output error
~/go/src/github.com/pachyderm/pachyderm$ ls /pfs
data  output

Also, if you do the echo in the quickstart to demonstrate that you can't write to the data directory, it throws things into a bad state. The echo command never exits and the pachctl mount command continuously throws errors like these-

2015/12/05 21:36:32 transport: http2Client.notifyError got notified that the client transport was broken read tcp 127.0.0.1:40089->127.0.0.1:650: read: connection reset by peer.
2015/12/05 21:36:32 transport: http2Client.notifyError got notified that the client transport was broken read tcp 127.0.0.1:40091->127.0.0.1:650: read: connection reset by peer.
2015/12/05 21:36:32 transport: http2Client.notifyError got notified that the client transport was broken read tcp 127.0.0.1:40093->127.0.0.1:650: read: connection reset by peer.
2015/12/05 21:36:32 transport: http2Client.notifyError got notified that the client transport was broken read tcp 127.0.0.1:40095->127.0.0.1:650: read: connection reset by peer.
2015/12/05 21:36:32 transport: http2Client.notifyError got notified that the client transport was broken read tcp 127.0.0.1:40097->127.0.0.1:650: read: connection reset by peer.
2015/12/05 21:36:32 transport: http2Client.notifyError got notified that the client transport was broken read tcp 127.0.0.1:40099->127.0.0.1:650: read: connection reset by peer.
2015/12/05 21:36:32 transport: http2Client.notifyError got notified that the client transport was broken read tcp 127.0.0.1:40101->127.0.0.1:650: read: connection reset by peer.

I've had to reboot to make to things work again.

@brinman2002
Copy link
Author

On the quickstart, is using "foo" just a documentation error?

brandon@tamami:~$ pachctl create-repo data
brandon@tamami:~$ pachctl create-repo output
brandon@tamami:~$ ls /pfs
data  output
brandon@tamami:~$ pachctl start-commit data
7ea12a9dac744eec817c634e4d486bd0
brandon@tamami:~$ echo "Hello world" > /pfs/data/7ea12a9dac744eec817c634e4d486bd0/hello.txt
brandon@tamami:~$ pachctl finish-commit data 7ea12a9dac744eec817c634e4d486bd0
brandon@tamami:~$ cat /pfs/data/7ea12a9dac744eec817c634e4d486bd0/hello.txt 
Hello world

@brinman2002
Copy link
Author

According to the quickstart, this shouldn't work either since it hasn't committed yet.

brandon@tamami:~$ pachctl start-commit data
559e851fc6f64086be60740d0a890540
brandon@tamami:~$ echo "Hello world" > /pfs/data/559e851fc6f64086be60740d0a890540/hello2.txt
brandon@tamami:~$ cat /pfs/data/559e851fc6f64086be60740d0a890540/hello2.txt 
Hello world

I know you mentioned the doc is a little stale (and as a developer, I know how that happens :D ), but I thought I'd point this out because it seems like it could be an issue in the code as well.

@brinman2002
Copy link
Author

Another hang

brandon@tamami:~$ pachctl list-commit data
ID                                 PARENT              STATUS              STARTED             FINISHED            TOTAL_SIZE          DIFF_SIZE           
559e851fc6f64086be60740d0a890540   <none>              writeable           45 years ago                            28 B                12 B                
7ea12a9dac744eec817c634e4d486bd0   <none>              read-only           45 years ago        15 minutes ago      26 B                12 B                
brandon@tamami:~$ pachctl finish-commit 559e851fc6f64086be60740d0a890540
Expected 2 args, got 1
brandon@tamami:~$ pachctl finish-commit 559e851fc6f64086be60740d0a890540 data
rpc error: code = 2 desc = "commit 559e851fc6f64086be60740d0a890540/data not found"
brandon@tamami:~$ pachctl finish-commit data 559e851fc6f64086be60740d0a890540 
# hung on this command

@jdoliner
Copy link
Member

jdoliner commented Dec 6, 2015

Thanks so much for reporting, just updated the quickstart to not reference foo anymore.

Regarding issue with the hang. I've created #162 to track that.

The issue with files being returned from unfinished commits is tracked in #159.

I think these are both fairly simple issues so I'll try to get them fixed soon.

@brinman2002
Copy link
Author

Great, thanks. Is there an IRC/Slack chat/mailing list for more informal questions?

@teodor-pripoae
Copy link
Contributor

Hi,

I'm running a default vagrant cluster with 3 k8s nodes to simulate a production cluster. pachyderm/pfsd is crashing every time. All other containers seem to work fine.

Cluster started with: KUBERNETES_MINION_MEMORY=2048 NUM_MINIONS=3 KUBERNETES_PROVIDER=vagrant ./kube-up.sh

[vagrant@kubernetes-minion-2 ~]$ sudo docker logs -f 4ff34f7e27da
Turning ON incompat feature 'extref': increased hardlink limit per file to 65536

WARNING! - Btrfs v3.12 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

fs created label (null) on /pfs-img/btrfs.img
    nodesize 16384 leafsize 16384 sectorsize 4096 size 10.00GiB
Btrfs v3.12
mount: could not find any free loop device

@jdoliner
Copy link
Member

Hi @teodor-pripoae, sorry you ran into this.
What's going on is that pfs needs a loop device to mount the btrfs volume on.
Sometimes there's a weird bug where when the container finishes the loop device can get leaked and there's no way to unmount it.

What do you get when you do losetup -a?

chainlink added a commit that referenced this issue Sep 6, 2022
* Remove unused orb

* Initial pass for lint errors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants