New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dockerfile to simplify installation #93

Open
wants to merge 2 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@slang800

slang800 commented Sep 5, 2016

I'm deploying a couple instances of grab-site to a CoreOS cluster, so I made a Dockerfile... Hopefully this is a bit easier to use than pip/virtualenv. The reason why this uses the larger python:3.4-slim image (rather than python:3.4-alpine) is because Alpine had some issues compiling https://github.com/dw/py-lmdb with its version of gcc.

This PR still needs docs, so it's a work-in-progress right now.

After starting the container you can use the regular grab-site command via docker exec <container-name> grab-site <args and site url>

@ivan

This comment has been minimized.

Member

ivan commented Sep 5, 2016

I haven't used Docker, so bear with me...

  1. Why COPY to /app/ if you still subsequently do a pip3 install .? If you pip3 install ., then grab-site, gs-server, etc should be installed somewhere, right?

  2. Can you make the script in .travis.yml test that this Dockerfile works? (Probably after all the existing stuff.)

Thanks for working on this!

@slang800

This comment has been minimized.

slang800 commented Sep 6, 2016

No prob - you can think of a Docker container as a lightweight VM... Like VirtualBox, but with better tooling and less overhead). The Dockerfile automates building/configuring the container and the COPY directive handles copying the code from your working directory into the container's file-system. Once the code is in the container (at /app) then we do pip3 install to get all the deps and set everything up.

This creates a fully isolated, reproducible installation of grab-site in a 200-300MB image. This image can be run on any host OS, including CoreOS where Python isn't even installed. Using Alpine as a base we could get this image down to 20-50MB, but that requires some modifications to py-lmdb.

As for testing, we can have https://hub.docker.com automatically rebuild the image whenever new code is pushed (see: https://docs.docker.com/docker-hub/builds/) and run Docker-based tests in Travis if you want: https://docs.travis-ci.com/user/docker/

@ivan

This comment has been minimized.

Member

ivan commented Sep 6, 2016

pip3 install . should install grab-site in addition to the dependencies, though. pip3 install puts things in /usr/local/bin while pip3 install --user puts things in ~/.local/bin, unless there's some extra configuration doing something else. Would it make sense to use the installed grab-site scripts in one of those paths rather than duplicate some pip functionality with the COPY lines?

@ivan

This comment has been minimized.

Member

ivan commented Sep 6, 2016

Is there an issue filed somewhere for py-lmdb's failure to compile on Alpine Linux's gcc?

@slang800

This comment has been minimized.

slang800 commented Sep 6, 2016

pip3 install . is being run within the context of the Docker container (not the host OS) so you need to COPY the files into the container for pip to work.

@ivan

This comment has been minimized.

Member

ivan commented Sep 6, 2016

Oh, that explains it :-)

@slang800

This comment has been minimized.

slang800 commented Sep 6, 2016

There isn't an issue filed on https://github.com/dw/py-lmdb/issues yet.

@igorbrigadir

This comment has been minimized.

igorbrigadir commented Sep 6, 2016

Alpine had some issues compiling https://github.com/dw/py-lmdb with its version of gcc.

I haven't tried running grab-site, but it seems like installing py-lmdb works on python:3.4-alpine with this:

FROM python:3.4-alpine
RUN apk add --update build-base libffi-dev
RUN pip install lmdb
Dockerfile Outdated
RUN pip3 install ./
RUN apt-get purge -y build-essential
RUN apt-get autoremove -y
RUN apt-get clean

This comment has been minimized.

@ivan

ivan Sep 6, 2016

Member

Someone tells me that each RUN creates a new layer, so the purge/autoremove/clean would not reduce the size of the final Docker image. What do you think about combining the RUNs on lines 6-12 into one RUN command?

https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/#/minimize-the-number-of-layers

This comment has been minimized.

@mithrandi

mithrandi Sep 6, 2016

"Someone" is me, in case additional clarification of this comment is needed :)

This comment has been minimized.

@mithrandi

mithrandi Sep 6, 2016

Another possibility is to use FROM python:3.4 instead of FROM python:3.4-slim; the non-slim variant is based off of buildpack-deps which has a lot of compilers / tools / libraries installed. The resulting total image size would be bigger, but the advantage is that the buildpack-deps portion would be shared with every other image based off of that, so in the usual case where you have several images, the total space usage would be lower.

This comment has been minimized.

@slang800

slang800 Sep 6, 2016

Good point! I combined the commands & got the image size down to 235.4MB. I'm kinda surprised that images don't get flattened, but moby/moby#332 offers a lengthy discussion on it.

As for basing it on python:3.4, that would reduce the build time & total size of images on the system, but only if a significant number of the other images on the system are based on it too, which I don't think we can assume. It's probably better to just optimize for the smallest resulting image size.

@slang800 slang800 force-pushed the slang800:patch-1 branch from 3debff9 to 73d2726 Sep 6, 2016

@slang800 slang800 force-pushed the slang800:patch-1 branch 2 times, most recently from fb56712 to ce4d178 Sep 13, 2016

@@ -34,6 +34,7 @@ Note: grab-site currently **does not work with Python 3.5**; please use Python 3
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->

This comment has been minimized.

@ivan

ivan Sep 13, 2016

Member

This comment is a lie, sorry. I've been updating this TOC manually and probably don't want the Tips for specific websites expanded.

This comment has been minimized.

@slang800
@slang800

This comment has been minimized.

slang800 commented Sep 13, 2016

You're right about it working on Alpine - I was just missing libffi-dev. Now it's down to 112.4 MB (37 MB when compressed). Also, I added instructions to the README, so I'm going to remove the "[wip]" from this.

@slang800 slang800 changed the title [wip] add Dockerfile to simplify installation Add Dockerfile to simplify installation Sep 13, 2016

@slang800 slang800 force-pushed the slang800:patch-1 branch 2 times, most recently from ee5a78e to 3393d82 Sep 13, 2016

README.md Outdated
Start the grab-site server. You can set the port, volume, and name to whatever you want:

```bash
docker run --detach -p 29000:29000 -v /home/ludios/download/grab-site-data:/data --name warcfactory slang800/grab-site

This comment has been minimized.

@ivan

ivan Sep 13, 2016

Member

How about just ~/grabs instead of /home/ludios/download/grab-site-data?

This comment has been minimized.

@slang800

slang800 Sep 13, 2016

docker requires an absolute path for mounts... I suppose I could do $(pwd)/grabs, if that's obvious to most users.

This comment has been minimized.

@ivan

ivan Sep 13, 2016

Member

~ will be made absolute by the shell, no?

$ echo ~/
/home/at/
README.md Outdated
docker exec warcfactory grab-site --no-offsite-links --1 http://xkcd.com/
```

The downloaded data, temp files, ignores list, and other configuration will be in a sub-directory of the mounted volume. In this case, `/home/ludios/download/grab-site-data/xkcd.com-2016-09-05-caf0a39c`.

This comment has been minimized.

@ivan

ivan Sep 13, 2016

Member

Same

README.md Outdated
@@ -55,6 +56,27 @@ Note: grab-site currently **does not work with Python 3.5**; please use Python 3

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

Install with Docker
---
Get the pre-built docker container:

This comment has been minimized.

@ivan

ivan Sep 13, 2016

Member

I would really like the instructions to be so complete that people who don't have Docker installed yet (me included) know where to get it.

This comment has been minimized.

@slang800

slang800 Sep 13, 2016

Eh, that depends on your package manager... for me it's pacman -S docker (on Arch), but it's apt-get install if you're on a debian-based system, yum on RHEL/CentOS/Fedora and I imagine there's something similar for OSX. Or it's pre-installed if you're on CoreOS.

This comment has been minimized.

@ivan

ivan Sep 13, 2016

Member

True, maybe just point to https://docs.docker.com/engine/installation/ if you consider that accurate.

This comment has been minimized.

@slang800

slang800 Sep 13, 2016

yeah, that's a great explanation - added a link.

@ivan

This comment has been minimized.

Member

ivan commented Sep 13, 2016

Thanks for the fixes.

I am currently somewhat busy and under-dockered, can a grab-site user please give the Docker instructions a try and see if they work? (And let me know if you had to perform any other steps to make this a useful setup?)

@igorbrigadir

This comment has been minimized.

igorbrigadir commented Sep 13, 2016

I had docker already, but worth linking to https://docs.docker.com/engine/installation/ instructions

I tried it with:
Ubuntu: 12.04.5 LTS, x86_64, 3.8.0-44-generic
Docker: Docker version 1.7.1, build 786b29d

Ran (sudo for docker commands because i skipped this step https://docs.docker.com/engine/installation/linux/ubuntulinux/#/create-a-docker-group):

sudo docker pull slang800/grab-site
sudo docker run --detach -p 29000:29000 -v ~/grab-site-data:/data --name warcfactory slang800/grab-site
Web UI worked on http://localhost:29000/
sudo docker exec warcfactory grab-site --no-offsite-links http://xkcd.com/

Crawl finished successfully!

@slang800 slang800 force-pushed the slang800:patch-1 branch from 3393d82 to 23f9099 Sep 13, 2016

@slang800 slang800 force-pushed the slang800:patch-1 branch from 23f9099 to 8b57b72 Sep 13, 2016

@ivan

This comment has been minimized.

Member

ivan commented Sep 19, 2016

I tried this out, but couldn't find a way to attach a terminal to a docker exec -d process (or a docker exec process that has been ctrl-c'ed - note the ctrl-c is not passed to the child). The reason that you sometimes need a terminal attached to a grab-site process is to 1) see which URL is currently being grabbed (this information is not reported to the dashboard, only finished responses) and 2) look at segfaults and websocket connection problems that don't get reported to the dashboard either.

Would adding tmux to the container and using tmux work? (Note, tmux 2.1 is broken; 1.8 is a known-good version.) I just hope that docker exec tmux attach works. If this does work, the documentation should also be updated.

@ivan

This comment has been minimized.

Member

ivan commented Sep 19, 2016

Also, running gs-server as PID 1 seems undesirable because if it were killed, it would kill all the grab-site processes as well. grab-site processes are designed to stay running even if gs-server crashes or is taken down for an upgrade. Maybe gs-server (and each grab-site) should run in its own container instead.

@slang800

This comment has been minimized.

slang800 commented Oct 20, 2016

Maybe gs-server (and each grab-site) should run in its own container instead.

Splitting up the server and client would make sense, especially since you could then run them on different machines, but I should probably do that as a separate PR, since I'll need to look into how they communicate.

Also, running gs-server as PID 1 seems undesirable because if it were killed, it would kill all the grab-site processes as well.

Would using dumb-init as PID 1 allow the orphaned grab-site processes to keep running in the case where gs-server dies? If so, that would be a decent temporary fix.

I tried this out, but couldn't find a way to attach a terminal to a docker exec -d process (or a docker exec process that has been ctrl-c'ed - note the ctrl-c is not passed to the child).

You could run docker exec without detatching, but this whole setup could be simplified by splitting up the processes into their own containers... Then you'd be able to use docker logs and pass signals in a sane manner.

@semente

This comment has been minimized.

semente commented Nov 7, 2018

hey people! what is the status of this PR? I could give a hand.

@ivan

This comment has been minimized.

Member

ivan commented Nov 7, 2018

For now, I would like someone else to be the Dockerized grab-site upstream. I don't use Docker and I don't have the resources to 1) figure out if a PR is taking the right approach with Dockerization (which base? which init? one container per grab-site? how to integrate tmux, if needed?) 2) double my manual testing matrix.

So, please, have at it and promote your fork/Dockerfile here. If you (or someone else) stays interested in maintaining and testing it, I might take a PR in the future.

@ivan ivan force-pushed the ludios:master branch from 229a81c to 0dbfdec Dec 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment