Discussion: Docker vs Pip vs Virtual Machine #26

madisonb · 2015-11-16T16:38:23Z

I would like to open up a discussion for Scrapy Cluster as to how it can be easier to work with.

As of right now, SC 1.1 (almost ready) allows you to do local development on a single Virtual Machine (at time of writing on the dev branch). This single VM is just to do local testing and should not be used in production.

This leaves you with a production deployment where a user must manually stand up Zookeeper, Kafka, and Redis at their desired scale and deploy SC to the various machines they want it to run on. This is done either manually by copying files, or (potentially) pip packages for the 3 main components. Ansible can help here to a degree but is quirky for your OS setup and is not always as modular.

Docker would allow you both the flexibility of deploying to arbitrary Docker Servers, and the ease of use of standing up your cluster. If we bundled the 3 main components as Docker Containers then with a bit of tweaking I think an easily scalable solution is possible. Especially using something like Mesos, Tumtum, Compose, or just plain Docker makes it really easy to run everything.

The downside to this is that it may be difficult for users to add custom spiders, pipelines, and middleware to their Scrapy based project, especially if heavy customization is going on and the spiders are deployed to a lot of different servers. ...Not to mention if the user adds custom plugins to the Kafka Monitor or Redis Monitor, would the user then would need to bundle their own docker container?

So the question is, what route seems like the most flexible, while allowing both local development and production scale deployment? What is the future of deploying distributed apps and how can we make SC extendable, flexible, deployable, and dev friendly?

madisonb · 2016-01-04T15:33:14Z

I think we should go down the Docker route, which brings us to a couple different choices:

A single docker image with Kafka, Redis, Zookeeper, and Scrapy Cluster. The container starts off with a certain commit/release of this repo in a specific folder. Container parameters would include the three folders that contain the Kafka Monitor, Redis Monitor, or the Scrapy project in order to allow someone to override or customize their setup. Other flags could also include number of spider processes to run within the container. This image is easy to work with because every single thing is self contained, but is probably going to be really big and bulky.
Rely on the user to set up Docker Kafka, Redis, and Zookeeper, and provide a single standalone Scrapy Cluster docker image containing the three core components. User still has the same container parameters to adjust # of spiders or folders, but would need to set up the ports correctly in order to work with the other containers.
Set up three different containers (each for Kafka Monitor, Redis Monitor, and Scrapers), which is basically splitting up item 2 above. This is useful for scaling and redundancy as the spiders are meant to scale really easily, so you just increase the number of spider containers.
Use either the concepts of Scrapy Cluster Docker images (either 1 single image or 3 different ones) and use docker compose to orchestrate them together. It is not compatible with Kitematic (see issue 137)) but would help new developers out by quickly allowing them to stand up a whole cluster of containers to test the functionality.

# 4 will allow us to potentially scale via Mesos or any other docker scaling solution, and is friendly enough to new users to allow them to configure their own code, test out a full scale deployment, and not have to fret about packaging or machine level configurations.

Work here will focus on creating Dockerfiles for the crawler, kafka monitor, and redis monitor. Will use supervisord as a base with environment variables to alter/update any kind of configuration we need within either the python settings.py files or the supervisor configs. references #48 #26

madisonb · 2016-08-12T18:56:11Z

Closing this since we are committing to Docker.

* Beginning work on docker branch. Work here will focus on creating Dockerfiles for the crawler, kafka monitor, and redis monitor. Will use supervisord as a base with environment variables to alter/update any kind of configuration we need within either the python settings.py files or the supervisor configs. references #48 #26 * Kafka Monitor Docker Adds a dockerfile to setup and run the kafka monitor. I do note that there are settings and configuration within that tie it into the docker-compose file, but I think they can be fairly easily overridden and maybe I switch them to just using the raw `settings.py` file instead of `localsettings.py`. This uses stand alone containers for Kafka, Zookeeper, and Redis, and each of the smaller scrapy cluster components will be their own separate containers as well. Everything will run under supervisord so you can scale the number of processes both within the container and outside. There is much more work to be done, for example reading in environment variables to override things in the settings file, configuring docker hub, etc. * Added Redis Monitor to docker setup. This is an initial cut at making the redis monitor docker compatible. There are still many environmen t variables to specify but this gets us prett far along In testing this I also found a dormant issue with ensuring the ZK file path existed, which is easy enough to fix and will be merged in when this branch is decently complete * Added Dockerfile for crawler This commit adds the link spider to be compatible with Docker. This completes the initial cut of dockerization of the three core components, and todo is the following: - Define environment variable overrides for commonly altered configurations - Documentation under the Advanced Topics as well as Quickstart to supply yet another alternative to working and provisioning Scrapy Cluster. - Update Changelog - Merge branch back into dev. - Dockerhub builds, or at least a stable set of images on the hub for people to pull down - ??? Very excited to get this going! * Added commonly altered variables to the settings.py files * Removed supervisord, Docker can restart the containers and scale * prep for dev vs prod docker compose * Added docker-compose based off of dockerhub images * Add initial docker docs * Updated testing for docker This commit hopefully enables us to run the integration tests for our Docker images within Travis, enabling us to continuously test both the ansible provisioning and docker images. Added a new script to run tests within the container. Updated documentation for quickstart and advanced docker docs. Modded the Dockerfiles to include the new test script. Fixed .gitignore to ignore more files * Travis build matrix testing Trying to make life easier by separating things into shell scripts instead of a massive script section in travis * more travis changes * more travis configs * fix variable reference Still need to figure out why kafka doesnt start immediately * More tweaks to get kafka and kafka-monitor to stand up first try * rename test script folder to just "travis" * trying to get kafka to spin up 1st try in travis * forgot to fix .test ports * kafka monitor typo * Add docker hub pulls badge * move travis test compose file into travis folder * travis name change with file move * docker down correct file path in travis * clean up docs, remove old supervisord.conf

madisonb added the question label Nov 16, 2015

madisonb mentioned this issue Dec 8, 2015

Scrapy Cluster Pip Packaging #14

Closed

madisonb mentioned this issue Feb 17, 2016

Dockerization #48

Closed

madisonb closed this as completed Aug 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Docker vs Pip vs Virtual Machine #26

Discussion: Docker vs Pip vs Virtual Machine #26

madisonb commented Nov 16, 2015

madisonb commented Jan 4, 2016

madisonb commented Aug 12, 2016

Discussion: Docker vs Pip vs Virtual Machine #26

Discussion: Docker vs Pip vs Virtual Machine #26

Comments

madisonb commented Nov 16, 2015

madisonb commented Jan 4, 2016

madisonb commented Aug 12, 2016