-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Docker vs Pip vs Virtual Machine #26
Labels
Comments
I think we should go down the Docker route, which brings us to a couple different choices:
# 4 will allow us to potentially scale via Mesos or any other docker scaling solution, and is friendly enough to new users to allow them to configure their own code, test out a full scale deployment, and not have to fret about packaging or machine level configurations. |
Closed
Closing this since we are committing to Docker. |
madisonb
pushed a commit
that referenced
this issue
Aug 26, 2016
* Beginning work on docker branch. Work here will focus on creating Dockerfiles for the crawler, kafka monitor, and redis monitor. Will use supervisord as a base with environment variables to alter/update any kind of configuration we need within either the python settings.py files or the supervisor configs. references #48 #26 * Kafka Monitor Docker Adds a dockerfile to setup and run the kafka monitor. I do note that there are settings and configuration within that tie it into the docker-compose file, but I think they can be fairly easily overridden and maybe I switch them to just using the raw `settings.py` file instead of `localsettings.py`. This uses stand alone containers for Kafka, Zookeeper, and Redis, and each of the smaller scrapy cluster components will be their own separate containers as well. Everything will run under supervisord so you can scale the number of processes both within the container and outside. There is much more work to be done, for example reading in environment variables to override things in the settings file, configuring docker hub, etc. * Added Redis Monitor to docker setup. This is an initial cut at making the redis monitor docker compatible. There are still many environmen t variables to specify but this gets us prett far along In testing this I also found a dormant issue with ensuring the ZK file path existed, which is easy enough to fix and will be merged in when this branch is decently complete * Added Dockerfile for crawler This commit adds the link spider to be compatible with Docker. This completes the initial cut of dockerization of the three core components, and todo is the following: - Define environment variable overrides for commonly altered configurations - Documentation under the Advanced Topics as well as Quickstart to supply yet another alternative to working and provisioning Scrapy Cluster. - Update Changelog - Merge branch back into dev. - Dockerhub builds, or at least a stable set of images on the hub for people to pull down - ??? Very excited to get this going! * Added commonly altered variables to the settings.py files * Removed supervisord, Docker can restart the containers and scale * prep for dev vs prod docker compose * Added docker-compose based off of dockerhub images * Add initial docker docs * Updated testing for docker This commit hopefully enables us to run the integration tests for our Docker images within Travis, enabling us to continuously test both the ansible provisioning and docker images. Added a new script to run tests within the container. Updated documentation for quickstart and advanced docker docs. Modded the Dockerfiles to include the new test script. Fixed .gitignore to ignore more files * Travis build matrix testing Trying to make life easier by separating things into shell scripts instead of a massive script section in travis * more travis changes * more travis configs * fix variable reference Still need to figure out why kafka doesnt start immediately * More tweaks to get kafka and kafka-monitor to stand up first try * rename test script folder to just "travis" * trying to get kafka to spin up 1st try in travis * forgot to fix .test ports * kafka monitor typo * Add docker hub pulls badge * move travis test compose file into travis folder * travis name change with file move * docker down correct file path in travis * clean up docs, remove old supervisord.conf
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I would like to open up a discussion for Scrapy Cluster as to how it can be easier to work with.
As of right now, SC 1.1 (almost ready) allows you to do local development on a single Virtual Machine (at time of writing on the dev branch). This single VM is just to do local testing and should not be used in production.
This leaves you with a production deployment where a user must manually stand up Zookeeper, Kafka, and Redis at their desired scale and deploy SC to the various machines they want it to run on. This is done either manually by copying files, or (potentially) pip packages for the 3 main components. Ansible can help here to a degree but is quirky for your OS setup and is not always as modular.
Docker would allow you both the flexibility of deploying to arbitrary Docker Servers, and the ease of use of standing up your cluster. If we bundled the 3 main components as Docker Containers then with a bit of tweaking I think an easily scalable solution is possible. Especially using something like Mesos, Tumtum, Compose, or just plain Docker makes it really easy to run everything.
The downside to this is that it may be difficult for users to add custom spiders, pipelines, and middleware to their Scrapy based project, especially if heavy customization is going on and the spiders are deployed to a lot of different servers. ...Not to mention if the user adds custom plugins to the Kafka Monitor or Redis Monitor, would the user then would need to bundle their own docker container?
So the question is, what route seems like the most flexible, while allowing both local development and production scale deployment? What is the future of deploying distributed apps and how can we make SC extendable, flexible, deployable, and dev friendly?
The text was updated successfully, but these errors were encountered: