Take the hassle out of web scraping
Clone or download
Permalink
Failed to load latest commit information.
app Add start of instructions for using Google Chrome headless Apr 3, 2018
bin Add rspec support for spring Apr 10, 2015
config Move chrome.json to config directory Apr 3, 2018
db Speed up calculation of scraper.latest_successful_run_time by adding … Dec 22, 2017
default_files Add sensible defaults to package.json Mar 24, 2017
docker_images/morph-mitmdump Update certificate. Now expires Mar 31 2021 Apr 1, 2018
lib Add new rake task for fixing up the queue run inconsistencies Jun 18, 2018
log rails new . -T Dec 22, 2013
provisioning Be more flexible about the java version May 16, 2018
public Move sitemaps to their own directories Mar 3, 2014
spec Set correct mime-type again for jsonp and add some tests so that this… Feb 27, 2018
tmp/pids Ignore all logfiles and most tempfiles. Ignore the PID tempfiles but … Mar 15, 2016
vendor/assets Use more consistent formatting for TODO comment Jul 12, 2015
.gitignore Install newrelic infrastructure agent Apr 20, 2018
.rspec Installing rspec very late in the day Jan 11, 2014
.ruby-version Use ruby 2.3.1 rather than ruby 2.3.0 Jul 19, 2016
.travis.yml Revert "Show current docker versions on travis" Dec 9, 2017
CONTRIBUTING.md Add a link for creating github issue Jul 16, 2015
Capfile Removed deprecation warning Jul 21, 2016
Dockerfile Use ruby 2.3.1 rather than ruby 2.3.0 Jul 19, 2016
Gemfile make atom test simpler to understand by just testing the complete output Feb 25, 2018
Gemfile.lock make atom test simpler to understand by just testing the complete output Feb 25, 2018
Guardfile Make guard livereload on changes to scss files too Jul 16, 2015
LICENSE Create LICENSE Jan 8, 2014
Procfile Get faye server working again in development Dec 7, 2017
Procfile.production Use RACK_ENV on faye startup to set configuration Nov 24, 2017
README.md Now using Ansible Vault exclusively for secrets so can update README Mar 26, 2018
Rakefile Rename main rails module to match name of app Jan 8, 2014
Vagrantfile We need more memory to run elasticsearch Nov 14, 2017
ansible.cfg Move over to using ansible from git-encrypt Mar 26, 2018
config.ru rails new . -T Dec 22, 2013
docker-compose.yml Explain in README how you can use docker-compose to start Elasticsearch Dec 10, 2017
env-example Remove bits of skylight configuration Apr 9, 2018
sync.ru Use RACK_ENV on faye startup to set configuration Nov 24, 2017

README.md

Stories in Ready Build Status Code Climate Dependency Status

morph.io: A scraping platform

  • A Heroku for Scrapers
  • All code and collaboration through GitHub
  • Write your scrapers in Ruby, Python, PHP, Perl or JavaScript (NodeJS, PhantomJS)
  • Simple API to grab data
  • Schedule scrapers or run manually
  • Process isolation via Docker
  • Trivial to move scraper code and data from ScraperWiki Classic
  • Email alerts for broken scrapers

Dependencies

Ruby 2.3.1, Docker, MySQL, SQLite 3, Redis, mitmproxy. (See below for more details about installing Docker)

Development is supported on Linux and Mac OS X.

Repositories

User-facing:

Docker images:

Installing Docker

On Linux

Just follow the instructions on the Docker site.

Your user account should be able to manipulate Docker (just add your user to the docker group).

On Mac OS X

Install Docker for Mac.

Starting up Elasticsearch

Morph needs Elasticsearch to run. We've made things easier for development by using docker to run Elasticsearch.

docker-compose up

To Install Morph

bundle install
cp config/database.yml.example config/database.yml
cp env-example .env

Edit config/database.yml with your database settings

Create an application on GitHub so that morph.io can talk to GitHub. Fill in the following values

Note the use of 127.0.0.1 rather than localhost. Use this or it won't work.

In the .env file, fill in the Client ID and Client Secret details provided by GitHub for the application you've just created.

Now setup the databases:

bundle exec dotenv rake db:setup

Now you can start the server

bundle exec dotenv foreman start

and point your browser at http://127.0.0.1:3000

To get started, log in with GitHub. There is a simple admin interface accessible at http://127.0.0.1:3000/admin. To access this, run the following to give your account admin rights:

bundle exec rake app:promote_to_admin

Running tests

If you're running guard (see above) the tests will also automatically run when you change a file.

By default, RSpec will skip tests that have been tagged as being slow. To change this behaviour, add the following to your .env:

RUN_SLOW_TESTS=1

By default, RSpec will run certain tests against a running Docker server. These tests are quite slow, but not have been tagged as slow. To stop Rspec from running these tests, add the following to your .env:

DONT_RUN_DOCKER_TESTS=1

Guard Livereload

We use Guard and Livereload so that whenever you edit a view in development the web page gets automatically reloaded. It's a massive time saver when you're doing design or lots of work in the view. To make it work run

bundle exec guard

Guard will also run tests when needed. Some tests do integration tests against a running docker server. These particular tests are very slow. If you want to disable them,

DONT_RUN_DOCKER_TESTS=1 bundle exec guard

Mail in development

By default in development mails are sent to Mailcatcher. To install

gem install mailcatcher

Deploying to production

This section will not be relevant to most people. It will however be relevant if you're deploying to a production server.

Ansible Vault

We're using Ansible Vault to encrypt certain files, like the private key for the SSL certificate.

To make this work you will need to put the password in a file at ~/.infrastructure_ansible_vault_pass.txt. This is the same password as used in the openaustralia/infrastructure GitHub repository.

Production devops development

Install Vagrant, VirtualBox and Ansible.

Install the hostsupdater plugin: vagrant plugin install vagrant-hostsupdater

Run vagrant up local. This will build and provision a box that looks and acts like production at dev.morph.io.

Once the box is created and provisioned, deploy the application to your Vagrant box:

cap local deploy

Now visit https://dev.morph.io/

Production provisioning and deployment

To deploy morph.io to production, normally you'll just want to deploy using Capistrano:

cap production deploy

When you've changed the Ansible playbooks to modify the infrastructure you'll want to run:

ansible-playbook --user=root --inventory-file=provisioning/hosts provisioning/playbook.yml

SSL certificates

We're using Let's Encrypt for SSL certificates. It's not 100% automated. On a completely fresh install (with a new domain) as root:

certbot --nginx certonly -m contact@oaf.org.au --agree-tos

It should show something like this:

Which names would you like to activate HTTPS for?
-------------------------------------------------------------------------------
1: morph.io
2: api.morph.io
3: faye.morph.io
4: help.morph.io

Leave your answer your blank which will install the certificate for all of them

Installing certificates for local vagrant build
sudo certbot certonly --manual -d dev.morph.io --preferred-challenges dns -d api.dev.morph.io -d faye.dev.morph.io -d help.dev.morph.io

How to contribute

If you find what looks like a bug:

  • Check the GitHub issue tracker to see if anyone else has reported issue.
  • If you don't see anything, create an issue with information on how to reproduce it.

If you want to contribute an enhancement or a fix:

  • Fork the project on GitHub.
  • Make your changes with tests.
  • Commit the changes without making changes to any files that aren't related to your enhancement or fix.
  • Send a pull request.

We maintain a list of issues that are easy fixes. Fixing one of these is a great way to get started while you get familiar with the codebase.

Copyright & License

Copyright OpenAustralia Foundation Limited. Licensed under the Affero GPL. See LICENSE file for more details.