Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the data_pipeline README file #259

MichaelaEBI opened this issue Nov 2, 2018 · 5 comments


Copy link

commented Nov 2, 2018

There are currently three separate wiki pages in the data_release and data_pipeline repos that describe how to run the pipeline:


A first version of a modified README is now available in readme_update branch of the data_pipeline. It contains all the information that can be found in 1. (the most detailed and comprehensive description of how to run the pipeline). There are a few things left to do:

  • Add an image of the pipeline workflow.
  • Move the Installation Instructions in the README file to the relevant sections within the file.
    • in particular, it erroneously states that the data pipeline can be installed pip install mrtarget
  • Check whether the environment variables are still correct and relevant.
  • Check which information from wiki pages 2. and 3. should be added.
  • There are different ways to run the pipeline (python -m mrtarget.CommandLine,, docker, Makefile) - add some guidance on when to use each one.
  • Info on how to run the pipeline with only a subset of the data.

This comment has been minimized.

Copy link

commented Nov 3, 2018

Can we get away with ASCII art for the diagram? Might be easier to maintain than embedding a separate file.


This comment has been minimized.

Copy link

commented Nov 3, 2018

For the environment variables, might be worth waiting / coordinating with #146. That's the issue to harmonize configuration, so has an impact on how the readme describes things. It'll try to be backwards compatible, but...


This comment has been minimized.

Copy link

commented Nov 8, 2018

I have been trying to run the pipeline locally following the readme in readme_update and the most confusing thing has been to know what the pre-requisites are. I have ended installing everything mentioned there but some things may not be needed.

This is what I have done:

  1. The first instruction in the readme is (0. Setting up) Ensure the data_pipeline/db.ini file points to the correct Elasticsearch server, so:

    • I have installed Elasticsearch version 5.6.13 following the instructions on their website, which mentions the need to install X-Pack.

    • I have added a db.ini file with the following config as suggested by @MichaelaEBI:

      ELASTICSEARCH_NODES = ["http://localhost:9200"]
    • I have installed Kibana version 5.6.13 accordingly, including the X-Pack.

  2. I thought that some extra python packages would be needed to run the pipeline so I have followed the instructions under Contributing to install the dependencies in a virtual environment:
    pip install -r requirements.txt

Then I have started loading the Base Data as explained in 1. Loading the Base data. It would be good to know what is actually needed to run the pipeline.


This comment has been minimized.

Copy link

commented Nov 13, 2018

Separate meeting to follow-up and close update of docs @MichaelaEBI


This comment has been minimized.

Copy link

commented Jan 24, 2019

Given the changes to the pipeline for 19.02 (config etc), I've updated the readme as best as I can for now (opentargets/data_pipeline#445) and I'll close this ticket. We can open new issue for specific changes in future as they become apparent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.