Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap: Reproducibility - Remove external data use (API, etc) #325

Closed
afaulconbridge opened this issue Nov 22, 2018 · 5 comments

Comments

Projects
None yet
2 participants
@afaulconbridge
Copy link

commented Nov 22, 2018

To improve reproducibility, use of external data sources (e.g. APIs, databases, git repositories) in the pipeline needs to be removed. Instead, there will be a pre-generation step where the external source is "frozen" into a static file, that can then be hosted under our control at a specific location for use by the pipeline.

The pipeline will access these frozen assets by downloading them locally, and then processing them as files. This download and storage will be handled by the Makefile to ensure reproducibility, avoid unnecessary duplication of actions, and improve performance since network bandwidth is often a limitation.

Subtasks:

  • #278 ensembl
  • #326 gene plugin hgnc
  • #327 reactome
  • #328 hpa
  • #329 gene plugin and Search step chembl
  • #379 gene plugin mousephenotypes
  • #389 efo phenotypes
  • #391 eco score modifiers
@afaulconbridge

This comment has been minimized.

Copy link
Author

commented Nov 22, 2018

A list of URLs called by the pipeline can be retrieved by the --log-http option, and processed with scripts/filter_http_logs.py to remove calls to elasticsearch. Note that this only records http traffic via the requests library, so may not capture everything particularly access via transitive dependencies.

@afaulconbridge

This comment has been minimized.

Copy link
Author

commented Nov 22, 2018

I've added the external APIs I know about via the http logging. There may be others lurking in the code - we should try running the pipeline without internet access (e.g. in a docker container with external networking disabled) to check once these are fixed.

@afaulconbridge

This comment has been minimized.

Copy link
Author

commented Nov 22, 2018

Another issue is json_schema. Currently this is read from a GitHub version URL, which should not change and should be under our control. We may however decide that it needs to be more rigorously controlled e.g. by being fixed inside the Docker image.

@afaulconbridge

This comment has been minimized.

Copy link
Author

commented Dec 19, 2018

This will need some coordination with #146 harmonize user configuration

@afaulconbridge afaulconbridge pinned this issue Dec 21, 2018

@ElaineMcA ElaineMcA self-assigned this Feb 5, 2019

@ElaineMcA ElaineMcA referenced this issue Mar 28, 2019

Open

Roadmap: Refactor platform pipeline #445

7 of 11 tasks complete
@afaulconbridge

This comment has been minimized.

Copy link
Author

commented Apr 25, 2019

All of these seem to be done now, so closing this issue. If more appear in future, new issue(s) should be created for them.

@afaulconbridge afaulconbridge unpinned this issue May 1, 2019

@ElaineMcA ElaineMcA changed the title Pipeline refactor: Remove external data use (API, etc) Roadmap: Reproducibility - Remove external data use (API, etc) May 10, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.