Adding nida studies data from stars by warrenstephens · Pull Request #30 · helxplatform/roger

warrenstephens · 2021-07-29T20:16:02Z

No description provided.

cschreep

One major component is multi-tenancy. We need to take a config variable (e.g. env var ROGER_DATA_SOURCE={bdc,nida,sparc}) and execute conditional logic in the pipeline to ensure that the correct data, parser, etc. are used.

YaphetKG · 2021-08-02T20:58:46Z

to expand on Carl's idea a little bit , i think we might want to maybe formulate this into a meta data.

as a case study if examine the pattern used in get_kgx_files in the tranql-translate pipeline. There we have two main variables that drive the tranql-translate graph :

meta_data.yaml : metadata of versioned list of files
ROGER_KGX_DATASET__VERSION (env var) / kgx.dataset_version (in dags/roger/config.yaml)

and by changing ROGER_KGX_DATASET_VERSION we can build a completly new graph with out code change.

Similarly if we devise a meta data yaml or some sort of versioned input files such as nida that would be used by the annotate pipeline (have get_db_gap files or the get_topmed_files tasks be controlled) , in a similar pattern we could probably avoid having to build multiple pipelines for getting datasets.

I think this pattern has the advantage of

more generic pipeline able to work on any dbgap / topmed formated input with out code change
new files can just be added to a list and be driven by an env variable like Carl mentioned in the comment above.

YaphetKG · 2021-08-03T15:14:23Z

Based on our discussion from earlier I think we are good on this PR, further modifications on meta-data based approach maybe handled on a separate branch.

cschreep · 2021-08-04T13:52:30Z

We can't merge this PR since it isn't backwards-compatible with BDC, which we will be deploying to a new namespace in the very near future.

YaphetKG · 2021-08-05T13:58:57Z

Here are is summary of the changes :
Metadata.yaml has been moved to dag/metadata.yaml , and it now has two sections ,
kgx and dug_inputs
Under each of these is versions list which is a list of versioned dataset "buckets" , each of them would have a name. For the dug_inputs ones they also have a format (dbGap or Topmed) and a version number.

From config side we have the following changes

kgx_base_data_uri: https://stars.renci.org/var/kgx_data/
annotation_base_data_uri: https://stars.renci.org/var/dug/

kgx:
  biolink_model_version: 1.5.0
  dataset_version: v2.0
  data_sets:
    - baseline-graph

dug_inputs:
  dataset_version: v1.0
  data_sets:
    - topmed
    - bdc-dbGaP

When doing annotation dug_inputs config is looked up to match metadata.yaml for datasets in dug_inputs matching the version and the names specified in dug_inputs.data_sets . Using the format spec in the meta data the pipeline would then parse the list of files specified in the metadata. Files are grabbed from

<annotation_base_data_uri>/<dug_inputs.dataset_version>/

dug_inputs:
  versions:
    - name: topmed
      version: v1.0
      files:
        - topmed_variables_v2.0.csv
        - topmed_tags_v2.0.json
      format: topmed
    - name: bdc-dbGaP
      version: v1.0
      files:
        - bdc_dbgap_data_dicts.tar.gz
      format: dbGaP

The above applies similarly to kgx files.

For environment variable

We can do
ROGER_DUG__INPUTS_DATA__SETS="bdc-dbGaP,nida" selectively specifying data set names from metadata.yaml

warrenstephens added 26 commits July 22, 2021 08:39

upgrade airflow to 2.1.2

2988cf9

upgrade airflow to 2.1.2 in requirements.txt

c828c2f

change 1 for nida and logging

e918b3a

change 2: fix dag

f561d52

change 3: fix dummies

6327483

change 4: fix continues

db6d7e3

5: straighten dag

c700cfe

6: two paths

d2d7058

7: no list for paths

e60c371

8: new arrangement

c4631a7

9: parallel arrangement

5ab97ae

10: add debug output

7c9d73d

dug mod for debug

c28dd6d

dug fix for identifier response errors

f058e11

dag logging test

f8a8765

only kwargs

8a8679a

log level investigation

4d17417

cleanup

517eb76

retry pythonoperator

0978a6f

add extra log

2f43a47

no op_kwargs

a011155

switch python operator style

99d1e77

remove tlogger

0e23743

remove python operator stuff again

ec31b66

cleanup log experiments

b62166d

revert version.py

7767cc5

warrenstephens requested a review from cschreep July 29, 2021 20:16

cschreep suggested changes Jul 30, 2021

View reviewed changes

YaphetKG marked this pull request as draft August 4, 2021 14:56

adding support multi-tenancy of datasets

a754f6f

YaphetKG marked this pull request as ready for review August 5, 2021 13:40

fix tests

c3aaeee

cschreep self-requested a review August 18, 2021 20:32

cschreep approved these changes Aug 18, 2021

View reviewed changes

cschreep merged commit d3e10a2 into develop Aug 18, 2021

cschreep deleted the feature/nidaW1 branch August 18, 2021 20:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding nida studies data from stars#30

Adding nida studies data from stars#30
cschreep merged 28 commits intodevelopfrom
feature/nidaW1

warrenstephens commented Jul 29, 2021

Uh oh!

cschreep left a comment

Uh oh!

YaphetKG commented Aug 2, 2021

Uh oh!

YaphetKG commented Aug 3, 2021

Uh oh!

cschreep commented Aug 4, 2021

Uh oh!

YaphetKG commented Aug 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

warrenstephens commented Jul 29, 2021

Uh oh!

cschreep left a comment

Choose a reason for hiding this comment

Uh oh!

YaphetKG commented Aug 2, 2021

Uh oh!

YaphetKG commented Aug 3, 2021

Uh oh!

cschreep commented Aug 4, 2021

Uh oh!

YaphetKG commented Aug 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants