Skip to content

Adding nida studies data from stars#30

Merged
cschreep merged 28 commits intodevelopfrom
feature/nidaW1
Aug 18, 2021
Merged

Adding nida studies data from stars#30
cschreep merged 28 commits intodevelopfrom
feature/nidaW1

Conversation

@warrenstephens
Copy link
Copy Markdown

No description provided.

@warrenstephens warrenstephens requested a review from cschreep July 29, 2021 20:16
Copy link
Copy Markdown

@cschreep cschreep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One major component is multi-tenancy. We need to take a config variable (e.g. env var ROGER_DATA_SOURCE={bdc,nida,sparc}) and execute conditional logic in the pipeline to ensure that the correct data, parser, etc. are used.

@YaphetKG
Copy link
Copy Markdown

YaphetKG commented Aug 2, 2021

to expand on Carl's idea a little bit , i think we might want to maybe formulate this into a meta data.

as a case study if examine the pattern used in get_kgx_files in the tranql-translate pipeline. There we have two main variables that drive the tranql-translate graph :

  • meta_data.yaml : metadata of versioned list of files
  • ROGER_KGX_DATASET__VERSION (env var) / kgx.dataset_version (in dags/roger/config.yaml)

and by changing ROGER_KGX_DATASET_VERSION we can build a completly new graph with out code change.

Similarly if we devise a meta data yaml or some sort of versioned input files such as nida that would be used by the annotate pipeline (have get_db_gap files or the get_topmed_files tasks be controlled) , in a similar pattern we could probably avoid having to build multiple pipelines for getting datasets.

I think this pattern has the advantage of

  1. more generic pipeline able to work on any dbgap / topmed formated input with out code change
  2. new files can just be added to a list and be driven by an env variable like Carl mentioned in the comment above.

@YaphetKG
Copy link
Copy Markdown

YaphetKG commented Aug 3, 2021

Based on our discussion from earlier I think we are good on this PR, further modifications on meta-data based approach maybe handled on a separate branch.

@cschreep
Copy link
Copy Markdown

cschreep commented Aug 4, 2021

We can't merge this PR since it isn't backwards-compatible with BDC, which we will be deploying to a new namespace in the very near future.

@YaphetKG YaphetKG marked this pull request as draft August 4, 2021 14:56
@YaphetKG YaphetKG marked this pull request as ready for review August 5, 2021 13:40
@YaphetKG
Copy link
Copy Markdown

YaphetKG commented Aug 5, 2021

Here are is summary of the changes :
Metadata.yaml has been moved to dag/metadata.yaml , and it now has two sections ,
kgx and dug_inputs
Under each of these is versions list which is a list of versioned dataset "buckets" , each of them would have a name. For the dug_inputs ones they also have a format (dbGap or Topmed) and a version number.

From config side we have the following changes

kgx_base_data_uri: https://stars.renci.org/var/kgx_data/
annotation_base_data_uri: https://stars.renci.org/var/dug/

kgx:
  biolink_model_version: 1.5.0
  dataset_version: v2.0
  data_sets:
    - baseline-graph

dug_inputs:
  dataset_version: v1.0
  data_sets:
    - topmed
    - bdc-dbGaP

When doing annotation dug_inputs config is looked up to match metadata.yaml for datasets in dug_inputs matching the version and the names specified in dug_inputs.data_sets . Using the format spec in the meta data the pipeline would then parse the list of files specified in the metadata. Files are grabbed from

<annotation_base_data_uri>/<dug_inputs.dataset_version>/

dug_inputs:
  versions:
    - name: topmed
      version: v1.0
      files:
        - topmed_variables_v2.0.csv
        - topmed_tags_v2.0.json
      format: topmed
    - name: bdc-dbGaP
      version: v1.0
      files:
        - bdc_dbgap_data_dicts.tar.gz
      format: dbGaP

The above applies similarly to kgx files.

For environment variable

We can do
ROGER_DUG__INPUTS_DATA__SETS="bdc-dbGaP,nida" selectively specifying data set names from metadata.yaml

@cschreep cschreep self-requested a review August 18, 2021 20:32
@cschreep cschreep merged commit d3e10a2 into develop Aug 18, 2021
@cschreep cschreep deleted the feature/nidaW1 branch August 18, 2021 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants