Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull mutation data from GDC #80

Closed
justaddcoffee opened this issue Feb 29, 2024 · 20 comments · Fixed by #81
Closed

pull mutation data from GDC #80

justaddcoffee opened this issue Feb 29, 2024 · 20 comments · Fixed by #81

Comments

@justaddcoffee
Copy link
Member

We'd like to pull mutation data from GDC directly if possible

@sujaypatil96 pointed us to some code that might help here - see cell 16:
https://github.com/cancerDHC/example-data/tree/main/cptac2-subject-09CO022

This code also might be useful for extract things from what the code above gets from GDC
https://github.com/cancerDHC/example-data/tree/main

cc @sujaypatil96 @msierk @ielis

@sujaypatil96
Copy link
Collaborator

Looks like GDC has an API that we can use to pull in the data from directly: https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/#getting-started

We don't need to use the CDA library to bring this data in. We can write a python client/script that pulls data by interacting with the above API.

@sujaypatil96
Copy link
Collaborator

I can work on modifying the code here: https://github.com/monarch-initiative/oncoexporter/blob/develop/src/oncoexporter/cda/cda_mutation_factory.py to bring in mutation data directly from GDC.

@justaddcoffee
Copy link
Member Author

great @sujaypatil96!

@msierk
Copy link
Collaborator

msierk commented Feb 29, 2024

My recollection from what Brian & Matt said at the hackathon was that the CRDC-H model did not have mutation information included in it yet. If you look at the Appendix A for the GDC API, I do not see any of the fields that are available in the CDA mutation endpoint.

My view is that we do not need to be restricted to using CDA if it doesn't do what we want, but that we should not recreate existing capabilities unnecessarily. The CDA has some process for producing the mutation table, and it makes sense to me to at least try to understand how they did that before trying to build our own from scratch.

However, if Sujay can figure out an easy way to get the mutation data directly from GDC I certainly don't have any objections to do things that way.

@sujaypatil96
Copy link
Collaborator

sujaypatil96 commented Mar 1, 2024

Let's consider a subject/case (a8b1f6e7-2bcf-460d-b1c6-1792a9801119) browsable on GDC: https://portal.gdc.cancer.gov/cases/a8b1f6e7-2bcf-460d-b1c6-1792a9801119

My understanding is that the mutation information that we want to pull for cases from GDC is what's available under the "MOST FREQUENT SOMATIC MUTATIONS" section/table on the above webpage say. To obtain this data we would need to query the "SSM (Simple Somatic Mutation)" endpoint. The GDC mutation data can be found at /ssms endpoint.

@justaddcoffee
Copy link
Member Author

thanks all

I'd suggest that Sujay does a first pass at collecting MOST FREQUENT SOMATIC MUTATIONS using the GDC API, and then we can have a closer look. Does that sound reasonable?

@msierk
Copy link
Collaborator

msierk commented Mar 1, 2024

Ahh, my problem was I did not look at the "Data Analysis" page, which describes the mutation endpoints: https://docs.gdc.cancer.gov/API/Users_Guide/Data_Analysis/

@sujaypatil96
Copy link
Collaborator

I've experimented with pulling mutation data from GDC directly here: https://gist.github.com/sujaypatil96/5659f766abeed7adf52fb6ce771e5552

@sujaypatil96
Copy link
Collaborator

sujaypatil96 commented Mar 5, 2024

I was looking at the list of mutation fields to pull in (from CDA) in the oncoexporter code and saw this: https://github.com/monarch-initiative/oncoexporter/blob/develop/src/oncoexporter/cda/cda_mutation_factory.py#L18-L50

Based on that list (and the full list of fields that we can pull mutation information for here), I wrote a quick Python script to demo/illustrate how we can use the GDC API (specifically the /ssms endpoint) to pull in mutation information for a specific case (case_id).

@sujaypatil96
Copy link
Collaborator

There's more information available at the /ssm_occurences endpoint which we can retrieve. See https://docs.gdc.cancer.gov/API/Users_Guide/Data_Analysis/ for examples.

@justaddcoffee
Copy link
Member Author

Thanks @sujaypatil96! We'll take a look hopefully today

@justaddcoffee
Copy link
Member Author

cc: @pnrobinson

@justaddcoffee
Copy link
Member Author

justaddcoffee commented Mar 5, 2024

@ielis could you have a go at incorporating Sujay's code into oncoexporter?

I think we just need to put Sujay's code in cda_mutation_factory and also write a bit of code to translate the mutation JSON into phenopacket items - glad to hack on this with you

@ielis
Copy link
Member

ielis commented Mar 6, 2024

@justaddcoffee @sujaypatil96
I added a draft PR with a class that builds heavily on @sujaypatil96's gist. The class can fetch variants for a subject ID.

The class is, however, not hooked up to the rest of the framework yet. Unfortunately, I cannot work on that this week, I'm taking 3 days off starting with Wed.

Do you guys think you can look into this? Probably use it instead of the CdaMutationFactory in CdaTableImporter plus try to fill the VariationDescriptor with missing fields, if possible (e.g. tumor/normal depths, gene..)?

@justaddcoffee
Copy link
Member Author

great @ielis ! thanks

@sujaypatil96 do you have any time this week to hook up Daniel's code into CdaTableImporter in place of the CdaMutationFactory, and also try to get gene info from the GDC API?

@sujaypatil96
Copy link
Collaborator

@ielis thanks for working on #81 it looks really good!

@justaddcoffee i'm mostly working on some high priority NMDC tasks for the rest of the week, but if I get done with them early I can take a look at hooking it up with the rest of the framework.

@justaddcoffee
Copy link
Member Author

@sujaypatil96 okay, no worries at all - NMDC I think should take precedence

@sujaypatil96
Copy link
Collaborator

@justaddcoffee happy to take a look at hooking up the code from @ielis in #81 with the rest of the framework tomorrow if no one else is working on it.

@justaddcoffee
Copy link
Member Author

Okay great @sujaypatil96

I don't think anyone else is currently working on this

@sujaypatil96
Copy link
Collaborator

Sounds good! I'll work on sometime today/tomorrow.

@sujaypatil96 sujaypatil96 linked a pull request Mar 29, 2024 that will close this issue
@ielis ielis closed this as completed in #81 Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants