## FHIR for Research Workshop - Exercise 2

## Learning Objectives and Key Concepts

In this exercise, you will: 

- Apply Knowledge from previous exercises
- Understand dbGaP extensions to ResearchStudy
- Query for dbGaP asthma studies by title
- Query for dbGaP studies by codes

## Identify dbGaP Asthma datasets
For this exercise we will search for studies/datasets in dbGaP related to asthma. 

## Motivation/Purpose
To identify asthma studies from which data may be aggregated. Consent codes will be listed to see if the data from different studies may be used for a particular research purpose

 ### Icons in this Guide
 📘 A link to a useful external reference related to the section the icon appears in  

 🖐 A hands-on section where you will code something or interact with the server  
 
 
Acknowledging use of code snippets from [NIH FHIR training](https://github.com/NIH-ODSS/fhir-exercises/tree/main/Python) Exercise 0.

## Step 1:  Set up a FHIR Client

Obtain our NCBI API Key and set up a client to make requests, handle pagination etc.

In [1]:
import json
import os
from dbgapfhir import dbgapfhir
import pandas as pd

FHIR_SERVER = 'https://dbgap-api.ncbi.nlm.nih.gov/fhir/x1'
API_KEY_PATH = '~/.keys/ncbi_api_key.txt'

with open(os.path.expanduser(API_KEY_PATH)) as f:  
    api_key = f.read()
    
mf = dbgapfhir(FHIR_SERVER, api_key=api_key)

## Step 2:  Explore ResearchStudy resource with dbGaP extensions

To help understand what a Research Study resource looks like in dbGaP on FHIR we'll run the following query for a single study and list the whole resource. 'Resource' is a generic term for the real world entities that FHIR can represent. For this query the resource is a ResearchStudy.

This helps see that some of the data we want to extract from the resource are held in extensions to the FHIR model.

In [3]:
documents = mf.run_query("ResearchStudy?_id=phs001156")
print("# of studies:{}".format(len(documents)))

for s in documents:

    print ("Study id: {}".format(s['id']))
    print ("Study title: {}".format(s['title']))
    print ("Full resource")
    print(json.dumps(s, indent=3))
    print('_'*40)

Total  Resources: 1
Total  Bytes: 9383
Time elapsed 0.1386 seconds
# of studies:1
Study id: phs001156
Study title: The EVE Asthma Genetics Consortium: Building Upon GWAS
Full resource
{
   "resourceType": "ResearchStudy",
   "id": "phs001156",
   "meta": {
      "versionId": "1",
      "lastUpdated": "2022-02-14T01:59:54.353-05:00",
      "source": "#LLcpbzxw95eFPBGW",
      "security": [
         {
            "system": "https://dbgap-api.ncbi.nlm.nih.gov/fhir/x1/CodeSystem/DbGaPConcept-SecurityStudyConsent",
            "code": "public",
            "display": "public"
         }
      ]
   },
   "extension": [
      {
         "url": "https://dbgap-api.ncbi.nlm.nih.gov/fhir/x1/StructureDefinition/ResearchStudy-StudyOverviewUrl",
         "valueUrl": "https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001156.v2.p1"
      },
      {
         "url": "https://dbgap-api.ncbi.nlm.nih.gov/fhir/x1/StructureDefinition/ResearchStudy-ReleaseDate",
         "valueDate": "20

Note that extensions include

- A link to the study description e.g. https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001156.v2.p1
- Study release date
- List groups of subjects within the study for which the subjects consented to spexcific uses of their data
- Summary counts of various kinds - subjects, samples, variables, etc.


### Query for a different study

In [5]:
documents = mf.run_query("ResearchStudy?_id=phs001222")
print("# of studies:{}".format(len(documents)))

for s in documents:

    print ("Study id: {}".format(s['id']))
    print ("Study title: {}".format(s['title']))
    print ("Full resource")
    print(json.dumps(s, indent=3))
    print('_'*40)

Total  Resources: 1
Total  Bytes: 31231
Time elapsed 0.1762 seconds
# of studies:1
Study id: phs001222
Study title: CCDG - Whole Genome Sequencing in Type 1 Diabetes (T1DGC)
Full resource
{
   "resourceType": "ResearchStudy",
   "id": "phs001222",
   "meta": {
      "versionId": "1",
      "lastUpdated": "2022-02-14T01:58:59.250-05:00",
      "source": "#YQ5VqwQ7veThKjIm",
      "security": [
         {
            "system": "https://dbgap-api.ncbi.nlm.nih.gov/fhir/x1/CodeSystem/DbGaPConcept-SecurityStudyConsent",
            "code": "public",
            "display": "public"
         }
      ]
   },
   "extension": [
      {
         "url": "https://dbgap-api.ncbi.nlm.nih.gov/fhir/x1/StructureDefinition/ResearchStudy-StudyOverviewUrl",
         "valueUrl": "https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001222.v1.p1"
      },
      {
         "url": "https://dbgap-api.ncbi.nlm.nih.gov/fhir/x1/StructureDefinition/ResearchStudy-ReleaseDate",
         "valueDate":

## Step 3:  Query for asthma studies

With the information above we can now run our query for studies with the word asthma in the title. We rely on standard FHIR query syntax for this. According to the FHIR standard the search will be case insensitive.

In [6]:
documents = mf.run_query("ResearchStudy?title:contains=asthma")

print("# of studies:{}".format(len(documents)))

for s in documents:

    print ("Study id: {}".format(s['id']))
    print ("Study title: {}".format(s['title']))

Total  Resources: 28
Total  Bytes: 380133
Time elapsed 3.1350 seconds
# of studies:28
Study id: phs000166
Study title: SNP Health Association Resource (SHARe) Asthma Resource Project (SHARP)
Study id: phs000233
Study title: Genome Wide Association Study of Asthma
Study id: phs000355
Study title: Genome Wide Association for Asthma and Lung Function
Study id: phs000422
Study title: NHLBI GO-ESP: Lung Cohorts Exome Sequencing Project (Asthma)
Study id: phs000886
Study title: An Omics View of Asthma through Monozygotic Twins
Study id: phs001009
Study title: Determinants of Asthma Following RSV Bronchiolitis in Early Life
Study id: phs001216
Study title: A Genome-Wide Association Study for Post-bronchodilator Lung Function in Children with Asthma
Study id: phs001156
Study title: The EVE Asthma Genetics Consortium: Building Upon GWAS
Study id: phs001123
Study title: Consortium on Asthma among African-ancestry Populations in the Americas
Study id: phs001812
Study title: Genetic and Epigenetic

## Step 4:  Extracting information about the studies
We'll define a function for convenience. Because so many of the study details are in extensions, and there are extensions within extensions, the following function helps us deal with an extension at any level.

In [7]:
# for a given resource find the extension identified by a given url
# The assumption is that there is only one such extension within a given resource
# For the dbGaP ResearchStudy resource that is true
def getExtension(resource, uri):
    exts = [d for d in resource['extension'] if d['url'] == uri]
    if len(exts) > 0 :
        return exts[0]
    else:
        return None

We can now use the function above to find the extensions for
* the number of subjects in the study
* the consent groups within the study

The following extracts these details and puts them into a dataframe.

In [8]:
def studies_to_df(documents, verbose=False):
    studies = []
    for s in documents:

        if verbose:
            print (s['id'])
            print (s['title'])
        # use our function to find the "study content" extension
        content = getExtension(s, "https://dbgap-api.ncbi.nlm.nih.gov/fhir/x1/StructureDefinition/ResearchStudy-Content")
        # use our function again to find the "number of subjects" extension nested within the content extension
        subject_ext = getExtension(content, "https://dbgap-api.ncbi.nlm.nih.gov/fhir/x1/StructureDefinition/ResearchStudy-Content-NumSubjects")
        #print(subject_ext)
        # Handle the fact that not all studies may have this extension
        if subject_ext != None and 'value' in subject_ext['valueCount']:
            subject_count = subject_ext['valueCount']['value']
        else:
            subject_count = 0

        # Now find the extension containing the study consents
        consent_ext = getExtension(s, "https://dbgap-api.ncbi.nlm.nih.gov/fhir/x1/StructureDefinition/ResearchStudy-StudyConsents")
        # extract the display name for each consent group and print them
        if consent_ext != None:
            consents = [d['valueCoding']['display'] for d in consent_ext['extension'] ]
            if verbose:
                print(consents)
        else:
            consents = []
            
        # focus
        if 'focus' in s:
            focus = s['focus'][0]['text']
            if 'coding' in s['focus'][0]:
                focus_code = s['focus'][0]['coding'][0]['code']
            else:
                focus_code = ''
        else:
            focus = ''
            focus_code = ''
        # Add the relevant details to our list of studies
        study = {"id":s['id'], "title":s["title"], "num_subjects":subject_count,
                 "focus":focus,"focus_mesh":focus_code,"consents":consents}
        studies.append(study)
        if verbose:
            print('_'*40)
        df = pd.DataFrame(studies)
    return df

In [9]:
df = studies_to_df(documents)

## Step 5:  Show the study data in a Dataframe

We can then put our list of studies into a DataFrame for display.

We're listing the consent so we can see if any of the studies could potentially be used in the Computable Cohort Representation asthma exercise.

In [10]:
pd.set_option('display.max_colwidth', 0)
df.sort_values(by=['id'], inplace=True)
df


Unnamed: 0,id,title,num_subjects,focus,focus_mesh,consents
0,phs000166,SNP Health Association Resource (SHARe) Asthma Resource Project (SHARP),4046,,,"[NRUP, ARR]"
1,phs000233,Genome Wide Association Study of Asthma,0,Asthma,D001249,[Analysis]
2,phs000355,Genome Wide Association for Asthma and Lung Function,0,Asthma,D001249,[Analysis]
3,phs000422,NHLBI GO-ESP: Lung Cohorts Exome Sequencing Project (Asthma),191,Asthma,D001249,[GRU]
4,phs000886,An Omics View of Asthma through Monozygotic Twins,74,Asthma,D001249,[GRU]
15,phs000920,NHLBI TOPMed - NHGRI CCDG: Genes-Environments and Admixture in Latino Asthmatics (GALA II),4860,Lung Diseases,D008171,[DS-LD-IRB-COL]
11,phs000921,"NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE)",2106,Lung Diseases,D008171,[DS-LD-IRB-COL]
17,phs000988,NHLBI TOPMed: The Genetic Epidemiology of Asthma in Costa Rica,4230,Asthma,D001249,"[NRUP, DS-ASTHMA-IRB-MDS-RD]"
5,phs001009,Determinants of Asthma Following RSV Bronchiolitis in Early Life,178,Asthma,D001249,"[NRUP, DS-AAR-IRB]"
8,phs001123,Consortium on Asthma among African-ancestry Populations in the Americas,14548,Asthma,D001249,"[NRUP, GRU-IRB, HMB, DS-LD, HMB-IRB-NPU, DS-FDO-IRB-NPU, HMB-IRB, DS-FDO-IRB]"


## Step 6 -  Query on focus of study, by code

The MESH code for Asthma

Note that this returns studies that were not retrieved where the word asthma was in the title.

In [11]:
documents = mf.run_query("ResearchStudy?focus=D001249")

Total  Resources: 25
Total  Bytes: 332675
Time elapsed 2.9763 seconds


In [12]:
studies_to_df(documents) 

Unnamed: 0,id,title,num_subjects,focus,focus_mesh,consents
0,phs001542,NHLBI TOPMed: Genetics of Asthma in Latino Americans (GALA),1024,Asthma,D001249,[DS-LD-IRB-COL]
1,phs001732,NHLBI TOPMed: TReating Children to Prevent EXacerbations of Asthma (TREXA),89,Asthma,D001249,[DS-ASTHMA-IRB-COL]
2,phs001730,NHLBI TOPMed: Pediatric Asthma Controller Trial (PACT),41,Asthma,D001249,[DS-ASTHMA-IRB-COL]
3,phs001729,NHLBI TOPMed: Characterizing the Response to a Leukotriene Receptor Antagonist and an Inhaled Corticosteroid (CLIC),19,Asthma,D001249,[DS-ASTHMA-IRB-COL]
4,phs001728,NHLBI TOPMed: Best ADd-on Therapy Giving Effective Response (BADGER),50,Asthma,D001249,[DS-ASTHMA-IRB-COL]
5,phs001727,NHLBI TOPMed: Pathways to Immunologically Mediated Asthma (PIMA),73,Asthma,D001249,[DS-ASTHMA-IRB-COL]
6,phs001661,NHLBI TOPMed: Genetic Causes of Complex Pediatric Disorders - Asthma (GCPD-A),5464,Asthma,D001249,[DS-ASTHMA-GSO]
7,phs001467,NHLBI TOPMed: Study of Asthma Phenotypes and Pharmacogenomic Interactions by Race-Ethnicity (SAPPHIRE),4857,Asthma,D001249,"[NRUP, HMB-COL]"
8,phs001605,NHLBI TOPMed: Chicago Initiative to Raise Asthma Health Equity (CHIRAH),292,Asthma,D001249,[DS-ASTHMA-IRB-COL]
9,phs001604,NHLBI TOPMed: Children&#39;s Health Study (CHS) Effects of Air Pollution on the Development of Obesity in Children (Meta-AIR),56,Asthma,D001249,[GRU]


## Step 7 - Query for studies sponsored by a particular NIH Institute

We can query on the sponsor parameter to find studies for a given NIH institute.

In [13]:
ic_studies = mf.run_query("ResearchStudy?sponsor=NIA")

Total  Resources: 39
Total  Bytes: 804259
Time elapsed 3.0077 seconds


In [14]:
studies_to_df(ic_studies) 

Unnamed: 0,id,title,num_subjects,focus,focus_mesh,consents
0,phs002610,Maintenance of Genome Sequence Integrity in Long- and Short-lived Rodent Species,2,Longevity,D008136,[GRU]
1,phs002411,Genome Instability in Mammary Cells of Pathogenic BRCA1/2 Mutation Carriers,16,,,[DS-BRCA-PUB-NPU]
2,phs002361,Genomics and Epigenomics of the Elderly Response to Pneumococcal Vaccines,2,,,[GRU]
3,phs002202,A New High-Throughput Sequencing-Based Technology to Study Heterochromatin Structure,2,Healthy Volunteers,D064368,[GRU]
4,phs000397,NIA Long Life Family Study (LLFS),4997,Longevity,D008136,"[NRUP, GRU-IRB, GRU-IRB-NPU, DS-LARHC-IRB-NPU]"
5,phs001916,Pilot Sequencing Study of Peripheral Blood Mononuclear Cells in Human Aging,20,Healthy Aging,,[GRU-IRB]
6,phs001963,DEMENTIA-SEQ: WGS in Lewy Body Dementia and Frontotemporal Dementia,6907,Lewy Body Disease,D020961,[GRU]
7,phs001956,Genome Instability in Liver Stem and Mature Cells,15,,,[HMB-PUB-NPU]
8,phs001808,Landscape of Somatic Mutations in B Lymphocytes Across Human Lifespan,14,Aging,D000375,[DS-AGR1-NPU]
9,phs001779,CIDR-NIA Whole Exome Analysis of Ehlers-Danlos Syndrome,153,Ehlers-Danlos syndrome type 3,C0268337,"[NRUP, DS-HCT-IRB-COL-GSO-RD]"


## Step 8 - Query for studies with a particular disease focus

We can query on the focus parameter to find studies which were focussed on a particular disease. The disease codes used are Mesh terms.


In [15]:
az_studies = mf.run_query("ResearchStudy?focus=D000544")
studies_to_df(az_studies) 

Total  Resources: 9
Total  Bytes: 348787
Time elapsed 0.2566 seconds


Unnamed: 0,id,title,num_subjects,focus,focus_mesh,consents
0,phs000572,Alzheimer&#39;s Disease Sequencing Project (ADSP),15630,Alzheimer Disease,D000544,"[HMB-IRB, HMB-IRB-NPU, DS-ALZ-IRB, DS-ALZ-IRB-NPU, DS-ND-IRB, DS-ND-IRB-NPU]"
1,phs000745,RNAseq analysis of posterior cingulate astrocytes in Alzheimer&#39;s disease,20,Alzheimer Disease,D000544,[GRU]
2,phs000727,miRNA Profiles in Serum and CSF of Parkinson&#39;s and Alzheimer&#39;s Patients,211,Alzheimer Disease,D000544,[GRU]
3,phs000496,Columbia University Study of Caribbean Hispanics and Late Onset Alzheimer&#39;s disease,3139,Alzheimer Disease,D000544,"[NRUP, GRU-IRB]"
4,phs000168,NIA - Late Onset Alzheimer&#39;s Disease and National Cell Repository for Alzheimer&#39;s Disease Family Study: Genome-Wide Association Study for Susceptibility Loci,5220,Alzheimer Disease,D000544,"[NRUP, GRU, NPU, DS-ALZ, DS-ALZ-NPU]"
5,phs000378,"Indianapolis-Ibadan, Nigeria Comparative Epidemiological Study of Dementia",1251,Alzheimer Disease,D000544,"[NRUP, GRU-IRB]"
6,phs000372,ADGC Genome Wide Association Study,6065,Alzheimer Disease,D000544,[ADR]
7,phs000219,GenADA/LONG/Imaging (Genetic Alzheimer&#39;s Disease Associations),1718,Alzheimer Disease,D000544,[GBA]
8,phs000160,Genetics Consortium for Late Onset of Alzheimer&#39;s Disease (LOAD CIDR Project),2398,Alzheimer Disease,D000544,"[NRU, GRU, NPU, ALZ, ALZ_NPU]"
