<h1>GA4GH Disease</h1>
<p>This notebook demonstrates how to use the oncopacket Python package to create GA4GH Disease messages from Cancer Data Aggregator (CDA) data.
We first extract data about a disease diagnoses in a CDA cohort and then use the package to create the GA4GH Disease messages.</p>
<p>The data is extracted from the <tt>diagnosis</tt> and <tt>researchsubject</tt> tables of CDA.</p>

In [1]:
from oncoexporter.cda import CdaTableImporter, CdaDiseaseFactory
from collections import defaultdict

In [2]:
from cdapython import ( Q, set_default_project_dataset, set_host_url, set_table_version )

set_default_project_dataset("gdc-bq-sample.dev")
set_host_url("http://35.192.60.10:8080/")
set_table_version("all_merged_subjects_v3_2_final")

# Set up the oncoexporter CdaTableImporter and retrieve the disease datafrom from CDA
This hides some of the complexities of the CDA code

In [3]:
cohort_name = "cervix cancer cohort"
query = 'treatment_anatomic_site = "Cervix"'
Tsite = Q('treatment_anatomic_site = "Cervix"')
tableImporter = CdaTableImporter(cohort_name=cohort_name, query_obj=Tsite);
merged_df = tableImporter.get_merged_diagnosis_research_subject_df();

Output()

Output()

In [4]:
merged_df.head()

Unnamed: 0,diagnosis_id,diagnosis_identifier,primary_diagnosis,age_at_diagnosis,morphology,stage,grade,method_of_diagnosis,subject_id_di,researchsubject_id,researchsubject_identifier,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site,subject_id_rs
0,CGCI-HTMCP-CC.HTMCP-03-06-02442.HTMCP-03-06-02...,"[{'system': 'GDC', 'field_name': 'case.diagnos...","Squamous cell carcinoma, nonkeratinizing, NOS",16606.0,8072/3,,G3,,CGCI.HTMCP-03-06-02442,CGCI-HTMCP-CC.HTMCP-03-06-02442,"[{'system': 'GDC', 'field_name': 'case.case_id...",CGCI-HTMCP-CC,Squamous Cell Neoplasms,Cervix uteri,CGCI.HTMCP-03-06-02442
1,CGCI-HTMCP-CC.HTMCP-03-06-02107.HTMCP-03-06-02...,"[{'system': 'GDC', 'field_name': 'case.diagnos...","Squamous cell carcinoma, nonkeratinizing, NOS",,8072/3,,G3,Biopsy,CGCI.HTMCP-03-06-02107,CGCI-HTMCP-CC.HTMCP-03-06-02107,"[{'system': 'GDC', 'field_name': 'case.case_id...",CGCI-HTMCP-CC,Squamous Cell Neoplasms,Cervix uteri,CGCI.HTMCP-03-06-02107
2,CGCI-HTMCP-CC.HTMCP-03-06-02156.HTMCP-03-06-02...,"[{'system': 'GDC', 'field_name': 'case.diagnos...","Squamous cell carcinoma, keratinizing, NOS",24831.0,8071/3,,G3,Biopsy,CGCI.HTMCP-03-06-02156,CGCI-HTMCP-CC.HTMCP-03-06-02156,"[{'system': 'GDC', 'field_name': 'case.case_id...",CGCI-HTMCP-CC,Squamous Cell Neoplasms,Cervix uteri,CGCI.HTMCP-03-06-02156
3,CGCI-HTMCP-CC.HTMCP-03-06-02400.HTMCP-03-06-02...,"[{'system': 'GDC', 'field_name': 'case.diagnos...","Squamous cell carcinoma, nonkeratinizing, NOS",21833.0,8072/3,,G3,Biopsy,CGCI.HTMCP-03-06-02400,CGCI-HTMCP-CC.HTMCP-03-06-02400,"[{'system': 'GDC', 'field_name': 'case.case_id...",CGCI-HTMCP-CC,Squamous Cell Neoplasms,Cervix uteri,CGCI.HTMCP-03-06-02400
4,CGCI-HTMCP-CC.HTMCP-03-06-02101.HTMCP-03-06-02...,"[{'system': 'GDC', 'field_name': 'case.diagnos...","Squamous cell carcinoma, nonkeratinizing, NOS",,8072/3,,G3,Biopsy,CGCI.HTMCP-03-06-02101,CGCI-HTMCP-CC.HTMCP-03-06-02101,"[{'system': 'GDC', 'field_name': 'case.case_id...",CGCI-HTMCP-CC,Squamous Cell Neoplasms,Cervix uteri,CGCI.HTMCP-03-06-02101


In [6]:
disease_factory = CdaDiseaseFactory()
ga4gh_disease_messages = []
for _, row in merged_df.iterrows():
    ga4gh_disease_messages.append(disease_factory.to_ga4gh(row=row))
print(f"We extracted {len(ga4gh_disease_messages)} GA4GH Phenopacket Disease messages")

In [7]:
from google.protobuf.json_format import MessageToJson
from pprint import pprint
json_string = MessageToJson(ga4gh_disease_messages[0])
pprint(json_string)

'{\n  "term": {\n    "id": "NCIT:C3262",\n    "label": "Neoplasm"\n  }\n}'
