<h1>GA4GH Individual</h1>
<p>This notebook demonstrates how to use the oncopacket Python package to create GA4GH Individual messages from Cancer Data Aggregator (CDA) data.
We first extract data about a subjects in a CDA cohort and then use the package to create the Individual messages.</p>
<p>The data is extracted from the <tt>subject</tt> table of CDA.</p>

# Import classes from oncoexporter and CDA

In [1]:
from oncoexporter.cda import CdaTableImporter, CdaIndividualFactory

In [2]:
from cdapython import ( Q, set_default_project_dataset, set_host_url, set_table_version )

set_default_project_dataset("gdc-bq-sample.dev")
set_host_url("http://35.192.60.10:8080/")
set_table_version("all_merged_subjects_v3_2_final")

# Set up the oncoexporter CdaTableImporter and retrieve the subject datafrom from CDA
This hides some of the complexities of the CDA code

In [3]:
cohort_name = "cervix cancer cohort"
query = 'treatment_anatomic_site = "Cervix"'
Tsite = Q('treatment_anatomic_site = "Cervix"')
tableImporter = CdaTableImporter(cohort_name=cohort_name, query_obj=Tsite)
subject_df = tableImporter.get_subject_df();


Output()

In [4]:
subject_df.head()

Unnamed: 0,subject_id,subject_identifier,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,days_to_death,cause_of_death
0,CGCI.HTMCP-03-06-02074,"[{'system': 'GDC', 'field_name': 'case.submitt...",Homo sapiens,female,black or african american,not reported,-23305.0,[CGCI-HTMCP-CC],Alive,,
1,CGCI.HTMCP-03-06-02147,"[{'system': 'GDC', 'field_name': 'case.submitt...",Homo sapiens,female,black or african american,Unknown,,[CGCI-HTMCP-CC],Alive,,
2,CGCI.HTMCP-03-06-02206,"[{'system': 'GDC', 'field_name': 'case.submitt...",Homo sapiens,female,black or african american,Unknown,,[CGCI-HTMCP-CC],Alive,,
3,CGCI.HTMCP-03-06-02003,"[{'system': 'GDC', 'field_name': 'case.submitt...",Homo sapiens,female,black or african american,not reported,,[CGCI-HTMCP-CC],Dead,510.0,Unknown
4,CGCI.HTMCP-03-06-02082,"[{'system': 'GDC', 'field_name': 'case.submitt...",Homo sapiens,female,black or african american,not reported,-18106.0,[CGCI-HTMCP-CC],Alive,,


## Import the data from CDA

The CdaIndividualFactory class contains the code for ETL'ing the data from CDA subject to GA4GH Individual.

In [5]:
individual_factory = CdaIndividualFactory()
ga4gh_individuals = []
for _, row in subject_df.iterrows():
    ga4gh_individuals.append(individual_factory.to_ga4gh_individual(row=row))
print(f"We extracted {len(ga4gh_individuals)} GA4GH Phenopacket Individual messages")

In [6]:
from google.protobuf.json_format import MessageToJson
from pprint import pprint
json_string = MessageToJson(ga4gh_individuals[0])
pprint(json_string)

('{\n'
 '  "id": "CGCI.HTMCP-03-06-02074",\n'
 '  "timeAtLastEncounter": {\n'
 '    "age": {\n'
 '      "iso8601duration": "P63Y24M1W"\n'
 '    }\n'
 '  },\n'
 '  "vitalStatus": {\n'
 '    "status": "ALIVE"\n'
 '  },\n'
 '  "sex": "FEMALE",\n'
 '  "taxonomy": {\n'
 '    "id": "NCBITaxon:9606",\n'
 '    "label": "Homo sapiens"\n'
 '  }\n'
 '}')
