# 🔬Transforming Real-World Genomic Data from FHIR IG to cBioPortal for Precision Oncology


# 🔬Transforming Genomic FHIR Data to cBioPortal-Compatible MAF Format
## For Precision Oncology and Evidence Generation

**Description:**
This pipeline demonstrates how clinical genomics data structured via the HL7 FHIR Clinical Genomics Implementation Guide (IG) can be transformed into formats (MAF, MDAR, Meta files) comparable with integration to cBioPortal. Uploading the data to cBioPortal supports visualization, exploratory cohort analysis for precision oncology.
---
​
## 📌 Section 1: Vision & Goals
- Integrate real-world HL7 FHIR clinical genomics data based on Genomics IG to reseach field.
- Support clinical decision-making with accessible molecular data.
- Enable visualization and cohort exploration via cBioPortal.
​
---
## 🧩 Section 2: Data Extraction
### 🔌 Export CSV using internal tool based on FhirExtinguisher
FhirExtinguisher is a lightweight web server application designed to facilitate data extraction from a FHIR server. It acts as an intermediary that receives query requests via HTTP and forwards them to a configured FHIR server. FhirExtinguisher simplifies access to FHIR resources by offering a controlled and potentially preprocessed interface for querying clinical and biomedical data.
​
Our internal application sending automated queries to extract clinical data (FhirPath expressions) to FhirExtinguisher (GET /Patient/1234/$everything). FhirExtinguisher forwards the request to the target FHIR server and extracts the necessary information. Our tool uses patient data extracted by FhirExtinguisher from FHIR server for generating 5 CSV files.
​
More details about : https://github.com/JohannesOehm/FhirExtinguisher 
### 🔌 Other options
We also looked at other options to extract data from FHIR server:
- `Direct connection to FHIR server via FHIR client libraries` :-  FHIR-PYrate,fhirclient. 
​
More information about FHIR-PYrate: https://github.com/UMEssen/FHIR-PYrate. 
More information about fhirclient: https://pypi.org/project/fhirclient/
- `FHIR JSON via REST API`:- For extracting properties from nested JSON we used "code" and "system" tags. 
```python
import json
​
with open('data/raw/genomic_bundle.json') as f:
    bundle = json.load(f)
```
​
## 🧬 Section 3: Mapping FHIR to MAF
1. Review MAF Format Specs (cols like Hugo_Symbol, Variant_Classification, Tumor_Sample_Barcode, etc.)
2. Map LOINC codes: 81258-6, 69547-8, etc.
3. Example of mapping table MAF (cBioPortal) to FHIR JSON Genomic mutations in Notion. 
​
<center>
    <img src="images/fhir_to_maf_mapping.png" width=1200/>
</center>

## 🧮 Section 4: Data Transformation Pipeline from FHIR to cBioPortal
**Disclaimer:**
The code below outlines how to transform CSV files (originally FHIR data) into formats (MAF, MDAR, Meta files) comparable with integration the Study to cBioPortal. It is created for inspiration and can be your starting point for your own implementation.
---

In [990]:
import numpy as np
import os
import pandas as pd
from datetime import datetime
from dateutil import parser
from dateutil import parser
from dateutil.relativedelta import relativedelta


In [991]:
longitudinal = pd.read_csv('data_source/longitudinal_test.csv')
meta = pd.read_csv('data_source/meta_test.csv')
molecular = pd.read_csv('data_source/molecular_test.csv')
reasons = pd.read_csv('data_source/reasons_test.csv', delimiter=",")
therapy = pd.read_csv('data_source/therapy_test.csv', delimiter=",")

tables = [longitudinal, meta, molecular, reasons, therapy]

### 🔌 Preprocessing of meta file

In [992]:
# Function to standardize various date formats to 'YYYY-MM-DD'
def standardize_date_format(date):
    if pd.isnull(date):
        return pd.NaT
    try:
        # Convert string and parse safely
        date_parsed = parser.parse(str(date), default=pd.Timestamp('2020-01-01'))
        return date_parsed.date()
    except Exception:
        return pd.NaT

meta['DiseaseStatusDate'] = meta['DiseaseStatusDate'].apply(standardize_date_format)
meta['DiseaseStatusDate'] = pd.to_datetime(meta['DiseaseStatusDate'], errors='coerce')
meta.head()

Unnamed: 0,PatientId,Birthdate,Gender,Indication,ICD10Code,DiagnosisName,Subtyp,DiseaseStatusDate,UICC,T,N,M,DiseaseStatus,ECOG,HER2,EBV,HPV,PDL1,PR,ER,MSI,HasGermlineMutation,Organization,HRD,TMB,TumorTissue,MTB_date
0,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,1966,male,Patienten mit seltenen Tumorerkrankungen,C78.7,Secondary malignant neoplasm of liver and intr...,,2022-04-01,"okkult""",T2a,N1,M0,Erstdiagnose,ECOG 1: (Einschränkung bei körperlicher Anstre...,,,,,,,,,klinikum,,,,
1,523f7133-1710-466e-8e2c-7d1ea78304c3,1939,female,Verdacht auf Hereditäre Tumorerkrankung,C56.2,Malignant neoplasm of ovary,Epithelialer Tumor,2021-03-26,III,T3,N0,M0,Progrediente Erkrankung,,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,No,klinikum,,low,0.8,
2,2b4aeb7c-c9e8-4c7f-9379-1f96f1b5d5cb,1980,female,Patient ohne weitere leitliniengerechte Therap...,C20,Malignant neoplasm of rectum,Karzinom,2021-12-01,IV,TX,Nx,M1,Stable Disease,ECOG 1: (Einschränkung bei körperlicher Anstre...,,,,,,,,Yes,klinikum,,,,
3,3aa1c7fb-90b5-43e2-bc3f-68ed471af187,1955,female,Junges Erkrankungsalter bezogen auf Tumorentität,C15.9,"Malignant neoplasm: Oesophagus, unspecified",Adenokarzinom,2022-02-01,IVA,T3,N0,M1,Progrediente Erkrankung,ECOG 0 (Normale,,,,,,,,No,klinikum,,,,
4,ddf29229-8b1c-48df-bdd9-d97c6896d365,1941,male,Patient ohne weitere leitliniengerechte Therap...,C16.0,Malignant neoplasm: Cardia,Karzinom,2013-10-31,,T3,N1,M0,Progrediente Erkrankung,ECOG 0 (Normale,,,,,,,,Yes,klinikum,,,,


In [993]:
meta['MTB_date'] = pd.to_datetime(meta['MTB_date'], errors='coerce')
meta['MTB_date'] = meta['MTB_date'].apply(lambda x: x.strftime('%Y-%m-%d') if pd.notnull(x) else x)

In [994]:
meta['MTB_date'] = pd.to_datetime(meta['MTB_date'], errors='coerce')
meta['DiseaseStatusDate'] = pd.to_datetime(meta['DiseaseStatusDate'], errors='coerce')
meta['MTB_days_after_diagnosis'] = (meta['MTB_date'] - meta['DiseaseStatusDate']).dt.days

In [995]:
meta['MTB_days_after_diagnosis'] = meta['MTB_days_after_diagnosis'].fillna(0)

In [996]:
meta['MTB_days_after_diagnosis'] = meta['MTB_days_after_diagnosis'].astype(int)

In [997]:
meta.dropna(subset=['DiseaseStatusDate'], inplace=True)

In [998]:
molecular.head()

Unnamed: 0,id,PatientId,Gene,Location,Refgene,Type,DNAChangeType,c_HGVS,p_HGVS,TranscriptID,AllelicFrequency,AllelicReadDepth,Effect,DriverMutation,RefAllele,OldAllele,VariantExactStart,VariantExactEnd,lastUpdated
0,43ab6b8a-bd05-4a77-ac44-7500574cfcb0,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,ABCC10,chr19,GRCh37,Somatic,SNV;missense_variant,weUDw0sg,RgYCG1K,NM_000400.4,0.0531,undefined,Ambiguous,Benign,G,T,45867689.0,45867689.0,2024-11-06T12:25:31.781+00:00
1,fb50560d-a799-43e3-8d81-288b1a801cc1,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,ABCB5,chr2,GRCh37,Somatic,SNV;missense_variant,wVCjbtcT9,uiBlad3,NM_003743.5,0.0543,undefined,Ambiguous,Benign,T,G,24951278.0,24951278.0,2024-11-06T12:25:31.781+00:00
2,25f8d9fe-ea35-42a9-b4e9-6e01eade27d1,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,ABHD16A,chr5,GRCh37,Somatic,SNV;upstream_gene_variant,2qRpfxf8E,vZa7h64,NM_198253.3,0.1538,undefined,Activating,Pathogenic,G,A,1295228.0,1295228.0,2024-11-06T12:25:31.781+00:00
3,ad39a8bc-2ae9-492e-b549-f7b0f6de1ee5,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,AASDHPPT,chr6,GRCh37,Somatic,SNV;missense_variant,Lacx4w2ut,6ZQU8yHw,NM_001374820.1,0.0677,undefined,Ambiguous,Benign,A,C,157527785.0,157527785.0,2024-11-06T12:25:31.781+00:00
4,c429ccd3-61a7-4212-807a-2b76e35d8bfa,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,ABCF3,chr13,GRCh37,Somatic,SNV;missense_variant,vK0s7Hn4,U7ttqmY,NM_006437.4,0.0866,undefined,Ambiguous,Benign,A,C,25074489.0,25074489.0,2024-11-06T12:25:31.781+00:00


### DATA_PATIENT file structure

In [999]:
data_patient = pd.DataFrame(columns=['PATIENT_ID',
                                     'AGE',
                                     'GENDER',
                                     'INDICATION',
                                     'DIAGNOSIS_DATE',
                                     'ICD_10',
                                     'OS_MONTHS',
                                     'OS_STATUS',
                                     'DSS_STATUS',
                                     'UICC'
                           ])

In [1000]:
def calculate_age(birth_date):
    if pd.notnull(birth_date):
        birth_date = pd.to_datetime(birth_date)
        today = datetime.today()
        return today.year - birth_date.year - ((today.month, today.day) < (birth_date.month, birth_date.day))
    return None

def calculate_os_months(onset_date, reference_date):
    if pd.notnull(onset_date) and pd.notnull(reference_date):
        onset_date = pd.to_datetime(onset_date)
        reference_date = pd.to_datetime(reference_date)
        delta = reference_date - onset_date
        # Calculate months based on average 30 days per month
        return round(delta.days / 30, 0)
    return None

def process_patient_data(patient_data):
    data_patient['AGE'] = meta['Birthdate'].apply(calculate_age)
    data_patient['PATIENT_ID'] = meta['PatientId']
    data_patient['GENDER'] = meta['Gender']
    data_patient['INDICATION'] = meta['Indication']
    data_patient['DIAGNOSIS_DATE'] = meta['DiseaseStatusDate']
    data_patient['OS_MONTHS'] = meta['DiseaseStatusDate'].apply(lambda x: calculate_os_months(x, datetime.today()))
    data_patient['OS_STATUS'] ='0:LIVING'
    data_patient['ICD_10'] =meta['ICD10Code']
    data_patient['DSS_STATUS'] = meta['DiseaseStatus']
    data_patient['UICC']=meta['UICC']
    return data_patient
data_patient = process_patient_data(meta)
data_patient['DFS_MONTHS'] = data_patient['OS_MONTHS']
data_patient['DFS_STATUS'] = '1:Recurred/Progressed'
data_patient.head()

Unnamed: 0,PATIENT_ID,AGE,GENDER,INDICATION,DIAGNOSIS_DATE,ICD_10,OS_MONTHS,OS_STATUS,DSS_STATUS,UICC,DFS_MONTHS,DFS_STATUS
0,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,55,male,Patienten mit seltenen Tumorerkrankungen,2022-04-01,C78.7,38.0,0:LIVING,Erstdiagnose,"okkult""",38.0,1:Recurred/Progressed
1,523f7133-1710-466e-8e2c-7d1ea78304c3,55,female,Verdacht auf Hereditäre Tumorerkrankung,2021-03-26,C56.2,51.0,0:LIVING,Progrediente Erkrankung,III,51.0,1:Recurred/Progressed
2,2b4aeb7c-c9e8-4c7f-9379-1f96f1b5d5cb,55,female,Patient ohne weitere leitliniengerechte Therap...,2021-12-01,C20,42.0,0:LIVING,Stable Disease,IV,42.0,1:Recurred/Progressed
3,3aa1c7fb-90b5-43e2-bc3f-68ed471af187,55,female,Junges Erkrankungsalter bezogen auf Tumorentität,2022-02-01,C15.9,40.0,0:LIVING,Progrediente Erkrankung,IVA,40.0,1:Recurred/Progressed
4,ddf29229-8b1c-48df-bdd9-d97c6896d365,55,male,Patient ohne weitere leitliniengerechte Therap...,2013-10-31,C16.0,141.0,0:LIVING,Progrediente Erkrankung,,141.0,1:Recurred/Progressed


In [1001]:
data_patient['OS_MONTHS']=data_patient['OS_MONTHS'].astype(int)
data_patient['DFS_MONTHS']=data_patient['DFS_MONTHS'].astype(int)
data_patient = data_patient.replace('"', '', regex=True)

### DATA_SAMPLE file structure

In [1002]:
data_sample =pd.DataFrame(columns=['SAMPLE_ID',
                                   'PATIENT_ID',
                                   'CANCER_TYPE_DETAILED',
                                   'TISSUE_SOURCE_SITE',
                                   'TUMOR_TISSUE',
                                   'TUMOR_MOLECULAR_SUBTYPE',
                                   'TMB','HRD','T','N','M','ECOG','HER2',
                                   'EBV','HPV','PDL1','PR','ER','MSI',
                                   'GERMLINE_MUTATION'
                                   
                           ])

In [1003]:
import hashlib
def generate_sample_id(patient_id):
    # Create a hash object and hash the Patient_id after converting it to a string
    short_hash = hashlib.md5(str(patient_id).encode()).hexdigest()[:6]  # Taking first 6 characters for a short hash
    return f"MOLIT-{short_hash}"

def process_data_sample(sample_data):
    data_sample['SAMPLE_ID'] = meta['PatientId'].apply(generate_sample_id)
    data_sample['PATIENT_ID'] = meta['PatientId']
    data_sample['CANCER_TYPE_DETAILED'] = meta['DiagnosisName']
    data_sample['TUMOR_MOLECULAR_SUBTYPE'] = meta['Subtyp']
    data_sample['TISSUE_SOURCE_SITE'] =meta['Organization']
    data_sample['TUMOR_TISSUE'] = meta['TumorTissue']
    data_sample['TMB'] = meta['TMB']
    data_sample['HRD'] =meta['HRD']
    data_sample['T']=meta['T']
    data_sample['N']=meta['N']
    data_sample['M']=meta['M']
    data_sample['ECOG'] = meta['ECOG']
    data_sample['HER2'] =meta['HER2']
    data_sample['EBV']=meta['EBV']
    data_sample['HPV']=meta['HPV']
    data_sample['PDL1']=meta['PDL1']
    data_sample['PR']=meta['PR']
    data_sample['ER']=meta['ER']
    data_sample['MSI']=meta['MSI']
    data_sample['GERMLINE_MUTATION'] = meta['HasGermlineMutation']
    return data_sample
data_sample = process_data_sample(meta)
data_sample.head()

Unnamed: 0,SAMPLE_ID,PATIENT_ID,CANCER_TYPE_DETAILED,TISSUE_SOURCE_SITE,TUMOR_TISSUE,TUMOR_MOLECULAR_SUBTYPE,TMB,HRD,T,N,M,ECOG,HER2,EBV,HPV,PDL1,PR,ER,MSI,GERMLINE_MUTATION
0,MOLIT-5679cb,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,Secondary malignant neoplasm of liver and intr...,klinikum,,,,,T2a,N1,M0,ECOG 1: (Einschränkung bei körperlicher Anstre...,,,,,,,,
1,MOLIT-e19430,523f7133-1710-466e-8e2c-7d1ea78304c3,Malignant neoplasm of ovary,klinikum,0.8,Epithelialer Tumor,low,,T3,N0,M0,,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,No
2,MOLIT-8698a2,2b4aeb7c-c9e8-4c7f-9379-1f96f1b5d5cb,Malignant neoplasm of rectum,klinikum,,Karzinom,,,TX,Nx,M1,ECOG 1: (Einschränkung bei körperlicher Anstre...,,,,,,,,Yes
3,MOLIT-41ec0e,3aa1c7fb-90b5-43e2-bc3f-68ed471af187,"Malignant neoplasm: Oesophagus, unspecified",klinikum,,Adenokarzinom,,,T3,N0,M1,ECOG 0 (Normale,,,,,,,,No
4,MOLIT-301348,ddf29229-8b1c-48df-bdd9-d97c6896d365,Malignant neoplasm: Cardia,klinikum,,Karzinom,,,T3,N1,M0,ECOG 0 (Normale,,,,,,,,Yes


### DATA_TIMELINE_STATUS file structure

In [1004]:
data_status= pd.DataFrame(columns=['PATIENT_ID',
                                   'EVENT_TYPE',
                                   'EVENT_DETAILED',
                                   'TREATMENT_TYPE',
                                   'START_DATE',
                                   'STOP_DATE',
                                   'THERAPY_ONGOING',
                                   'STATUS'
                                   
                           ])

### 1. Make statuses uniform(for meta and longitudinal):
- `Progrediente Erkrankung`:--->Progressive Disease 
- `Erstdiagnose`:--->Initial Diagnosis
- `Teilremission`:--->Partial Remission
- `Vollremission`:--->Copmlete Remission
- `Lokoregionaeres Rezidiv`:--->Locally Recurrent Malignant Neoplasm
- `Neue Fernmetastasierung`:--->Site of New Distant Metastasis Tumor Event

In [1005]:
replace_map = {
    'Progrediente Erkrankung': 'Progressive Disease',
    'Erstdiagnose': 'Initial Diagnosis',
    'Teilremission': 'Partial Remission',
    'Vollremission': 'Complete Remission',
    'Lokoregionaeres Rezidiv': 'Locally Recurrent Malignant Neoplasm',
    'Neue Fernmetastasierung': 'Site of New Distant Metastasis Tumor Event'
}

longitudinal['Notes'] = longitudinal['Notes'].replace(replace_map)

In [1006]:
meta['DiseaseStatus'] = meta['DiseaseStatus'].replace(replace_map)

### 2. Set 'TherapyEnd' equal to 'TherapyStart' for rows where 'TherapyTypeProcedure' is 'Tumor Resection'

In [1007]:
longitudinal.loc[longitudinal['TherapyTypeProcedure'] == 'Tumor Resection', 'TherapyEnd'] = longitudinal['TherapyStart']

### 3. Imputation of missing values for 'Therapy Type Procedure' with value "Status"

In [1008]:
longitudinal['TherapyTypeProcedure'] = longitudinal['TherapyTypeProcedure'].fillna("STATUS")

### 4. Convert dates format

### 🔌Generation 'meta' and 'TXT' files

In [1009]:
current_dir = os.getcwd()

folder_path = os.path.join(current_dir, 'MOLIT_data_project')
folder_path_ = os.path.join(current_dir+"/MOLIT_data_project", 'case_lists')
# Create the folder if it doesn't already exist
os.makedirs(folder_path_, exist_ok=True)

In [1010]:
# Function to standardize TherapyStart and TherapyEnd formats to 'YYYY-MM-DD'
def standardize_therapy_date(date):
    if pd.isnull(date):
        return pd.NaT
    date_str = str(date)
    try:
        # Parse date string with dateutil, fill missing parts with '2020-01-01'
        parsed_date = parser.parse(date_str, default=pd.Timestamp('2020-01-01'))
        return parsed_date.date()
    except Exception:
        return pd.NaT

# Apply standardization to TherapyStart and TherapyEnd columns
longitudinal['TherapyStart'] = longitudinal['TherapyStart'].apply(standardize_therapy_date)
longitudinal['TherapyEnd'] = longitudinal['TherapyEnd'].apply(standardize_therapy_date)

# Map PatientId to DiseaseStatusDate for quick lookup
disease_status_dates = meta.set_index('PatientId')['DiseaseStatusDate'].to_dict()

# Function to calculate the difference in months between two dates
def months_diff(date1, date2):
    if pd.isnull(date1) or pd.isnull(date2):
        return None
    rdelta = relativedelta(date1, date2)
    return rdelta.years * 12 + rdelta.months

# Calculate months from DiseaseStatusDate for TherapyStart and TherapyEnd
def calculate_months_from_disease_start(row):
    patient_id = row['PatientId']
    disease_start = disease_status_dates.get(patient_id)
    if pd.isnull(disease_start):
        return None, None
    
    start_months = months_diff(row['TherapyStart'], disease_start)
    end_months = months_diff(row['TherapyEnd'], disease_start)

    return start_months, end_months

# Apply the calculation across the dataframe and create new columns
longitudinal[['START_MONTH', 'END_MONTH']] = longitudinal.apply(
    lambda row: pd.Series(calculate_months_from_disease_start(row)), axis=1
)

# Convert months to days approximately
longitudinal['START_DAYS'] = longitudinal['START_MONTH'] * 30.44
longitudinal['END_DAYS'] = longitudinal['END_MONTH'] * 30.44
longitudinal.head()

Unnamed: 0,id,PatientId,TherapyTypeProcedure,TherapyStart,TherapyEnd,Notes,START_MONTH,END_MONTH,START_DAYS,END_DAYS
0,dd1c0cad-8c86-4364-90c6-4205cdf68ed0,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,Systemic therapy,2022-05-01,2022-06-01,Bevacizumab,1.0,2.0,30.44,60.88
1,a781a84d-06dd-41a1-8936-6d879e28b124,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,Systemic therapy,2022-05-01,2022-06-01,Atezolizumab,1.0,2.0,30.44,60.88
2,097781a2-9446-4acc-be7a-3139a82ebcec,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,Tumor Resection,2022-06-01,2022-06-01,cTACE,2.0,2.0,60.88,60.88
3,d3c0fd02-cf88-47cd-97c2-2988cb392b0e,523f7133-1710-466e-8e2c-7d1ea78304c3,STATUS,2022-09-01,2018-04-12,Initial Diagnosis,17.0,-35.0,517.48,-1065.4
4,4c6cad06-698a-4a43-9620-c73707a7e8dd,523f7133-1710-466e-8e2c-7d1ea78304c3,Tumor Resection,2022-09-01,2022-09-01,Probenentnahme durch Punktion einer großen Leb...,17.0,17.0,517.48,517.48


In [1011]:
longitudinal['START_DAYS'] = longitudinal['START_DAYS'].round()
longitudinal['END_DAYS'] = longitudinal['END_DAYS'].round()

In [1012]:
longitudinal['START_DAYS'] = longitudinal['START_DAYS'].fillna(0)
longitudinal['END_DAYS'] = longitudinal['END_DAYS'].fillna(0)

In [1013]:
longitudinal['START_DAYS'] = longitudinal['START_DAYS'].astype(int)
longitudinal['END_DAYS'] = longitudinal['END_DAYS'].astype(int)

In [1014]:
def therapy_status(end_days):
    return 'NO' if end_days != 0 else 'YES'


conditions = [
    longitudinal['TherapyTypeProcedure'] == "Systemic therapy",
    longitudinal['TherapyTypeProcedure'] == "Tumor Resection",
    longitudinal['TherapyTypeProcedure'] == "STATUS",
    longitudinal['TherapyTypeProcedure'] =='Radiation Therapy'
]


choices = ['Treatment', 'Surgery', 'Status','Treatment']
choices_2 = ["Medical Therapy","Tumor Resection",".", "Radiation Therapy"]
# Create the EVENT_TYPE column based on the conditions and choices
longitudinal['EVENT_TYPE'] = np.select(conditions, choices, default=np.nan)
longitudinal['TREATMENT_TYPE'] = np.select(conditions, choices, default=np.nan)

longitudinal['STATUS'] = np.where(
    longitudinal['TherapyTypeProcedure'] == "STATUS",
    longitudinal['Notes'],   
    "."                     
)

def process_timeline_status_data(longitudinal):
    data_status['PATIENT_ID'] = longitudinal['PatientId']
    data_status['EVENT_TYPE']= longitudinal['EVENT_TYPE']
    data_status['EVENT_DETAILED'] = longitudinal['Notes']
    data_status['TREATMENT_TYPE'] = longitudinal['TREATMENT_TYPE']
    data_status['STATUS']=longitudinal['STATUS']
    data_status['START_DATE'] = longitudinal['START_DAYS']                       
    data_status['STOP_DATE']= longitudinal['END_DAYS']                        
    data_status['THERAPY_ONGOING'] = longitudinal['END_DAYS'].apply(therapy_status)                                                                                  
   
    return data_status
data_status = process_timeline_status_data(longitudinal)
data_status.head()

Unnamed: 0,PATIENT_ID,EVENT_TYPE,EVENT_DETAILED,TREATMENT_TYPE,START_DATE,STOP_DATE,THERAPY_ONGOING,STATUS
0,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,Treatment,Bevacizumab,Treatment,30,61,NO,.
1,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,Treatment,Atezolizumab,Treatment,30,61,NO,.
2,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,Surgery,cTACE,Surgery,61,61,NO,.
3,523f7133-1710-466e-8e2c-7d1ea78304c3,Status,Initial Diagnosis,Status,517,-1065,NO,Initial Diagnosis
4,523f7133-1710-466e-8e2c-7d1ea78304c3,Surgery,Probenentnahme durch Punktion einer großen Leb...,Surgery,517,517,NO,.


In [1015]:
new_order = ['PATIENT_ID', 'START_DATE', 'STOP_DATE','EVENT_TYPE','TREATMENT_TYPE','EVENT_DETAILED','STATUS','THERAPY_ONGOING']
data_status = data_status[new_order]

In [1016]:
negative_end_date_rows = longitudinal[longitudinal['END_DAYS'] < 0]

In [1017]:
negative_end_date_rows['PatientId'].unique()

array(['523f7133-1710-466e-8e2c-7d1ea78304c3 '], dtype=object)

### DATA_MUTATIONS file structure

In [1018]:
data_mutations = molecular.copy()

In [1019]:
data_mutations=data_mutations.rename(columns={"PatientId": "PATIENT_ID", 
                               "Gene": "Hugo_Symbol",
                              "Location": "Chromosome",
                              "Type": "Mutation_status",
                              "TranscriptID": "Transcript_ID",
                              "VariantExactStart":"Start_Position",
                              "VariantExactEnd":"End_Position",
                              "RefAllele":"Reference_Allele"})

In [1020]:
# Split the 'DNA change type' column into 'VARIANT_CLASS' and 'Consequence' columns
data_mutations[['VARIANT_CLASS', 'Consequence']] = data_mutations['DNAChangeType'].str.split(';', expand=True)

In [1021]:
data_mutations['Start_Position'] = data_mutations['Start_Position'].fillna(0)
data_mutations['End_Position'] = data_mutations['End_Position'].fillna(0)

In [1022]:
data_mutations['Start_Position'] = data_mutations['Start_Position'].astype(int)

In [1023]:
data_mutations = data_mutations.fillna(".")
data_mutations = data_mutations.replace("undefined", ".")
data_mutations = data_mutations.drop(['DNAChangeType', 'id'], axis=1)
data_mutations['HGVSp_Short'] = data_mutations['p_HGVS'].str.extract(r'(p\.[A-Za-z]+\d+[A-Za-z]+)')

In [1024]:
data_mutations['Start_Position'] = data_mutations['Start_Position'].fillna(0)
data_mutations['End_Position'] = data_mutations['End_Position'].fillna(0)

In [1025]:
data_mutations['Tumor_Sample_Barcode'] = data_mutations['PATIENT_ID'].map(data_sample.set_index('PATIENT_ID')['SAMPLE_ID'])

In [1026]:
rows_with_nan = data_mutations[data_mutations['Tumor_Sample_Barcode'].isna()]

In [1027]:
data_patient['PATIENT_ID'].nunique()

7

In [1028]:
data_mutations.dropna(subset=['Tumor_Sample_Barcode'], inplace=True)

In [1029]:
data_mutations.head()

Unnamed: 0,PATIENT_ID,Hugo_Symbol,Chromosome,Refgene,Mutation_status,c_HGVS,p_HGVS,Transcript_ID,AllelicFrequency,AllelicReadDepth,Effect,DriverMutation,Reference_Allele,OldAllele,Start_Position,End_Position,lastUpdated,VARIANT_CLASS,Consequence,HGVSp_Short,Tumor_Sample_Barcode
0,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,ABCC10,chr19,GRCh37,Somatic,weUDw0sg,RgYCG1K,NM_000400.4,0.0531,.,Ambiguous,Benign,G,T,45867689,45867689.0,2024-11-06T12:25:31.781+00:00,SNV,missense_variant,,MOLIT-5679cb
1,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,ABCB5,chr2,GRCh37,Somatic,wVCjbtcT9,uiBlad3,NM_003743.5,0.0543,.,Ambiguous,Benign,T,G,24951278,24951278.0,2024-11-06T12:25:31.781+00:00,SNV,missense_variant,,MOLIT-5679cb
2,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,ABHD16A,chr5,GRCh37,Somatic,2qRpfxf8E,vZa7h64,NM_198253.3,0.1538,.,Activating,Pathogenic,G,A,1295228,1295228.0,2024-11-06T12:25:31.781+00:00,SNV,upstream_gene_variant,,MOLIT-5679cb
3,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,AASDHPPT,chr6,GRCh37,Somatic,Lacx4w2ut,6ZQU8yHw,NM_001374820.1,0.0677,.,Ambiguous,Benign,A,C,157527785,157527785.0,2024-11-06T12:25:31.781+00:00,SNV,missense_variant,,MOLIT-5679cb
4,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,ABCF3,chr13,GRCh37,Somatic,vK0s7Hn4,U7ttqmY,NM_006437.4,0.0866,.,Ambiguous,Benign,A,C,25074489,25074489.0,2024-11-06T12:25:31.781+00:00,SNV,missense_variant,,MOLIT-5679cb


In [1030]:
data_mutations['Tumor_Seq_Allele1'] = '.'
data_mutations['Tumor_Seq_Allele2'] = '.'
data_mutations['Matched_Norm_Sample_Barcode'] = '.'
data_mutations['Match_Norm_Seq_Allele1']= '.'
data_mutations['Match_Norm_Seq_Allele2']= '.'
data_mutations['Tumor_Validation_Allele1']= '.'
data_mutations['Tumor_Validation_Allele2']= '.'
data_mutations['Match_Norm_Validation_Allele1']= '.'
data_mutations['Match_Norm_Validation_Allele2']= '.'
data_mutations['Verification_Status']= '.'
data_mutations['Validation_Status']= '.'
data_mutations['Sequencing_Phase']= '.'
data_mutations['Sequence_Source']= '.'
data_mutations['Validation_Method']= '.'
data_mutations['Score']= '.'
data_mutations['BAM_File']= '.'
data_mutations['Sequencer']= '.'
data_mutations['Tumor_Sample_UUID']= '.'
data_mutations['Matched_Norm_Sample_UUID']= '.'
data_mutations['SWISSPROT'] = 'HUMAN'
data_mutations['Variant_Classification']= '.'

In [1031]:
data_mutations = data_mutations.drop(columns=['PATIENT_ID'])

### Data recommendations (data_clinical_treatment_recommendation.txt) file structure

In [1032]:
data_recommendations = therapy.copy()

In [1033]:
data_recommendations.head()

Unnamed: 0,PatientId,Id,orderNumber,MedicationCode,MedicationName,DiseaseStatusCodeAtRecommendation,DiseaseStatusAtRecommendation,RecommendedUseCode,RecommendedUse,evidenceCode,evidenceLevel,reasonId
0,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,bacd886f-f223-46b9-8ba7-32497ff65f5a,1,,,C35571,Progressive Disease,C28237,Clinical Practice Guidelines,,,db0d0
1,523f7133-1710-466e-8e2c-7d1ea78304c3,bacd886f-f223-46b9-8ba7-32497ff65f5a,2,L01XK01,Olaparib,C18058,Partial Remission,C125600,Off-Label Treatment,,,efe1c
2,2b4aeb7c-c9e8-4c7f-9379-1f96f1b5d5cb,68fc147e-73b6-4f5e-b882-3b88f5be9bee,1,L01XC41,Trastuzumab deruxtecan,C35571,Progressive Disease,C125600,Off-Label Treatment,m2B,In einer anderen Tumorentität wurde der prädik...,202d1
3,2b4aeb7c-c9e8-4c7f-9379-1f96f1b5d5cb,68fc147e-73b6-4f5e-b882-3b88f5be9bee,1,L01XC14,Trastuzumab emtansin,C35571,Progressive Disease,C71104,Clinical Trial,,,1eb33
4,2b4aeb7c-c9e8-4c7f-9379-1f96f1b5d5cb,68fc147e-73b6-4f5e-b882-3b88f5be9bee,2,L01EM03,Alpelisib,C35571,Progressive Disease,C125600,Off-Label Treatment,m2B,In einer anderen Tumorentität wurde der prädik...,faa1f


In [1034]:
data_recommendations=data_recommendations.rename(columns={"PatientId": "PATIENT_ID", 
                               "MedicationName": "RECOMMENDED_MEDICATION",
                                "MedicationCode":"MEDICATION_CODE",
                                "orderNumber": "RECOMMENDATION_ORDER_NUMBER",
                                "DiseaseStatusCodeAtRecommendation": "DISEASE_STATUS_CODE",
                                "DiseaseStatusAtRecommendation": "DISEASE_STATUS_AT_RECOMMENDATION",
                                "RecommendedUse": "EVIDENCE_LEVEL",
                                "evidenceLevel": 'EVIDENCE_LEVEL_DETAILED',
                                "evidenceCode": "EVIDENCE_CODE",
                                "RecommendedUseCode": "CATEGORY",
                                 "reasonId": "REASON_ID"})

In [1035]:
data_recommendations = data_recommendations.fillna(".")
data_recommendations = data_recommendations.replace("undefined", ".")
data_recommendations = data_recommendations.drop(index=0)
data_recommendations = data_recommendations.drop(['Id'], axis=1)

In [1036]:
data_recommendations.head()

Unnamed: 0,PATIENT_ID,RECOMMENDATION_ORDER_NUMBER,MEDICATION_CODE,RECOMMENDED_MEDICATION,DISEASE_STATUS_CODE,DISEASE_STATUS_AT_RECOMMENDATION,CATEGORY,EVIDENCE_LEVEL,EVIDENCE_CODE,EVIDENCE_LEVEL_DETAILED,REASON_ID
1,523f7133-1710-466e-8e2c-7d1ea78304c3,2,L01XK01,Olaparib,C18058,Partial Remission,C125600,Off-Label Treatment,.,.,efe1c
2,2b4aeb7c-c9e8-4c7f-9379-1f96f1b5d5cb,1,L01XC41,Trastuzumab deruxtecan,C35571,Progressive Disease,C125600,Off-Label Treatment,m2B,In einer anderen Tumorentität wurde der prädik...,202d1
3,2b4aeb7c-c9e8-4c7f-9379-1f96f1b5d5cb,1,L01XC14,Trastuzumab emtansin,C35571,Progressive Disease,C71104,Clinical Trial,.,.,1eb33
4,2b4aeb7c-c9e8-4c7f-9379-1f96f1b5d5cb,2,L01EM03,Alpelisib,C35571,Progressive Disease,C125600,Off-Label Treatment,m2B,In einer anderen Tumorentität wurde der prädik...,faa1f
5,523f7133-1710-466e-8e2c-7d1ea78304c3,1,L01XC06,Cetuximab,C35571,Progressive Disease,C28237,Clinical Practice Guidelines,.,.,0efe1


### Data recommendation reasons file structure

In [1037]:
data_reasons = reasons.copy()

In [1038]:
data_reasons['comments'].unique()

array([nan])

In [1039]:
data_reasons=data_reasons.rename(columns={"MedicationName": "REASONS_MEDICATION",
                                               "GeneticReason": "REASONS_GENETICS",
                                               "therapyId" :"REASON_ID",
                                                "other": "REASONS_OTHER",
                                                 "comments": "REASONS_COMMENTS"})

In [1040]:
data_reasons = data_reasons.fillna(".")
data_reasons = data_reasons.replace("undefined", ".")
data_reasons = data_reasons.drop(['MedicationCode','reasonId'], axis=1)

In [1041]:
data_reasons.head()

Unnamed: 0,REASON_ID,REASONS_MEDICATION,REASONS_OTHER,REASONS_GENETICS,REASONS_COMMENTS
0,202d1,Trastuzumab,.,ERBB2 (SNV,.
1,1eb33,Trastuzumab,.,ERBB2 (SNV,.
2,faa1f,Alpelisib,.,PIK3CA (SNV,.
3,e1bea,Trastuzumab,.,ERBB2 (SNV,.
4,0efe1,Pembrolizumab,.,ARID1A (SNV,.


In [1042]:
data_recommendations = pd.merge(data_recommendations, data_reasons, on='REASON_ID', how='inner')

### Creating data status timeline (using recommendations and reasoning)

In [1043]:
new_rows = pd.DataFrame({
    'PATIENT_ID': data_recommendations['PATIENT_ID'],
    'EVENT_TYPE': 'Recommendation',  
    'TREATMENT_TYPE': 'Treatment',  
    'EVENT_DETAILED': data_recommendations['RECOMMENDED_MEDICATION'],
    'STATUS': '.',  # Constant value
    'THERAPY_ONGOING': 'NO',  # Constant value
    'EVIDENCE_LEVEL': data_recommendations['EVIDENCE_LEVEL'],
    'REASONS_GENETICS': data_recommendations['REASONS_GENETICS'],
    'REASONS_MEDICATION': data_recommendations['REASONS_MEDICATION'],
    'RECOMMENDATION_ORDER_NUMBER':data_recommendations['RECOMMENDATION_ORDER_NUMBER'],
    'START_DATE': '.',  
    'STOP_DATE': '.'    
})

# Merge new_rows with meta to get 'MTB_days_after_diagnosis' for each patient
new_rows = new_rows.merge(meta[['PatientId', 'MTB_days_after_diagnosis']], 
                          left_on='PATIENT_ID', 
                          right_on='PatientId', 
                          how='left')

# Use 'MTB_days_after_diagnosis' to populate 'START_DATE' and 'STOP_DATE'
new_rows['START_DATE'] = new_rows['MTB_days_after_diagnosis']
new_rows['STOP_DATE'] = new_rows['MTB_days_after_diagnosis']

# Drop the temporary 'PatientId' and 'MTB_days_after_diagnosis' columns after merging
new_rows.drop(columns=['PatientId', 'MTB_days_after_diagnosis'], inplace=True)

# Ensure the columns from new_rows that don't already exist in data_status are added
for col in new_rows.columns:
    if col not in data_status.columns:
        data_status[col] = '.' 

# Add the new rows to data_status, keeping existing records and adding the new ones
data_status = pd.concat([data_status, new_rows], ignore_index=True)

In [1044]:
data_status.head()

Unnamed: 0,PATIENT_ID,START_DATE,STOP_DATE,EVENT_TYPE,TREATMENT_TYPE,EVENT_DETAILED,STATUS,THERAPY_ONGOING,EVIDENCE_LEVEL,REASONS_GENETICS,REASONS_MEDICATION,RECOMMENDATION_ORDER_NUMBER
0,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,30,61,Treatment,Treatment,Bevacizumab,.,NO,.,.,.,.
1,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,30,61,Treatment,Treatment,Atezolizumab,.,NO,.,.,.,.
2,8c52a4b0-fae9-4cd3-a4d6-97d850a4076a,61,61,Surgery,Surgery,cTACE,.,NO,.,.,.,.
3,523f7133-1710-466e-8e2c-7d1ea78304c3,517,-1065,Status,Status,Initial Diagnosis,Initial Diagnosis,NO,.,.,.,.
4,523f7133-1710-466e-8e2c-7d1ea78304c3,517,517,Surgery,Surgery,Probenentnahme durch Punktion einer großen Leb...,.,NO,.,.,.,.


In [1045]:
#data_status.to_csv('out.csv')

In [1046]:
#data_status=data_status.drop(columns=['Unnamed: 0'])

In [1047]:
data_status['STOP_DATE'] = data_status['STOP_DATE'].fillna(0)

In [1048]:
data_status['STOP_DATE'] = data_status['STOP_DATE'].astype(int)

In [1049]:
print("Data_status after appending new rows:\n", data_status.tail())

Data_status after appending new rows:
                                PATIENT_ID  START_DATE  STOP_DATE  \
90   3aa1c7fb-90b5-43e2-bc3f-68ed471af187           0          0   
91  ddf29229-8b1c-48df-bdd9-d97c6896d365            0          0   
92   fdfc3b53-617a-4ac1-b4d6-91d5b249bb90           0          0   
93   fdfc3b53-617a-4ac1-b4d6-91d5b249bb90           0          0   
94   fdfc3b53-617a-4ac1-b4d6-91d5b249bb90           0          0   

        EVENT_TYPE TREATMENT_TYPE          EVENT_DETAILED STATUS  \
90  Recommendation      Treatment  Trastuzumab deruxtecan      .   
91  Recommendation      Treatment                Olaparib      .   
92  Recommendation      Treatment       Gem-nabPaclitaxel      .   
93  Recommendation      Treatment                  LV5FU2      .   
94  Recommendation      Treatment              Gemcitabin      .   

   THERAPY_ONGOING       EVIDENCE_LEVEL REASONS_GENETICS REASONS_MEDICATION  \
90              NO  Off-Label Treatment       ERBB2 (SNV       T

## Generation 'meta' and 'TXT' files

In [1050]:
current_dir = os.getcwd()

folder_path = os.path.join(current_dir, 'output_study')
#folder_path = 'my_data_folder'
folder_path_ = os.path.join(current_dir+"/output_study", 'case_lists')
# Create the folder if it doesn't already exist
os.makedirs(folder_path_, exist_ok=True)

### 'meta_study.txt'

In [1051]:
os.makedirs(folder_path, exist_ok=True)

file_path = os.path.join(folder_path, 'meta_study.txt')
meta_study = {
    'cancer_study_identifier:': ['data_export_study'],
    'type_of_cancer:': ['mixed'],
    'name:': ['Data export study'],
    'description:': ['Analysis export csv files'],
    'groups:': ['PUBLIC'],
    'add_global_case_list:': ['true']
}

# Write the content to the specified file path
with open(file_path, 'w') as f:
    for key, value in meta_study.items():
        f.write(f"{key}\t{value[0]}\n")

print(f"File created at: {file_path}")

File created at: /Users/valeriya.vishnevskaya/cg-on-fhir/output_study/meta_study.txt


### 'meta_clinical_patient.txt'

In [1052]:
file_path = os.path.join(folder_path, 'meta_clinical_patient.txt')
meta_clinical_patient = {
    'cancer_study_identifier:': ['data_export_study'],
    'genetic_alteration_type:': ['CLINICAL'],
    'datatype:': ['PATIENT_ATTRIBUTES'],
    'data_filename:': ['data_clinical_patient.txt'],
}

with open(file_path, 'w') as f:
    for key, value in meta_clinical_patient.items():
        f.write(f"{key}\t{value[0]}\n")
print(f"File created at: {file_path}")

File created at: /Users/valeriya.vishnevskaya/cg-on-fhir/output_study/meta_clinical_patient.txt


### 'meta_clinical_sample.txt'

In [1053]:
file_path = os.path.join(folder_path, 'meta_clinical_sample.txt')
meta_clinical_sample = {
    'cancer_study_identifier:': ['data_export_study'],
    'genetic_alteration_type:': ['CLINICAL'],
    'datatype:': ['SAMPLE_ATTRIBUTES'],
    'data_filename:': ['data_clinical_sample.txt'],
}

with open(file_path, 'w') as f:
    for key, value in meta_clinical_sample.items():
        f.write(f"{key}\t{value[0]}\n")
print(f"File created at: {file_path}")

File created at: /Users/valeriya.vishnevskaya/cg-on-fhir/output_study/meta_clinical_sample.txt


### 'meta_mutations.txt'

In [1054]:
file_path = os.path.join(folder_path, 'meta_mutations.txt')
meta_mutations = {
    'cancer_study_identifier:': ['data_export_study'],
    'genetic_alteration_type:': ['MUTATION_EXTENDED'],
    'datatype:': ['MAF'],
    'stable_id:': ['mutations'],
    'show_profile_in_analysis_tab:': ['true'],
    'profile_description:': ['Mutation data from export csv files'],
    'profile_name:': ['Mutations'],
    'data_filename:': ['data_mutations.txt'],
   
}

with open(file_path, 'w') as f:
    for key, value in meta_mutations.items():
        f.write(f"{key}\t{value[0]}\n")
print(f"File created at: {file_path}")

File created at: /Users/valeriya.vishnevskaya/cg-on-fhir/output_study/meta_mutations.txt


### 'meta_timeline_status.txt'

In [1055]:
file_path = os.path.join(folder_path, 'meta_timeline_status.txt')
meta_timeline_status = {
    'cancer_study_identifier:': ['data_export_study'],
    'genetic_alteration_type:': ['CLINICAL'],
    'datatype:': ['TIMELINE'],
    'data_filename:': ['data_timeline_status.txt']
   
}

with open(file_path, 'w') as f:
    for key, value in meta_timeline_status.items():
        f.write(f"{key}\t{value[0]}\n")
print(f"File created at: {file_path}")  

File created at: /Users/valeriya.vishnevskaya/cg-on-fhir/output_study/meta_timeline_status.txt


### Case lists

In [1056]:
samples_number = data_sample['SAMPLE_ID'].nunique()

In [1057]:
sample_ids_str = '\t'.join(data_sample['SAMPLE_ID'].tolist())

### 'cases_all.txt'

In [1058]:
file_path = os.path.join(folder_path_, 'cases_all.txt')
cases_all = {
    'cancer_study_identifier:': ['data_export_study'],
    'stable_id:': ['data_export_study_all'],
    'case_list_name:': ['All samples'],
    'case_list_description:': [f"All samples ({samples_number} samples)"],
    'case_list_category:': ['all_cases_in_study'],
    'case_list_ids:': [sample_ids_str]
   
}

with open(file_path, 'w') as f:
    for key, value in cases_all.items():
        f.write(f"{key}\t{value[0]}\n")
print(f"File created at: {file_path}")  

File created at: /Users/valeriya.vishnevskaya/cg-on-fhir/output_study/case_lists/cases_all.txt


### 'cases_sequenced.txt'

In [1059]:
file_path = os.path.join(folder_path_, 'cases_sequenced.txt')
cases_sequenced= {
    'cancer_study_identifier:': ['data_export_study'],
    'stable_id:': ['data_export_study_sequenced'],
    'case_list_name:': ['Samples with mutation data'],
    'case_list_description:': [f"Samples with mutation data ({samples_number} samples)"],
    'case_list_category:': ['all_cases_with_mutation_data'],
    'case_list_ids:': [sample_ids_str]
   
}

with open(file_path, 'w') as f:
    for key, value in cases_sequenced.items():
        f.write(f"{key}\t{value[0]}\n")
print(f"File created at: {file_path}") 

File created at: /Users/valeriya.vishnevskaya/cg-on-fhir/output_study/case_lists/cases_sequenced.txt


In [1060]:
# Dictionary for cBioPortal headers for patient and sample data 
header_lines_patient = {
    "PATIENT_ID": "#Unique Patient Identifier",            
    "AGE": 'Age of the patient',                         
    "GENDER": "Gender of the patient",
    'INDICATION': 'Patient disease history',
    "DIAGNOSIS_DATE": "Date of initial diagnosis in YYYY-MM-DD format",
    'ICD_10': 'ICD 10 Classification',
    'OS_MONTHS': 'Overall survival months',
    'OS_STATUS': 'Overall survival status',
    "DSS_STATUS": "Disease status",    
    "UICC": "Stage to define the anatomical extent of disease",           
}

header_lines_sample = { 
    "PATIENT_ID": "#Unique Patient Identifier", 
    "SAMPLE_ID": 'Unique Sample Identifier',
    'CANCER_TYPE': "Cancer type",
    'TISSUE_SOURCE_SITE': "Tissue source site",
    'TUMOR_TISSUE': "Tumor tissue size",
    'TUMOR_MOLECULAR_SUBTYPE':"Tumor molecular subtype",
    'TMB': "TMB",
    'HRD':"HRD",
    'T':"T",
    'N':"N",
    'M':"M",
    'ECOG':"ECOG",
    'HER2': "HER2",
    'EBV':"EBV",
    'HPV':"HPV",
    'PDL1': "PDL1",
    'PR':"PR",
    'ER': "ER",
    'MSI':"MSI",
    'GERMLINE_MUTATION': "Exist germline mutation or not" 
}

header_lines_recommendations={
    "PATIENT_ID": "#Patient id",
    "RECOMMENDATION_ORDER_NUMBER": "Order number of treatment recommendations",
    "MEDICATION_CODE":"Medications code",
    "RECOMMENDED_MEDICATION": "Recommended medication",
    "DISEASE_STATUS_CODE": "Disease status code",
    "DISEASE_STATUS": "Disease status",
    "CATEGORY": "Evidence level group",
    "EVIDENCE_LEVEL": "Evidence level",
    "EVIDENCE_CODE": "Evidence level code",
    "EVIDENCE_LEVEL_DETAILED": "Detailed evidence level",
    "REASON_ID": "Reason id",
    "REASONS_MEDICATION": "Medication reason rull",
    "REASONS_OTHER": "Other column from reason", 
    "REASONS_GENETICS": "Genetics reasons for recommended medication",
    "REASONS_COMMENTS": "Physicians comments about reasons" 
}


In [1061]:
data_patient = data_patient.replace('"', '', regex=True)

### 'data_clinical_sample.txt'

In [1062]:
os.makedirs(folder_path, exist_ok=True)
file_path = os.path.join(folder_path, 'data_clinical_sample.txt')

header_lines_sample = { 
    "#PATIENT_ID": "#Unique Patient Identifier", 
    "SAMPLE_ID": 'Unique Sample Identifier',
    'CANCER_TYPE': "Cancer type",
    'TISSUE_SOURCE_SITE': "Tissue source site",
    'TUMOR_TISSUE': "Tumor tissue size",
    'TUMOR_MOLECULAR_SUBTYPE':"Tumor molecular subtype",
    'TMB': "TMB",
    'HRD':"HRD",
    'T':"T",
    'N':"N",
    'M':"M",
    'ECOG':"ECOG",
    'HER2': "HER2",
    'EBV':"EBV",
    'HPV':"HPV",
    'PDL1': "PDL1",
    'PR':"PR",
    'ER': "ER",
    'MSI':"MSI",
    'GERMLINE_MUTATION': "Exist germline mutation or not" 
}

column_names = list(header_lines_sample.keys())
first_header = '\t'.join(column_names)
second_header = '\t'.join(header_lines_sample.values())
third_header = '#STRING' + '\t' + '\t'.join(['STRING'] * (len(column_names) - 1))
fourth_header = '#1' + '\t' + '\t'.join(['1'] * (len(column_names) - 1))

# Combine all header rows
header_lines = [first_header, second_header, third_header, fourth_header]

with open(file_path, 'w') as f:
    # Write the header lines
    for line in header_lines:
        f.write(line + "\n")
    data_sample.to_csv(f, sep='\t', index=False)
print(f"File created at: {file_path}")     

File created at: /Users/valeriya.vishnevskaya/cg-on-fhir/output_study/data_clinical_sample.txt


### 'data_clinical_patient.txt'

In [1063]:
os.makedirs(folder_path, exist_ok=True)
file_path = os.path.join(folder_path, 'data_clinical_patient.txt')

header_lines_patient = {
    "#PATIENT_ID": "#Unique Patient Identifier",            
    "AGE": 'Age of the patient',                         
    "GENDER": "Gender of the patient",
    'INDICATION': 'Patient disease history',
    "DIAGNOSIS_DATE": "Date of initial diagnosis in YYYY-MM-DD format",
    "ICD_10": "ICD 10 Classification",
    "OS_MONTHS": "Overall survival months",
    "OS_STATUS": "Overall survival status",
    "DSS_STATUS": "Disease status",    
    "UICC": "Stage to define the anatomical extent of disease",
    "DFS_MONTHS": "DFS_MONTHS",
    "DFS_STATUS":"DFS_STATUS"
}
number_columns = {'AGE', 'OS_MONTHS','DFS_MONTHS'}
column_names = list(header_lines_patient.keys())
first_header = '\t'.join(column_names)
second_header = '\t'.join(header_lines_patient.values())
third_header = '#STRING' + '\t' + '\t'.join(
    ['NUMBER' if col in number_columns else 'STRING' for col in column_names[1:]]
)
fourth_header = '#1' + '\t' + '\t'.join(['1'] * (len(column_names) - 1))

# Combine all header rows
header_lines = [first_header, second_header, third_header, fourth_header]

with open(file_path, 'w') as f:
    # Write the header lines
    for line in header_lines:
        f.write(line + "\n")
    data_patient.to_csv(f, sep='\t', index=False)
print(f"File created at: {file_path}")  

File created at: /Users/valeriya.vishnevskaya/cg-on-fhir/output_study/data_clinical_patient.txt


### 'data_mutations.txt'

In [1064]:
os.makedirs(folder_path, exist_ok=True)
file_path = os.path.join(folder_path, 'data_mutations.txt')

# Write to a tab-separated .txt file
with open(file_path, 'w') as f:
    
    # Save the DataFrame as tab-separated values
    data_mutations.to_csv(f, sep='\t', index=False)
print(f"File created at: {file_path}")  

File created at: /Users/valeriya.vishnevskaya/cg-on-fhir/output_study/data_mutations.txt


### 'data_timeline_status.txt'

In [1065]:
os.makedirs(folder_path, exist_ok=True)
file_path = os.path.join(folder_path, 'data_timeline_status.txt')

# Write to a tab-separated .txt file
with open(file_path, 'w') as f:
    
    # Save the DataFrame as tab-separated values
    data_status.to_csv(f, sep='\t', index=False)
print(f"File created at: {file_path}") 

File created at: /Users/valeriya.vishnevskaya/cg-on-fhir/output_study/data_timeline_status.txt


## 📈 Section 6: Validation & Upload to cBioPortal
- `Use cBioPortal's validation scripts to validate data before import` :-   ./validateData.py --help
https://docs.cbioportal.org/using-the-dataset-validator/
- `Use cBioPortal's import study sctipts to import study to cBioPortal` :-  ./cbioportalImporter.py -s <path to study directory>
https://docs.cbioportal.org/data-loading-maintaining-studies/

<center>
    <img src="images/cbioportal.png" width=1200/>
</center>

## 🧪 Section 7: Data Quality & Challenges
1. Handle incomplete genomic info.
2. Hard to create Longitudinal data timeline representation. Cbioportal timeline needs convertation of all dates in days.
3. Date parsing edge cases: Inconsistent or partial date formats may silently fail (e.g., NaT)
4. Missing values: Some fields (Mutation_status, Start/End_Date) are filled with placeholders (.) and may trigger warnings in validation
5. Column misalignment: Ensure all required headers and values match cBioPortal schemas exactly
6. ID consistency: PATIENT_ID must match across all files (sample, mutations, events)
7. Data integrity: Check for duplicated or missing rows after concat() operations    

## 📚 References
- 'HL7 Clinical Genomics IG' https://build.fhir.org/ig/HL7/genomics-reporting/index.html
- 'MAF File Format' https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/
- 'cBioPortal User Guide' https://docs.cbioportal.org/user-guide/
- 'LOINC Codes: 81258-6, 48005-3, 51958-7'ect.