# Example (synthetic) data sets
For the purpose of this effort to create a common data shape on Glioblastoma, two synthetic data sets were provided. Synthetic datasets contain fabricated data that is represtable of actual data on the topic. 

In this chapter we will inspect the provided datasets and prepare them to by converted into linked data. As described earlier, linked data is data that follows a linked-data format consisting of triples where the individual parts consists of either values or identifiers (IRIs) pointing to concept definitions. 

The proces of converting data to linked data is rather straightforwards and can be summed up as follows:
1. Inspect the data
2. Clean the data headers
3. Align the data with common controlled vocabularies or ontologies
4. Design linked-data shapes using the IRIs from the controlled vocabularies and ontologies
5. Transform the data into RDF

In this chapter we will focus on the first two steps.

## Dataset 1 on Glioblastoma
The first dataset is a provided excel file (Sample clinical data_July 2022.xlsx) ## TODO add a link.
It contains synthetic clinical data on patients with glioblastoma. 

In [1]:
import pandas as pd
gdb1 = pd.read_excel("data/Sample clinical data_July 2022.xlsx")
gdb1

Unnamed: 0,ID Lab site,ID clinical site,DOB,Age,Gender,Pathology,Localization,IDH1/2 status,MGMT status,ECOG,...,doses bevacizumab,Best response,pseudo-progression (1=yes 0=no),œdema,PP or œdema,progression (1=yes 0=no),date of progression,months PFS,deceased (1=yes 0=no),months OS
0,Ge 829,DAG,1953-05-16,62,F,GBM,parietal R,wt,UnMeth,1,...,12,PD,0,1,1,1,2013-11-11,3,1,19
1,Ge 835,EAG,1964-06-18,50,M,GBM,temporal D,wt,UnMeth,0,...,0,PD,1,1,1,1,2014-01-21,6,1,10
2,Ge 849,FAG,1957-12-15,57,F,GBM,frontal R,wt,UnMeth,0,...,9,PD,1,0,1,1,2014-09-02,11,1,18
3,Ge 852,GAG,1960-03-22,55,M,GBM,fronto insular L,wt,UnMeth,0,...,16,SD,0,1,1,1,2014-10-01,11,1,21
4,Ge 882,IAG,1948-05-06,67,M,GBM,fronto insular R,wt,UnMeth,0,...,4,PD,0,1,1,1,2014-12-22,18,1,24
5,Ge 893*,KAG,1958-07-12,56,M,GBM,temporal R,wt,Meth,0,...,0,SD,0,0,0,1,2015-02-04,26,0,44
6,Ge 901,LAG,1991-06-09,23,F,AIII,fronto temporal insular R,mut,UnMeth,0,...,29,SD,1,0,1,1,2016-07-18,17,1,37
7,Ge 904,MAG,1949-10-13,65,M,GBM,parieto occipital L,wt,Meth,0,...,6,PR,1,1,1,1,2016-12-12,29,1,35
8,Ge 939,PAG,1972-04-10,43,F,AIII,temporal R,wt,UnMeth,1,...,42,PD,1,1,1,1,2015-06-12,6,1,32
9,Ge 941*,OAG,1954-01-07,61,M,GBM,frontal L,mut,Meth,0,...,0,CR,1,0,1,0,NaT,38,0,38


In [64]:
gdb1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 26 columns):
 #   Column                                    Non-Null Count  Dtype         
---  ------                                    --------------  -----         
 0   ID Lab site                               19 non-null     object        
 1   ID clinical site                          19 non-null     object        
 2   DOB                                       19 non-null     datetime64[ns]
 3   Age                                       19 non-null     int64         
 4   Gender                                    19 non-null     object        
 5   Pathology                                 19 non-null     object        
 6   Localization                              19 non-null     object        
 7   IDH1/2 status                             19 non-null     object        
 8   MGMT status                               19 non-null     object        
 9   ECOG                              

This dataset contains 26 fields of various datatypes. In one of the next steps this field names will be alligned with various controlled vocabularies and ontologies. For this to be succesful the fieldnames need to be as expressive as it can be. 

In [74]:
def show_difference(row):
    highlight = 'background-color: yellow;'
    default = ''
    if row['original_field'] != row['prepared_field']:
        return [default, highlight]
    else:
        return [default, default]
    
def strip_check(row):
    if row['original_field'] != row['prepared_field']:
        row["change"] = "remove trailing spaces"
        
def abbreviations(row):
    if row['original_field'] != row['prepared_field']:
        if row["change"] == "":
            row["change"] = "resolved abbreviations"
            
def removechoices(row):
    if row['original_field'] != row['prepared_field']:
        if row["change"] == "":
            row["change"] = "removed choices"

df = pd.DataFrame(columns=["original_field", "prepared_field", "change"])
for column in gdb1.columns: 
    df.loc[len(df.index)] = [column, column.strip(), ""]
    
df.apply(strip_check, axis=1)
    
df['prepared_field'] = df['prepared_field'].str.replace('adj','adjuvant')
df['prepared_field'] = df['prepared_field'].str.replace('months OS','months overall survival')
df['prepared_field'] = df['prepared_field'].str.replace('PP','pseudo-progression')
df['prepared_field'] = df['prepared_field'].str.replace('PFS','progression-free survival')

df.apply(abbreviations, axis=1)
         
df['prepared_field'] = df['prepared_field'].str.replace('(1=yes 0=no)','', regex=False)
df.apply(removechoices, axis=1)
 
df.style.set_properties(**{'text-align': 'left'})
df.style.apply(show_difference, subset=['original_field', 'prepared_field'], axis=1)

Unnamed: 0,original_field,prepared_field,change
0,ID Lab site,ID Lab site,
1,ID clinical site,ID clinical site,
2,DOB,DOB,
3,Age,Age,remove trailing spaces
4,Gender,Gender,
5,Pathology,Pathology,
6,Localization,Localization,remove trailing spaces
7,IDH1/2 status,IDH1/2 status,
8,MGMT status,MGMT status,
9,ECOG,ECOG,


# Dataset 2 on Glioblastoma
The second dataset is a provided excel file (dt_Bayraktar_ClinicalMetadata_mock_2022.07.21.xlsx) ## TODO add a link. It contains synthetic clinical data on patients with glioblastoma.

In [None]:
gdb2 = pd.read_excel("data/dt_Bayraktar_ClinicalMetadata_mock_2022.07.21.xlsx")
gdb2

In [None]:
gdb2.info()

In [None]:
def show_difference(row):
    highlight = 'background-color: yellow;'
    default = ''
    if row['original_field'] != row['prepared_field']:
        return [default, highlight]
    else:
        return [default, default]
    
def strip_check(row):
    if row['original_field'] != row['prepared_field']:
        row["change"] = "remove trailing spaces"
        
def abbreviations(row):
    if row['original_field'] != row['prepared_field']:
        if row["change"] == "":
            row["change"] = "resolved abbreviations"
            
def removechoices(row):
    if row['original_field'] != row['prepared_field']:
        if row["change"] == "":
            row["change"] = "removed choices"

df = pd.DataFrame(columns=["original_field", "prepared_field", "change"])
for column in gdb2.columns: 
    df.loc[len(df.index)] = [column, column.strip(), ""]
    
df.apply(strip_check, axis=1)
    
df['prepared_field'] = df['prepared_field'].str.replace('adj','adjuvant')
df['prepared_field'] = df['prepared_field'].str.replace('months OS','months overall survival')
df['prepared_field'] = df['prepared_field'].str.replace('PP','pseudo-progression')
df['prepared_field'] = df['prepared_field'].str.replace('PFS','progression-free survival')

df.apply(abbreviations, axis=1)
         
df['prepared_field'] = df['prepared_field'].str.replace('(1=yes 0=no)','', regex=False)
df.apply(removechoices, axis=1)
 
df.style.set_properties(**{'text-align': 'left'})
df.style.apply(show_difference, subset=['original_field', 'prepared_field'], axis=1)

In the next chapter the clarified field names will be used to identify terms in applicable controlled vocabularies ontologies to drive the linked data. 

In [3]:
gdb2 = pd.read_excel("data/dt_Bayraktar_ClinicalMetadata_mock_2022.07.21.xlsx")
gdb2

Unnamed: 0,Sample ID,MPBTP (Sequencing Data),# of blocks,Sex,Age (optional),Tumour Spatial Information,Pathology - pre-surgical,Histopathological Assignment,Histopathology - Microscopic Description,Mutations,Methylation Status (>10% = METHYLATED),Other Variants,Covid test,Other comments,Spatial Location
0,Patient A,,5,M,54,Right Temporal Lobe,GBM,"Glioblastoma, IDH-wiltype (CNS WHO Grade 4)",Cellular glial neoplasm with elongated nuclei ...,IDH-WT (IHC and Sequencing)\nATRX - Retained,MGMT Promotor Methylation - 2% - ABSENT,EGFR Amplification;\nCDKN2A Deletion;\nPTEN De...,Negative,A1 has a duplicate block,Wasn’t able to get stealth location informatio...
1,Patient B,,4,F,82,Right Temporal Lobe,GBM,"Glioblastoma, IDH-wiltype (CNS WHO Grade 4)",Pleomorphic astrocytes with numerous gemistocy...,IDH-WT (IHC and Sequencing)\nATRX - Retained,MGMT Promotor Methylation - 5% - ABSENT,CDKN2A Deletion; \nPTEN Deletion,Negative,-,"Anterior, Lateral, Posterior and Medial sample..."
2,Patient C,,4,F,41,Right Frontal Lobe,GBM,"Oligodendroglioma, IDH-mut and 1p/19q co-delet...",Diffusely infiltrative glial neoplasm with mod...,IDH1-mut R132H (IHC and Sequencing)\nATRX - Re...,No tested,1p/19q Co-deletion; \nTP53 P152L,Negative,Sampled by Richard,"Superior, Anterior, Deep and Inferior samples ..."


In [4]:
gdb2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 15 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   Sample ID                                 3 non-null      object 
 1   MPBTP (Sequencing Data)                   0 non-null      float64
 2   # of blocks                               3 non-null      int64  
 3   Sex                                       3 non-null      object 
 4   Age (optional)                            3 non-null      int64  
 5   Tumour Spatial Information                3 non-null      object 
 6   Pathology - pre-surgical                  3 non-null      object 
 7   Histopathological Assignment              3 non-null      object 
 8   Histopathology - Microscopic Description  3 non-null      object 
 9   Mutations                                 3 non-null      object 
 10  Methylation Status (>10% = METHYLATED)    

In [6]:
def show_difference(row):
    highlight = 'background-color: yellow;'
    default = ''
    if row['original_field'] != row['prepared_field']:
        return [default, highlight]
    else:
        return [default, default]
    
def strip_check(row):
    if row['original_field'] != row['prepared_field']:
        row["change"] = "remove trailing spaces"
        
def abbreviations(row):
    if row['original_field'] != row['prepared_field']:
        if row["change"] == "":
            row["change"] = "resolved abbreviations"
            
def removechoices(row):
    if row['original_field'] != row['prepared_field']:
        if row["change"] == "":
            row["change"] = "removed conditionals"

df = pd.DataFrame(columns=["original_field", "prepared_field", "change"])
for column in gdb2.columns: 
    df.loc[len(df.index)] = [column, column.strip(), ""]
    
df.apply(strip_check, axis=1)
    
df['prepared_field'] = df['prepared_field'].str.replace('adj','adjuvant')
df['prepared_field'] = df['prepared_field'].str.replace('months OS','months overall survival')
df['prepared_field'] = df['prepared_field'].str.replace('PP','pseudo-progression')
df['prepared_field'] = df['prepared_field'].str.replace('PFS','progression-free survival')

df.apply(abbreviations, axis=1)
         
df['prepared_field'] = df['prepared_field'].str.replace('(optional)','', regex=False)
df.apply(removechoices, axis=1)
 
df.style.set_properties(**{'text-align': 'left'})
df.style.apply(show_difference, subset=['original_field', 'prepared_field'], axis=1)

Unnamed: 0,original_field,prepared_field,change
0,Sample ID,Sample ID,
1,MPBTP (Sequencing Data),MPBTP (Sequencing Data),
2,# of blocks,# of blocks,
3,Sex,Sex,
4,Age (optional),Age,removed conditionals
5,Tumour Spatial Information,Tumour Spatial Information,remove trailing spaces
6,Pathology - pre-surgical,Pathology - pre-surgical,
7,Histopathological Assignment,Histopathological Assignment,
8,Histopathology - Microscopic Description,Histopathology - Microscopic Description,
9,Mutations,Mutations,


In the next chapter the clarified field names will be used to identify terms in applicable controlled vocabularies ontologies to drive the linked data. 