# Automated metadata matching
Life sciences (LS)/ Clinical research institutes (academic research institutes, pharma companies, hospitals, clinics etc.) across the world are producing large volumes of data from patients. This can range from clinical information such as diagnostic/prognostic data, omic data such as genetic/proteomic/epigenetic screens, pathological data such as MRI scans etc. One of the main objectives in LS research (both academic and industrial) is to gain actionable insights from these data sets, that goes beyond the diagnosis/prognosis of a (group of) patient(s) and provides a deeper understanding of the diseases, as well as shine lights on new therapeutic options. It is becoming apparent, that to gain actionable insights from LS data sets, we need data from a large number of patients. This is achievable, if we could merge datasets from various institutes, which is turning out to be hugely challenging task, simply because different institutes use different standards, units, nomenclature etc. to store data. <br><br>

For instance, patient's age is a common clinical parameter recorded by almost all organisations. One institute can name the variable that records patients’ age as 'patient age', another can name the same variable as 'age', others can name it as 'age at diagnosis', 'days since birth', 'years since birth' etc. The values can also be in days, months, years etc. Therefore, to combine data from many institutes (and sometimes within same institutes), it's essential to understand that all of the above variables are recording the same thing, i.e. patient's age, also we need to make sure that the units (days, months, years) of measuring age are homogenised at the time of integration. <br><br>

To assist in the above process the National Cancer Institute (NCI) created the concept of CDE (common data element). See https://cdebrowser.nci.nih.gov/cdebrowserClient/cdeBrowser.html#/search for more details. A big data dump of about 69000 CDE elements are provided in 'cde_database\full_database' folder in XML format, if you want to further explore. They provide a standard format of representing Life Science's data. This gives us standard variable name, the permissible values , units etc. for each of these clinical parameters. Some research organizations are following this standard, but vast majority aren't. Additionally, there is huge amount of data produced until now which are not standardized using CDEs. <br><br>

To be able to integrate data from various institutes, we need to be able to match the variable names in the clinical datasets to the corresponding CDE elements. Currently, there is a drive for developing ML/AI algorithms to achieve this.<br><br>


The code below is an initial attempt in this direction. In summary, it tries to match the variables names (generally the column headers in a clinical data file) and values (the column values) of the clinical parameters in a dataset, to the long variable names, and permissible values of the CDE elements. The objective is to find the CDE elements that closely match the each clinical parameter name (i.e. the column header). To do the the following steps are performed: <br><br>

1. Converted selected aspectes (e.g. long_name, permissible_values etc.) of all CDE elements into numerical vectors using a word embedding model which itself was trained on these data.

2. Coverted the clinical parameter names (headers) and values into numerical vector using the same word embedding model as above.

3. The vectors from the clinical data can be matched to CDE vectors in a few different ways: <br>
   (a) One way is to use unsupervised learning, fit a Nearest Neighbour model to the CDE vectors, and look for the nearest neighbors of each clinical parameter using this model.
   (b) Another way is to use supervised learning: Create feature vectors for all possible pairs of clinical parameters and CDE elements, consider the true pairs as positive class (target =1) and the remaining pairs as negative class (target = 0). Train classifiers to this data and use the classifier to evaluate new clinical parameters.
   
 
See more details below.
   
   
   
   





## Install custom benchmark solutions libraries

In [1]:
!pip install cde_modelling_tools/.

Processing c:\users\tapes\onedrive\documents\hackathon_documents_v1\cde_modelling_tools
Building wheels for collected packages: cde-data-modeller
  Building wheel for cde-data-modeller (setup.py): started
  Building wheel for cde-data-modeller (setup.py): finished with status 'done'
  Created wheel for cde-data-modeller: filename=cde_data_modeller-0.0.1-py3-none-any.whl size=15963 sha256=0e62489060126c4b323dbe10e20e84c0e7ddd266148121079995fa95552d43fc
  Stored in directory: C:\Users\tapes\AppData\Local\Temp\pip-ephem-wheel-cache-in7obrn5\wheels\d1\d6\de\83b2166f12e4abb68ba736ffc7f9cfb70c0093eda89b5f882b
Successfully built cde-data-modeller
Installing collected packages: cde-data-modeller
  Attempting uninstall: cde-data-modeller
    Found existing installation: cde-data-modeller 0.0.1
    Uninstalling cde-data-modeller-0.0.1:
      Successfully uninstalled cde-data-modeller-0.0.1
Successfully installed cde-data-modeller-0.0.1


## Import necessary libraries

In [2]:
import pandas as pd

import json
import numpy as np
import random
import mlflow
from cde_modelling.modelling import CDE_data_modeller as cdm
from cde_modelling.parsing import TCGA_data_parser as tdp
from cde_modelling.utils import Accuracy_calculations as ac
import pickle 
from sklearn.model_selection import train_test_split

scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


## File paths

In [4]:
clinical_data_files_dir = 'tcga_training_data/'

clinical_data_test_dir = 'tcga_test_data/'

cde_database_file = 'cde_database/combined_small_dataset.json'

parameter_file = 'params_unsupervised.json'

model_dir = 'models/'

test_gold_standard = 'gold_standard/test_gs.json'

## Load model parameters

In [6]:
# read model parameters
params = {}

with open(parameter_file,'r') as file:
    params = json.load(file)


## Create a fasttext model for the CDE database and index the individual CDE elements in the database 

Fasttext is a word embedding algorithm developed by FaceBook. Given a corpus, it creates a model that tries to predict if a pair of words appear in the same context. The model first converts the words to a numeric vector which are used as features for the above prediction. We are interested in the feature generation part, i.e. the part which converts words to numeric vectors. For more information on the FastText model see https://radimrehurek.com/gensim/models/fasttext.html.

### FastText model training: 

To train a FastText model we first extracted the long_name and permissible_values of each CDE elements. These were then parseed and cleaned (lower cased, alphanumeric character only, splitted into bag of words). The preprocessed long names and permissible values of all CDE element was considered as the training corpus for the FastText model. The corpus was then used to train A FastText model. The parameters for the model are in the above json file. The trained model is then used to index the CDE elements (i.e. create numeric vectors representing each CDE). We created two sets of vectors for CDE elements, one for the long_names and the other for permissible values. We alo extracted the data_type information for each CDE elements. Below is an example. Let's assume that the following is a (oversimplified) CDE element .
CDE_element: 
{
'public_id': 1234
'long_name': 'received radiotherapy'
..............
'permissible_values': ['yes','no']
}.

To index the above CDE, we performed the following:

1. Vectorized the long_name entry (i.e. 'received radiotherapy') using the FastText model. To do that, we vectorized each word (i.e. 'received' and 'radiotherapy') of the long name entry separately. The vectors were then normalized by their L2 norms and averaged. Say for example, the long_name vector is [0.1, 0.345]. 

2. Vectorized the 'permissible_values' entry (i.e. 'yes', 'no') using the FastText model. To do that, we vectorized each word (i.e. 'received' and 'radiotherapy') in the permissible_values entry separately. The vectors were then normalized by their L2 norms and averaged.Say for example, the permissible vector is [0.981, 0.233]. 

3. We identified whether the permissible values are string or numbers. Note, that for the benchmark solution, we kept this simple. But for the hackathon, the participants can conder more grannular data type for example, string, binary, float, int long etc.


Combination of the above is used to numerically represent (index) each CDE. The class CDE_data_modeller, in package cde_modelling_tools does the above. Pparticipants should explore using other entries in the CDE data fields to improve their chances of finding a match.

The CDE_data_modeller class not only creates the word embedding models and index (vectorize) the CDE data elements, it can also save and load pretrained models and indexes.

In [5]:
cde_data_modellers = cdm.CDE_data_modeller(cde_database_file, params)
cde_data_modellers.create_model_and_cde_indexes()
cde_data_modellers.save_model_and_indexes(model_dir+'fasttext/')

Loading CDE database... please wait
Took 0.003886  minutes to load CDE database..
Starting model training ... 


 10%|███████▍                                                                     | 483/4991 [00:00<00:00, 4646.70it/s]

Model training took 4.793509 minutes
Start converting descriptors to vectors


100%|████████████████████████████████████████████████████████████████████████████| 4991/4991 [00:01<00:00, 4184.82it/s]


Took 0.020008 minutes to vectorize the dataset


  vector = vector/len(sentence)
  3%|██▎                                                                          | 147/4991 [00:00<00:03, 1413.38it/s]

Start converting descriptors to vectors


100%|████████████████████████████████████████████████████████████████████████████| 4991/4991 [00:02<00:00, 2242.02it/s]


Took 0.037369 minutes to vectorize the dataset


## Load a pretrained FastText model and saved indexes for CDE elements

In [7]:
cde_data_modellers = cdm.CDE_data_modeller(cde_database_file, params)
cde_data_modellers.load_model_and_cde_indexes(model_dir+'fasttext/')

Loading CDE database... please wait
Took 0.003666  minutes to load CDE database..


## Load and parse training data

The training data are a set of clinical data files which records cinical information of patients, e.g. gender, age, disease_type, disease_sub_type, treatment received etc. It's in table format, where the rows represent patients and the columns represent colinical parameters. In case of the training data, the CDE data element corrsponding to each clinical parameter is provided. This information can be used to train machine learning algorithms to predict CDE elements for new clinical parameters.

In [8]:
tdpr = tdp.TCGA_data_processor(clinical_data_files_dir,True )
tcga_data = tdpr.get_parsed_data()


  1%|█                                                                                 | 3/220 [00:00<00:10, 20.83it/s]

 Processing clinical metadata.. please wait..


100%|████████████████████████████████████████████████████████████████████████████████| 220/220 [00:08<00:00, 25.98it/s]


The parser returns three types of information for each clinical parameter.

1. The name of the parameter (e.g. age, gender, etc.)
2. List of values for each parameters (except id columns, continuous variabales etc.)
3. Data type of the values. For instance, data type of 'age' is 'number', data type of gender = 'string'. 
4. A dictionary containing clinical parameters and it's corresponding 

See the parsed data below.

In [8]:
tcga_data.keys()

dict_keys(['headers', 'values', 'value_type', 'gold_standard'])

In [9]:
tcga_data['headers'] # tcga_data['values'] / tcga_data['value_type'] / tcga_data['gold_standard']

{'bcr_patient_barcode': ['bcr', 'patient', 'barcode'],
 'bcr_drug_barcode': ['bcr', 'drug', 'barcode'],
 'bcr_drug_uuid': ['bcr', 'drug', 'uuid'],
 'form_completion_date': ['form', 'completion', 'date'],
 'pharmaceutical_therapy_drug_name': ['pharmaceutical',
  'therapy',
  'drug',
  'name'],
 'clinical_trial_drug_classification': ['clinical',
  'trial',
  'drug',
  'classification'],
 'pharmaceutical_therapy_type': ['pharmaceutical', 'therapy', 'type'],
 'pharmaceutical_tx_started_days_to': ['pharmaceutical',
  'tx',
  'started',
  'days',
  'to'],
 'pharmaceutical_tx_ongoing_indicator': ['pharmaceutical',
  'tx',
  'ongoing',
  'indicator'],
 'pharmaceutical_tx_ended_days_to': ['pharmaceutical',
  'tx',
  'ended',
  'days',
  'to'],
 'treatment_best_response': ['treatment', 'best', 'response'],
 'days_to_stem_cell_transplantation': ['days',
  'to',
  'stem',
  'cell',
  'transplantation'],
 'number_cycles': ['number', 'cycles'],
 'pharm_regimen': ['pharm', 'regimen'],
 'pharm_regimen

## Create base table for unsupervised learning

To create base tables I performed vectorized the long_names and permissible values of each CDE elements





In [9]:
abt = cde_data_modellers.create_base_for_unsupervised_learning()

## Train an un-supervised machine learning model using the abt created above
I created a wrapper class called create_model where a number of supervised and unsupervised learning algorithms are implemented (from sklearn library). The type and parameters of the model can be passed to this calss using the params dictionary. 

!!! Warning: Currently the the two datatype columns in the abt (data_type_string, data_type_number) are complementary and hence redundant. Only one should be used for modelling. This needs to be corrected in the future versions.

In [10]:


# import the Model class
from cde_modelling.modelling.create_models import Model

# create model
model = Model(params)

#fit the model
model.fit(abt)



## Predict CDE elements for the clinical parameters

Step1: Create an abt from the tcga dataset. This is done by vectorizing the headers (cinical parameters) and values, and dummy-fying the data_types. Then merging these vectors side by side, to create a design matrix (dataset). For this, I have created a method 'create_base_for_unsupervised_learning' in the CDE_data_modeller class. Note that the class has another method called 'create_abt', which creates base table for supervised learning. Each row in the former case represents a header (clinical parameter), where the later case represents a tuple (header, cde_element). Therefore,  the base table in the unsupervised case has the same number of rows as the the number of headers (or clinical parameters) in the dataset. But in case of the base table in the supervised case, the number of rows = the number of CDE elemtns times the number of headers (when no undersampling is performed.) <br><br>

Step 2: Make predictions using the trained model

Note that I have created a model.predict_and_convert_to_json function which returns the prediction in the following format: <br>
{
clinical parameter1: [most likely predictions, 2nd most likely prediction, .... , 20th most likely prediction] <br>
clinical parameter2: [most likely predictions, 2nd most likely prediction, .... , 20th most likely prediction] <br>
.....
clinical parametern: [most likely predictions, 2nd most likely prediction, .... , 20th most likely prediction] <br>
}



In [11]:
# convert the tcga data to features
tcga_abt = cde_data_modellers.create_base_for_unsupervised_learning(tcga_data)

tcga_predictions = model.predict_and_convert_to_json(tcga_abt)


100%|██████████████████████████████████████████████████████████████████████████████| 713/713 [00:00<00:00, 8073.11it/s]
  0%|                                                                                          | 0/702 [00:00<?, ?it/s]

Start converting descriptors to vectors
Took 0.001714 minutes to vectorize the dataset
Start converting descriptors to vectors


100%|███████████████████████████████████████████████████████████████████████████████| 702/702 [00:02<00:00, 304.56it/s]


Took 0.038544 minutes to vectorize the dataset


## Calculate accuracy of prediction

In [12]:

tcga_accuracy = ac.calculate_accuracy(tcga_data['gold_standard'],tcga_predictions)

In [13]:
tcga_accuracy

0.41907433380084186

## Test the model on the test dataset

To do that, we shall first create the base table for the test dataset by parsing and indexing the test set in the same way as was done for the training set.

In [14]:
tdp1 = tdp.TCGA_data_processor(clinical_data_test_dir,False )
test_data = tdp1.get_parsed_data()
test_abt =cde_data_modellers.create_base_for_unsupervised_learning(test_data)

 50%|██████████████████████████████████████████                                          | 3/6 [00:00<00:00, 22.05it/s]

 Processing clinical metadata.. please wait..


100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 18.75it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 225/225 [00:00<00:00, 7029.19it/s]
 44%|███████████████████████████████████                                            | 100/225 [00:00<00:00, 961.30it/s]

Start converting descriptors to vectors
Took 0.000533 minutes to vectorize the dataset
Start converting descriptors to vectors


100%|███████████████████████████████████████████████████████████████████████████████| 225/225 [00:00<00:00, 720.97it/s]

Took 0.005379 minutes to vectorize the dataset





## Make predictions for the test dataset using the trained model

In [15]:
results = model.predict_and_convert_to_json(test_abt)


In [17]:
results

{'bcr_subject_code': ['6642901',
  '3738935',
  '3902309',
  '6790551',
  '6422470',
  '2866635',
  '2998131',
  '6411566',
  '3124770',
  '2431571',
  '3734524',
  '2708071',
  '2241117',
  '5042480',
  '2848548',
  '4168555',
  '2735254',
  '2534855',
  '2724803',
  '2752812'],
 'subject_gender': ['5596079',
  '3124770',
  '2608196',
  '2804953',
  '2724803',
  '5042480',
  '2608256',
  '2968039',
  '2521657',
  '3025101',
  '2475573',
  '2968264',
  '2528470',
  '2969380',
  '5042478',
  '2939287',
  '2672252',
  '2475554',
  '6790551',
  '2732976'],
 'subject_race': ['5596079',
  '3978832',
  '3199145',
  '5042480',
  '3528623',
  '2969380',
  '6790551',
  '4718161',
  '2724803',
  '2716767',
  '3882972',
  '5042478',
  '2968264',
  '2825041',
  '3629992',
  '3101859',
  '2946951',
  '2435424',
  '2780896',
  '3865396'],
 'pharm_tx_mitotane_indicator': ['3738935',
  '2756539',
  '2752812',
  '3734524',
  '2534855',
  '5432711',
  '3902309',
  '2625737',
  '2342457',
  '2438173',
  

# Calculate accuracy of prediction for the test dataset

Note that the participants won't have access to the gold standard data, therefore won't be able to perform the following step. However, participants can divide the training data in to train, test, validation sets and perform the following on the test data.

In [16]:
test_gs = {}
with open(test_gold_standard, 'rb') as file:
    test_gs = json.load(file)
test_accuracy = ac.calculate_accuracy(test_gs,results)

In [17]:
test_accuracy

0.0951111111111111

## Log model, model parameters etc. using mlflow

This will ensure reproducibility of results and will keep track of all models and results during the model development and calibration.

In [18]:
with mlflow.start_run():
    # print out current run_uuid
    run_uuid = mlflow.active_run().info.run_uuid
    print("MLflow Run ID: %s" % run_uuid)
    
    # log parameters
    mlflow.log_param("window_size", params["fasttext"]["window"])
    mlflow.log_param("min_count", params["fasttext"]["min_count"])
    mlflow.log_param("epochs", params["fasttext"]["epochs"])
    mlflow.log_param("vector_size", params["fasttext"]["vector_size"])
      
    mlflow.log_param("model_type", params['model']["name"])
    
    for k in params['model']['model_params'].keys():
        mlflow.log_param("model_params_"+k, params['model']["model_params"][k])
    
    # log metrics
    mlflow.log_metric('train_accuracy',tcga_accuracy)    
    mlflow.log_metric("test_accuracy",test_accuracy)
    
    #mlflow.sklearn.logmodel()
    with open('models/'+run_uuid+'.pkl','wb') as file:
        pickle.dump(model, file)
    
    mlflow.end_run()

MLflow Run ID: 7927668ee0c241268d84b08ed3f5722b


### Use mlflow to track models 

<img src="img/mlflow_ui.png">