# Auto Vocab Mapping
___

## POC 1 - Vector Space Search

For the first POC I'll focus on source and target descriptions only. So I just need previously matched sources and targets.

In [1]:
import pandas as pd

Read CHUC example files and see what's in it

In [2]:
chuc_s_df = pd.read_csv("../lib/data/raw/source_codes_description/chuc/analises_cod_acto.csv")
chuc_s2c_df = pd.read_csv("../lib/data/raw/source_to_concept/chuc/source_to_standard_analises_cod_acto.csv")
concept = pd.read_csv("../lib/data/raw/vocabularies/CONCEPT.csv", low_memory=False)

In [3]:
concept.head()

Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason
0,45756805,Pediatric Cardiology,Provider,ABMS,Physician Specialty,S,OMOP4821938,19700101,20991231,
1,45756804,Pediatric Anesthesiology,Provider,ABMS,Physician Specialty,S,OMOP4821939,19700101,20991231,
2,45756803,Pathology-Anatomic / Pathology-Clinical,Provider,ABMS,Physician Specialty,S,OMOP4821940,19700101,20991231,
3,45756802,Pathology - Pediatric,Provider,ABMS,Physician Specialty,S,OMOP4821941,19700101,20991231,
4,45756801,Pathology - Molecular Genetic,Provider,ABMS,Physician Specialty,S,OMOP4821942,19700101,20991231,


In [4]:
concept['concept_id'].dtype

dtype('int64')

In [5]:
set_dtype = concept['concept_id'].dtype

Make dict to map quickly

In [6]:
target_dict = dict(zip(concept['concept_id'], concept['concept_name']))

From here I need concept_id and concept_name to map

In [7]:
chuc_s_df.head()

Unnamed: 0,source_code,source_description,translated_source_description
0,A21900,"FERRO, S","Ferro, s"
1,X34281,"AN¡LISE POR SEQUENCIA«√O EM LARGA ESCALA (~0,5MB",LARGE SCALE SEQUENCE ANALYSIS (~0.5MB
2,A22375,CYFRA 21-1,DIGIT 21-1
3,A21646,"DELTA-4-ANDROSTENEDIONA, S","DELTA-4-ANDROSTENEDIONA, S"
4,A25520,ANTICORPOS ANTI-NUCLEARES E CITOPLASMATICOS (A...,ANTI-NUCLEAR AND CYTOPLASMATIC ANTIBODIES (ANT...


These are translations. We're not going into this for now. A separate exploration will be carried out for this topic alone. We could fine-tune our own medical data whichi has its specificities. We'll need: 
- Medical terms translation
- Acronym desambiguation

In [8]:
chuc_s2c_df.head()

Unnamed: 0,source_code,source_concept_id,source_vocabulary_id,source_code_description,target_concept_id,target_vocabulary_id,valid_start_date,valid_end_date,invalid_reason
0,A22793,0,analises_cod_acto,"SODIO, S/U",3022810,LOINC,1970-01-01,2099-12-31,
1,A24347,0,analises_cod_acto,"TEMPO DE PROTROMBINA, S",4245261,SNOMED,1970-01-01,2099-12-31,
2,A21789,0,analises_cod_acto,"EQUILIBRIO ACIDO BASICO (PH, PC02,SAT O2,CO2,), S",3013290,LOINC,1970-01-01,2099-12-31,
3,A21789,0,analises_cod_acto,"EQUILIBRIO ACIDO BASICO (PH, PC02,SAT O2,CO2,), S",3027315,LOINC,1970-01-01,2099-12-31,
4,A21789,0,analises_cod_acto,"EQUILIBRIO ACIDO BASICO (PH, PC02,SAT O2,CO2,), S",4013965,SNOMED,1970-01-01,2099-12-31,


From here I need the source code description and the target concept id. This is what well need in large quantities if we want to train a translator or a classifier. 

In [9]:
chuc_df = chuc_s2c_df[["source_code_description", "target_concept_id"]]
chuc_df.head()

Unnamed: 0,source_code_description,target_concept_id
0,"SODIO, S/U",3022810
1,"TEMPO DE PROTROMBINA, S",4245261
2,"EQUILIBRIO ACIDO BASICO (PH, PC02,SAT O2,CO2,), S",3013290
3,"EQUILIBRIO ACIDO BASICO (PH, PC02,SAT O2,CO2,), S",3027315
4,"EQUILIBRIO ACIDO BASICO (PH, PC02,SAT O2,CO2,), S",4013965


### Map target concepts and check missing values

In [10]:
chuc_df.loc[:, 'concept_name'] = chuc_df['target_concept_id'].astype(set_dtype).map(target_dict)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chuc_df.loc[:, 'concept_name'] = chuc_df['target_concept_id'].astype(set_dtype).map(target_dict)


In [11]:
chuc_df[chuc_df.isna().any(axis=1)]

Unnamed: 0,source_code_description,target_concept_id,concept_name


In [12]:
chuc_s2c = chuc_df.dropna()

In [13]:
chuc_s2c.head()

Unnamed: 0,source_code_description,target_concept_id,concept_name
0,"SODIO, S/U",3022810,Sodium [Moles/volume] in Body fluid
1,"TEMPO DE PROTROMBINA, S",4245261,Prothrombin time
2,"EQUILIBRIO ACIDO BASICO (PH, PC02,SAT O2,CO2,), S",3013290,Carbon dioxide [Partial pressure] in Blood
3,"EQUILIBRIO ACIDO BASICO (PH, PC02,SAT O2,CO2,), S",3027315,Oxygen [Partial pressure] in Blood
4,"EQUILIBRIO ACIDO BASICO (PH, PC02,SAT O2,CO2,), S",4013965,"Oxygen saturation measurement, arterial"


In [14]:
sources = chuc_s2c["source_code_description"].tolist()
sources[:10]

['SODIO, S/U',
 'TEMPO DE PROTROMBINA, S',
 'EQUILIBRIO ACIDO BASICO (PH, PC02,SAT O2,CO2,), S',
 'EQUILIBRIO ACIDO BASICO (PH, PC02,SAT O2,CO2,), S',
 'EQUILIBRIO ACIDO BASICO (PH, PC02,SAT O2,CO2,), S',
 'EQUILIBRIO ACIDO BASICO (PH, PC02,SAT O2,CO2,), S',
 'PESQUISA DE RNA DO VÕRUS SARS-COV-2 POR PCR EM TEMPO REAL',
 'SODIO, S/U',
 'POTASSIO, S/U',
 'GLUCOSE, DOSEAMENTO, S/U/L']

In [15]:
targets = chuc_s2c["concept_name"].tolist()
targets[:10]

['Sodium [Moles/volume] in Body fluid',
 'Prothrombin time',
 'Carbon dioxide [Partial pressure] in Blood',
 'Oxygen [Partial pressure] in Blood',
 'Oxygen saturation measurement, arterial',
 'Hydrogen ion concentration',
 'PCR test for SARS',
 'Sodium measurement, serum',
 'Potassium level',
 'Glucose measurement, plasma']

Some lm are trained as seq2seq and need the `query` and `passage` prefixes.

In [16]:
sources = [("query: " + i) for i in sources]
targets = [("query: " + i) for i in targets]

In [17]:
assert len(sources) == len(targets)

### Encode texts into fixed sized mean pooled vectors. 

Encode using torch. 

In [18]:
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel
import numpy as np


class TextEncoder:
    def __init__(self, model):
        self.tokenizer = AutoTokenizer.from_pretrained(model)
        self.model = AutoModel.from_pretrained(model)

    def encode(self, texts):
        # Tokenize the input texts
        batch_dict = self.tokenizer(texts,
                                    max_length=512,
                                    padding=True,
                                    truncation=True,
                                    return_tensors='pt')
        outputs = self.model(**batch_dict)
        embeddings = TextEncoder.__average_pool(
            outputs.last_hidden_state, batch_dict['attention_mask'])

        # Normalize embeddings
        embeddings = F.normalize(embeddings, p=2, dim=1)
        return np.array(embeddings.detach(), dtype=np.float32)

    @staticmethod
    def __average_pool(last_hidden_states: Tensor,
                       attention_mask: Tensor) -> Tensor:
        last_hidden = last_hidden_states.masked_fill(
            ~attention_mask[..., None].bool(), 0.0)
        return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

In [19]:
model_name = 'intfloat/multilingual-e5-small'
embeddings = TextEncoder(model_name).encode(sources)

By default, sentence_transformers disables the parallelism to avoid any hidden deadlock that would be hard to debug

In [20]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "true"

In [21]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('intfloat/multilingual-e5-small')
sources_emb = model.encode(sources, normalize_embeddings=True)
targets_emb = model.encode(targets, normalize_embeddings=True)

Sentence_transformer's implementation is faster than my manual approach so I'll stick to that. If in any case it has some incopatibility with a newer model I'll use mine. 

In [22]:
embeddings.shape

(428, 384)

In [23]:
embeddings[:10]

array([[ 0.04107222, -0.02139925, -0.0353516 , ...,  0.10612641,
         0.06803897,  0.00305798],
       [ 0.03053902,  0.00532008, -0.00736805, ...,  0.07866557,
         0.07970123,  0.03390224],
       [ 0.03296087, -0.02778542, -0.01535674, ...,  0.06343906,
         0.08069205,  0.02741423],
       ...,
       [ 0.04107222, -0.02139925, -0.0353516 , ...,  0.10612641,
         0.06803897,  0.00305798],
       [ 0.06119867, -0.0139116 , -0.02081711, ...,  0.07180168,
         0.05883745,  0.0364945 ],
       [ 0.04395493,  0.00539457, -0.05504538, ...,  0.07089277,
         0.07933093,  0.05188119]], dtype=float32)

Everything seems fine with the resulting vector space.

# PCA
Exploring projections in the vector space

In [24]:
from sklearn.decomposition import PCA

def compute_pca(vectors):
    pca = PCA()
    pca.fit(vectors)
    pcs = pca.transform(vectors)
    return pcs

In [25]:
import sys
sys.path.insert(0, '..') # add parent folder path

from utils.plotting import plot_pca, parallel

In [26]:
stacked = np.vstack([sources_emb, targets_emb])
print(stacked.shape)

(856, 384)


In [27]:
stacked_pcs = compute_pca(stacked)

Prepare labels, clusters and colors for PCA

In [28]:
# labels
names = sources + targets
sources_ids = ["source" for _ in sources]
targets_ids = ["target" for _ in targets]
group_names = sources_ids + targets_ids
# colors
color_by_group = sources_ids + targets_ids
individual_names = targets + targets

In [29]:
plot_pca(pcs=stacked_pcs, colors=color_by_group, names=group_names, title='PCA colored by group (source, target)')

Clusters relate to the languages.

matches (sources - targets) should be closer if we color them the same

In [30]:
plot_pca(pcs=stacked_pcs[:20], colors=individual_names[:20], names=individual_names[:20], title="PCA of 20 matched examples (same colors should be closer)")

In [31]:
source_dict = dict(zip(range(len(sources)), sources))
target_dict = dict(zip(range(len(targets)), targets))

In [32]:
rand_number = np.random.choice(len(sources), 1, replace=True)[0]
source_example = sources_emb[rand_number]

In [33]:
print(f' source: {source_dict[rand_number]};\n target: {target_dict[rand_number]}')

 source: query: HEMOGRAMA COM FORMULA LEUCOCITARIA (ERITROGRAMA, CONTAGEM DE LEUCOCITOS, CONTAGEM DE PLAQU;
 target: query: Complete blood count with white cell differential, automated


# Test distance: Compute nomalized L2 inner product

In [34]:
import faiss


def norml2_innerproduct(feature_space, query):

    index = faiss.index_factory(
        feature_space.shape[1], "Flat", faiss.METRIC_INNER_PRODUCT)
    faiss.normalize_L2(feature_space)
    index.add(feature_space)
    distance, index = index.search(np.array([query]), k=feature_space.shape[0])

    return distance, index

In [35]:
distance, index = norml2_innerproduct(targets_emb, source_example)

In [36]:
print(f' source: {source_dict[rand_number]};\n target: {target_dict[index[0][0]]}')

 source: query: HEMOGRAMA COM FORMULA LEUCOCITARIA (ERITROGRAMA, CONTAGEM DE LEUCOCITOS, CONTAGEM DE PLAQU;
 target: query: Hemoglobin C/Hemoglobin.total in Blood by HPLC


In [39]:
top1 = 0
top5 = 0
top10 = 0
total = len(sources)
for i in range(total):
    source_example = sources_emb[i]
    distance, index = norml2_innerproduct(targets_emb, source_example)
    
    if i == index[0][0]:
        top1+=1
        top5+=1
        top10+=1
    elif i in index[0][:5]:
        top5+=1
        top10+=1
    elif i in index[0][:10]:
        top10+=1

    

In [40]:
print(f"""
      Top 1 match: {top1/total:.2%};
      Top 5 match: {top5/total:.2%};ok
      Top 10 match: {top10/total:.2%};
      Total number of tests: {len(sources)}
    """)


      Top 1 match: 44.16%;
      Top 5 match: 70.56%;ok
      Top 10 match: 78.27%;
      Total number of tests: 428
    


Distance metric seems to be correctly implemented. Now we need more examples to test on.

# Expand the number of examples


In [41]:
import sys
sys.path.insert(0, '..') # add parent folder path
from data_preprocessors import RawDataProcessor

In [42]:
hospital_folders = ["../lib/data/raw/source_to_concept/chuc/", "../lib/data/raw/source_to_concept/hds/"]
concept_vocab = "../lib/data/raw/vocabularies/CONCEPT.csv"

rdp = RawDataProcessor(vocab_file=concept_vocab, hospital_folders=hospital_folders)
sources, targets = rdp.join_source_target()

In [43]:
assert len(sources) == len(targets)

In [44]:
len(sources)

2222

In [45]:
source_dict = dict(zip(range(len(sources)), sources))
target_dict = dict(zip(range(len(targets)), targets))

In [46]:
import pickle 
with open('../lib/artifacts/dicts/sources.pickle', 'wb') as handle:
    pickle.dump(source_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('../lib/artifacts/dicts/targets.pickle', 'wb') as handle:
    pickle.dump(target_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)


# Select models

In [47]:
list_of_models = [ 
                  "sentence-transformers/distiluse-base-multilingual-cased-v2", # 2019 maps sentences & paragraphs to a 512 dimensional dense vector space and can be used for tasks like clustering or semantic search
                  "sentence-transformers/paraphrase-multilingual-mpnet-base-v2", # 2019 maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search
                  "mixedbread-ai/mxbai-embed-large-v1", # 2024 It achieves SOTA performance on BERT-large scale (feature extraction)
                  'intfloat/multilingual-e5-small', # 2024 This model has 12 layers and the embedding size is 384 (sentence similarity)
                  "intfloat/multilingual-e5-base", # 2024 This model has 24 layers and the embedding size is 1024 (sentence similarity)
                  "intfloat/multilingual-e5-large", # 2024 This model has 12 layers and the embedding size is 768 (sentence similarity)
                  "sentence-transformers/all-MiniLM-L6-v2", # maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search
                  "sentence-transformers/all-MiniLM-L12-v2", # maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search
                  # "Henrychur/MMedLM2", # too large for now
                  "medicalai/ClinicalBERT" # 2023 The ClinicalBERT model was trained on a large multicenter dataset with a large corpus of 1.2B words of diverse diseases
]

# Top1, Top5, and Top10 recall for 2222 examples

For testing we need to apply different query prefixes since some models are trained with specific start tokens for query and response. 

Here, we'll test with a "query: " prefix and without. The performance differences can be quite unpredictable. 

The test function bellow tracks if the model needs remote code to run (some models do), and the recall@k metrics for each model whith each prefix. 

By default, the test will be conducted using the L2 search of the normalized IP. 





In [52]:
from tqdm import tqdm
from time import time
from sentence_transformers import SentenceTransformer


def test_models(models: list, sources: list, targets: list):

    # Store results
    results = []

    for plm in tqdm(models, desc="Testing models: "):

        # Load model
        needs_remote_code = 0
        try:
            model = SentenceTransformer(plm, trust_remote_code=False)
        except ValueError:
            model = SentenceTransformer(plm, trust_remote_code=True)
            needs_remote_code = 1
        
        for query_prefix in ['', 'query: ']:

            mod_sources = [(query_prefix + i) for i in sources]
            mod_targets = [(query_prefix + i) for i in targets]

            # Track results
            top1 = 0
            top5 = 0
            top10 = 0
            total = len(sources)

            # Encode
            sources_emb = model.encode(mod_sources, normalize_embeddings=True)
            targets_emb = model.encode(mod_targets, normalize_embeddings=True)

            # Track Encoding Time
            start = time()
            for i in tqdm(range(total), leave=False):

                # Compute distances
                source_example = sources_emb[i]
                distance, index = norml2_innerproduct(targets_emb, source_example)

                # Check matches
                if i == index[0][0]:
                    top1 += 1
                    top5 += 1
                    top10 += 1
                elif i in index[0][:5]:
                    top5 += 1
                    top10 += 1
                elif i in index[0][:10]:
                    top10 += 1

            # Compute time
            end = time()
            elapsed_seconds = end - start

            results.append(
                {   
                    "plm": plm + '__query_prefix__' + query_prefix,
                    "remote_code": needs_remote_code,
                    "Top-1 match": top1/total,
                    "Top-5 match": top5/total,
                    "Top-10 match": top10/total,
                    "Total number of tests": len(sources),
                    "Elapsed seconds": elapsed_seconds,
                    "Predictions per second X 1000": len(sources)/elapsed_seconds/1000
                }
            )

    return results

In [53]:
results = test_models(list_of_models, sources, targets)

Testing models:  89%|████████▉ | 8/9 [02:59<00:19, 19.19s/it]No sentence-transformers model found with name medicalai/ClinicalBERT. Creating a new one with MEAN pooling.
Testing models: 100%|██████████| 9/9 [03:11<00:00, 21.31s/it]


Append USAGI's reported results

In [54]:
usagis = {"plm": 'USAGI', "Top-1 match": 0.42, "Top-5 match": 0.58, "Top-10 match": 0.62} # From toki paper
results.append(usagis)

In [55]:
import pandas as pd
results_df = pd.DataFrame.from_dict(results)
results_df

Unnamed: 0,plm,remote_code,Top-1 match,Top-5 match,Top-10 match,Total number of tests,Elapsed seconds,Predictions per second X 1000
0,sentence-transformers/distiluse-base-multiling...,0.0,0.233573,0.446445,0.536004,2222.0,1.465041,1.516681
1,sentence-transformers/distiluse-base-multiling...,0.0,0.244824,0.439244,0.518902,2222.0,1.42276,1.561753
2,sentence-transformers/paraphrase-multilingual-...,0.0,0.365437,0.617462,0.680018,2222.0,2.086202,1.065093
3,sentence-transformers/paraphrase-multilingual-...,0.0,0.364986,0.607111,0.676868,2222.0,2.038171,1.090193
4,mixedbread-ai/mxbai-embed-large-v1__query_pref...,0.0,0.480198,0.761026,0.823582,2222.0,2.803631,0.792544
5,mixedbread-ai/mxbai-embed-large-v1__query_pref...,0.0,0.469847,0.737624,0.809181,2222.0,2.751335,0.807608
6,intfloat/multilingual-e5-small__query_prefix__,0.0,0.484248,0.714671,0.775878,2222.0,1.142786,1.944371
7,intfloat/multilingual-e5-small__query_prefix__...,0.0,0.479748,0.718722,0.784878,2222.0,1.154759,1.924211
8,intfloat/multilingual-e5-base__query_prefix__,0.0,0.471647,0.710171,0.777678,2222.0,2.056411,1.080523
9,intfloat/multilingual-e5-base__query_prefix__q...,0.0,0.469397,0.693069,0.763726,2222.0,2.052016,1.082838


Filter best performant pretrained models

In [57]:
results_df = results_df.loc[
    (results_df['Top-1 match'] >= usagis['Top-1 match']) &
    (results_df['Top-5 match'] >= usagis['Top-5 match']) &
    (results_df['Top-10 match'] >= usagis['Top-10 match']), :]

results_df.sort_values(by=['Top-1 match', 'Top-5 match', 'Top-10 match'], ascending=False)

Unnamed: 0,plm,remote_code,Top-1 match,Top-5 match,Top-10 match,Total number of tests,Elapsed seconds,Predictions per second X 1000
11,intfloat/multilingual-e5-large__query_prefix__...,0.0,0.509001,0.733123,0.792979,2222.0,2.788705,0.796786
10,intfloat/multilingual-e5-large__query_prefix__,0.0,0.50045,0.743024,0.805581,2222.0,2.793587,0.795393
6,intfloat/multilingual-e5-small__query_prefix__,0.0,0.484248,0.714671,0.775878,2222.0,1.142786,1.944371
4,mixedbread-ai/mxbai-embed-large-v1__query_pref...,0.0,0.480198,0.761026,0.823582,2222.0,2.803631,0.792544
7,intfloat/multilingual-e5-small__query_prefix__...,0.0,0.479748,0.718722,0.784878,2222.0,1.154759,1.924211
8,intfloat/multilingual-e5-base__query_prefix__,0.0,0.471647,0.710171,0.777678,2222.0,2.056411,1.080523
5,mixedbread-ai/mxbai-embed-large-v1__query_pref...,0.0,0.469847,0.737624,0.809181,2222.0,2.751335,0.807608
9,intfloat/multilingual-e5-base__query_prefix__q...,0.0,0.469397,0.693069,0.763726,2222.0,2.052016,1.082838
14,sentence-transformers/all-MiniLM-L12-v2__query...,0.0,0.432493,0.69982,0.765077,2222.0,1.153568,1.926198
15,sentence-transformers/all-MiniLM-L12-v2__query...,0.0,0.424842,0.676868,0.746175,2222.0,1.161286,1.913396


In [58]:
results_df.drop(['Total number of tests'], axis=1, inplace=True)

# Plot results

In [60]:

parallel(results_df, label='plm')         

# Conclusions

We can see that through this approach we can easily beat USAGI's reported performance (values from the literature - TOKI paper). 

Another curious finding is that bigger is not obviously better. This actually makes sense, since by raising the amount of dimensions, although more information is being captured, the points in space start to become equally distant to each other, so the gains in information don't translate equally to discriminant power. As an example, `multilingual-e5-small` maps tokens to a 384 dimensional vector while `multilingual-e5-large` maps to a 1024 dimensional one. It presents only a very slight improvement at the cost of ram and inference speed. For now the small one seems to be the best suited for a first POC, but these results can be marginally incresed with a bigger one in the future. 