# Two-step linker

The two-step linker is a linker that attempt to learn and use type information for further context.
It essentially uses the same context manager as the regular linker, but only for types.
Since there's generally far fewer types, and they're all quite different, classification between them should be easier.
This should help with misidentifying wrong types of concepts that share the same name (e.g _infusion (procedure)_ vs _infusion (method of drug admiinstartion)_).

Normally, changing the linker would mean you'd retrain the model.
That's because the new linker needs to be retrained.
However, if the train counts are correct, we should be able to infer the per-type traininng from existing per-concept trainining.

In [1]:
# let's reuse the supervised trained model, but we'll change the linker
from medcat.cat import CAT
cat = CAT.load_model_pack("../basic/models/sup_trained_model.zip")
print("Existing concept (and types),    their names,                    and corrresponding training:")
for tui, ci in cat.cdb.cui2info.items():
    print(f"{tui:16s} ({ci['type_ids']}) \t {cat.cdb.get_name(tui):24s} \t {ci['count_train']}")

Existing concept (and types),    their names,                    and corrresponding training:
73211009         (set()) 	 Diabetes Mellitus Diagnosed 	 2
396230008        (set()) 	 Wagner Unverricht Syndrome 	 1
44132006         (set()) 	 Abscess                  	 2
128477000        (set()) 	 Abscesses                	 2


We notice that the current concepts don't have type IDs.
So we have to  add them before we can use 2-step linking.
Let's just devide them into two types:
- Ones that start with A
- Ones that don't start with A
This is an arbitrary distinction, but should work for our example.

In [2]:
from medcat.cdb.concepts import TypeInfo
cat.cdb.type_id2info.update({
    "SA": TypeInfo(
        type_id="SA",
        name="Starts with A",
        cuis=[cui for cui in cat.cdb.cui2info if cat.cdb.get_name(cui).startswith("A")],
    ),
    "NA": TypeInfo(
        type_id="NA",
        name="Does not start with A",
        cuis=[cui for cui in cat.cdb.cui2info if not cat.cdb.get_name(cui).startswith("A")],
    ),
})
# NOTE: we also need to update the concept info, so that the new types are included
for tui, ci in cat.cdb.cui2info.items():
    for ti in cat.cdb.type_id2info.values():
        if tui in ti.cuis:
            ci["type_ids"].add(ti.type_id)

print("Existing concept (and types),    their names,                    and corrresponding training:")
for tui, ci in cat.cdb.cui2info.items():
    print(f"{tui:16s} ({ci['type_ids']}) \t {cat.cdb.get_name(tui):24s} \t {ci['count_train']}")

Existing concept (and types),    their names,                    and corrresponding training:
73211009         ({'NA'}) 	 Diabetes Mellitus Diagnosed 	 2
396230008        ({'NA'}) 	 Wagner Unverricht Syndrome 	 1
44132006         ({'SA'}) 	 Abscess                  	 2
128477000        ({'SA'}) 	 Abscesses                	 2


Now we've got the necessary information to use a 2-step linker.
We just need to get the context vectors from the existing training.

In [3]:
import numpy as np
from medcat.components.linking.two_step_context_based_linker import TYPE_ID_PREFIX
# change linking components
cat.config.components.linking.comp_name = 'medcat2_two_step_linker'
# recreate pipe with 2-step linker, this won't be automatic
cat._recreate_pipe()
# update TypeID counts and context vectors
def get_context_vectors(tui: str) -> tuple[dict[str, np.ndarray | None], int]:
    cvs = [((ci := cat.cdb.cui2info[cui])['context_vectors'], ci['count_train'])
           for cui in cat.cdb.type_id2info[tui].cuis]
    # print("FIRST CSV", cvs[0][0]['xlong'].shape)
    vec_types = set(
        vt
        for ccvs, _ in cvs if ccvs
        for vt in ccvs)
    out_vec: dict[str, np.ndarray | None] = {}
    max_train = 0
    for vtype in vec_types:
        vecs: list[np.ndarray] = []
        counts: list[int] = []
        for ccvs, count in cvs:
            if vtype in ccvs:
                vecs.append(ccvs[vtype])
                counts.append(count)
        if vecs and counts:
            out_vec[vtype] = np.sum([c * v for c, v in zip(counts, vecs)], axis=0) / sum(counts)
            max_train = max(max_train, sum(counts))
    return out_vec, max_train

for tui, type_ci in cat.cdb.cui2info.items():
    if not tui.startswith(TYPE_ID_PREFIX):
        # ignore actual/regular CUIs
        continue
    type_ci['context_vectors'], type_ci['count_train'] = get_context_vectors(tui.removeprefix(TYPE_ID_PREFIX))
    # print("Set CV", type_ci['context_vectors'], type_ci['context_vectors']['xlong'].shape)
print("Existing concept (and types),    their names,                    and corrresponding training:")
for tui, ci in cat.cdb.cui2info.items():
    print(f"{tui:16s} ({ci['type_ids']}) \t {cat.cdb.get_name(tui):24s} \t {ci['count_train']}")

Existing concept (and types),    their names,                    and corrresponding training:
73211009         ({'NA'}) 	 Diabetes Mellitus Diagnosed 	 2
396230008        ({'NA'}) 	 Wagner Unverricht Syndrome 	 1
44132006         ({'SA'}) 	 Abscess                  	 2
128477000        ({'SA'}) 	 Abscesses                	 2
TYPE_ID:SA       (set()) 	 Starts with A            	 4
TYPE_ID:NA       (set()) 	 Does not start with A    	 3


NB: Ideally this is where you would start if you had done the entire process of model creation and training (e.g basics/1 through 3) with a 2-step linker approach.

Now let's explore the model's output.
To tell the difference (since it's all happening under the hood), we need to enable logging.

In [4]:
abscess_text_morph = """Histopathology reveals a well-encapsulated abscess with central necrosis and neutrophilic infiltration."""
abscess_text_disorder = """An abscess is a disorder, which is a clinical condition characterized by the formation of a painful and inflamed mass containing purulent material"""

# enable logging
import logging
from medcat.components.linking.two_step_context_based_linker import logger as tsl_l
from medcat.components.linking.context_based_linker import logger as cbl_l
sh = logging.StreamHandler()
for l in [tsl_l, cbl_l]:
    if not l.handlers:
        l.addHandler(sh)
    l.setLevel(logging.DEBUG)

for text_num, text in enumerate([abscess_text_morph, abscess_text_disorder]):
    ents = cat.get_entities(text)['entities']
    print(text_num, ":", ents)

Narrowing down candidates for: 'abscess' from ['44132006', '128477000']
Adding per CUI to 128477000 (tokens 5..5) weights {'44132006': 0.7396468208767506, '128477000': 0.7396468208767506}
Linker started with entity: abscess
Mixing type similarity of 0.9929 and CUI similarity of 0.7396 with 0.7682 weight for CUI similarity
[Per CUI weights] CUI: 44132006, Name: abscess, Old sim: 0.993, New sim: 0.798
Mixing type similarity of 0.1344 and CUI similarity of 0.7396 with 0.7682 weight for CUI similarity
[Per CUI weights] CUI: 128477000, Name: abscess, Old sim: 0.134, New sim: 0.599
Considering CUI 44132006 with sim 0.798340
Narrowing down candidates for: 'abscess' from ['44132006', '128477000']
Adding per CUI to 128477000 (tokens 1..1) weights {'44132006': 0.4386427664494404, '128477000': 0.4386427664494404}
Linker started with entity: abscess
Mixing type similarity of 0.2137 and CUI similarity of 0.4386 with 0.4239 weight for CUI similarity
[Per CUI weights] CUI: 44132006, Name: abscess, Ol

0 : {0: {'pretty_name': 'Abscess', 'cui': '44132006', 'type_ids': ['SA'], 'source_value': 'abscess', 'detected_name': 'abscess', 'acc': 0.7983395419982506, 'context_similarity': 0.7983395419982506, 'start': 43, 'end': 50, 'id': 0, 'meta_anns': {}, 'context_left': [], 'context_center': [], 'context_right': []}}
1 : {0: {'pretty_name': 'Abscesses', 'cui': '128477000', 'type_ids': ['SA'], 'source_value': 'abscess', 'detected_name': 'abscess', 'acc': 0.45236695339580785, 'context_similarity': 0.45236695339580785, 'start': 3, 'end': 10, 'id': 0, 'meta_anns': {}, 'context_left': [], 'context_center': [], 'context_right': []}}


As we saw above in the log, some extra steps were done to mix in the type context to the CUI context.