# Unsupervised training for the core NER+L

We've now got a model.
But it's really not a great one since it can't differentiate between concepts that share a name.
We should be able to rectify that somewhat by doing some training.

First we need to load the model pack we created.

In [1]:
import os
from medcat2.cat import CAT


# NOTE: can refer to the .zip or the folder - both will work just fine
model_path = os.path.join("models", "base_model.zip")

cat = CAT.load_model_pack(model_path)

Now we want to provide some data to teach the model the difference between `73211009` (_Diabetes mellitus_) and `396230008`  (_Dermatomyositis_).
We should be able to do so by providing some data for self-supervised training that with unambiguous names for either concept.
So let's try that.

In [2]:
unsup_trian_texts = [
    # text regarding diabetes mellitus (73211009)
    "Diabetes mellitus is a metabolic disorder characterized by "
    "chronic hyperglycemia due to impaired insulin secretion, "
    "insulin resistance, or both. It can lead to complications "
    "such as neuropathy, nephropathy, and retinopathy if not well "
    "managed. Treatment typically involves lifestyle modifications, "
    "blood glucose monitoring, and pharmacologic interventions like "
    "insulin or oral hypoglycemics.",
    # text regarding dermatomyositis (396230008)
    "A renowned painter, once known for his intricate brushwork, "
    "found his art hindered by progressive muscle weakness and "
    "a distinctive rash on his hands. Doctors diagnosed him with "
    "dermatomyositis, an inflammatory condition affecting muscles "
    "and skin. Though his strength waned, he adapted his technique, "
    "creating expressive works that reflected his resilience in the "
    "face of illness."
]
cat.trainer.train_unsupervised(unsup_trian_texts)
print("Trained concepts:",
      [(ci['cui'], cat.cdb.get_name(ci['cui']), ci['count_train']) for ci in cat.cdb.cui2info.values() if ci['count_train']])
print("Trained names:",
      [(ni["name"], ni["count_train"]) for ni in cat.cdb.name2info.values() if ni["count_train"]])

Trained concepts: [('73211009', 'Diabetes Mellitus Diagnosed', 2), ('396230008', 'Wagner Unverricht Syndrome', 1)]
Trained names: [('diabetes', 1), ('diabetes~mellitus', 1), ('dermatomyositis', 1)]


Note that normally, one would load a larger dataset - e.g from a CSV file - and train based on the data there rather than specifying the text in code.

In [3]:
example_text1 = """DM is a chronic disease caused by impaired insuline excretion."""# definitely diabetes
example_text2 = """Patient diagnosed with DM now has chronic kidney disease."""# probably diabetes
example_text3 = """Patient diagnosed with DM now has difficulty with their fine motor skills"""# probably dermatomyositis
for text_num, cur_text in enumerate([example_text1, example_text2, example_text3]):
    print(text_num, ":", cat.get_entities(cur_text)['entities'])

0 : {0: {'pretty_name': 'Diabetes Mellitus Diagnosed', 'cui': '73211009', 'type_ids': [], 'source_value': 'DM', 'detected_name': 'dm', 'acc': np.float64(0.8805509317765855), 'context_similarity': np.float64(0.8805509317765855), 'start': 0, 'end': 2, 'id': 0, 'meta_anns': {}, 'context_left': [], 'context_center': [], 'context_right': []}}
1 : {0: {'pretty_name': 'Diabetes Mellitus Diagnosed', 'cui': '73211009', 'type_ids': [], 'source_value': 'DM', 'detected_name': 'dm', 'acc': np.float64(0.8631518546876605), 'context_similarity': np.float64(0.8631518546876605), 'start': 23, 'end': 25, 'id': 0, 'meta_anns': {}, 'context_left': [], 'context_center': [], 'context_right': []}}
2 : {0: {'pretty_name': 'Wagner Unverricht Syndrome', 'cui': '396230008', 'type_ids': [], 'source_value': 'DM', 'detected_name': 'dm', 'acc': np.float64(0.6594844055302075), 'context_similarity': np.float64(0.6594844055302075), 'start': 23, 'end': 25, 'id': 0, 'meta_anns': {}, 'context_left': [], 'context_center': []

Note that this only work if the context is somewhat similar to what was in the training data.
And because that's based on our very limited `Vocab` in this example, and because there has only been 1 training example, we won't be able to get the correct output for all sets of texts

In [4]:
fail_text = """Patient presented with classic signs of DM: they were thirstier than normal, felt tried and weak"""# probably diabetes
print(cat.get_entities(fail_text)['entities'])

{0: {'pretty_name': 'Wagner Unverricht Syndrome', 'cui': '396230008', 'type_ids': [], 'source_value': 'DM', 'detected_name': 'dm', 'acc': np.float64(0.7090981523297809), 'context_similarity': np.float64(0.7090981523297809), 'start': 40, 'end': 42, 'id': 0, 'meta_anns': {}, 'context_left': [], 'context_center': [], 'context_right': []}}


Now that we've got a model that has received some (very limited!) training, we can save it again.

In [5]:
save_path = "models"
mpp = cat.save_model_pack(save_path, pack_name="unsup_trained_model")
print("Saved at", mpp)

Saved at models/unsup_trained_model
