# MedCAT model

The core components that we need to built a MedCAT model are:
- `CDB` - the concept database
- `Config` - all the configurations/options/settings needed for the CDB, as well as the NER and EL process
- `Vocab` - the vocabulary for context vectors

## Building a vocabulary

One of the parts of a model is its vocabulary.
The vocabulary is used for both spell checking as well as for word embeddings within to allow the model to differentiate between different contexts.

The `Vocab` is a simple class that (more or less) just keeps track of the words, their occurance rate, and the corresponding vector embedding.
The occurance rate of a word is used to determine the likelyhood of it in case of typos.
And the word embedding is used for the context of concepts within text.


In [1]:
import os
import numpy as np
from medcat2.vocab import Vocab

vocab = Vocab()

Now we have an empty `Vocab`.
But we want to add names to it.

In [2]:
vocab.add_word("severe", 10000, np.array((1.0, 0, 0, 1 ,0, 0, 0)))
vocab.add_word("minor", 10000, np.array((1.0, 0, 0, 1, 0, 0, 0)))
vocab.add_word("acute", 6500, np.array((0, 1.0, 0, 1, 0, 0, 0)))
vocab.add_word("chronic", 6500, np.array((0, -1.0, 0, 0, 1, 0, 0)))

vocab_data_path = os.path.join('in_data','dummy_vocab_data.txt')
# we add the rest based on an input file
vocab.add_words(vocab_data_path)

Now we have a vocabulary with some words that have some embedding vectors.
We can look at them a little bit as well.

In [3]:
print("Words in vocab:", vocab.vocab.keys())
print("Word info for 'severe'", vocab.vocab['severe'])
print("Word vector for 'severe'", vocab.vec('severe'))

Words in vocab: dict_keys(['severe', 'minor', 'acute', 'chronic', 'heavy', 'light', 'considered', 'with', 'of', 'to', 'were', 'was', 'is', 'are', 'has', 'presence', 'indication', 'time'])
Word info for 'severe' {'vector': array([1., 0., 0., 1., 0., 0., 0.]), 'count': 10000, 'index': 0}
Word vector for 'severe' [1. 0. 0. 1. 0. 0. 0.]


## Let's start building a concept database (CDB)

The concept database (CDB) defines the concepts (the terms) as well as the names each concept can be referred to as.
For instance, if we're talking about the Snomed term `73211009` (_Diabetes mellitus_), it can be known as many names (e.g "diabetes", "diabetes mellitus diagnosis", "diabete mellitus", "diabetes mellitus", "dm", "diabetes mellitus diagnosed", "diabetes mellitus disorder").

While many of these names are clearly always referring to the term, some are not.
For instance, the abrevation `dm` could refer to many other terms (e.g `77956009` (Steinert myotonic dystrophy syndrome), `396230008` (Dermatomyositis), `387114001` (Dextromethorphan), or `30782001` (Diastolic murmur), and so on).
The concept databse is designed to make sure we understand all this.

I.e so that we know which names can be used for each terms, as well as which names correspond to which different terms.
In the context of MedCAT, we tend to call the term identifier (`73211009` in this case) the Concept Unique Identifier (_CUI_) and the variants simple the _names_.


In [4]:
import pandas as pd
from medcat2.model_creation.cdb_maker import CDBMaker
from medcat2.cdb import CDB
from medcat2.config import Config

# first we need a config
# we can use the default for now
cnf = Config()

# now we can create a CDB
cdb = CDB(cnf)

# though this CDB is empty
print("CDB, CUI2Info:", cdb.cui2info, "name2Info", cdb.name2info)

CDB, CUI2Info: {} name2Info {}


Now we have a CDB, but we need a `CDBMaker` to help us add concepts to the CDB.
The `CDBMaker` helps automate some of the preprocessing for the names.
As we will see later, this makes the names appear somewhat unorthodox in the CDB.


In [5]:
# NOTE: we can just start with a config and in that case a new CDB is created by the maker automatically
maker = CDBMaker(cnf, cdb)
# now 

Next, we want to add some actual context into the CDB.
We will select the diabetes concept (CUI `73211009`) that was discussed earlier along with the names listed before.
Do note though that this is not all the names that the term can be found with.

In [6]:
cui = '73211009'  # NOTE: we use the CUI as strings because not all ontologies have integer concept identifiers
name_list = ["diabetes", "diabetes mellitus diagnosis", "diabete mellitus",
             "diabetes mellitus", "dm", "diabetes mellitus diagnosed", "diabetes mellitus disorder"]

Before we can feed this to the `CDBMaker` we need to migrate the concept and its names into a `pandas.DataFrame`.
This is because in general this method is used to read data from one or several CSV files with many concepts within them.

In [7]:

cui_df = pd.DataFrame({"cui": cui, "name": name_list})
# now we check that all our names are in the dataframe
print("The dataframe:\n", cui_df)

The dataframe:
         cui                         name
0  73211009                     diabetes
1  73211009  diabetes mellitus diagnosis
2  73211009             diabete mellitus
3  73211009            diabetes mellitus
4  73211009                           dm
5  73211009  diabetes mellitus diagnosed
6  73211009   diabetes mellitus disorder


Now that we have the input data in the correct format, we can add it to the CDB using the `CDBMaker`.

NOTE:
The full supported CSV format can be described as:
```csv
cui,name,ontologies,name_status,type_ids,description
1,Kidney Failure,SNOMED,P,T047,kidneys stop working
```

In [8]:
maker.prepare_csvs([cui_df])
# now we can verify that the added concept and its names are in there:
# NOTE: we only print out the key for now
print("Done CDB CUI2Info:", cdb.cui2info.keys(), "name2Info", cdb.name2info.keys())

Done CDB CUI2Info: dict_keys(['73211009']) name2Info dict_keys(['diabetes', 'diabetes~mellitus~diagnosis', 'diabete~mellitus', 'diabetes~mellitus', 'dm', 'diabetes~mellitus~diagnosed', 'diabetes~mellitus~disorder'])


As we can see, the names have `~` in them instead of spaces (` `).
That is the preprocessing that `CDBMaker` automated for us.
The reason for this is not important at this time.

However, in order for the CDB to be useful we may want to add another concept to it.

In [9]:
cui2 = '396230008'  # Dermatomyositis
# NOTE: we're not including all the possible names here either
name_list2 = ['dermatopolymyositis', 'dermatomyositis disorder', 'wagner unverricht syndrome',
         'dermatomyositides', 'dm', 'dermatomyositis', 'dermatomyositis diagnosis', 'dermatomyositide']
cui_df2 = pd.DataFrame({"cui": cui2, "name": name_list2})
maker.prepare_csvs([cui_df2])
# and check the contents again
print("Final CDB CUI2Info:", cdb.cui2info.keys(), "name2Info", cdb.name2info.keys())

Final CDB CUI2Info: dict_keys(['73211009', '396230008']) name2Info dict_keys(['diabetes', 'diabetes~mellitus~diagnosis', 'diabete~mellitus', 'diabetes~mellitus', 'dm', 'diabetes~mellitus~diagnosed', 'diabetes~mellitus~disorder', 'dermatopolymyositis', 'dermatomyositis~disorder', 'wagner~unverricht~syndrome', 'dermatomyositides', 'dermatomyositis', 'dermatomyositis~diagnosis', 'dermatomyositide'])


## Now we can create a model pack

Now that we have the CDB and the Vocab, we can finally create a model pack.

In [10]:
from medcat2.cat import CAT

cat = CAT(cdb, vocab, cnf)

Now that we have a model pack, we can try and use it as well.

In [11]:
text_for_diabetes = """Patient was diagnosed with diabetes last year."""
print("Found entities:", cat.get_entities(text_for_diabetes)['entities'])
text_for_dermatomyositis = """Patient with dermatomyositis had no comorbilities"""
print("Found entities:", cat.get_entities(text_for_dermatomyositis)['entities'])

Training was enabled during inference. It was automatically disabled.


Found entities: {0: {'pretty_name': 'Diabetes Mellitus Diagnosed', 'cui': '73211009', 'type_ids': [], 'source_value': 'diabetes', 'detected_name': 'diabetes', 'acc': 1, 'context_similarity': 1, 'start': 27, 'end': 35, 'id': 0, 'meta_anns': {}, 'context_left': [], 'context_center': [], 'context_right': []}}
Found entities: {0: {'pretty_name': 'Wagner Unverricht Syndrome', 'cui': '396230008', 'type_ids': [], 'source_value': 'dermatomyositis', 'detected_name': 'dermatomyositis', 'acc': 1, 'context_similarity': 1, 'start': 13, 'end': 28, 'id': 0, 'meta_anns': {}, 'context_left': [], 'context_center': [], 'context_right': []}}


These worked without training because the concepts were not amiguous.
Each name we detected was directly linked to only one concept.
However, things change if we try the same with some text that is ambiguous.

In [12]:
# NOTE: we need to set the min name length to 2 to have any chance here
cat.config.components.ner.min_name_len = 2

ambig_text = """Patient with DM was diagnosed with chronic kidney disease."""
print("Found entities:", cat.get_entities(ambig_text)['entities'])

Found entities: {}


The next parts of the tutorials will show how we can use the model pack and make it disambiguate names as well
For now, we will just save the model.

In [13]:
save_path = "models"
mpp = cat.save_model_pack(save_path, pack_name="base_model")
print("Saved at", mpp)

Saved at models/base_model
