# This is the parser for the Doctor Lingo Corpus base
- We use UMLS as a base, which is quite complex to understand so I made this jupyter notebook as a learning tool on how we develop our corpus.
- There is an additional pure python file to just run it and not learn.
- DISCLAIMER: Working with the data with no understanding of underlying architecture is not the best idea.
- I highly advise working through this if you are new to the corpus.

## If you get lost on this parsing journey reference the below which is what I used.
https://www.ncbi.nlm.nih.gov/books/NBK9676/

## Conceptually for each atom aka word / word phrase we need all of the information from this chart
https://www.nlm.nih.gov/research/umls/implementation_resources/query_diagrams/er1.html
Keep in mind this chart diagrams the approach with SQL query language for obtaining the data for any one given atom.
We are obtaining the same data for all atoms and storing them in distinct formats:
- A file for all concepts, atoms, and pointers to their respective properties from MRCONSO
- A file for all definitions in UMLS (around 400,000 definitions) from MRDEFF
- A folder directory of files for each term's hierarchical relationships according to their sources (i.e. SNOMED "is a") from MRHIER
- A folder directory of files for each term's relationships according to relationship type from MRREL
- There are two files we skip over:
- The MRSAT file as most of these attributes are source related and irrelevant for most NLP tasks
- The MRSTY file, which is very important
- However both files don't really need trimming and we can keep them in the corpus and interact with them programmatically
- We will include the MRSAT file in the UMLS base in the event works needs to be done from on of the attributes in the link below
- https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/attribute_names.html
- Semantic types are quite important to understand: https://www.nlm.nih.gov/research/umls/new_users/online_learning/OVR_003.html


### Let's start by importing the tools we will use to parse this corpus base.

In [1]:
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import dask_cudf as dcf
import cudf as cdf
import dask.dataframe as ddf
cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="0")
client = Client(cluster)
client.run(cdf.set_allocator, "managed")
client


distributed.preloading - INFO - Import preload module: dask_cuda.initialize


0,1
Connection method: Cluster object,Cluster type: LocalCUDACluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Status: running,Using processes: True
Dashboard: http://127.0.0.1:8787/status,Workers: 1
Total threads:  1,Total memory:  188.59 GiB

0,1
Comm: tcp://127.0.0.1:44449,Workers: 1
Dashboard: http://127.0.0.1:8787/status,Total threads:  1
Started:  Just now,Total memory:  188.59 GiB

0,1
Comm: tcp://127.0.0.1:32797,Total threads: 1
Dashboard: http://127.0.0.1:37503/status,Memory: 188.59 GiB
Nanny: tcp://127.0.0.1:46257,
Local directory: /home/karl/PycharmProjects/DLAI/build_corpus/dask-worker-space/worker-5dgje5yd,Local directory: /home/karl/PycharmProjects/DLAI/build_corpus/dask-worker-space/worker-5dgje5yd
GPU: NVIDIA GeForce RTX 3090,GPU memory: 23.69 GiB


# 1. Find all atoms of a UMLS concepts and create a lookup dataframe/table
- In UMLS atoms represent synonyms that map to a concept and definition.
- The only time an atom will map to the same concept but different definition is if it is a language difference.
- For example a single concept may have 3 definitions for 3 different languages, English, French, Spanish.
- The corresponding atoms will map to that concept, because a concept should be conserved across languages, but...
- Because definitions must be translated across languages, we may have only certain atoms of the same language map to its respective definition.

## We want to create a lookup dataframe for any concept in UMLS with pointers for all properties stored in other files
- We do this by using the MRCONSO file.
- MRCONSO is named for CONcepts and SOurces.
- It has a unique row for each atom's identifier, AUI, and string STR that maps to the respective concept, CUI (1st index)

# Briefly let's review the structure of MRCONSO

|Col. |	Description |
| --- | ----------- |
| CUI |	Unique identifier for concept |
| LAT |	Language of term |
| TS |	Term status |
| LUI |	Unique identifier for term. |
| STT |	String type. PF: preferred form of term, VCW: case and word-order variant of preferred term, VC: case variant of preferred term, VO: Variant of the preferred form, VW: word order variant of preferred term |
| SUI |	Unique identifier for string |
| ISPREF | Atom status - preferred (Y) or not (N) for this string within this concept |
| AUI |	Unique identifier for atom - variable length field, 8 or 9 characters|
| SAUI | Source asserted atom identifier [optional] |
| SCUI | Source asserted concept identifier [optional] |
| SDUI | Source asserted descriptor identifier [optional] |
| SAB |	Abbreviated source name (SAB). Maximum field length is 20 alphanumeric characters. Two source abbreviations are assigned: Root Source Abbreviation (RSAB) — short form, no version information, for example, AI/RHEUM, 1993, has an RSAB of "AIR" Versioned Source Abbreviation (VSAB) — includes version information, for example, AI/RHEUM, 1993, has an VSAB of "AIR93" Official source names, RSABs, and VSABs are included on the UMLS Source Vocabulary Documentation page.|
| TTY |	Abbreviation for term type in source vocabulary, for example PN (Metathesaurus Preferred Name) or CD (Clinical Drug). Possible values are listed on the Abbreviations Used in Data Elements page. |
| CODE | Most useful source asserted identifier (if the source vocabulary has more than one identifier), or a Metathesaurus-generated source entry identifier (if the source vocabulary has none) |
| STR |	String |
| SRL |	Source restriction level |
| SUPPRESS | Suppressible flag. Values = O, E, Y, or N. O: All obsolete content, whether they are obsolesced by the source or by NLM. These will include all atoms having obsolete TTYs, and other atoms becoming obsolete that have not acquired an obsolete TTY (e.g. RxNorm SCDs no longer associated with current drugs, LNC atoms derived from obsolete LNC concepts). E: Non-obsolete content marked suppressible by an editor. These do not have a suppressible SAB/TTY combination. Non-obsolete content deemed suppressible during inversion. These can be determined by a specific SAB/TTY combination explicitly listed in MRRANK. N: None of the above. Default suppressibility as determined by NLM (i.e., no changes at the Suppressibility tab in MetamorphoSys) should be used by most users, but may not be suitable in some specialized applications. See the MetamorphoSys Help page for information on how to change the SAB/TTY suppressibility to suit your requirements. NLM strongly recommends that users not alter editor-assigned suppressibility, and MetamorphoSys cannot be used for this purpose. |
| CVF |	Content View Flag. Bit field used to flag rows included in Content View. This field is a varchar field to maximize the number of bits available for use. |


### We are interested in keeping columns CUI, LAT, LUI, STT, AUI, SAB, TTY, CODE - so we are able to drop 10 out of 18 rows
- We will not return the dataframe with columns in order as above.
- We will instead use the order:
- **CUI, AUI, STR, LAT, LUI, STT, TTY, CODE**

### Why do we keep these columns?
- We keep CUI and AUI because it is important to be able to map the identifiers of words / phrases to various properties.
- We keep STR for obvious reasons, it is the actual English word for a given AUI / atom that maps to a concept.
- We keep LAT to be able to get terms from any language we choose.
- We keep LUI as these are used as lexical elements for a word in a given word phrase i.e. cardiac structures, has two words: cardiac and structures
- We keep STT to know what is the preferred term or not for this concept, or if it is some variation of the same, i.e. synonyms to preferred term
- We keep TTY for broad categories of terms i.e. CD is a clinical drug
- We keep CODE as it was the sources original identifier

In [10]:

# Read the UMLS MRCONSO file into GPU memory with dask_cudf
mrconso_columns = ['CUI','LAT','TS','LUI','STT','SUI','ISPREF','AUI','SAUI','SCUI','SDUI','SAB','TTY','CODE','STR','SRL','SUPPRESS','CVF']
mrconso = dcf.read_csv(
    "/home/karl/PycharmProjects/DLAI/datasets/UMLS/MRCONSO.RRF",
    sep="|",
    names=mrconso_columns,
    #chunksize="1 GB"
)

# we switch the dataframe from dask_cudf to cudf for manipulation of strings (only supported in cudf)
mrconso = mrconso.compute()

# we select the functional columns that we need for our corpus and make a new cudf dataframe
mrconso = cdf.DataFrame({
    'concept_identifier': mrconso['CUI'],
    'atom_identifier': mrconso['AUI'],
    'atom_word_phrase': mrconso['STR'],
    'language': mrconso['LAT'],
    'word unit identifier': mrconso['LUI'],
    'preferred_or_variant': mrconso['STT'],
    'general_category': mrconso['TTY'],
    'source_identifier': mrconso['CODE'],

})

mrconso.to_csv('/home/karl/PycharmProjects/DLAI/corpus/UMLS_base/UMLS_all_concepts_and_synonyms_with_property_pointers.csv', sep='|', single_file=True)


# 2. Find all source definitions associated with a UMLS concept.
- This part is a bit easier. MRDEF appropriately named containts definitions for all concepts.
- Conceptually it should make sense that a concept can have its own unique definition shared by synonyms aka atoms.
- This script just builds the corpus aka the lookup files for all properties of each concept, i.e. synonyms, relationships, attributes, etc.
- There are scripts that help access the data from each look up file separate from this.
- That is, to interact with this corpus you should see ... files and usage in the README

## On to creating useable output for each definition for each concept in UMLS

In [12]:
# Read in the MRDEF file to a dask-cuDF dataframe
mrdef_columns = ['CUI','AUI','ATUI','SATUI','SAB','DEF','SUPPRESS','CVF']
mrdef = dcf.read_csv(
    "/home/karl/PycharmProjects/DLAI/datasets/UMLS/MRDEF.RRF",
    sep="|",
    names=mrdef_columns,
    dtype=str
)

mrdef = mrdef.compute()

# We don't need a lot of the info stored in MRDEF as we already have it in our
mrdef = cdf.DataFrame({
    'concept_identifier': mrdef['CUI'],
    'atom_identifier': mrdef['AUI'],
    'DEF': mrdef['DEF']
})

mrdef.to_csv('/home/karl/PycharmProjects/DLAI/corpus/UMLS_base/UMLS_all_definitions.csv', sep='|', singlefile=True)

# 3. Find all concept hierarchies for atoms per source
## This is where we start to get a little fancy, but appropriate for an NLP corpus
## We will now begin parsing out all of the relationships, attributes, etc. aka all properties for each concept
## We will keep all possible syntactic and semantic properties from the UMLS

- This is a bit challenging as these properties are stored across various files.
- The first file we work with is MRHIER, short for hierarchy.
- This stores the hierarchichal properties for each atom (also has CUI for mapping).
- We want to pull this data out and store each hierarchy as a file for the source.
- For example, SNOMEDCT_US is a source, so is MeSH (MSH), we will create separate look up files for each source.
- We do this for ease of use when wanting to navigate hierarchical structures, which are properties for terms.
- One such property being "isa" from SNOMEDCT_US, this property is very similar to hypernymy, an important linguistic property.

In [4]:
# We should know all sources in our current UMLS subset.
# This current DL subset includes all available data for all terms per the open license
# MRHIER for us has ~86 sources as of 2021AA UMLS

source_list = []
for line in open('/home/karl/PycharmProjects/DLAI/datasets/UMLS/MRHIER.RRF', 'r'):
    linesplit = line.split('|')
    source = linesplit[4]
    if source not in source_list:
        source_list.append(source)
print(source_list)
print(len(source_list))

# Now we can iterate over these sources and create unique files for them by selecting row in the dataframe from said source


['MSH', 'SNMI', 'MSHSWE', 'MSHCZE', 'MSHPOR', 'MSHFIN', 'MSHJPN', 'LNC', 'MSHPOL', 'MSHFRE', 'MSHGER', 'MSHITA', 'MSHRUS', 'MSHSPA', 'SNM', 'SNOMEDCT_US', 'SCTSPA', 'RCD', 'CSP', 'PSY', 'MSHSCR', 'MSHNOR', 'AOD', 'CPM', 'NCI', 'MEDCIN', 'ATC', 'USPMG', 'CST', 'UWDA', 'OMIM', 'FMA', 'MSHLAV', 'HL7V2.5', 'ICD10', 'ICD10AM', 'MDR', 'MDRITA', 'MDRJPN', 'MDRCZE', 'KCD5', 'ICD10CM', 'MDRHUN', 'MDRGER', 'MDRRUS', 'MDRKOR', 'MDRBPO', 'MDRPOR', 'MDRSPA', 'MDRDUT', 'MDRFRE', 'NOC', 'HPO', 'ICPC', 'ICPC2EENG', 'WHO', 'CCS', 'MEDLINEPLUS', 'ICNP', 'ICD9CM', 'NEU', 'NCBI', 'PCDS', 'MED-RT', 'HL7V3.0', 'AIR', 'GO', 'CCC', 'PDQ', 'UMD', 'NIC', 'ALT', 'ICF', 'ICF-CY', 'CCSR_ICD10PCS', 'OMS', 'CPT', 'CDT', 'HCPCS', 'NANDA-I', 'PNDS', 'ICD10PCS', 'AOT', 'SOP', 'PPAC', 'TKMT']
86


# Keep in mind pretty much all data in MRHIER file is required to retain
It's rows are labeled as follows:

| Column	| Description |
 ---------- | ----------- |
| CUI	| Unique identifier of concept |
| AUI	| Unique identifier of atom (synonyms) - variable length field, 8 or 9 characters |
| CXN	| Context number (e.g., 1, 2, 3) aka the number of level in the hierarchy for a given atom in a hierarchy |
| PAUI | Unique identifier of atom's immediate parent within this context |
| SAB	| Abbreviated source name (SAB) of the source of atom (and therefore of hierarchical context). Maximum field length is 20 alphanumeric characters. Two source abbreviations are assigned: Root Source Abbreviation (RSAB) — short form, no version information, for example, AI/RHEUM, 1993, has an RSAB of "AIR" Versioned Source Abbreviation (VSAB) — includes version information, for example, AI/RHEUM, 1993, has an VSAB of "AIR93" Official source names, RSABs, and VSABs are included on the UMLS Source Vocabulary Documentation page. |
| RELA | Relationship of atom to its immediate parent (i.e. "is a") |
| PTR	| Path to the top or root of the hierarchical context from this atom, represented as a list of AUIs, separated by periods (.) The first one in the list is top of the hierarchy; the last one in the list is the immediate parent of the atom, which also appears as the value of PAUI. |
| HCD	| Source asserted hierarchical number or code for this atom in this context; this field is only populated when it is different from the code (unique identifier or code for the string in that source). |
| CVF	| Content View Flag. Bit field used to flag rows included in Content View. This field is a varchar field to maximize the number of bits available for use. |

### We are most interested in the atoms and the "PTR" or path to root as this is how you can build out the hierarchy property for any given atom.
Therefore we parse it out as described above, with a file for every source and keeping all columns to work with in the future as needed.
This way if you are ever interested in a particular hierarchical relationship, you can referece the link below and pull the data needed.
https://www.nlm.nih.gov/research/umls/sourcereleasedocs/index.html


In [6]:
# Since we do some transformation to the dataframe and write it immediately without creating a dataframe,
# We do not get the chance to rename the columns as we did in previous steps.
# mrhier_columns = ['CUI','REL_AUI','CXN','PAUI','SAB','HIER_RELA','PTR','HCD','CVF']
# So instead of using the above columns as named in UMLS we do the renaming at the start.
mrhier_columns = ['concept_identifier','REL_AUI','CXN','PAUI','SAB','HIER_RELA','PTR','HCD','CVF']

mrhier = dcf.read_csv(
    '/home/karl/PycharmProjects/DLAI/datasets/UMLS/MRHIER.RRF',
    sep = '|',
    names=mrhier_columns,
    dtype='str',
)

for source in source_list:
    match_df = mrhier[mrhier['SAB'].str.match(str(source))]
    match_df = match_df.reset_index()
    file_name = '/home/karl/PycharmProjects/DLAI/corpus/UMLS_base/all_hierarchies/{src}_hierarchy.csv'.format(src = source)
    match_df.to_csv(file_name, sep='|', single_file=True)


# 4. Find all "natural" direction relationships for a given concept
- Again, like MRHIER, almost all columns are important to retain all relationship properties from UMLS.
- From the below columns we can get rid of SUPPRESS, CVF at least.
- mrrel_columns = ['CUI1','AUI1','STYPE1','REL','CUI','AUI','STYPE2','RELA','RUI','SRUI','SAB','SL','RG','DIR','SUPPRESS','CVF']


In [2]:
# First let's do some cleaning and file generation to work with in the future as described above

# We should know all sources in our current UMLS subset (DL subset includes all available per the open license ~86 as of 2021AA UMLS)
rel_list = []
for line in open('/home/karl/PycharmProjects/DLAI/datasets/UMLS/MRREL.RRF', 'r'):
    linesplit = line.split('|')
    rel = linesplit[7]
    if rel not in rel_list:
        rel_list.append(rel)
print(rel_list)
print(len(rel_list))

['', 'translation_of', 'sort_version_of', 'entry_version_of', 'permuted_term_of', 'mapped_to', 'associated_with', 'has_permuted_term', 'has_translation', 'has_sort_version', 'has_entry_version', 'has_transliterated_form', 'has_component', 'inverse_isa', 'has_measured_component', 'same_as', 'measures', 'parent_of', 'form_of', 'transliterated_form_of', 'disposition_of', 'exhibited_by', 'see_from', 'see', 'entry_combination_of', 'mapped_from', 'has_causative_agent', 'used_for', 'use', 'isa', 'replaces', 'replaced_by', 'has_direct_substance', 'subset_includes_concept', 'has_ingredient', 'has_tradename', 'mapping_qualifier_of', 'contains', 'active_ingredient_of', 'has_active_ingredient', 'has_active_moiety', 'has_member', 'has_basis_of_strength_substance', 'has_precise_active_ingredient', 'tradename_of', 'may_be_prevented_by', 'may_be_treated_by', 'has_contraindicated_drug', 'has_part', 'physiologic_effect_of', 'mechanism_of_action_of', 'therapeutic_class_of', 'lab_number_of', 'chemotherapy

### Note: In MRREL, the REL/RELA always expresses the nature of the relationship from CUI2 to the "current concept", CUI1.
**For this reason we label the CUI2 column to CUI, this is because of the above as noted in UMLS docs.**
A way to conceptualize this is an example relationship as below.

- C0000039|A0016515|AUI1|SY|C0000039|A13096036|AUI|translation_of|R73331672||MSHPOR|MSHPOR|||N||
- mrrel_columns = ['CUI1','AUI1','STYPE1','REL','CUI2','AUI2','STYPE2','RELA','RUI','SRUI','SAB','SL','RG','DIR','SUPPRESS','CVF']


- The 0th column is the CUI1, the 4th column is the CUI2.
- CUI2 is a translation of CUI1. We consider CUI1 the parent node in this relationship.
- Thus we must rename CUI2 as CUI, as all pointers from the other files refer to this column in this file.
- mrrel_columns = ['CUI1','AUI1','STYPE1','REL','CUI','AUI','STYPE2','RELA','RUI','SRUI','SAB','SL','RG','DIR','SUPPRESS','CVF']


- We store the file in our corpus in this fashion as this represents the syntactic natural direction of the relationship in English.
- We additionally store the AUIs in this manner.
- The parse of this file will take about 26 minutes

In [5]:
mrrel_columns = ['CUI1','AUI1','STYPE1','REL','CUI','AUI','STYPE2','RELA','RUI','SRUI','SAB','SL','RG','DIR','SUPPRESS','CVF']
mrrel= dcf.read_csv(
    '/home/karl/PycharmProjects/DLAI/datasets/UMLS/MRREL.RRF',
    sep = '|',
    names=mrrel_columns,
    dtype='str',
)

for rel in rel_list:
    if rel == '':
        match_df = mrrel[mrrel['RELA'].str.match(str(rel))]
        match_df = match_df.reset_index()
        rel = 'rel_with_no_name'
        file_name = '/home/karl/PycharmProjects/DLAI/corpus/UMLS_base/all_relationships/{relationship}_hierarchy.csv'.format(relationship = rel)
        match_df.to_csv(file_name, sep='|', single_file=True)
    else:
        match_df = mrrel[mrrel['RELA'].str.match(str(rel))]
        match_df = match_df.reset_index()
        file_name = '/home/karl/PycharmProjects/DLAI/corpus/UMLS_base/all_relationships/{relationship}_hierarchy.csv'.format(relationship = rel)
        match_df.to_csv(file_name, sep='|', single_file=True)

# That's it, you've built the UMLS corpus base
Now we need to build a data retrieval programmatic interface for the corpus base.
The first data we are interested in building retrieval for is the "is a" relationships stored in the relationship directory of the corpus.
We will want to store these as a hypernymy stand in in the DL portion of the corpus for our hypernymy look up table.

