# Installs and Imports

## Installs

In [20]:
%pip install -i https://test.pypi.org/simple/ mdb-tools --upgrade

Looking in indexes: https://test.pypi.org/simple/
Collecting mdb-tools
  Downloading https://test-files.pythonhosted.org/packages/3e/4e/023a4e390770ea17eacdfbdf476268bbfd2765b78adda039fa9fb2494b1d/mdb_tools-0.7.0-py3-none-any.whl (15 kB)
Installing collected packages: mdb-tools
  Attempting uninstall: mdb-tools
    Found existing installation: mdb-tools 0.6.14
    Uninstalling mdb-tools-0.6.14:
      Successfully uninstalled mdb-tools-0.6.14
Successfully installed mdb-tools-0.7.0
Note: you may need to restart the kernel to use updated packages.


## Imports

In [2]:
from neo4j import GraphDatabase, basic_auth
from mdb_tools import NelsonMDB
from mdb_tools.mdb_tools import get_entity_type
from bento_meta.entity import Entity
from bento_meta.objects import Term, Concept, Predicate, Node, Edge, Property
from bento_meta.mdb import MDB, read_txn, read_txn_value, read_txn_data
from bento_meta.mdb.writeable import WriteableMDB, write_txn
import pandas as pd

# Database connection

First we need the URL, username, and password for the database connection.

These tools were developed to be used with the Bento [Metamodel database (MDB)](https://github.com/CBIIT/bento-meta), but the examples provided should work with [an empty Neo4J graph database](https://sandbox.neo4j.com/?ref=get-started-dropdown-cta&persona=data-scientist) too.

The metamodel database (MDB) records:
- node/relationship/property structure of models;
- the official local vocabulary - terms that are employed in the backend data system;
- synonyms for local vocabulary mapped from external standards; and
- the value sets for properties with enumerated value domains, and data types for other properties.

In [3]:
# MDB sandbox
URL = "bolt://localhost:7687" # <URL for database>
USER = "neo4j" # <Username for database>
PASSWORD = "noble-use-dairy" # <Password for database>

Use these credentials to instantiate a "NelsonMDB" object (a subclass of bento-meta's WriteableMDB).

Many of the tools in the bento_mdb package use this object to interact with the Metamodel database in a number of different ways without requiring in-depth knowledge of the Neo4J Python Driver.

In [4]:
mdbn = NelsonMDB(uri=URL, user=USER, password=PASSWORD)

# Bento-meta Entities

[Bento-meta entities](https://cbiit.github.io/bento-meta/the_object_model.html#component-objects) are Python objects that represent Neo4J nodes in the database. We use these as input for many of the mdb tools to specify the nodes we wish to work with.

Let's look at some of the types of entities found in the MDB, and how we can set their [attributes](https://cbiit.github.io/bento-meta/the_object_model.html#objects-and-their-attributes) with Python.

<img src="https://github.com/CBIIT/bento-meta/blob/master/metamodel.svg?raw=true" width="750px"/>

### Node

First, let's look at the node entity. This represents a Neo4j node with the label "node." In the MDB, this represents a model node or endpoint, such as "Subject" in the Cancer Data Aggregator (CDA) model or "file" in the Genomic Data Commons (GDC) model.

For a node, the combination of its model and handle are unique in the MDB (Note: ALL entities must also have a "nanoid" attribute, which we will touch on later in the [MDB Tools section](#make_nano).

In [6]:
# first we can instantiate a Node and add its attributes via object properties
test_node = Node()
test_node.handle = "file"
test_node.model = "GDC"

In [48]:
# we can then access those attributes later in the same way
print(f"Node handle: {test_node.handle}; Node model: {test_node.model}")

Node handle: file; Node model: GDC


### Edge

Edge entities represent relationships between model nodes. In the MDB these are nodes with the label "relationship," such as the relationship node with handle "data_from" in the GDC model. This has a source node with the handle "file" and a destination node with the handle "slide."

For a given edge/relationship between a source node (linked via the "has_src" Neo4J relationship) and destination node (via "has_dst"), the combination of its model & handle along with its source and destination node handles is unique within a model.

In [58]:
# we can also set attributes of an entity as a dictionary
test_edge = Edge({
    "handle": "data_from",
    "model": "GDC"
})

# define source and destination nodes separately
test_src = Node({"handle": "file", "model": "GDC"})
test_dst = Node({"handle": "slide", "model": "GDC"})

# can then set edge attributes src & dst, which accept entities.
test_edge.src = test_src
test_edge.dst = test_dst

print(f"Edge handle: {test_edge.handle}, model: {test_edge.model}, src: {test_edge.src.handle}, dst: {test_edge.dst.handle}")


Edge handle: data_from, model: GDC, src: file, dst: slide


### Property

Property entities represent a property of a node or edge/relationship, and are linked to that entity as the destination of a "has_property" Neo4J relationship. An example is the proprety with the handle "file_id" of the node with the handle "file."

For a given property, the combination of its model and handle along with the handle of the source entity on the other side of the "has_property" relationship is unique.

In [7]:
test_prop = Property({
    "handle": "file_id",
    "model": "GDC"
})

print(f"Property handle: {test_prop.handle}, model: {test_prop.model}, node: {test_node.handle}")

Property handle: file_id, model: GDC, node: file


### Concept

Concept entities represent intellectual concepts, and are represented only by their nanoid. 

They are used in the MDB to connect Term nodes that are identical in meaning (synonymous). This can link semantically identical terms with values such as "Wilm's Tumor" and "Nephroblastoma, NOS" and their representative codes from external vocabularies such as "8960/3" all to the same Concept (with nanoid eN7Kvk) via the "represents" Neo4J relationship.

They are also used to connect other entities, such as Properties, via the "has_concept" Neo4J relationship. In this case, they represent inter-model mappings such as between:
- the CDA property with the handle "identifier" of the node with the handle "File"
- the GDC property with the handle "file_id" of the node with the handle "file"
- the PDC property with the handle "diagnosis_id" of the node with the handle "diagnosis"

All three of these can be linked by the same Concept with a nanoid of "BrFFDM" via the "has_concept" Neo4J relationship, as they represent the same identifier data within each model.

One important distinction is that a Term "represents" a Concept, while a Property (or other node) "has_concept" a Concept. This highlights the fact that Terms are words for things, not the things themselves.

In [10]:
test_concept = Concept({"nanoid": "eN7Kvk"})

print(f"Concept nanoid: {test_concept.nanoid}")

Concept nanoid: eN7Kvk


### Term

Term entities represent instances of an encoding of a concept. The value attribute is a string represention of a term, and often they represent different unique values within a model's database for a given Property (which often represents a field/column name).

For example, the synonymous terms with values of "Wilm's Tumor" and "Nephroblastoma, NOS" with origin_name "GDC" are valid values of the GDC Property with the handle "relationship_primary_diagnosis." Another term in the "value set" for this property is "Bone Cancer."

For a given term, the combination of it's value and origin_name is unique (origin_id and origin_version may also be required for unique identification of terms in the future).

In [37]:
test_term_1 = Term({
    "value": "Wilm's Tumor",
    "origin_name": "GDC"
})
test_term_2 = Term({
    "value": "Nephroblastoma, NOS",
    "origin_name": "GDC"
})
test_term_3 = Term({
    "value": "8960/3",
    "origin_name": "GDC"
})

test_concept.terms = [test_term_1, test_term_2, test_term_3]

print(f"Term values: '{test_term_1.value}', '{test_term_2.value}', '{test_term_3.value}', model: {test_term_1.origin_name}")
print(f"Concept Terms: {test_concept.terms}")

Term values: 'Wilm's Tumor', 'Nephroblastoma, NOS', '8960/3', model: GDC
Concept Terms: {"Wilm's Tumor": <bento_meta.objects.Term object at 0x000001FF118554F0>, 'Nephroblastoma, NOS': <bento_meta.objects.Term object at 0x000001FF118554C0>, '8960/3': <bento_meta.objects.Term object at 0x000001FF118556D0>}


### Predicate

Predicate entities represent semantic relationships between two Neo4J concept nodes. Predicates are also uniquely identified by their nanoid attribute, but also have handles associated with them and are linked to concept nodes via the "has_subject" and "has_object" Neo4J relationships.

For example, the predicate with the handle "exactMatch" links the concepts with nanoids "n3udfp" and "4jkzA3", each of which represent terms with values of "Lung" in the GDC model.

Predicate handles are derived from [SKOS vocabulary](https://www.w3.org/TR/2005/WD-swbp-skos-core-spec-20051102/). Some examples that will be used the MDB are "exactMatch," "closeMatch," "broader," "narrower," and "related".

In [41]:
test_pred = Predicate({
    "handle": "exactMatch",
    "subject": Concept({"nanoid": "4jkzA3"}),
    "object": Concept({"nanoid": "n3udfp"})
})

print(f"Predicate handle: {test_pred.handle}, subject nanoid: '{test_pred.subject.nanoid}', object nanoid: '{test_pred.object.nanoid}'") # type: ignore

Predicate handle: exactMatch, subject nanoid: '4jkzA3', object nanoid: 'n3udfp'


# MDB Tools

## Setup for Blank Sandbox

Run this code to set up the database if you are trying this notebook on a new Neo4J sandbox, or some of the examples may not work properly.

If you are trying this on an actual MDB instance, the examples should work without this setup.

In [None]:
# basic tools setup


# link_synonyms setup
test_term_1 = Term({"value": "Epithelioma, benign", "origin_name": "GDC", "nanoid": "zRGp4a"})
test_term_2 = Term({"value": "Epithelial tumor, benign", "origin_name": "GDC", "nanoid": "GmGVWZ"})
test_concept_1 = Concept({"nanoid": "gNbSxf"})
mdbn.create_entity(test_term_1)
mdbn.create_entity(test_term_2)
mdbn.create_relationship(test_term_1, test_concept_1, "represents")
mdbn.create_relationship(test_term_2, test_concept_1, "represents")

# merge_two_concepts setup
test_concept_1 = Concept({"nanoid": "oXDdoU"})
test_concept_2 = Concept({"nanoid": "vHb6hz"})
test_concept_3 = Concept({"nanoid": "7v2zvu"})
mdbn.create_entity(test_concept_1)
mdbn.create_entity(test_concept_2)
mdbn.create_entity(test_concept_3)
test_term_1 = Term({"value": "Wilms Tumor", "origin_name": "NDC", "nanoid": "ajQta4"})
test_term_2 = Term({"value": "Nephroblastoma, NOS", "origin_name": "NDC", "nanoid": "1Njrd8"})
test_term_3 = Term({"value": "8960/3", "origin_name": "MDC", "nanoid": "e5zCQJ"})
test_term_4 = Term({"value": "Wilms Tumor", "origin_name": "ODC", "nanoid": "nVEvUF"})
mdbn.create_entity(test_term_1)
mdbn.create_entity(test_term_2)
mdbn.create_entity(test_term_3)
mdbn.create_entity(test_term_4)
mdbn.create_relationship(test_term_1, test_concept_1, "represents")
mdbn.create_relationship(test_term_2, test_concept_1, "represents")
mdbn.create_relationship(test_term_3, test_concept_2, "represents")
mdbn.create_relationship(test_term_4, test_concept_3, "represents")
mdbn.link_concepts_to_predicate(test_concept_2, test_concept_3)

# find synonyms setup
test_term = Term({"value": "Melanoma", "origin_name": "GDC", "nanoid": "1FQA8k"})

## Basic Tools

These are some of the basic building blocks of the MDB toolkit and what they do. All of these tools are methods of the NelsonMDB object that [we defined earlier](#database-connection).

If a tool uses "entity" in the name, it generally will accept the Node, Property, Edge, & Term entities. Some of these may also work with other entities (such as Concepts and Predicates) and as functionality is added these examples will be updated accordingly.

### get_entity_count

Returns count of given entity (w/ its properties) found in MDB as a list.
- If count = 0, entity with given properties not found
- If count = 1, entity with given properties is unique
- If count > 1, more properties needed to uniquely identify entity

In [73]:
test_term = Term({"value": "Cancerr"})

entity_count = mdbn.get_entity_count(test_term)

print(entity_count[0])

0


Oops, we misspelled the term value, let's try again with the correct value.

In [74]:
test_term.value = "Cancer"

entity_count = mdbn.get_entity_count(test_term)

print(entity_count[0])

2


Better, but Terms also need an origin_name attribute for unique identification.

In [75]:
test_term.origin_name = "GDC"

entity_count = mdbn.get_entity_count(test_term)

print(entity_count[0])

1


Nice, so now we know a term with the value "Cancer" and the origin_name "GDC" exists in the database.

### make_nano

As previously mentioned, each entity in the database needs a unique nanoid to act as an identifier.

Nanoids are 6 character strings made up of letters from the English alphabet and the numbers 0-9, and can be generated with the following function:

In [62]:
print(mdbn.make_nano())

2FdsZc


### get_entity_nano

This function returns a list of entity nanoids matching the given entity's attributes.

In [78]:
print(mdbn.get_entity_nano(test_term))

['Mg4qVS']


As previously mentioned, some entity types need additional information to be uniquely identified.

The extra_handle_1 & extra_handle_2 parameters are used to hold this information for now and can be set in a few different ways.

In [91]:
# for edge nodes, use extra_handle_1 to designate the src
# handle for which that edge "has_src" and extra_handle 2
# for the dst handle for which that edge "has_dst" 
test_edge = Edge({
    "handle": "describes",
    "model": "GDC"
})

test_node_src = Node({"handle": "treatment", "model": "GDC"})
test_node_dst = Node({"handle": "diagnosis", "model": "GDC"})

test_edge.src = test_node_src
test_edge.dst = test_node_dst

print(mdbn.get_entity_nano(
    entity=test_edge,
    # using handle of src attribute of test_edge #type: ignore
    extra_handle_1=test_edge.src.handle, #type: ignore
    #type: ignore # using handle of src node itself
    extra_handle_2=test_node_dst.handle)) #type: ignore

['m51zMt']


In [93]:
# for property nodes, use extra_handle_1 to designate 
# the node handle that "has_property" for that property
test_prop = Property({
    "handle": "treatment_type",
    "model": "GDC"
})

print(mdbn.get_entity_nano(
    entity=test_prop,
    extra_handle_1="treatment")) # using string value of node handle

['KEMRe0']


### get_or_make_nano

This function combines the two nanoid functions above. It returns a nanoid if the given entity exists uniquely or generates a new one if the entity doesn't exist.

This function also uses the extra_handle_1 & extra_handle_2 parameters to uniquely identify property and edge entities.

In [100]:
real_term = Term({
    "value": "Cancer",
    "origin_name": "GDC"
})

fake_term = Term({
    "value": "foo",
    "origin_name": "bar"
})

mdbn.get_or_make_nano(real_term), mdbn.get_or_make_nano(fake_term)

('Mg4qVS', '46pxeV')

If we run the same function again on these node entities, the real_node should return the same nanoid and the fake should generate a new one (since it doesn't exist in the MDB)

In [101]:
mdbn.get_or_make_nano(real_term), mdbn.get_or_make_nano(fake_term)

('Mg4qVS', 'W7Wtqv')

### create_entity

This function actually creates the given entity in the database. 

As previously mentioned, entities MUST have a nanoid before being added to the database. The recommended way to do this is with the get_or_make_nano function described above.

If the entity already exists in the database, does nothing.

In [121]:
test_node = Node({
    "handle": "ResearchSubject",
    "model": "NDC"
})
test_node.nanoid = mdbn.get_or_make_nano(test_node)

# entity count should initially be 0
mdbn.get_entity_count(test_node)[0]

0

In [122]:
# create the entity
mdbn.create_entity(test_node)

Creating new node node with properties: {handle: 'ResearchSubject', model: 'NDC', nanoid: 'wqSKUm'}


[]

In [123]:
# entity count should now be 1
mdbn.get_entity_count(test_node)[0]

1

### detach_delete_entity

This function removes the given entity from the database. Be careful with this if using in a production environment.

In [124]:
# delete the entity
mdbn.detach_delete_entity(test_node)

Removing node node with with properties: {handle: 'ResearchSubject', model: 'NDC', nanoid: 'wqSKUm'}


[]

In [125]:
# entity count should now be 0 again
mdbn.get_entity_count(test_node)[0]

0

### create_relationship

This function creates a relationship of the given type ("relationship" parameter of function) directed from a src_entity to a dst_entity.

As with the create_entity function, if the relationship already exists between the given entities, does nothing.

In [130]:
test_src_node = Node({
    "handle": "clinical",
    "model": "GDC"
})

test_dst_prop = Property({
    "handle": "vital_status",
    "model": "GDC"
})

mdbn.create_relationship(
    src_entity=test_src_node,
    dst_entity=test_dst_prop,
    relationship="has_property")

Ensuring has_property relationship exists between src node with properties: {handle: 'clinical', model: 'GDC'} to dst property with properties: {handle: 'vital_status', model: 'GDC'}


[]

### get_concepts

This function returns a list of concepts associated with a given entity.

The list shouldn't be longer than 1 item long (if more, is probably an opportunity for [linking concepts](#linking-concepts)).

In [140]:
test_term = Term({
    "value": "Lung",
    "origin_name": "NCIt"
})

mdbn.get_concepts(test_term)

['4jtzA3']

### get_entity_attrs

This function returns a collection of attributes associated with the given entity. The output_str parameter (default True) indicates whether to return the attributes as a string formatted for use in Neo4J Cypher queries or as a dictionary for use as bento_meta entity attributes.

Also useful for remembering the attributes that have been set for an entity.

In [141]:
# default string format
mdbn.get_entity_attrs(test_term)

"{value: 'Lung', origin_name: 'NCIt'}"

In [144]:
# add a property to see the difference
test_term.nanoid = mdbn.get_or_make_nano(test_term)

# dict format
mdbn.get_entity_attrs(test_term, output_str=False)

{'value': 'Lung', 'origin_name': 'NCIt', 'nanoid': 'cCdN9n'}

### get_term_nanos & get_predicate_nanos

These functions return lists of nanoids for all terms & predicates (respectively) representing the given concept.

In [152]:
# generate test concept using a nanoid for a known term
test_term = Term({
    "value": "Lung",
    "origin_name": "GDC"
})

test_concept = Concept()
test_concept.nanoid = mdbn.get_concepts(test_term)[0]

# get nanoids of all terms associated with test_concept
mdbn.get_term_nanos(test_concept)

['cCdN9n', 'dyBbzh']

In [151]:
# get nanoids of all predicates associated with test_concept
mdbn.get_predicate_nanos(test_concept)

['C7iTgM']

## Linking Synonyms via Concept


### link_synonyms

This function takes two entities that are synonymous (as determined by the user/SME) and ensures that they are linked in the MDB via a Concept via a 'represents' Neo4J relationship (for Terms) or a 'has_concept' relationship for others.

If one or both doesn't exist in the MDB and add_missing_ent is True they will be added.

In [None]:
# both terms exist & connected via concept
test_term_1 = Term()
test_term_1.value = "Epithelioma, benign"
test_term_1.origin_name = "GDC"
test_term_1.nanoid = mdbn.get_or_make_nano(test_term_1)

test_term_2 = Term()
test_term_2.value = "Epithelial tumor, benign"
test_term_2.origin_name = "GDC"
test_term_2.nanoid = mdbn.get_or_make_nano(test_term_2)

mdbn.link_synonyms(test_term_1, test_term_2, add_missing_ent=True)

Both terms are already connected via Concept pSXy2q


In [165]:
# both terms exist but neither have concept representing them
test_term_1 = Term()
test_term_1.value = "Cancer"
test_term_1.origin_name = "BentoTailorX"
test_term_1.nanoid = mdbn.get_or_make_nano(test_term_1)

test_term_2 = Term()
test_term_2.value = "Cancer"
test_term_2.origin_name = "GDC"
test_term_2.nanoid = mdbn.get_or_make_nano(test_term_2)

mdbn.link_synonyms(test_term_1, test_term_2, add_missing_ent=True)

Creating new concept node with properties: {nanoid: 'uPhyMk'}
Ensuring represents relationship exists between src term with properties: {value: 'Cancer', nanoid: 'anmyEP', origin_name: 'BentoTailorX'} to dst concept with properties: {nanoid: 'uPhyMk'}
Ensuring represents relationship exists between src term with properties: {value: 'Cancer', nanoid: 'tKVteT', origin_name: 'GDC'} to dst concept with properties: {nanoid: 'uPhyMk'}


In [166]:
# one term exists & already has concept
test_term_1 = Term()
test_term_1.value = "Minimally Invasive Lung Adenocarcinoma"
test_term_1.origin_name = "NCIt"
test_term_1.nanoid = mdbn.get_or_make_nano(test_term_1)

test_term_2 = Term()
test_term_2.value = "Alveolar adenocarcinoma"
test_term_2.origin_name = "NDC"
test_term_2.nanoid = mdbn.get_or_make_nano(test_term_2)

mdbn.link_synonyms(test_term_1, test_term_2, add_missing_ent=True)

Creating new term node with properties: {value: 'Alveolar adenocarcinoma', nanoid: 'kGDUya', origin_name: 'NDC'}
Ensuring represents relationship exists between src term with properties: {value: 'Minimally Invasive Lung Adenocarcinoma', nanoid: '9G8qkD', origin_name: 'NCIt'} to dst concept with properties: {nanoid: '0c271a'}
Ensuring represents relationship exists between src term with properties: {value: 'Alveolar adenocarcinoma', nanoid: 'kGDUya', origin_name: 'NDC'} to dst concept with properties: {nanoid: '0c271a'}


In [None]:
# one term exists & doesn't have concept yet
test_term_1 = Term()
test_term_1.value = "Carcinoma, anaplastic, NOS"
test_term_1.origin_name = "BentoTailorX"
test_term_1.nanoid = mdbn.get_or_make_nano(test_term_1)

test_term_2 = Term()
test_term_2.value = "Undifferentiated Carcinoma"
test_term_2.origin_name = "NDC"
test_term_2.nanoid = mdbn.get_or_make_nano(test_term_2)

mdbn.link_synonyms(test_term_1, test_term_2, add_missing_ent=True)

Creating new term node with properties: {value: 'Undifferentiated Carcinoma', origin_name: 'NDC'}
Creating new concept node with properties: {nanoid: 'SZs2Th'}
Creating represents relationship between term with properties: {value: 'Carcinoma, anaplastic, NOS', origin_name: 'BentoTailorX'} and Concept with {nanoid: 'SZs2Th'}
Creating represents relationship between term with properties: {value: 'Undifferentiated Carcinoma', origin_name: 'NDC'} and Concept with {nanoid: 'SZs2Th'}


In [None]:
# neither term exists
test_term_1 = Term()
test_term_1.value = "Epithelioma, malignant"
test_term_1.origin_name = "NDC"
test_term_1.nanoid = mdbn.get_or_make_nano(test_term_1)

test_term_2 = Term()
test_term_2.value = "Carcinoma"
test_term_2.origin_name = "NDC"
test_term_2.nanoid = mdbn.get_or_make_nano(test_term_2)

mdbn.link_synonyms(test_term_1, test_term_2, add_missing_ent=True)

Creating new term node with properties: {value: 'Epithelioma, malignant', nanoid: 'xSAWgU', origin_name: 'NDC'}
Creating new term node with properties: {value: 'Carcinoma', nanoid: '7WZ7gd', origin_name: 'NDC'}
Creating new concept node with properties: {nanoid: '0QKDxb'}
Creating represents relationship from src term with properties: {value: 'Epithelioma, malignant', nanoid: 'xSAWgU', origin_name: 'NDC'} to dst term with properties: {nanoid: '0QKDxb'}
Creating represents relationship from src term with properties: {value: 'Carcinoma', nanoid: '7WZ7gd', origin_name: 'NDC'} to dst term with properties: {nanoid: '0QKDxb'}


Let's also link some properties - this represents inter-model mappings.

- the CDA property with the handle "identifier" of the node with the handle "File"
- the GDC property with the handle "file_id" of the node with the handle "file"
- the PDC property with the handle "diagnosis_id" of the node with the handle "diagnosis"

In [170]:
test_prop_1 = Property()
test_prop_1.handle = "identifier"
test_prop_1.model = "CDA"
test_prop_1.nanoid = mdbn.get_or_make_nano(test_prop_1, "File")

test_prop_2 = Property()
test_prop_2.handle = "diagnosis_id"
test_prop_2.model = "PDC"
test_prop_2.nanoid = mdbn.get_or_make_nano(test_prop_2, "diagnosis")

mdbn.link_synonyms(test_prop_1, test_prop_2, add_missing_ent=True)

Ensuring has_concept relationship exists between src property with properties: {handle: 'identifier', model: 'CDA', nanoid: 'MjBkRF'} to dst concept with properties: {nanoid: '0j7zCi'}
Ensuring has_concept relationship exists between src property with properties: {handle: 'diagnosis_id', model: 'PDC', nanoid: 'Yu5TGE'} to dst concept with properties: {nanoid: '0j7zCi'}


## Linking Concepts


When two existing Concept nodes are deemed synonymous, there are two primary ways to approach linking them together. The first way to link the synonymous Concepts would be via a Predicate node with the an "exactMatch" handle. This method maintains the exisiting Concept & Term structure while adding to it, allowing queries already in use to continue to work. 

The second way is simply merging the two so they are represented by the same Concept node. With this approach, the Terms linked to each Concept would then be linked to the new merged Concept instead. They could be merged under one of the exisiting Concepts or a new Concept could be created and the old two removed. This method would invalidate existing queries using relevant Concepts & Terms.

### link_concepts_to_predicate

This function takes two Concept entities and ensures they are linked via the given predicate handle.

In [7]:
c_concept = Concept({"nanoid": "Egtdgp"})
lc_concept = Concept({"nanoid": "mGYuf2"})

mdbn.link_concepts_to_predicate(c_concept, lc_concept, "broader")
mdbn.link_concepts_to_predicate(lc_concept, c_concept, "narrower")

Creating new predicate node with properties: {handle: 'broader', nanoid: 'XffiEi'}
Ensuring has_subject relationship exists between src predicate with properties: {handle: 'broader', nanoid: 'XffiEi'} to dst concept with properties: {nanoid: 'Egtdgp'}
Ensuring has_object relationship exists between src predicate with properties: {handle: 'broader', nanoid: 'XffiEi'} to dst concept with properties: {nanoid: 'mGYuf2'}
Creating new predicate node with properties: {handle: 'narrower', nanoid: 'QSKHFH'}
Ensuring has_subject relationship exists between src predicate with properties: {handle: 'narrower', nanoid: 'QSKHFH'} to dst concept with properties: {nanoid: 'mGYuf2'}
Ensuring has_object relationship exists between src predicate with properties: {handle: 'narrower', nanoid: 'QSKHFH'} to dst concept with properties: {nanoid: 'Egtdgp'}


In [174]:
# these Concepts both represent Terms with value: 'Lung' in the MDB.
test_concept_1 = Concept({"nanoid": "4jtzA3"})
test_concept_2 = Concept({"nanoid": "n3udfp"})

mdbn.link_concepts_to_predicate(
    test_concept_1,
    test_concept_2,
    predicate_handle="exactMatch")

Creating new predicate node with properties: {handle: 'exactMatch', nanoid: 'hPigoR'}
Ensuring has_subject relationship exists between src predicate with properties: {handle: 'exactMatch', nanoid: 'hPigoR'} to dst concept with properties: {nanoid: '4jtzA3'}
Ensuring has_object relationship exists between src predicate with properties: {handle: 'exactMatch', nanoid: 'hPigoR'} to dst concept with properties: {nanoid: 'n3udfp'}


### merge_two_concepts

In [179]:
# merge Concepts into one
test_concept_1 = Concept({"nanoid": "oXDdoU"})
test_concept_2 = Concept({"nanoid": "vHb6hz"})

mdbn.merge_two_concepts(test_concept_1, test_concept_2)

Removing concept node with with properties: {nanoid: 'vHb6hz'}
Ensuring represents relationship exists between src term with properties: {nanoid: 'e5zCQJ'} to dst concept with properties: {nanoid: 'oXDdoU'}
Ensuring has_subject relationship exists between src predicate with properties: {nanoid: 'GFpCnh'} to dst concept with properties: {nanoid: 'oXDdoU'}


## Finding potentially synonymous Terms 

### get_term_synonyms

This function returns a list of dictionaries representing Term nodes synonymous to given Term.

In [183]:
test_term = Term({
    "value": "Melanoma",
    "origin_name": "GDC"
})

terms_to_csv = mdbn.get_term_synonyms(test_term)
print(terms_to_csv)

  similarity_score = term_1.similarity(term_2)


[{'value': 'Melanoma', 'origin_name': 'GDC', 'nanoid': '1FQA8k', 'similarity': 1.0, 'valid_synonym': 0}, {'value': 'Melanoma', 'origin_name': 'BentoTailorX', 'nanoid': '04jXAr', 'similarity': 1.0, 'valid_synonym': 0}, {'value': 'Melanoma', 'origin_name': 'ICDC', 'nanoid': 'FJhZkb', 'similarity': 1.0, 'valid_synonym': 0}, {'value': 'Melanoma', 'origin_name': 'NCIt', 'nanoid': '7VjG2K', 'similarity': 1.0, 'valid_synonym': 0}, {'value': 'Glioma', 'origin_name': 'NCIt', 'nanoid': 'S1BAsU', 'similarity': 0.9999999420005758, 'valid_synonym': 0}, {'value': 'Mesothelioma', 'origin_name': 'GDC', 'nanoid': '9UYCxx', 'similarity': 0.9999999420005758, 'valid_synonym': 0}, {'value': 'Mesothelioma', 'origin_name': 'BentoTailorX', 'nanoid': 'TNmFoz', 'similarity': 0.9999999420005758, 'valid_synonym': 0}, {'value': 'Glioma', 'origin_name': 'ICDC', 'nanoid': '3mnea2', 'similarity': 0.9999999420005758, 'valid_synonym': 0}, {'value': 'Mesothelioma', 'origin_name': 'NCIt', 'nanoid': 'pEFc19', 'similarity'

### potential_synonyms_to_csv

This function takes the output of get_term_synonyms() and writes them as a CSV to the specified filepath.

Terms are sorted in descending order based on their similarity score. The valid_synonym field (initially 0) can be set to 1 if a term is a valid synonym of the first Term. This indiates for the next function whether the terms should be linked via a concept.

In [184]:
file_name = "potential_synonyms_melanoma.csv"
file_path = "C:/Users/nelso/OneDrive - Georgetown University/School Stuff/Capstone/Test/" + file_name

mdbn.potential_synonyms_to_csv(terms_to_csv, file_path)

### link_term_synonyms_csv

This function takes the CSV from potential_synonyms_to_csv() and links terms marked as a valid_synonym of the given term via a concept.

In [185]:
test_term = Term({
    "value": "Melanoma",
    "origin_name": "GDC"
})

csv_path = "C:/Users/nelso/OneDrive - Georgetown University/School Stuff/Capstone/Test/potential_synonyms_melanoma.csv"

mdbn.link_term_synonyms_csv(test_term, csv_path)

# Scripts to add/update CDA mappings in the MDB

These scripts were developed to reformat the [CDA mappings](https://cda.readthedocs.io/en/latest/Schema/overview_mapping/) and add them to the MDB as concepts between Property Neo4J nodes, adding nodes as needed for entities not already in the MDB.

These examples show how the command line scripts can be used to update the MDB with these mappings. I've demonstrated them in the notebook here, but they are built for use in the terminal and will be much easier to read and use there. 

### Initial CDA Mappings in Excel

Let's see what a sheet in the inital CDA mappings looks like. There are six total sheets:
- subject mapping
- researchsubject mapping
- diagnosis mapping
- treatment mapping
- specimen mapping
- file mapping

In [10]:
# this file connects to tables in CDA mapping documentation spreadsheet
# each sheet within the file contains the mappings for a CDA endpoint
# sheet names: subject mapping, researchsubject mapping, diagnosis mapping,
    # treatment mapping, specimen mapping, file mapping
RAW_CDA_MAP_EXCEL = "C:/Users/nelso/Documents/GitHub/HIDS-Capstone/data/CDA_mappings_20220721.xlsx"
pd.read_excel(RAW_CDA_MAP_EXCEL, sheet_name="subject mapping")

Unnamed: 0,field,GDC,PDC,IDC
0,id,files.cases.submitter_id,files.cases.case_submitter_id,files.PatientID
1,identifier.system,GDC,PDC,IDC
2,identifier.value,files.cases.submitter_id,files.cases.case_submitter_id,files.PatientID
3,species,Homo sapiens,files.cases.taxon,files.tcia_species
4,sex,files.cases.demographic.gender,files.cases.demographics.gender,NOT CURRENTLY MAPPED
5,race,files.cases.demographic.race,files.cases.demographics.race,NOT CURRENTLY MAPPED
6,ethnicity,files.cases.demographic.ethnicity,files.cases.demographics.ethnicity,NOT CURRENTLY MAPPED
7,days_to_birth,files.cases.demographic.days_to_birth,files.cases.demographics.days_to_birth,NOT CURRENTLY MAPPED
8,subject_associated_project,files.cases.project.project_id,files.cases.project_submitter_id,files.collection_id
9,vital_status,files.cases.demographic.vital_status,cases.demographics.vital_status,NOT CURRENTLY MAPPED


### Cleaning CDA Mappings to CSV

Let's apply the command line script to clean these into a format where each line contains the information to uniquely identify two entities that can be considered synonymous concepts (representing these cross-model mappings)

Typing just "clean-cda" can be used in the terminal, but within the notebook we need to specify the file with the script.

In [16]:
# run function to clean CDA mapping Excel file
!python ("C:\Users\nelso\Documents\GitHub\HIDS-Capstone\src\mdb_tools\scripts\clean_cda_map_excel.py"--input_filepath="C:\Users\nelso\Documents\GitHub\HIDS-Capstone\data\CDA_mappings_20220721.xlsx"

Cleaned CDA file now at C:\Users\nelso\Documents\GitHub\HIDS-Capstone\data\CDA_mappings_20220721_clean.csv
Other CDA file now at C:\Users\nelso\Documents\GitHub\HIDS-Capstone\data\CDA_mappings_20220721_other.csv


In [17]:
# check out df of cleaned CDA mappings (two synonymous entities per line)
CLEAN_CDA_MAP_CSV = "C:/Users/nelso/Documents/GitHub/HIDS-Capstone/data/CDA_mappings_20220721_clean.csv"
pd.read_csv(CLEAN_CDA_MAP_CSV)

Unnamed: 0,ent_1_model,ent_1_handle,ent_1_extra_handles,ent_2_model,ent_2_handle,ent_2_extra_handles
0,CDA,id,['Subject'],GDC,submitter_id,['case']
1,CDA,identifier,['Subject'],GDC,submitter_id,['case']
2,CDA,sex,['Subject'],GDC,gender,['demographic']
3,CDA,race,['Subject'],GDC,race,['demographic']
4,CDA,ethnicity,['Subject'],GDC,ethnicity,['demographic']
...,...,...,...,...,...,...
113,CDA,label,['File'],IDC,gcs_url,['file']
114,CDA,associated_project,['File'],IDC,collection_id,['file']
115,CDA,drs_uri,['File'],IDC,crdc_instance_uuid,['file']
116,CDA,imaging_modality,['File'],IDC,Modality,['file']


### Linking Mappings as Synonymous Property Entities

Now we can use the clean CSV to link each of these entities via a concept to their synonym(s).

Again, the terminal command "link-ents" alone can be used for this, but here in this notebook we have to provide full file paths to the script and all of the parameters at once.

In [19]:
!python "C:\Users\nelso\Documents\GitHub\HIDS-Capstone\src\mdb_tools\scripts\link_synonym_ents_csv.py" --csv_filepath="C:\Users\nelso\Documents\GitHub\HIDS-Capstone\data\CDA_mappings_20220721_clean.csv" --mdb_uri="bolt://localhost:7687" --mdb_user="neo4j" --mdb_pass="noble-use-dairy" --entity_type="property" --add_missing_ent="True"

Property entity with properties: {handle: 'id', model: 'CDA', nanoid: 'RqrWbb'} already exists in the database.
Property entity with properties: {handle: 'submitter_id', model: 'GDC', nanoid: '2uPJme'} already exists in the database.
Node entity with properties: {handle: 'Subject', model: 'CDA', nanoid: '3FYakr'} already exists in the database.
Node entity with properties: {handle: 'case', model: 'GDC', nanoid: 'ehMeUy'} already exists in the database.
Ensuring has_property relationship exists between src node with properties: {handle: 'Subject', model: 'CDA', nanoid: '3FYakr'} to dst property with properties: {handle: 'id', model: 'CDA', nanoid: 'RqrWbb'}
Ensuring has_property relationship exists between src node with properties: {handle: 'case', model: 'GDC', nanoid: 'ehMeUy'} to dst property with properties: {handle: 'submitter_id', model: 'GDC', nanoid: '2uPJme'}
Both entities are already connected via Concept RmXiVN
Property entity with properties: {handle: 'identifier', model: 'C