In [None]:
# Import John Snow License keys
import json

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
import os
os.environ.update(license_keys)

Saving spark_nlp_for_healthcare_spark_ocr_4435.json to spark_nlp_for_healthcare_spark_ocr_4435.json


In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display
# Installing neo4j driver and xml parser
! pip install neo4j xmltodict

[K     |████████████████████████████████| 140 kB 3.2 MB/s 
[K     |████████████████████████████████| 198 kB 7.2 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 147 kB 404 kB/s 


# Agenda
In this project, we will start by downloading biomedical articles from PubMed. The PubMed provides an API to retrieve data as well as a FTP site, where daily updates are available.
Next, we will run the data through an NLP pipeline to extract relationships between biomedical entities. There are many open-source named entity recognition models out there, but unfortunately, I haven't come across any biomedical relation extraction models that don't require manual training. Since the goal of this post is not to teach you how to train a biomedical relation extraction model but rather how to apply it to solve real-world problems, we will be using the John Snow Labs Healthcare models. [John Snow Labs](https://www.johnsnowlabs.com/) offer free models for recognizing entities and extracting relations from news-like text. However, the biomedical models are not open-source. Luckily for us, they offer a free 30-day trial period for healthcare models. To follow along with the examples in this post, you will need to start the free trial and obtain the license keys.
In the last part of this post, we will store the extracted relations in Neo4j, a native graph database designed to store and analyze highly interconnected data. 
# Steps
* Download and parse daily update of articles from the PubMed FTP site
* Store articles in Neo4j
* Use John Snow Labs models to extract relations from text
* Store and analyze relations in Neo4j

# Download daily update from the PubMed FTP site
As mentioned, the PubMed daily updates are available on their FTP site. The data is available in XML format. The files have an incremental ID. I've first tried to calculate the incremental file id for a specific date programmatically. However, it's not straightforward, and I didn't want to waste my time figuring it out, so you will have to copy the desired file location in the code manually.

In [None]:
import urllib
import gzip
import io
import xmltodict

# Get latest pubmed daily update location at
# https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/

url = "https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/pubmed22n1211.xml.gz"
oec = xmltodict.parse(gzip.GzipFile(fileobj=io.BytesIO(urllib.request.urlopen(url).read())))

My gut instinct was that it would be easier to convert the XML to a dictionary and process that. However, if I had to do it again, I would probably use XML search functions as I had to include several exceptions to extract required data from the dictionary format correctly.

In [None]:
from datetime import date

# Export pubmed article params
params = list()

for row in oec['PubmedArticleSet']['PubmedArticle']:

    # Skip articles without abstract or other text
    if not row['MedlineCitation']['Article'].get('Abstract'):
        continue

    # Article id
    pmid = row['MedlineCitation']['PMID']['#text']

    abstract_raw = row['MedlineCitation']['Article']['Abstract']['AbstractText']

    if isinstance(abstract_raw, str):
        text = [{'label': 'SINGLE', 'text': abstract_raw}]
    elif isinstance(abstract_raw, list):
        text = [{'label': el.get('@Label', 'SINGLE'), 'text': el['#text']}
                for el in abstract_raw if not isinstance(el, str) and el.get('#text')]
    else:
        text = [{'label': abstract_raw.get(
            '@Label', 'SINGLE'), 'text': abstract_raw.get('#text')}]

    # Completed date
    if row['MedlineCitation'].get('DateCompleted'):
        completed_year = int(row['MedlineCitation']['DateCompleted']['Year'])
        completed_month = int(row['MedlineCitation']['DateCompleted']['Month'])
        completed_day = int(row['MedlineCitation']['DateCompleted']['Day'])
        completed_date = date(completed_year, completed_month, completed_day)
    else:
        completed_date = None

    # Revised date
    revised_year = int(row['MedlineCitation']['DateRevised']['Year'])
    revised_month = int(row['MedlineCitation']['DateRevised']['Month'])
    revised_day = int(row['MedlineCitation']['DateRevised']['Day'])
    revised_date = date(revised_year, revised_month, revised_day)

    # title
    title_raw = row['MedlineCitation']['Article']['ArticleTitle']
    if isinstance(title_raw, str):
        title = title_raw
    else:
        title = title_raw['#text'] if title_raw else None
    # Country
    country = row['MedlineCitation']['MedlineJournalInfo']['Country']

    # Mesh headings
    mesh_raw = row['MedlineCitation'].get('MeshHeadingList')
    if mesh_raw:
        if isinstance(mesh_raw['MeshHeading'], list):
            mesh = [{'mesh_id': el['DescriptorName']['@UI'], 'text': el['DescriptorName']['#text'], 'major_topic': el['DescriptorName']
                     ['@MajorTopicYN']} for el in mesh_raw['MeshHeading']]
        else:
            mesh = [{'mesh_id': el['DescriptorName']['@UI'], 'text': el['DescriptorName']['#text'], 'major_topic': el['DescriptorName']
                     ['@MajorTopicYN']} for el in [mesh_raw['MeshHeading']]]
    else:
        mesh = []

    # Authors
    authors_raw = row['MedlineCitation']['Article'].get('AuthorList')
    if not authors_raw:
        authors = []
    elif isinstance(authors_raw['Author'], list):
        authors = [
            f"{el['ForeName']} {el['LastName']}" for el in authors_raw['Author'] if el.get('ForeName')]
    else:
        authors = [f"{authors_raw['Author']['ForeName']} {authors_raw['Author']['LastName']}"] if authors_raw['Author'].get('ForeName') else None

    params.append({'pmid': pmid, 'text': text, 'completed_date': completed_date,
                  'revised_date': revised_date, 'title': title, 'country': country, 'mesh': mesh, 'author': authors})


# Store articles in Neo4j
Before moving onto the NLP extraction pipeline, we will store the articles in Neo4j.
In the center of the graph are the articles. We store their PubMed ids, title, country, and dates as properties. Of course, we could refactor the country as a separate node if we wanted to, but here I modeled them as node properties. Each article contains one or more sections of texts. Several types of sections are available, like the abstract, methods, or conclusions. I've stored the section type as the relationship property between the article and the section. We also know who authored a particular research paper. PubMed articles in particular also contain the entities mentioned or researched in the paper, which we will store as the Mesh node as the entities are mapped to the Mesh ontology.
P.s. For most articles, only the abstract is available. You could probably download full-text for most articles through the PubMed API. However, we won't do that here.
Before importing the data, we have to set up our Neo4j environment. If you are using the Colab notebook, I suggest you open a [Blank Project in Neo4j Sandbox](https://sandbox.neo4j.com/?usecase=blank-sandbox). Neo4j Sandbox is a free time-limited cloud instance of Neo4j.

In [None]:
# Define Neo4j connections
import pandas as pd
from neo4j import GraphDatabase
host = 'bolt://54.89.97.91:7687'
user = 'neo4j'
password = 'witnesses-bells-drunk'
driver = GraphDatabase.driver(host,auth=(user, password))

def run_query(query, params={}):
    with driver.session() as session:
        result = session.run(query, params)
        return pd.DataFrame([r.values() for r in result], columns=result.keys())

My good practice when dealing with Neo4j is to define unique constraints and indexes to optimize the performance of both import and read queries.

In [None]:
# Define constraints
run_query("CREATE CONSTRAINT IF NOT EXISTS ON (a:Article) ASSERT a.pmid IS UNIQUE;")
run_query("CREATE CONSTRAINT IF NOT EXISTS ON (a:Author) ASSERT a.name IS UNIQUE;")
run_query("CREATE CONSTRAINT IF NOT EXISTS ON (m:Mesh) ASSERT m.id IS UNIQUE;")

Now that we are all set, we can go ahead and import articles into Neo4j.

In [None]:
import_pubmed_query = """
UNWIND $data AS row
// Store article
MERGE (a:Article {pmid: row.pmid})
SET a.completed_date = date(row.completed_date),
    a.revised_date = date(row.revised_date),
    a.title = row.title,
    a.country = row.country
// Store sections of articles
FOREACH (map IN row.text | 
    CREATE (a)-[r:HAS_SECTION]->(text:Section)
    SET text.text = map.text,
        r.type = map.label)
// Store Mesh headings        
FOREACH (heading IN row.mesh | 
    MERGE (m:Mesh {id: heading.mesh_id})
    ON CREATE SET m.text = heading.text
    MERGE (a)-[r:MENTIONS_MESH]->(m)
    SET r.isMajor = heading.major_topic)
// Store authors    
FOREACH (author IN row.author | 
    MERGE (au:Author {name: author})
    MERGE (a)<-[:AUTHORED]-(au))
"""

# Import pubmed articles into Neo4j
step = 1000
for x in range(0, len(params), step):
    chunk = params[x:x+step]
    try:
        run_query(import_pubmed_query, {'data': chunk})
    except Exception as e:
        print(e)

The import is split into batches of 1000 articles to avoid dealing with a single huge transaction and potential memory issues. The import Cypher statement is a bit longer, but nothing too complex. We can quickly inspect the data before moving on to the NLP pipeline.

In [None]:
run_query("""
MATCH (a:Article)
RETURN count(*) AS count
""")

Unnamed: 0,count
0,26829


We can compare the revised versus the completed date to understand better why there are so many articles.

In [None]:
run_query("""
MATCH (a:Article)
RETURN a.pmid AS article_id, a.completed_date AS completed_date, a.revised_date AS revised_date
ORDER BY completed_date ASC
LIMIT 5
""")

Unnamed: 0,article_id,completed_date,revised_date
0,10954585,2000-10-30,2022-02-28
1,11802252,2002-03-15,2022-02-28
2,18254086,2008-04-14,2022-02-28
3,18646090,2008-10-15,2022-02-28
4,19093323,2009-02-03,2022-02-28


I have no idea why articles older than 20 years are being revised, but we get that information from the XML files. Next, we can inspect which mesh entities are most frequently researched as major topics in the articles completed in 2020 or later.

In [None]:
run_query("""
MATCH (a:Article)-[rel:MENTIONS_MESH]->(mesh_entity)
WHERE a.completed_date.year >= 2020 AND rel.isMajor = "Y"
RETURN mesh_entity.text as entity, count(*) AS count
ORDER BY count DESC
LIMIT 5
""")

Unnamed: 0,entity,count
0,COVID-19,33
1,HIV Infections,10
2,Alcoholism,8
3,Influenza Vaccines,7
4,"Diabetes Mellitus, Type 2",7


Interestingly, COVID-19 comes out on top even though we imported only a single daily update. Before relation extraction NLP models gained popularity, you could use co-occurrence networks to identify potential links between entities. For example, we can inspect which entities most frequently co-occur with COVID-19.

In [None]:
run_query("""
MATCH (e1:Mesh)<-[:MENTIONS_MESH]-(a:Article)-[:MENTIONS_MESH]->(e2)
WHERE e1.text = 'COVID-19'
RETURN e1.text AS entity1, e2.text AS entity2, count(*) AS count
ORDER BY count DESC
LIMIT 5
""")

Unnamed: 0,entity1,entity2,count
0,COVID-19,Humans,58
1,COVID-19,SARS-CoV-2,50
2,COVID-19,Female,24
3,COVID-19,Male,22
4,COVID-19,Pandemics,20


Co-occurrence results for COVID-19 make sense, even though they don't explain much other than it's related to humans and pandemics and has a strong connection to SARS-CoV-2.
# Relation Extraction NLP pipeline
Simple co-occurrence analysis can be a powerful technique to analyse relations between entities, but it ignores a lot of information that is available in the text. For that reason, researches have been investing a lot of effort in building in training relation extraction models.

In [None]:
import json
import os
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp
import pyspark.sql.functions as F

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"6G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

Spark NLP Version : 3.4.1
Spark NLP_JSL Version : 3.4.1


In [None]:
spark

Relationship extraction models are mostly very domain-specific and trained to detect only specific types of links. For this example, I have decided to include two John Snow Labs models in the NLP pipeline. One model will detect adverse drug effects between drugs and conditions, while the other model is used to extract relations between drugs and proteins.

In [None]:
documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")

# PoS and Dependency parser

pos_tagger = PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")

dependency_parser = DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

# NER for ReDL

redl_words_embedder = WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("redl_embeddings")

redl_drugprot_ner_tagger = MedicalNerModel.pretrained("ner_drugprot_clinical", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "redl_embeddings")\
    .setOutputCol("redl_ner_tags")

redl_ner_converter = NerConverter()\
    .setInputCols(["sentences", "tokens", "redl_ner_tags"])\
    .setOutputCol("redl_ner_chunks")

# NER for ADE

ade_words_embedder = BertEmbeddings() \
    .pretrained("biobert_pubmed_base_cased", "en") \
    .setInputCols(["sentences", "tokens"]) \
    .setOutputCol("ade_embeddings")

ade_ner_tagger = MedicalNerModel() \
    .pretrained("ner_ade_biobert", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens", "ade_embeddings"]) \
    .setOutputCol("ade_ner_tags")

ade_ner_converter = NerConverter() \
    .setInputCols(["sentences", "tokens", "ade_ner_tags"]) \
    .setOutputCol("ade_ner_chunks")

# ReDL relaton extraction

# Set a filter on pairs of named entities which will be treated as relation candidates
drugprot_re_ner_chunk_filter = RENerChunksFilter()\
    .setInputCols(["redl_ner_chunks", "dependencies"])\
    .setOutputCol("redl_re_ner_chunks")\
    .setMaxSyntacticDistance(4)
    
drugprot_re_Model = RelationExtractionDLModel()\
    .pretrained('redl_drugprot_biobert', "en", "clinical/models")\
    .setPredictionThreshold(0.9)\
    .setInputCols(["redl_re_ner_chunks", "sentences"])\
    .setOutputCol("redl_relations")

# ADE relation extraction

ade_re_model = RelationExtractionModel()\
        .pretrained("re_ade_biobert", "en", 'clinical/models')\
        .setInputCols(["ade_embeddings", "pos_tags", "ade_ner_chunks", "dependencies"])\
        .setOutputCol("ade_relations")\
        .setMaxSyntacticDistance(3)\
        .setPredictionThreshold(0.9)\
        .setRelationPairs(["drug-ade"]) # Possible relation pairs. Default: All Relations.

# Define whole pipeline
pipeline = Pipeline(
    stages=[documenter, sentencer, tokenizer,
            pos_tagger,
            dependency_parser,
            redl_words_embedder,
            redl_drugprot_ner_tagger,
            redl_ner_converter,
            ade_words_embedder,
            ade_ner_tagger,
            ade_ner_converter,
            drugprot_re_ner_chunk_filter,
            drugprot_re_Model,
            ade_re_model])

pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_drugprot_clinical download started this may take some time.
Approximate size to download 14 MB
[OK!]
biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[OK!]
ner_ade_biobert download started this may take some time.
Approximate size to download 15.3 MB
[OK!]
redl_drugprot_biobert download started this may take some time.
Approximate size to download 386.6 MB
[OK!]
re_ade_biobert download started this may take some time.
Approximate size to download 17.1 MB
[OK!]


Some of the steps are relevant for both the ADE (Adverse Drug Effect) and REDL (Drugs and Proteins) relations. However, since the models detect relationships between different types of entities, we have to use two NER models to detect both types of entities. Then we can simply feed those entities into relation extraction models. For example, the ADE model will produce only two types of relationships (0,1), where 1 indicates an adverse drug effect. On the other hand, the REDL model is trained to detect nine different types of relations between drugs and proteins (ACTIVATOR, INHIBITOR, AGONIST…).

In [None]:
def extract_rel_params(df):
  """
  Extract relationship parameters from the output dataframe for ADE and ReDL relations
  """
  rel_params = list()
  for i, row in df.iterrows():
      node_id = row['nodeId']
      if row['redl_relations']:
          for result in row['redl_relations']:
              rel_type = result['result'].replace('-', '_')
              confidence = result['metadata']['confidence']
              entity_1_type = result['metadata']['entity1']
              entity_1_label = result['metadata']['chunk1']
              entity_2_type = result['metadata']['entity2']
              entity_2_label = result['metadata']['chunk2']

              rel_params.append({'node_id': node_id, 'rel_type': rel_type, 'confidence': confidence,
                                'entity_1_type': entity_1_type, 'entity_1_label': entity_1_label, 'entity_2_type': entity_2_type, 'entity_2_label': entity_2_label})
      if row['ade_relations']:
          for result in row['ade_relations']:
              # Skip when ADE is not found
              if result['result'] == '0':
                  continue
              rel_type = 'ADE'
              confidence = result['metadata']['confidence']
              entity_1_type = result['metadata']['entity1']
              entity_1_label = result['metadata']['chunk1']
              entity_2_type = result['metadata']['entity2']
              entity_2_label = result['metadata']['chunk2']

              rel_params.append({'node_id': node_id, 'rel_type': rel_type, 'confidence': confidence,
                                'entity_1_type': entity_1_type, 'entity_1_label': entity_1_label, 'entity_2_type': entity_2_type, 'entity_2_label': entity_2_label})

  return rel_params


Lastly, we need to define the graph model to represent extracted entities. Mostly, it depends if you want the extracted relationships to point to their original text or not.

In [None]:
# Define neo4j import query
import_rels_query = """
UNWIND $data AS row
MATCH (a:Section)
WHERE id(a) = toInteger(row.node_id)
WITH row, a 
CALL apoc.merge.node(
  ['Entity', row.entity_1_type],
  {name: row.entity_1_label},
  {},
  {}
) YIELD node AS startNode
CALL apoc.merge.node(
  ['Entity', row.entity_2_type],
  {name: row.entity_2_label},
  {},
  {}
) YIELD node AS endNode

MERGE (startNode)-[:RELATIONSHIP]->(rel:Relationship {type: row.rel_type})-[:RELATIONSHIP]->(endNode)

MERGE (a)-[:MENTIONS]->(startNode)
MERGE (a)-[:MENTIONS]->(endNode)
MERGE (a)-[rm:MENTIONS]->(rel)
SET rm.confidence = row.confidence

"""

The only thing left is to execute the code and import extracted biomedical relations into Neo4j.

In [None]:
from datetime import datetime

# Define NLP input
nlp_input = run_query("""
MATCH (t:Section)
RETURN id(t) AS nodeId, t.text as text
LIMIT 1000
""")

# Run through NLP pipeline and store results
step = 100  #batch size
for i in range(0, len(nlp_input), step):
  print(f"Start processing row {i} at {datetime.now()}")
  # Create a chunk from the original Pandas Dataframe
  chunk_df = nlp_input[i: i + step]
  # Convert Pandas into Spark Dataframe
  sparkDF=spark.createDataFrame(chunk_df)
  # Run through NLP pipeline
  result = pipeline.fit(sparkDF).transform(sparkDF)
  df = result.toPandas()
  # Extract REL params
  rel_params = extract_rel_params(df)
  # Store to Neo4j
  run_query(import_rels_query, {'data': rel_params})



Start processing row 0 at 2022-03-10 15:13:46.156199
Start processing row 100 at 2022-03-10 15:23:28.880193
Start processing row 200 at 2022-03-10 15:32:36.873671
Start processing row 300 at 2022-03-10 15:42:27.586173
Start processing row 400 at 2022-03-10 15:49:43.298054
Start processing row 500 at 2022-03-10 15:55:56.895929
Start processing row 600 at 2022-03-10 16:02:53.612173
Start processing row 700 at 2022-03-10 16:09:38.976272
Start processing row 800 at 2022-03-10 16:13:57.521061
Start processing row 900 at 2022-03-10 16:18:28.480545


This code processes only 1000 sections, but you can increase the limit if you want. Since we didn't specify any unique id of the Section nodes, I've fetched the text and section internal node ids from Neo4j, which will make the import of relations faster as matching nodes by long text is not the most optimized way. Usually, you can get around this problem by calculating and storing a hash of text like sha1. In Google Colab, it takes about an hour to process 1000 sections.

Now we can examine the results. First, we will look at the relationships with the most mentions.

In [None]:
run_query("""
MATCH (start:Entity)-[:RELATIONSHIP]->(r)-[:RELATIONSHIP]->(end:Entity)
WITH start, end, r,
  size((r)<-[:MENTIONS]-()) AS totalMentions
ORDER BY totalMentions DESC
LIMIT 5
RETURN start.name AS startNode, r.type AS rel_type, end.name AS endNode, totalMentions
""")

Unnamed: 0,startNode,rel_type,endNode,totalMentions
0,cytokines,INDIRECT_UPREGULATOR,chemokines,4
1,cytokines,INDIRECT_UPREGULATOR,tumor necrosis factor-alpha,3
2,nitric oxide,PRODUCT_OF,NO,3
3,IL-1b,INDIRECT_UPREGULATOR,IL-6,3
4,matrix metalloproteinases,ACTIVATOR,MMPs,2


Since I am not a medical doctor, I won't comment the results as I have no idea how accurate they are. If we were to ask a medical doctor if a specific relation is valid, we can present them with the original text and let them decide.

In [None]:
run_query("""
MATCH (start:Entity)-[:RELATIONSHIP]->(r)-[:RELATIONSHIP]->(end:Entity)
WHERE start.name = 'cytokines' AND end.name = 'chemokines'
MATCH (r)<-[:MENTIONS]-(section)<-[:HAS_SECTION]-(article)
RETURN section.text AS text, article.pmid AS pmid
LIMIT 5
""")

Unnamed: 0,text,pmid
0,Chronic Lymphocytic Leukemia (CLL) is a B cell...,22202043
1,The developing brain is susceptible to hypoxic...,21622239
2,The two major neuropathologic hallmarks of AD ...,21196374
3,Macrophages are versatile cells involved in he...,19273336


What might also be interesting is to search for indirect relationships between specific entities.

In [None]:
run_query("""
MATCH (start:Entity), (end:Entity)
WHERE start.name = "cytokines" AND end.name = "CD40L"
MATCH p=allShortestPaths((start)-[:RELATIONSHIP*..5]->(end))
RETURN [n in nodes(p) | coalesce(n.name, n.type)] AS result LIMIT 25
""")

Unnamed: 0,result
0,"[cytokines, INDIRECT_UPREGULATOR, IL-1b, INDIR..."


# Next steps
There are a couple of options we have to enhance our NLP pipeline. The first that comes to mind is using entity linking or resolver models. Basically the entity resolver maps an entity to a target knowledge base like UMLS or Ensembl. By accurately linking entities to a target knowledge base we achieve two things:
* Entity disambiguation
* Ability to enrich our knowledge graph with external sources 

For example, I've found two nodes entities in our graph that might refer to the same real-world entity. While John Snow Labs offers multiple Entity Resolution models, it takes a bit of domain knowledge to map entities to a specified target knowledge base efficiently. I've seen some real-world biomedical knowledge graphs that use multiple target knowledge bases like UMLS, OMIM, Entrez to cover all types of entities.
The second feature of using entity resolvers is that we can enrich our knowledge graph by using external biomedical sources. For example, one application would be to use a knowledge base to import existing knowledge and then find new relations between entities through NLP extraction.
Lastly, we could also use various graph machine learning libraries like the Neo4j GDS, PyKEEN, or even PyTorch Geometric to predict new relationships.