# PubMed Knowledge Graph

This notebook walks through the process of generating a knowledge graph of PubMed articles.

This notebook will
* Download a selection of articles from PubMed
* Define a knowledge graph schema
* Extract entities from the articles according to the defined schema
* Populate a Neo4j instance with articles and extracted entities
* Connect extracted entities with existing patient journey data

This notebook requires a local repo of articles. You may download a sample of 20 PubMed articles by running the following command.

```bash
python3 ./scripts/fetch_pubmed_articles.py
```

In [1]:
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

In [2]:
from typing import Any, Optional, List
import asyncio

In [3]:
# allows for async operations in notebooks
import nest_asyncio
nest_asyncio.apply()

## Lexical Graph Construction

We will use Unstructured.IO to partition and chunk our articles. 

This process breaks the articles into sensible chunks that may be used as context in our application. 

These chunks will also have relationships to the extracted entities.

IMAGE OF LEXICAL DATA MODEL

In [4]:
import pandas as pd
from pydantic import BaseModel, Field
from unstructured.partition.auto import partition
from unstructured.documents.elements import CompositeElement

from uuid import uuid4

In [5]:
class Document(BaseModel):
    id: str = Field(..., description="The id of the document")
    name: str = Field(..., description="The name of the document")
    source: str = Field(..., description="The source of the document")

class Chunk(BaseModel):
    id: str = Field(..., description="The id of the chunk")
    text: str = Field(..., description="The text of the chunk")

class ChunkWithEmbedding(Chunk):
    embedding: list[float] = Field(..., description="The embedding of the chunk text field")

class ChunkPartOfDocument(BaseModel):
    chunk_id: str = Field(..., description="The id of the chunk")
    document_id: str = Field(..., description="The id of the document")

class ChunkHasEntity(BaseModel):
    chunk_id: str = Field(..., description="The id of the chunk")
    entity_id: str = Field(..., description="The id of the entity")

In [6]:
def get_node_and_relationship_from_chunk_element(element: CompositeElement, parent_document_id: str) -> dict[str, dict[str, Any]]:
    """Parse the entity node and document relationship for a given chunk element"""
    chunk = Chunk(id=element.id, text=element.text)
    chunk_part_of_document = ChunkPartOfDocument(chunk_id=chunk.id, document_id=parent_document_id)
    return {
        "nodes": [chunk.model_dump()],
        "relationships": [chunk_part_of_document.model_dump()],
    }

def get_nodes_and_relationships_from_chunk_elements(elements: list[CompositeElement], parent_document: Document) -> dict[str, dict[str, pd.DataFrame]]:
    """Parse entity nodes and document relationships for a set of chunk elements and their parent document"""
    
    data = {
        "nodes": {
            "document": pd.DataFrame([parent_document.model_dump()]),
            "chunk": list(),
        },
        "relationships": {
            "chunk_part_of_document": list(),
        }
    }

    for element in elements:
        new_data = get_node_and_relationship_from_chunk_element(element, parent_document.id)
        data["nodes"]["chunk"].extend(new_data["nodes"])
        data["relationships"]["chunk_part_of_document"].extend(new_data["relationships"])

    # convert to pandas dataframe for ingestion
    data["nodes"]["chunk"] = pd.DataFrame(data["nodes"]["chunk"])
    data["relationships"]["chunk_part_of_document"] = pd.DataFrame(data["relationships"]["chunk_part_of_document"])

    return data

def process_article(file_name: str) -> dict[str, dict[str, pd.DataFrame]]:
    parent_document = Document(id=str(uuid4()), name=file_name, source="pubmed")
    elements = partition(file_name, chunking_strategy="by_title")
    return get_nodes_and_relationships_from_chunk_elements(elements, parent_document)

## Lexical Graph Embedding

Here we will embed the text fields of our lexical graph for vector similarity search. 

In [7]:
#TODO

## Schema Definition

We now need to define our knowledge graph schema. This information will be passed to the entity extraction LLM to control which entities and relationships are pulled out of the text.

This is necessary to prevent our schema from growing too large with an unbounded extraction process.

We are using Pydantic to define the schema here since it can be used to validate any returned results as well. This ensures that all data we are ingesting into Neo4j adheres to this structure.

In [8]:
class Medication(BaseModel):
    """a substance used for medical treatment, especially a medicine or drug. This is a base medication, not a medication implemented in a study."""
    
    name: str = Field(..., description="Name of the medication. Should also be uniquely identifiable.")
    medication_class: str = Field(..., description="Drug class (e.g., GLP-1 RA, SGLT2i)")
    mechanism: Optional[str] = Field(None, description="Mechanism of action")
    generic_name: Optional[str] = Field(None, description="Generic name if different from name")
    brand_names: Optional[List[str]] = Field(None, description="Commercial brand names")
    approval_status: Optional[str] = Field(None, description="FDA approval status")
    
    class Config:
        json_schema_extra = {
            "examples": [
                {
                    "name": "Semaglutide", 
                    "medication_class": "GLP-1 receptor agonist",
                    "mechanism": "GLP-1 receptor activation",
                    "generic_name": "semaglutide",
                    "brand_names": ["Ozempic", "Wegovy", "Rybelsus"],
                    "approval_status": "FDA approved"
                }
            ]
        }


class StudyMedication(BaseModel):
    """Study-specific medication usage - how a medication was used in a particular study"""
    
    study_medication_id: str = Field(..., description="Unique identifier for this study medication instance")
    dosage: Optional[str] = Field(None, description="Dosage used in this study")
    route: Optional[str] = Field(None, description="Route of administration")
    frequency: Optional[str] = Field(None, description="Dosing frequency")
    treatment_duration: Optional[str] = Field(None, description="Duration of treatment")
    treatment_arm: Optional[str] = Field(None, description="Treatment arm description")
    comparator: Optional[str] = Field(None, description="What this was compared against")
    adherence_rate: Optional[float] = Field(None, description="Treatment adherence rate")
    formulation: Optional[str] = Field(None, description="Specific formulation used")
    
    class Config:
        json_schema_extra = {
            "examples": [
                {
                    "study_medication_id": "STUDY_MED001",
                    "dosage": "1.0 mg",
                    "route": "subcutaneous",
                    "frequency": "weekly",
                    "treatment_duration": "12 weeks",
                    "treatment_arm": "Active treatment group",
                    "comparator": "placebo",
                    "adherence_rate": 85.5,
                    "formulation": "pre-filled pen"
                }
            ]
        }


class ClinicalOutcome(BaseModel):
    """Measured clinical outcomes and biomarkers"""
    
    clinical_outcome_id: str = Field(..., description="Unique identifier for the outcome")
    name: str = Field(..., description="Name of the clinical outcome")
    category: str = Field(..., description="Category of outcome")
    measurement_unit: Optional[str] = Field(None, description="Unit of measurement")
    normal_range: Optional[str] = Field(None, description="Normal or target range when applicable")
    baseline_value: Optional[float] = Field(None, description="Baseline measurement value")
    post_treatment_value: Optional[float] = Field(None, description="Post-treatment measurement value")
    change_from_baseline: Optional[float] = Field(None, description="Change from baseline")
    p_value: Optional[float] = Field(None, description="Statistical significance if reported")
    confidence_interval: Optional[str] = Field(None, description="95% confidence interval")
    effect_size: Optional[float] = Field(None, description="Standardized effect size")
    
    class Config:
        json_schema_extra = {
            "examples": [
                {
                    "clinical_outcome_id": "OUT001",
                    "name": "HbA1c",
                    "category": "Glycemic control",
                    "measurement_unit": "%",
                    "normal_range": "<7.0%",
                    "baseline_value": 8.5,
                    "post_treatment_value": 7.2,
                    "change_from_baseline": -1.3,
                    "p_value": 0.001,
                    "confidence_interval": "[-1.8, -0.8]",
                    "effect_size": -0.8
                }
            ]
        }


class MedicalCondition(BaseModel):
    """Medical conditions and comorbidities studied"""
    
    name: str = Field(..., description="Name of the medical condition")
    category: str = Field(..., description="Category of condition")
    severity: Optional[str] = Field(None, description="Severity or stage when specified")
    icd10_code: Optional[str] = Field(None, description="ICD-10 code when available")
    duration: Optional[str] = Field(None, description="Duration of condition if specified")
    prevalence: Optional[float] = Field(None, description="Prevalence in study population")
    
    class Config:
        json_schema_extra = {
            "examples": [
                {
                    "name": "Type 2 diabetes mellitus",
                    "category": "Primary condition", 
                    "severity": "moderate",
                    "icd10_code": "E11",
                    "duration": "5-10 years",
                    "prevalence": 100.0
                }
            ]
        }


class StudyPopulation(BaseModel):
    """Patient populations and demographics in research studies"""
    
    study_population_id: str = Field(..., description="Unique identifier for the population")
    description: str = Field(..., description="Description of the population")
    age_range: Optional[str] = Field(None, description="Age range")
    mean_age: Optional[float] = Field(None, description="Mean age in years")
    male_percentage: Optional[float] = Field(None, description="Percentage of male gender participants")
    female_percentage: Optional[float] = Field(None, description="Percentage of female gender participants")
    other_gender_percentage: Optional[float] = Field(None, description="Percentage of participants that identify as another gender")
    sample_size: Optional[int] = Field(None, description="Number of participants")
    study_type: str = Field(..., description="Type of study")
    location: Optional[str] = Field(None, description="Geographic location of study")
    inclusion_criteria: Optional[List[str]] = Field(None, description="Key inclusion criteria")
    exclusion_criteria: Optional[List[str]] = Field(None, description="Key exclusion criteria")
    study_duration: Optional[str] = Field(None, description="Duration of study")
    
    class Config:
        json_schema_extra = {
            "examples": [
                {
                    "study_population_id": "POP001",
                    "description": "Adults with T2DM and schizophrenia",
                    "age_range": "18-65 years",
                    "mean_age": 43.8,
                    "female_percentage": 47.0,
                    "male_percentage": 53.0,
                    "sample_size": 354,
                    "study_type": "Observational study",
                    "location": "Denmark",
                    "inclusion_criteria": ["Type 2 diabetes diagnosis", "Schizophrenia diagnosis", "Age ≥18"],
                    "study_duration": "12 months"
                }
            ]
        }


# Relationship classes
class StudyMedicationUsesMedication(BaseModel):
    """Links study medication to base medication"""
    study_medication_id: str
    medication_name: str


class StudyMedicationProducesClinicalOutcome(BaseModel):
    """Links study medication usage to clinical outcomes"""
    study_medication_id: str
    clinical_outcome_name: str


class StudyPopulationHasMedicalCondition(BaseModel):
    """Relationship between study population and medical conditions"""
    study_population_id: str
    medical_condition_name: str


class StudyPopulationReceivesStudyMedication(BaseModel):
    """Relationship between study population and study medication"""
    study_population_id: str
    study_medication_id: str


class StudyPopulationHasOutcome(BaseModel):
    """Direct relationship between population and outcomes (for population-level measurements)"""
    study_population_id: str
    clinical_outcome_name: str

Our knowledge graph data model looks like this 

IMAGE OF DATA MODEL

## Entity Extraction via LLM

We will be using [OpenAI](https://platform.openai.com/docs/overview) and the [Instructor](https://python.useinstructor.com/) library to perform our entity extraction.

In [9]:
from openai import AsyncOpenAI
import instructor

Instructor handles requesting structured outputs from the LLM. 

If the LLM fails to return output that adheres to the response models, Instructor will also handle the retry logic and pass any errors to inform corrections.

In [10]:
client = instructor.from_openai(AsyncOpenAI())

In [11]:
system_prompt = """
You are a healthcare research expert that is responsible for extracting detailed entities from PubMed articles. 
You will be provided a graph data model schema and must extract entities and relationships to populate a knowledge graph.
"""

async def extract_entities_from_text_chunk(text_chunk: str) -> list:
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": text_chunk}
        ],
        response_model=list[Medication | StudyMedication | StudyMedicationUsesMedication],
    )

    return response

In [12]:
async def extract_entities_from_chunk_nodes(chunk_nodes: pd.DataFrame) -> list[tuple[str, list[Any]]]:
    """
    Process a Pandas DataFrame of Chunk nodes and return the entities found in each chunk.
    
    Returns
    -------
    list[tuple[str, list[dict[str, Any]]]]
        A list of tuples, where the first element is the chunk id and the second element is a list of entities found in the chunk.
    """

    # Create tasks for all nodes
    # order is maintained
    tasks = [extract_entities_from_text_chunk(row["text"]) for _, row in chunk_nodes.iterrows()]

    # Execute all tasks concurrently
    extraction_results = await asyncio.gather(*tasks)

    # Return chunk_id paired with its entities
    return [(chunk_id, entities) for chunk_id, entities in zip(chunk_nodes["id"], extraction_results)]

### Test Extraction

In [13]:
with open("pubmed_abstracts.txt", "r") as f:
    text = f.read()[1500:2500]

print(text)

al Science, Faculty of Medicine, Universitas Airlangga, 
Surabaya, Indonesia. fahrul.nurkolis.mail@gmail.com.
(11)Medical Research Center of Indonesia, Surabaya, East Java, Indonesia. 
fahrul.nurkolis.mail@gmail.com.

BACKGROUND: The global rise in obesity and type 2 diabetes highlights the need 
for safe and effective therapeutic interventions. Enhalus acoroides is a 
tropical seagrass rich in carotenoids and other bioactives. Its potential for 
metabolic regulation has been suggested in vitro, but in vivo efficacy and 
molecular mechanisms remain unexplored. This study aimed to evaluate the 
anti-obesity and anti-diabetic effects of Enhalus acoroides extract (SEAE) in a 
zebrafish model of diet- and glucose-induced metabolic dysfunction.
METHODS: Adult zebrafish were subjected to overfeeding and glucose immersion, 
after overfeeding and 14 days of glucose immersion to induce diabetes, adult 
zebrafish were randomized into three groups: untreated diabetic, SEAE-treated 
(5 mg/L), and 

In [14]:
ents = await extract_entities_from_text_chunk(text)

In [15]:
ents

[Medication(name='Enhalus acoroides extract', medication_class='Natural Product', mechanism='Metabolic regulation', generic_name='SEAE', brand_names=None, approval_status=None)]

## Data Ingestion

We have now defined 
* Lexical and domain data models
* Partitioning and chunking logic for articles
* Entity extraction logic for chunks

It is now time to define our ingestion logic. We will run ingest in three stages 

1. Load lexical graph
2. Embed lexical graph Chunk nodes
3. Extract domain / entity graph from lexical graph

Decoupling these stages allows us easily make changes as we iterate our ingestion process.

We will be using PyNeoInstance to ingest our data. 

This library allows for easy ingest orchestration and features parallel ingest options.

In [16]:
import os

from pyneoinstance import Neo4jInstance, load_yaml_file

Our database credentials and all of our queries are stored in the `pyneoinstance_config.yaml` file. 

This makes it easy to manage our queries and keeps the notebook code clean. 

In [17]:
config = load_yaml_file("pyneoinstance_config.yaml")

db_info = config['db_info']

constraints = config['initializing_queries']['constraints']
indexes = config['initializing_queries']['indexes']

node_load_queries = config['loading_queries']['nodes']
relationship_load_queries = config['loading_queries']['relationships']

processing_queries = config['processing_queries']

This graph object will handle database connections and read / write transactions for us.

In [18]:
graph = Neo4jInstance(db_info.get('uri', os.getenv("NEO4J_URI", "neo4j://localhost:7687")), # use config value -> use env value -> use default value
                      db_info.get('user', os.getenv("NEO4J_USER", "neo4j")), 
                      db_info.get('password', os.getenv("NEO4J_PASSWORD", "password")))

This is a helper function for ingesting data using the PyNeoInstance library.

In [19]:
def get_partition(data: list[dict[str, Any]], batch_size: int = 500) -> int:
    """
    Determine the data partition based on the desired batch size.
    """
    
    partition = int(len(data) / batch_size)
    print("partition: "+str(partition if partition > 1 else 1))
    return partition if partition > 1 else 1

### Constraints

Here we write all the constraints and indexes we need for both the lexical and domain graphs

In [20]:
def create_constraints_and_indexes() -> None:
    """
    Create constraints and indexes for the graph.
    """
    try:
        graph.execute_write_queries(database=db_info['database'], queries=list(constraints.values()))
    except Exception as e:
        print(e)

    try:
        graph.execute_write_queries(database=db_info['database'], queries=list(indexes.values()))
    except Exception as e:
        print(e)

In [21]:
create_constraints_and_indexes()

'NoneType' object has no attribute 'values'


### Ingest Lexical Graph

#### Processing | Preparation

Process the articles

In [24]:
lexical_ingest_records = process_article("pubmed_abstracts.txt")

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


Check the first few records 

In [25]:
lexical_ingest_records["nodes"]["chunk"][:3]

Unnamed: 0,id,text
0,c7575a02b7776f6183bcb0c3aa2b3a58,1. Diabetol Metab Syndr. 2025 Jun 21;17(1):235...
1,aaf28f4d0b350bd4c312418709797fd5,"Author information: (1)Faculty of Medicine, Un..."
2,def7de5c464d6254bdaa2d73b86034eb,"Indonesia, Jakarta, 12930, Indonesia. (6)Divis..."


#### Ingestion

In [26]:
# load_lexical_graph(driver, lexical_ingest_records)

Load the Document and Chunk nodes into the graph

In [27]:
def load_lexical_nodes(document_records: pd.DataFrame, chunk_records: pd.DataFrame) -> None:
    """
    Load lexical nodes into the graph. These include Document and Chunk nodes.
    """
    lexical_nodes_ingest_iterator = list(zip([document_records, chunk_records], ['document', 'chunk']))

    for data, query in lexical_nodes_ingest_iterator:
        res = graph.execute_write_query_with_data(database=db_info['database'], 
                                                    data=data, 
                                                    query=node_load_queries[query], 
                                                    partitions=get_partition(data, batch_size=500),
                                                    parallel=True,
                                                    workers=2)
        print(res)

In [28]:
load_lexical_nodes(lexical_ingest_records["nodes"]["document"], lexical_ingest_records["nodes"]["chunk"])

partition: 1
{}
partition: 1
{}


Load the relationships into the graph

In [29]:
def load_lexical_relationships(chunk_part_of_document_records: list[ChunkPartOfDocument]) -> None:
    """
    Load lexical relationships into the graph. This includes the Chunk - PART_OF -> Document relationship.
    """
    lexical_relationships_ingest_iterator = list(zip([chunk_part_of_document_records], ['chunk_part_of_document']))

    for data, query in lexical_relationships_ingest_iterator:
        res = graph.execute_write_query_with_data(database=db_info['database'], 
                                                    data=data, 
                                                    query=relationship_load_queries[query], 
                                                    partitions=get_partition(data, batch_size=500))
        print(res)

In [30]:
load_lexical_relationships(lexical_ingest_records["relationships"]["chunk_part_of_document"])

partition: 1
{}


### Embed Lexical Graph

Here we will read Chunk nodes from the graph that don't have embedding properties yet. 

We will then embed the Chunk text property and add the embedding as a property.

In [31]:
vector_index = ...

def create_vector_index() -> None:
    ...

In [32]:
def create_embeddings(driver) -> None:
    ...

def embed_lexical_graph(driver) -> None:
    ...

### Extract Entities from Lexical Graph

We will now perform entity extraction on the Chunk nodes to augment and connect to our domain graph containing patient journey information.

In [33]:
def get_chunk_nodes_to_process_by_article_name(article_name: str) -> list[Chunk]:
    """
    Retrieve Chunk node id and text from the database that have a relationship to the Document with the article name provided.
    These chunks may then be used as input to the entity extraction process.
    """

    return graph.execute_read_query(database=db_info['database'], 
                            parameters={"article_name": article_name}, 
                            query=processing_queries['get_chunk_nodes_to_process_by_article_name'], 
                        )


In [34]:
chunks_to_process = get_chunk_nodes_to_process_by_article_name("pubmed_abstracts.txt")

In [35]:
print(f"Found {len(chunks_to_process)} chunks to process")
print(f"First chunk:\n\n{chunks_to_process.loc[0,'text']}")

Found 171 chunks to process
First chunk:

1. Diabetol Metab Syndr. 2025 Jun 21;17(1):235. doi: 10.1186/s13098-025-01823-4.

Seagrass Enhalus acoroides extract mitigates obesity and diabetes via GLP-1, PPARγ, SREBP-1c modulation and gut microbiome restoration in diabetic zebrafish.

Kadharusman MM(1), Syahputra RA(2), Kurniawan R(3), Hadinata E(4), Tjandrawinata RR(5), Taslim NA(6), Romano R(7), Santini A(8), Nurkolis F(9)(10)(11).


In [36]:
entity_ingest_records = await extract_entities_from_chunk_nodes(chunks_to_process[:8])

In [37]:
entity_ingest_records

[('c7575a02b7776f6183bcb0c3aa2b3a58',
  [Medication(name='Seagrass Enhalus acoroides extract', medication_class='Herbal Supplement', mechanism='GLP-1, PPARγ, SREBP-1c modulation and gut microbiome restoration', generic_name=None, brand_names=None, approval_status=None)]),
 ('aaf28f4d0b350bd4c312418709797fd5', []),
 ('def7de5c464d6254bdaa2d73b86034eb', []),
 ('e0313766b618ceaaf67ad271da664657', []),
 ('13e0b5fa31fac4fc45f154b8a820a39c', []),
 ('a367e70b651483e37ff05b067c18204f',
  [StudyMedication(study_medication_id='STUDY_MED002', dosage='5 mg/L', route='Immersion', frequency=None, treatment_duration='20 days', treatment_arm='SEAE-treated group', comparator=None, adherence_rate=None, formulation=None),
   StudyMedication(study_medication_id='STUDY_MED003', dosage='3.3 mg/L', route='Immersion', frequency=None, treatment_duration='20 days', treatment_arm='Metformin-treated group', comparator=None, adherence_rate=None, formulation=None),
   StudyMedicationUsesMedication(study_medication_

### Ingest Entities Into Knowledge Graph

These functions load the extracted entities and relationships

These functions link the extracted entities with their text chunk nodes

In [38]:
ENTITY_LABELS = {
    "Medication", 
    "StudyMedication",
    "MedicalCondition",
    "StudyPopulation",
    "ClinicalOutcome",
}

ENTITY_RELS = {
    "StudyMedicationUsesMedication",
    "StudyMedicationProducesClinicalOutcome",
    "StudyPopulationHasMedicalCondition",
    "StudyPopulationReceivesStudyMedication",
    "StudyPopulationHasOutcome",
}

def prepare_entities_for_ingestion(entities: list[tuple[str, list[Any]]]) -> dict[str, dict[str, pd.DataFrame]]:
    """
    Prepare entities for ingestion into the graph.
    This function takes the results of the `get_chunk_nodes_to_process_by_article_name` function and returns a dictionary of entity label to list of entities.

    Returns
    -------
    dict[str, dict[str, pd.DataFrame]]
        A dictionary of entity label to pandas dataframe of entities.

        {
            "nodes": {
                "Medication": pd.DataFrame(...),
                "StudyMedication": pd.DataFrame(...),
                ...
            },
            "relationships": {
                "StudyMedicationUsesMedication": pd.DataFrame(...),
                "StudyMedicationProducesClinicalOutcome": pd.DataFrame(...),
                ...
            }
        }
    """

    records_node_dict = {lbl: list() for lbl in ENTITY_LABELS}
    records_rel_dict = {lbl: list() for lbl in ENTITY_RELS}

    for chunk_id, entities in entities:
        for entity in entities:
            to_add = entity.model_dump()
            to_add.update({"chunk_id": chunk_id})
            # nodes
            if entity.__class__.__name__ in ENTITY_LABELS:
                records_node_dict[entity.__class__.__name__].append(to_add)
            # rels
            elif entity.__class__.__name__ in ENTITY_RELS:
                records_rel_dict[entity.__class__.__name__].append(to_add)
            else:
                print(f"Unknown entity type: {entity.__class__.__name__}")

    for key, value in records_node_dict.items():
        records_node_dict[key] = pd.DataFrame(value)

    for key, value in records_rel_dict.items():
        records_rel_dict[key] = pd.DataFrame(value)

    return {"nodes": records_node_dict, "relationships": records_rel_dict}

In [44]:
def load_entity_nodes(medication_records: pd.DataFrame, 
                      medical_condition_records: pd.DataFrame, 
                      study_medication_records: pd.DataFrame, 
                      study_population_records: pd.DataFrame, 
                      clinical_outcome_records: pd.DataFrame) -> None:
    """
    Load entity nodes into the graph.
    """
    
    entity_nodes_ingest_iterator = list(zip([medication_records, 
                                             medical_condition_records, 
                                             study_medication_records, 
                                             study_population_records, 
                                             clinical_outcome_records], 
                                             ['medication', 
                                              'medical_condition', 
                                              'study_medication', 
                                              'study_population', 
                                              'clinical_outcome']))

    for data, query in entity_nodes_ingest_iterator:
        print(f"Loading {len(data)} {query} nodes")
        if len(data) > 0:
            res = graph.execute_write_query_with_data(database=db_info['database'], 
                                                    data=data, 
                                                    query=node_load_queries[query], 
                                                    partitions=get_partition(data, batch_size=500),
                                                    parallel=False)
            print(res)
        else:
            print(f"No {query} nodes to load")

In [45]:
def load_entity_relationships(study_medication_uses_medication: pd.DataFrame,
                              study_medication_produces_clinical_outcome: pd.DataFrame,
                              study_population_has_medical_condition: pd.DataFrame,
                              study_population_receives_study_medication: pd.DataFrame,
                              study_population_has_outcome: pd.DataFrame,
                              ) -> None:
    """
    Load entity relationships into the graph.
    """
    entity_relationships_ingest_iterator = list(zip([study_medication_uses_medication, 
                                                      study_medication_produces_clinical_outcome, 
                                                      study_population_has_medical_condition, 
                                                      study_population_receives_study_medication, 
                                                      study_population_has_outcome], 
                                                      ['study_medication_uses_medication', 
                                                       'study_medication_produces_clinical_outcome', 
                                                       'study_population_has_medical_condition', 
                                                       'study_population_receives_study_medication', 
                                                       'study_population_has_outcome']))
    
    for data, query in entity_relationships_ingest_iterator:
        print(f"Loading {len(data)} {query} relationships")
        if len(data) > 0:
            res = graph.execute_write_query_with_data(database=db_info['database'], 
                                                    data=data, 
                                                    query=relationship_load_queries[query], 
                                                    partitions=get_partition(data, batch_size=500),
                                                    parallel=False)
            print(res)
        else:
            print(f"No {query} relationships to load")

In [51]:
def link_entities_to_chunks(medication_link_records: pd.DataFrame, 
                      medical_condition_link_records: pd.DataFrame, 
                      study_medication_link_records: pd.DataFrame, 
                      study_population_link_records: pd.DataFrame, 
                      clinical_outcome_link_records: pd.DataFrame) -> None:
    """
    Link entities to chunks.
    """
    entity_link_iterator = list(zip([medication_link_records, 
                                     medical_condition_link_records, 
                                     study_medication_link_records, 
                                     study_population_link_records, 
                                     clinical_outcome_link_records], 
                                     ["chunk_has_entity_medication",
                                      "chunk_has_entity_medical_condition",
                                      "chunk_has_entity_study_medication",
                                      "chunk_has_entity_study_population",
                                      "chunk_has_entity_clinical_outcome"]))
    
    for data, query in entity_link_iterator:
        print(f"Linking {len(data)} {query} entities to chunks")
        if len(data) > 0:
            res = graph.execute_write_query_with_data(database=db_info['database'], 
                                                    data=data, 
                                                    query=relationship_load_queries[query], 
                                                    partitions=get_partition(data, batch_size=500),
                                                    parallel=False)
            print(res)
        else:
            print(f"No {query} relationships to load")

In [41]:
ingest_records = prepare_entities_for_ingestion(entity_ingest_records)

In [57]:
ingest_records["nodes"]["Medication"]['generic_name'].isnull().sum()

np.int64(2)

In [46]:
load_entity_nodes(ingest_records["nodes"]["Medication"], 
                  ingest_records["nodes"]["MedicalCondition"], 
                  ingest_records["nodes"]["StudyMedication"], 
                  ingest_records["nodes"]["StudyPopulation"], 
                  ingest_records["nodes"]["ClinicalOutcome"])

Loading 2 medication nodes
partition: 1
{'properties_set': 10}
Loading 0 medical_condition nodes
No medical_condition nodes to load
Loading 2 study_medication nodes
partition: 1
{'labels_added': 2, 'nodes_created': 2, 'properties_set': 18}
Loading 0 study_population nodes
No study_population nodes to load
Loading 0 clinical_outcome nodes
No clinical_outcome nodes to load


In [48]:
load_entity_relationships(ingest_records["relationships"]["StudyMedicationUsesMedication"], 
                          ingest_records["relationships"]["StudyMedicationProducesClinicalOutcome"], 
                          ingest_records["relationships"]["StudyPopulationHasMedicalCondition"], 
                          ingest_records["relationships"]["StudyPopulationReceivesStudyMedication"], 
                          ingest_records["relationships"]["StudyPopulationHasOutcome"])

Loading 1 study_medication_uses_medication relationships
partition: 1
{'relationships_created': 1}
Loading 0 study_medication_produces_clinical_outcome relationships
No study_medication_produces_clinical_outcome relationships to load
Loading 0 study_population_has_medical_condition relationships
No study_population_has_medical_condition relationships to load
Loading 0 study_population_receives_study_medication relationships
No study_population_receives_study_medication relationships to load
Loading 0 study_population_has_outcome relationships
No study_population_has_outcome relationships to load


In [52]:
link_entities_to_chunks(ingest_records["nodes"]["Medication"], 
                        ingest_records["nodes"]["MedicalCondition"], 
                        ingest_records["nodes"]["StudyMedication"], 
                        ingest_records["nodes"]["StudyPopulation"], 
                        ingest_records["nodes"]["ClinicalOutcome"])

Linking 2 chunk_has_entity_medication entities to chunks
partition: 1
{}
Linking 0 chunk_has_entity_medical_condition entities to chunks
No chunk_has_entity_medical_condition relationships to load
Linking 2 chunk_has_entity_study_medication entities to chunks
partition: 1
{'relationships_created': 2}
Linking 0 chunk_has_entity_study_population entities to chunks
No chunk_has_entity_study_population relationships to load
Linking 0 chunk_has_entity_clinical_outcome entities to chunks
No chunk_has_entity_clinical_outcome relationships to load
