# PubMed Knowledge Graph

This notebook is part of a series that walks through the process of generating a knowledge graph of PubMed articles.

This notebook will
* Load structured patient journey data into a Neo4j instance

In [1]:
import os

import pandas as pd
from pyneoinstance import Neo4jInstance, load_yaml_file

We will be loading patient data that follows this data model

![patient-data-model](assets/images/patient-data-model.png)

## PyNeoInstance

Our database credentials and all of our queries are stored in the `pyneoinstance_config.yaml` file. 

This makes it easy to manage our queries and keeps the notebook code clean. 

In [2]:
config = load_yaml_file("pyneoinstance_config.yaml")

db_info = config['db_info']

constraints = config['initializing_queries']['constraints']
indexes = config['initializing_queries']['indexes']

claim_queries = config['loading_queries']['patient_journey_claims']
protocol_query = config['loading_queries']['patient_journey_protocol']
post_processing_queries = config['processing_queries']

This graph object will handle database connections and read / write transactions for us.

In [3]:
graph = Neo4jInstance(db_info.get('uri', os.getenv("NEO4J_URI", "neo4j://localhost:7687")), # use config value -> use env value -> use default value
                      db_info.get('user', os.getenv("NEO4J_USER", "neo4j")), 
                      db_info.get('password', os.getenv("NEO4J_PASSWORD", "password")))

This is a helper function for ingesting data using the PyNeoInstance library.

In [4]:
def get_partition(data: pd.DataFrame, batch_size: int = 500) -> int:
    """
    Determine the data partition based on the desired batch size.

    Parameters
    ----------
    data : pd.DataFrame
        The Pandas DataFrame to partition.
    batch_size : int
        The desired batch size.

    Returns
    -------
    int
        The partition size.
    """
    
    partition = int(len(data) / batch_size)
    print("partition: "+str(partition if partition > 1 else 1))
    return partition if partition > 1 else 1

This is a mapping to add icd10 codes based on the existing icd9 codes in our Claims data. This is just a mock dataset so we're not too worried about detailed icd10 codes here.

This cell was used to process the `claims_with_all_codes.csv` file, but is not currently being used.

In [5]:
icd9_to_icd10 = {
    '244.9': 'E03.9',      # Hypothyroidism, unspecified
    '250.0': 'E11.9',      # Diabetes mellitus type 2 without complications
    '272.4': 'E78.5',      # Hyperlipidemia, unspecified
    '300.0': 'F41.9',      # Anxiety disorder, unspecified
    '311.0': 'F32.9',      # Major depressive disorder, single episode, unspecified
    '401.9': 'I10',        # Essential (primary) hypertension
    '414.0': 'I25.10',     # Atherosclerotic heart disease without angina pectoris
    '493.9': 'J45.909',    # Asthma, unspecified, uncomplicated
    '496.0': 'J44.9',      # COPD, unspecified
    '530.81': 'K21.9',     # Gastroesophageal reflux disease without esophagitis
    '585.9': 'N18.9',      # Chronic kidney disease, unspecified
    '715.9': 'M19.90'      # Osteoarthritis, unspecified site
}

icd10_to_condition = {
    'E03.9': 'Hypothyroidism, unspecified',
    'E11.9': 'Diabetes mellitus type 2 without complications',
    'E78.5': 'Hyperlipidemia, unspecified',
    'F41.9': 'Anxiety disorder, unspecified',
    'F32.9': 'Major depressive disorder, single episode, unspecified',
    'I10': 'Essential (primary) hypertension',
    'I25.10': 'Atherosclerotic heart disease without angina pectoris',
    'J45.909': 'Asthma, unspecified, uncomplicated',
    'J44.9': 'COPD, unspecified',
    'K21.9': 'Gastroesophageal reflux disease without esophagitis',
    'N18.9': 'Chronic kidney disease, unspecified',
    'M19.90': 'Osteoarthritis, unspecified site'
}

## Constraints + Indexes

In [10]:
def create_constraints_and_indexes() -> None:
    """
    Create constraints and indexes for the lexical and domain graphs.
    """
    try:
        if constraints and len(constraints) > 0:
            graph.execute_write_queries(database=db_info['database'], queries=list(constraints.values()))
    except Exception as e:
        print(e)

    try:
        if indexes and len(indexes) > 0:
            graph.execute_write_queries(database=db_info['database'], queries=list(indexes.values()))
    except Exception as e:
        print(e)

In [11]:
create_constraints_and_indexes()

## Load Data

Our patient journey data will be loaded from a Pandas DataFrame.

We will load two types of patient journey data
* Claims
* Protocol

After ingesting the data, we will connect the events sequentially for each patient with a Cypher query.

In [12]:
def load_patient_journey_protocol_data(graph: Neo4jInstance, data: pd.DataFrame) -> None:
    """
    Load patient journey protocol data into the graph.

    Parameters
    ----------
    graph: Neo4jInstance
        The graph to load the data into.
    dataframe_tuples: list[tuple[str, pd.DataFrame]]
        A list of tuples, where the first element is the name of the query in the config and the second element is the associated dataframe.
    """
    print(f"Loading {len(data)} patient journey protocol rows")
    
    res = graph.execute_write_query_with_data(database=db_info['database'], 
                                            data=data, 
                                            query=protocol_query,
                                            partitions=get_partition(data, batch_size=500),
                                            parallel=False)
    print(res)
def load_patient_journey_claims_data(graph: Neo4jInstance, dataframe_tuples: list[tuple[str, pd.DataFrame]]) -> None:
    """
    Load patient journey claims data into the graph.

    Parameters
    ----------
    graph: Neo4jInstance
        The graph to load the data into.
    dataframe_tuples: list[tuple[str, pd.DataFrame]]
        A list of tuples, where the first element is the name of the query in the config and the second element is the associated dataframe.
    """

    # load csv data
    for query_name, dataframe in dataframe_tuples:
        print(f"Loading {len(dataframe)} {query_name} rows")
        res = graph.execute_write_query_with_data(database=db_info['database'], 
                                                data=dataframe, 
                                                query=claim_queries[query_name], 
                                                partitions=get_partition(dataframe, batch_size=500),
                                                parallel=False)
        print(res)

def sequence_patient_events(graph: Neo4jInstance) -> None:
    """
    Post processing function to connect all patient events sequentially.
    Patient will be connected to their most recent event.

    (:Event)*<-[:PREVIOUS]-(:Event)<-[:MOST_RECENT_EVENT]->(:Patient)
    """
    res = graph.execute_write_query(database=db_info['database'], 
                                    query=post_processing_queries['sequence_patient_events'])
    print(res)

In [13]:
protocol_data = pd.read_csv("data/protocol/extended_patient_journey.csv")

In [14]:
claims_data_path = "data/claims"
claims_data_tuples = [
    ("patients", pd.read_csv(f"{claims_data_path}/patients.csv")),
    ("providers", pd.read_csv(f"{claims_data_path}/providers.csv")),
    ("claims", pd.read_csv(f"{claims_data_path}/claims_with_all_codes.csv", converters={'icd9': str})),
    ("events", pd.read_csv(f"{claims_data_path}/patient_journey_with_providers.csv")),
    # ("conditions", pd.read_csv(f"{claims_data_path}/conditions.csv")),
    ("care_gaps", pd.read_csv(f"{claims_data_path}/care_gaps.csv")),
    ("risk_scores", pd.read_csv(f"{claims_data_path}/risk_scores.csv"))
]

And now we load our patient journey data

In [15]:
load_patient_journey_protocol_data(graph, protocol_data)

Loading 221 patient journey protocol rows
partition: 1
{'labels_added': 550, 'relationships_created': 1597, 'nodes_created': 550, 'properties_set': 1316}


In [16]:
load_patient_journey_claims_data(graph, claims_data_tuples)

Loading 50 patients rows
partition: 1
{'labels_added': 50, 'nodes_created': 50, 'properties_set': 50}
Loading 20 providers rows
partition: 1
{'labels_added': 20, 'nodes_created': 20, 'properties_set': 60}
Loading 310 claims rows
partition: 1
{'labels_added': 644, 'relationships_created': 1547, 'nodes_created': 644, 'properties_set': 3756}
Loading 333 events rows
partition: 1
{'labels_added': 328, 'relationships_created': 329, 'nodes_created': 328, 'properties_set': 989}
Loading 43 care_gaps rows
partition: 1
{'labels_added': 85, 'relationships_created': 86, 'nodes_created': 85, 'properties_set': 170}
Loading 50 risk_scores rows
partition: 1
{'labels_added': 50, 'relationships_created': 50, 'nodes_created': 50, 'properties_set': 100}


In [17]:
sequence_patient_events(graph)

{'relationships_created': 900, 'relationships_deleted': 900}
