# PubMed Knowledge Graph

This notebook is part of a series that walks through the process of generating a knowledge graph of PubMed articles.

This notebook will
* Load structured patient journey data into a Neo4j instance

In [1]:
import os

import pandas as pd
from pyneoinstance import Neo4jInstance, load_yaml_file

## PyNeoInstance

Our database credentials and all of our queries are stored in the `pyneoinstance_config.yaml` file. 

This makes it easy to manage our queries and keeps the notebook code clean. 

In [2]:
config = load_yaml_file("pyneoinstance_config.yaml")

db_info = config['db_info']

constraints = config['initializing_queries']['constraints']
indexes = config['initializing_queries']['indexes']

load_queries = config['patient_journey']

This graph object will handle database connections and read / write transactions for us.

In [3]:
graph = Neo4jInstance(db_info.get('uri', os.getenv("NEO4J_URI", "neo4j://localhost:7687")), # use config value -> use env value -> use default value
                      db_info.get('user', os.getenv("NEO4J_USER", "neo4j")), 
                      db_info.get('password', os.getenv("NEO4J_PASSWORD", "password")))

This is a helper function for ingesting data using the PyNeoInstance library.

In [4]:
def get_partition(data: pd.DataFrame, batch_size: int = 500) -> int:
    """
    Determine the data partition based on the desired batch size.

    Parameters
    ----------
    data : pd.DataFrame
        The Pandas DataFrame to partition.
    batch_size : int
        The desired batch size.

    Returns
    -------
    int
        The partition size.
    """
    
    partition = int(len(data) / batch_size)
    print("partition: "+str(partition if partition > 1 else 1))
    return partition if partition > 1 else 1

## Constraints + Indexes

In [5]:
def create_constraints_and_indexes() -> None:
    """
    Create constraints and indexes for the lexical and domain graphs.
    """
    try:
        if constraints and len(constraints) > 0:
            graph.execute_write_queries(database=db_info['database'], queries=list(constraints.values()))
    except Exception as e:
        print(e)

    try:
        if indexes and len(indexes) > 0:
            graph.execute_write_queries(database=db_info['database'], queries=list(indexes.values()))
    except Exception as e:
        print(e)

In [6]:
create_constraints_and_indexes()

## Load Data

Our patient journey data will be loaded from a Pandas DataFrame.

In [7]:
def load_patient_journey_data(graph: Neo4jInstance, dataframe_tuples: list[tuple[str, pd.DataFrame]]) -> None:
    """
    Load patient journey data into the graph.

    Parameters
    ----------
    graph: Neo4jInstance
        The graph to load the data into.
    dataframe_tuples: list[tuple[str, pd.DataFrame]]
        A list of tuples, where the first element is the name of the query in the config and the second element is the associated dataframe.
    """

    for query_name, dataframe in dataframe_tuples:
        print(f"Loading {len(dataframe)} {query_name} rows")
        res = graph.execute_write_query_with_data(database=db_info['database'], 
                                                data=dataframe, 
                                                query=load_queries[query_name], 
                                                partitions=get_partition(dataframe, batch_size=500),
                                                parallel=False)
    print(res)

In [11]:
data_path = "data/patient_journey"
data_tuples = [
    ("patients", pd.read_csv(f"{data_path}/patients.csv")),
    ("providers", pd.read_csv(f"{data_path}/providers.csv")),
    ("claims", pd.read_csv(f"{data_path}/claims_with_all_codes.csv")),
    ("events", pd.read_csv(f"{data_path}/patient_journey_with_providers.csv")),
    ("conditions", pd.read_csv(f"{data_path}/conditions.csv")),
    ("care_gaps", pd.read_csv(f"{data_path}/care_gaps.csv")),
    ("risk_scores", pd.read_csv(f"{data_path}/risk_scores.csv"))
]

And now we load our patient journey data

In [10]:
load_patient_journey_data(graph, data_tuples)

Loading 221 patient journey rows
partition: 1
{'labels_added': 139, 'relationships_created': 863, 'nodes_created': 139, 'properties_set': 938}
