# BEL Graph RAG Example for paper usecases



In [2]:
# %pip install ndex2 langchain

In [3]:
%pip install texttoknowledgegraph==0.3.8 -q

[0mNote: you may need to restart the kernel to use updated packages.


In [4]:
import os
import json
from typing import List, Dict, Any
from pathlib import Path
import sys
from ndex2 import Ndex2
import ndex2
from ndex2.cx2 import CX2Network
from dotenv import load_dotenv
from ndex2.cx2 import RawCX2NetworkFactory
load_dotenv()



# Get NDEx account and password from environment variables
OPENAI_API_KEY   = os.getenv("OPENAI_API_KEY")
NDEX_ACCOUNT     = os.getenv("NDEX_ACCOUNT")
NDEX_PASSWORD    = os.getenv("NDEX_PASSWORD")
assert all([OPENAI_API_KEY, NDEX_ACCOUNT, NDEX_PASSWORD]), "Missing creds"

# Connect to NDEx using the provided credentials
ndex_client = Ndex2(username=NDEX_ACCOUNT, password=NDEX_PASSWORD)



## Run GraphRAG query on first paper

In [5]:
# Download the network of the first paper: pmid10436023

BASE_KG_UUID = "03290389-567f-11f0-a218-005056ae3c32"   
base_cx2 = ndex_client.get_network_as_cx2_stream(BASE_KG_UUID).json()
with open("base_kg.cx2", "w") as f: json.dump(base_cx2, f)

In [6]:
from openai import OpenAI
from typing import List

# Initialize OpenAI client
client = OpenAI(api_key=OPENAI_API_KEY)

def query_llm(prompt: str) -> str:
    """
    Query OpenAI's GPT model.
    """
    try:
        completion = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "user", "content": prompt}
            ],
            temperature=0.7,
            max_tokens=1000
        )
        return completion.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error querying OpenAI: {str(e)}")
        return ""


gene list: SIRT1, NAMPT, PARP1, TP53, CDK2, BRCA1
- sirt1 and nampt: metabolism genes
- parp1, tp53, brca1: dna damage response genes
- cdk2: cell cycle gene


This usecase was choosen because we want to understand how different biological processes are connected to each other

Try the option of lettin the LLM use its knowledge and the option where we specifically tell it to use only the knowledge graph.

- IMPORTANT NOTE: ONLY USE THE INFORMATION PROVIDED IN THE KNOWLEDGE GRAPH TO ANSWER THE QUESTION. DO NOT MAKE UP ANY INFORMATION OR USE YOUR OWN KNOWLEDGE.

- IMPORTANT NOTE: YOU CAN MAKE USE OF YOUR KNOWLEDGE OF BIOLOGY AND THE PROVIDED KNOWLEDGE GRAPH TO ANSWER THE QUESTION.


In [11]:
# Base prompt for LLM query
PROMPT_TEMPLATE = """
You are playing the role of an expert cancer biologist.

QUESTION: How does metabolism affect dna damage response?

IMPORTANT NOTE: YOU CAN MAKE USE OF YOUR KNOWLEDGE OF BIOLOGY AND THE PROVIDED KNOWLEDGE GRAPH TO ANSWER THE QUESTION.

TASK:
1. Review the following gene/protein set and the provided knowledge graph.
2. Answer the question under QUESTION based on the provided knowledge graph.
3. Provide your final answer as a paragraph summary

Genes: {geneset}

{knowledge_graph}
"""


In [12]:
# Function that does graphrag query and returns the llm response

from ndex2 import create_nice_cx_from_raw_cx
from typing import Callable, Union, List

def graph_rag_query(
    geneset: Union[List[str], str],
    ndex_id: str,
    prompt_template: str,
    ndex_client,
    llm_query_fn: Callable[[str], str],
    search_depth: int = 1
) -> str:
    """
    Perform a graph-RAG: pull a 1-hop neighborhood from NDEx, extract BEL context,
    fill the prompt, and query the LLM.

    Inputs:
      geneset         – either a list of HGNC symbols or a whitespace-delimited string
      ndex_id         – NDEx network UUID
      prompt_template – string containing placeholders {geneset} and {knowledge_graph}
      ndex_client     – an instantiated ndex2.client.Ndex2 object
      llm_query_fn    – function that takes a single string prompt and returns the LLM’s response
      search_depth    – how many hops out to pull (default=1)

    Returns:
      The raw response string from the LLM.
    """
    # Normalize gene list to Python list
    if isinstance(geneset, str):
        gene_list = geneset.split()
    else:
        gene_list = geneset

    # Build the semicolon-delimited search string for NDEx
    search_string = ";".join(gene_list)

    # 1) Fetch the neighborhood as raw CX2 JSON
    raw_cx2 = ndex_client.get_neighborhood(
        ndex_id,
        search_string=search_string,
        search_depth=search_depth
    )

    # 2) Wrap in the “nice” CX helper
    nice_net = create_nice_cx_from_raw_cx(raw_cx2)

    # 3) Extract BEL expressions from every edge
    bel_lines = []
    for edge_id, edge_obj in nice_net.get_edges():
        bel_stmt = nice_net.get_edge_attribute_value(edge_obj, "bel_expression")
        bel_lines.append(bel_stmt)
    knowledge_graph = "\n".join(bel_lines)

    # 4) Fill in the prompt template
    formatted_prompt = prompt_template.format(
        geneset=" ".join(gene_list),
        knowledge_graph=knowledge_graph
    )

    # 5) Call the LLM and return its response
    return llm_query_fn(formatted_prompt)


# ─── Example Usage ─────────────────────────────────────────────────────────────

# response = graph_rag_query(
#     geneset=["SIRT1", "PARP1", "TP53"],
#     ndex_id=BASE_KG_UUID,
#     prompt_template=PROMPT_TEMPLATE,
#     ndex_client=ndex,
#     llm_query_fn=query_llm
# )
# print(response)

In [10]:
graphrag_res = graph_rag_query(
    geneset=["SIRT1", "NAMPT", "PARP1", "TP53", "BRCA1", "CDK2"],
    ndex_id=BASE_KG_UUID,
    prompt_template=PROMPT_TEMPLATE,
    ndex_client=ndex_client,
    llm_query_fn=query_llm
)

print(graphrag_res)

The interplay between metabolism and DNA damage response (DDR) involves a complex network of genes and proteins, where metabolic processes can influence DDR pathways. From the knowledge graph provided, we can infer several interactions involving the genes and proteins relevant to metabolism and DDR. 

SIRT1 and NAMPT, although not directly mentioned in the knowledge graph, are known to be involved in metabolic processes and can influence DDR through their roles in NAD+ metabolism and deacetylation activities. PARP1, a critical protein in DNA repair, directly decreases the activity of TP53, a central player in DDR. TP53, known as a tumor suppressor protein, is pivotal in responding to DNA damage by inducing cell cycle arrest, apoptosis, and DNA repair pathways. The modulation of TP53 activity by PARP1 suggests a link between metabolic sensing and DDR, as PARP1 activity is NAD+-dependent and thus, connected to cellular metabolic status.

BRCA1, another key gene in DDR, is increased by TP

### LLM response using only the knowledge graph


The metabolism-related gene/protein set provided includes SIRT1, NAMPT, PARP1, TP53, BRCA1, and CDK2. According to the knowledge graph, TP53 plays a central role in the DNA damage response by influencing various cellular processes. TP53 increases the transcription of CDKN1A, PCNA, and GADD45A, which are important for cell cycle regulation and DNA repair. The activity of TP53 is modulated by phosphorylation at different serine residues, which influences its ability to form complexes and bind DNA. BRCA1 enhances the activity of TP53, promoting its role in DNA repair. PARP1, another protein in the set, directly decreases the activity of TP53, indicating a potential regulatory interaction between metabolism and the DNA damage response through the modulation of TP53 activity. CDK2, while not directly linked to TP53 in the knowledge graph, influences the phosphorylation of other proteins, which may indirectly affect DNA repair processes. Overall, the interplay between these proteins suggests that metabolic factors can impact the DNA damage response through modulation of TP53 activity, which is a key regulator of cell cycle arrest and apoptosis in response to DNA damage.

### LLM response using knowledge graph and LLM knowledge

The interplay between metabolism and DNA damage response (DDR) involves a complex network of genes and proteins, where metabolic processes can influence DDR pathways. From the knowledge graph provided, we can infer several interactions involving the genes and proteins relevant to metabolism and DDR. 

SIRT1 and NAMPT, although not directly mentioned in the knowledge graph, are known to be involved in metabolic processes and can influence DDR through their roles in NAD+ metabolism and deacetylation activities. PARP1, a critical protein in DNA repair, directly decreases the activity of TP53, a central player in DDR. TP53, known as a tumor suppressor protein, is pivotal in responding to DNA damage by inducing cell cycle arrest, apoptosis, and DNA repair pathways. The modulation of TP53 activity by PARP1 suggests a link between metabolic sensing and DDR, as PARP1 activity is NAD+-dependent and thus, connected to cellular metabolic status.

BRCA1, another key gene in DDR, is increased by TP53 activity, indicating a pathway where TP53 can stimulate DNA repair mechanisms. CDK2 is involved in cell cycle regulation and influences phosphorylation states of proteins like RB1 and E2F, which are integral to cell cycle progression and can be tied to DDR regulation. The phosphorylation and acetylation of TP53 at various sites further modulate its activity, with interactions involving proteins such as MDM2, EP300, and p14_3_3, highlighting the complex regulation of TP53 in response to DNA damage and potential metabolic cues.

Overall, metabolism affects DNA damage response by influencing key players like TP53 and PARP1, where metabolic state can modulate DDR pathways via NAD+-dependent enzymes and signaling cascades that regulate cell cycle arrest and repair mechanisms.

## Demonstate GraphRag on Merged Network of two papers

In [9]:
# Download network of the second paper: pmid24360018
NEW_KG_UUID = "57becd7b-5680-11f0-a218-005056ae3c32"   
new_cx2 = ndex_client.get_network_as_cx2_stream(NEW_KG_UUID).json()
with open("new_kg.cx2", "w") as f: json.dump(new_cx2, f)

In [12]:
# Merge base and new networks
from ndex2.cx2 import RawCX2NetworkFactory
from textToKnowledgeGraph.convert_to_cx2 import add_style_to_network

# Creating an instance of RawCX2NetworkFactory
cx2_factory = RawCX2NetworkFactory()
base_net = cx2_factory.get_cx2network(base_cx2)
new_net  = cx2_factory.get_cx2network(new_cx2)

def merge_cx2(cx2_graphs):
    merged_graph = CX2Network()
    node_map = {}  # Maps (node_name, node_type) to node ID in merged graph
    
    # First, merge all nodes
    for graph in cx2_graphs:
        for node_id, node in graph.get_nodes().items():  # Changed to .items() based on docs
            # Create a tuple of node attributes that define uniqueness
            node_data = node["v"]
            node_name = node_data.get('name', '')
            
            if node_name not in node_map:
                # Create new node using add_node() as documented
                new_node_id = merged_graph.add_node(attributes=node_data)
                node_map[node_name] = new_node_id
    
    # Then, merge all edges
    for graph in cx2_graphs:
        for edge_id, edge_data in graph.get_edges().items():  
            # Get source and target directly from edge data
            source_id = edge_data.get('s')  
            target_id = edge_data.get('t') 
            
            # Get source and target nodes
            source_node = graph.get_node(source_id)
            source_name = source_node["v"]["name"]
            target_node = graph.get_node(target_id)
            target_name = target_node["v"]["name"]
            
            merged_source = node_map[source_name]
            merged_target = node_map[target_name]
            
            # Create edge using add_edge() as documented
            merged_graph.add_edge(source=merged_source, 
                                target=merged_target, 
                                attributes=edge_data["v"])
    
    return merged_graph


cx2_graphs = [base_net, new_net]
merged_network = merge_cx2(cx2_graphs)

# Apply style to the merged network
add_style_to_network(
    cx2_network=merged_network,
    style_path="/Users/favourjames/Downloads/llm-text-to-knowledge-graph/textToKnowledgeGraph/cx_style.json"   
)

merged_network.set_name("Merged Network of Base and New KGs")

# Upload the merged network to NDEx
merged_uuid = ndex_client.save_new_cx2_network(merged_network.to_cx2())

print("Merged network UUID:", merged_uuid)

INFO: [2025-07-01 15:02:14] textToKnowledgeGraph.convert_to_cx2 - Setting visual style properties


Merged network UUID: https://www.ndexbio.org/v3/networks/ff2deee1-5683-11f0-a218-005056ae3c32


In [13]:
# Graphrag query with merged graph context
Merged_UUID = "ff2deee1-5683-11f0-a218-005056ae3c32"

merged_graphrag_res = graph_rag_query(
    geneset=["SIRT1", "NAMPT", "PARP1", "TP53", "BRCA1", "CDK2"],
    ndex_id=Merged_UUID,
    prompt_template=PROMPT_TEMPLATE,
    ndex_client=ndex_client,
    llm_query_fn=query_llm
)

print(merged_graphrag_res)

The interplay between metabolism and the DNA damage response is significantly influenced by the regulation of key proteins, including SIRT1, PARP1, and TP53, which are interconnected through their dependence on NAD+ levels. SIRT1, a protein deacetylase, and PARP1, a poly(ADP-ribose) polymerase, both utilize NAD+ as a substrate, which positions NAD+ as a crucial intermediary in modulating their activities. SIRT1 is known to regulate the DNA damage response by enhancing double-strand break repair via nonhomologous end joining and by modulating the activities of various transcription factors and DNA repair proteins. It also directly decreases the activity of TP53, a pivotal tumor suppressor and regulator of DNA damage response, influencing processes like cell cycle arrest and apoptosis. In contrast, PARP1, which facilitates DNA repair and increases poly(ADP-ribosyl)ation, can directly decrease NAD+ levels, thereby influencing SIRT1 activity due to the competition for NAD+. This dynamic re

### LLM response using only the knowledge graph for merged network

The metabolism of NAD(+) plays a crucial role in the regulation of the DNA damage response, with SIRT1 and PARP1 being central players. SIRT1 and PARP1 both require NAD(+) for their activity and regulate each other's functions, which influences the DNA damage response. SIRT1, through its deacetylase activity, increases the DNA damage response and double-strand break repair via nonhomologous end joining. It also directly decreases the activity of PARP1, which itself is involved in enhancing DNA repair and ADP-ribosylation processes. PARP1 reduces NAD(+) levels by converting it into ADP-ribose, while SIRT1 activity is negatively correlated with NAD(+) levels and PARP1 activity. TP53, another key gene in the DNA damage response, is modulated by both SIRT1 and PARP1, with SIRT1 directly decreasing TP53 activity, while PARP1 increases TP53 modification through ADP-ribosylation. Metabolically, NAMPT is a rate-limiting enzyme for NAD(+) production, linking energy status with DNA repair processes. Thus, NAD(+) metabolism and the interplay between SIRT1 and PARP1 are integral to managing DNA damage response, with ramifications on cellular processes like apoptosis, cell cycle arrest, and DNA repair mechanisms.

### LLM response using knowledge graph and LLM knowledge for merged network

The interplay between metabolism and the DNA damage response is significantly influenced by the regulation of key proteins, including SIRT1, PARP1, and TP53, which are interconnected through their dependence on NAD+ levels. SIRT1, a protein deacetylase, and PARP1, a poly(ADP-ribose) polymerase, both utilize NAD+ as a substrate, which positions NAD+ as a crucial intermediary in modulating their activities. SIRT1 is known to regulate the DNA damage response by enhancing double-strand break repair via nonhomologous end joining and by modulating the activities of various transcription factors and DNA repair proteins. It also directly decreases the activity of TP53, a pivotal tumor suppressor and regulator of DNA damage response, influencing processes like cell cycle arrest and apoptosis. In contrast, PARP1, which facilitates DNA repair and increases poly(ADP-ribosyl)ation, can directly decrease NAD+ levels, thereby influencing SIRT1 activity due to the competition for NAD+. This dynamic relationship is further complicated by the regulatory feedback loops involving NAMPT, the rate-limiting enzyme in NAD+ biosynthesis, which is influenced by metabolic signals and oncogenic factors such as MYC. Additionally, TP53 can indirectly affect NAD+ metabolism by regulating microRNAs that decrease SIRT1 activity, illustrating a complex network where metabolic states can alter the cellular response to DNA damage. This intricate balance ensures that metabolic cues are integrated into the cellular DNA damage response, influencing cell fate decisions in response to genotoxic stress.

In [None]:
response = query_llm("How does metabolism affect dna damage response?")
print(response)

Metabolism and the DNA damage response (DDR) are intricately connected, and their interaction plays a crucial role in maintaining genomic stability and overall cellular health. Here are several ways in which metabolism affects the DNA damage response:

1. **Energy Supply**: The DDR is an energy-intensive process, requiring ATP for various repair mechanisms. Metabolic processes, particularly those in the mitochondria, provide the necessary energy to fuel these repair activities. A well-functioning metabolism ensures an adequate supply of ATP, facilitating efficient DNA repair.

2. **Nucleotide Synthesis**: DNA repair often involves the synthesis of new nucleotides to replace damaged ones. Metabolism, especially pathways like the pentose phosphate pathway, provides the precursors for nucleotide biosynthesis. Disruptions in metabolism can lead to shortages in these building blocks, impairing the cell's ability to repair DNA.

3. **Redox Balance**: Metabolism is a major source of reactive oxygen species (ROS), which can cause oxidative DNA damage. Cells must maintain a balance between ROS production and antioxidant defenses. An imbalance, often due to metabolic dysregulation, can increase oxidative stress and overwhelm the DDR, leading to accumulated damage.

4. **Post-Translational Modifications**: Metabolic intermediates can influence post-translational modifications of DDR proteins. For example, acetyl-CoA from metabolism is crucial for acetylation reactions, which can modify the activity of proteins involved in DNA repair. Similarly, S-adenosylmethionine is a methyl donor for methylation reactions, influencing DDR protein function.

5. **Sensing and Signaling**: Metabolic sensors, such as AMP-activated protein kinase (AMPK) and NAD+-dependent enzymes like sirtuins, can modulate the DDR. These sensors respond to changes in cellular energy status and can activate pathways that influence DNA repair processes.

6. **Cell Cycle Regulation**: Metabolic states can influence cell cycle progression, which is tightly linked to DDR. For example, during nutrient scarcity, cells may delay cell cycle progression to provide more time for DNA repair before division. Metabolic cues thus help coordinate the DDR with cell cycle checkpoints.

7. **Epigenetic Regulation**: Metabolites like acetyl-CoA and NAD+ are involved in the modification of histones and other proteins that affect chromatin structure. These modifications can influence the accessibility of damaged DNA to repair machinery, affecting the efficiency of the DDR.

In summary, metabolism provides the energy and materials needed for DNA repair, regulates the activity of DDR proteins through post-translational modifications, and influences cellular responses to DNA damage through signaling pathways. Disruptions in metabolic processes can therefore have significant impacts on the efficacy of the DNA damage response and overall genomic stability.