# GraphRAG Implementation with LlamaIndex - V2

[GraphRAG (Graphs + Retrieval Augmented Generation)](https://www.microsoft.com/en-us/research/project/graphrag/) combines the strengths of Retrieval Augmented Generation (RAG) and Query-Focused Summarization (QFS) to effectively handle complex queries over large text datasets. While RAG excels in fetching precise information, it struggles with broader queries that require thematic understanding, a challenge that QFS addresses but cannot scale well. GraphRAG integrates these approaches to offer responsive and thorough querying capabilities across extensive, diverse text corpora.

This notebook provides guidance on constructing the GraphRAG pipeline using the LlamaIndex PropertyGraph abstractions using Neo4J.

This notebook updates the GraphRAG pipeline to v2. If you haven’t checked v1 yet, you can find it [here](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/cookbooks/GraphRAG_v1.ipynb). Following are the updates to the existing implementation:

1. Integrate with Neo4J Graph database.
2. Embedding based retrieval.



## Installation

`graspologic` is used to use hierarchical_leiden for building communities.

In [52]:
# !pip install llama-index llama-index-graph-stores-neo4j graspologic numpy==1.24.4 scipy==1.12.0 future

## Load Data

We will use a sample news article dataset retrieved from Diffbot, which Tomaz has conveniently made available on GitHub for easy access.

The dataset contains 2,500 samples; for ease of experimentation, we will use 50 of these samples, which include the `title` and `text` of news articles.

In [53]:
import pandas as pd
from llama_index.core import Document

json_path =  "datasets/arxiv_cs_metadata.json"
nrows = 5
papers = pd.read_json(json_path, lines=True, nrows=nrows)
database = "arxivcs-demo"

# papers = pd.read_csv(
#     "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
# )[:50]

papers.head()

Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
0,801.0341,Michael Chertkov,Michael Chertkov (Los Alamos),Exactness of Belief Propagation for Some Graph...,"12 pages, 1 figure, submitted to JSTAT",J. Stat. Mech. (2008) P10016,10.1088/1742-5468/2008/10/P10016,LANL LA-UR-07-8441,cond-mat.stat-mech cond-mat.other cs.AI cs.IT ...,http://arxiv.org/licenses/nonexclusive-distrib...,It is well known that an arbitrary graphical m...,"[{'version': 'v1', 'created': 'Wed, 2 Jan 2008...",2009-11-13,"[[Chertkov, Michael, , Los Alamos]]"
1,803.4355,Marko A. Rodriguez,Marko A. Rodriguez,Grammar-Based Random Walkers in Semantic Networks,First draft of manuscript originally written i...,"Rodriguez, M.A., ""Grammar-Based Random Walkers...",10.1016/j.knosys.2008.03.030,LA-UR-06-7791,cs.AI cs.DS,http://creativecommons.org/licenses/publicdomain/,Semantic networks qualify the meaning of an ed...,"[{'version': 'v1', 'created': 'Mon, 31 Mar 200...",2008-09-11,"[[Rodriguez, Marko A., ]]"
2,810.2434,Edward Rosten,"Edward Rosten, Reid Porter, Tom Drummond",Faster and better: a machine learning approach...,"35 pages, 11 figures","IEEE Trans. PAMI, 32 (2010), 105--119",10.1109/TPAMI.2008.275,07-3912,cs.CV cs.LG,http://arxiv.org/licenses/nonexclusive-distrib...,The repeatability and efficiency of a corner d...,"[{'version': 'v1', 'created': 'Tue, 14 Oct 200...",2010-07-09,"[[Rosten, Edward, ], [Porter, Reid, ], [Drummo..."
3,812.4446,Peter Turney,Peter D. Turney (National Research Council of ...,The Latent Relation Mapping Engine: Algorithm ...,related work available at http://purl.org/pete...,"Journal of Artificial Intelligence Research, (...",10.1613/jair.2693,NRC-50738,cs.CL cs.AI cs.LG,http://arxiv.org/licenses/nonexclusive-distrib...,Many AI researchers and cognitive scientists h...,"[{'version': 'v1', 'created': 'Tue, 23 Dec 200...",2020-08-20,"[[Turney, Peter D., , National Research Counci..."
4,901.3574,Christoph Benzmueller,Christoph Benzmueller,Automating Access Control Logics in Simple Typ...,ii + 20 pages,"SEKI Report SR-2008-01 (ISSN 1437-4447), Saarl...",10.1007/978-3-642-01244-0_34,SEKI Report SR-2008-01,cs.LO cs.AI,http://arxiv.org/licenses/nonexclusive-distrib...,Garg and Abadi recently proved that prominent ...,"[{'version': 'v1', 'created': 'Fri, 23 Jan 200...",2015-05-13,"[[Benzmueller, Christoph, ]]"


Prepare documents as required by LlamaIndex

In [54]:
documents = [
    Document(text=f"{row['title']}: {row['abstract']}",)
    for i, row in papers.iterrows()
]

## Setup API Key and LLM

In [55]:
import os


# os.environ["OPENAI_API_KEY"] = "sk-.."

# from llama_index.llms.openai import OpenAI
from llama_index.llms.ollama import Ollama
llm = Ollama(model="qwen2.5",  request_timeout=20000)

## GraphRAGExtractor

The GraphRAGExtractor class is designed to extract triples (subject-relation-object) from text and enrich them by adding descriptions for entities and relationships to their properties using an LLM.

This functionality is similar to that of the `SimpleLLMPathExtractor`, but includes additional enhancements to handle entity, relationship descriptions. For guidance on implementation, you may look at similar existing [extractors](https://docs.llamaindex.ai/en/latest/examples/property_graph/Dynamic_KG_Extraction/?h=comparing).

Here's a breakdown of its functionality:

**Key Components:**

1. `llm:` The language model used for extraction.
2. `extract_prompt:` A prompt template used to guide the LLM in extracting information.
3. `parse_fn:` A function to parse the LLM's output into structured data.
4. `max_paths_per_chunk:` Limits the number of triples extracted per text chunk.
5. `num_workers:` For parallel processing of multiple text nodes.


**Main Methods:**

1. `__call__:` The entry point for processing a list of text nodes.
2. `acall:` An asynchronous version of __call__ for improved performance.
3. `_aextract:` The core method that processes each individual node.


**Extraction Process:**

For each input node (chunk of text):
1. It sends the text to the LLM along with the extraction prompt.
2. The LLM's response is parsed to extract entities, relationships, descriptions for entities and relations.
3. Entities are converted into EntityNode objects. Entity description is stored in metadata
4. Relationships are converted into Relation objects. Relationship description is stored in metadata.
5. These are added to the node's metadata under KG_NODES_KEY and KG_RELATIONS_KEY.

**NOTE:** In the current implementation, we are using only relationship descriptions. In the next implementation, we will utilize entity descriptions during the retrieval stage.

In [56]:
import asyncio
import nest_asyncio

nest_asyncio.apply()

from typing import Any, List, Callable, Optional, Union, Dict
from IPython.display import Markdown, display

from llama_index.core.async_utils import run_jobs
from llama_index.core.indices.property_graph.utils import (
    default_parse_triplets_fn,
)
from llama_index.core.graph_stores.types import (
    EntityNode,
    KG_NODES_KEY,
    KG_RELATIONS_KEY,
    Relation,
)
from llama_index.core.llms.llm import LLM
from llama_index.core.prompts import PromptTemplate
from llama_index.core.prompts.default_prompts import (
    DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
)
from llama_index.core.schema import TransformComponent, BaseNode
from llama_index.core.bridge.pydantic import BaseModel, Field


class GraphRAGExtractor(TransformComponent):
    """Extract triples from a graph.

    Uses an LLM and a simple prompt + output parsing to extract paths (i.e. triples) and entity, relation descriptions from text.

    Args:
        llm (LLM):
            The language model to use.
        extract_prompt (Union[str, PromptTemplate]):
            The prompt to use for extracting triples.
        parse_fn (callable):
            A function to parse the output of the language model.
        num_workers (int):
            The number of workers to use for parallel processing.
        max_paths_per_chunk (int):
            The maximum number of paths to extract per chunk.
    """

    llm: LLM
    extract_prompt: PromptTemplate
    parse_fn: Callable
    num_workers: int
    max_paths_per_chunk: int

    def __init__(
        self,
        llm: Optional[LLM] = llm,
        extract_prompt: Optional[Union[str, PromptTemplate]] = None,
        parse_fn: Callable = default_parse_triplets_fn,
        max_paths_per_chunk: int = 10,
        num_workers: int = 4,
    ) -> None:
        """Init params."""
        from llama_index.core import Settings

        if isinstance(extract_prompt, str):
            extract_prompt = PromptTemplate(extract_prompt)

        super().__init__(
            llm=llm or Settings.llm,
            extract_prompt=extract_prompt,
            parse_fn=parse_fn,
            num_workers=num_workers,
            max_paths_per_chunk=max_paths_per_chunk,
        )

    @classmethod
    def class_name(cls) -> str:
        return "GraphExtractor"

    def __call__(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes."""
        return asyncio.run(
            self.acall(nodes, show_progress=show_progress, **kwargs)
        )

    async def _aextract(self, node: BaseNode) -> BaseNode:
        """Extract triples from a node."""
        assert hasattr(node, "text")

        text = node.get_content(metadata_mode="llm")
        try:
            llm_response = await self.llm.apredict(
                self.extract_prompt,
                text=text,
                max_knowledge_triplets=self.max_paths_per_chunk,
            )
            print(f"llm_response: {llm_response}")
            entities, entities_relationship = self.parse_fn(llm_response)
        except ValueError:
            entities = []
            entities_relationship = []

        existing_nodes = node.metadata.pop(KG_NODES_KEY, [])
        existing_relations = node.metadata.pop(KG_RELATIONS_KEY, [])
        entity_metadata = node.metadata.copy()
        for entity, entity_type, description in entities:
            entity_metadata["entity_description"] = description
            entity_node = EntityNode(
                name=entity, label=entity_type, properties=entity_metadata
            )
            existing_nodes.append(entity_node)

        relation_metadata = node.metadata.copy()
        for triple in entities_relationship:
            subj, obj, rel, description = triple
            relation_metadata["relationship_description"] = description
            rel_node = Relation(
                label=rel,
                source_id=subj,
                target_id=obj,
                properties=relation_metadata,
            )

            existing_relations.append(rel_node)

        node.metadata[KG_NODES_KEY] = existing_nodes
        node.metadata[KG_RELATIONS_KEY] = existing_relations
        return node

    async def acall(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes async."""
        jobs = []
        for node in nodes:
            jobs.append(self._aextract(node))

        return await run_jobs(
            jobs,
            workers=self.num_workers,
            show_progress=show_progress,
            desc="Extracting paths from text",
        )

## GraphRAGStore

The `GraphRAGStore` class is an extension of the `Neo4jPropertyGraphStore`class, designed to implement GraphRAG pipeline. Here's a breakdown of its key components and functions:


The class uses community detection algorithms to group related nodes in the graph and then it generates summaries for each community using an LLM.


**Key Methods:**

`build_communities():`

1. Converts the internal graph representation to a NetworkX graph.

2. Applies the hierarchical Leiden algorithm for community detection.

3. Collects detailed information about each community.

4. Generates summaries for each community.

`generate_community_summary(text):`

1. Uses LLM to generate a summary of the relationships in a community.
2. The summary includes entity names and a synthesis of relationship descriptions.

`_create_nx_graph():`

1. Converts the internal graph representation to a NetworkX graph for community detection.

`_collect_community_info(nx_graph, clusters):`

1. Collects detailed information about each node based on its community.
2. Creates a string representation of each relationship within a community.

`_summarize_communities(community_info):`

1. Generates and stores summaries for each community using LLM.

`get_community_summaries():`

1. Returns the community summaries by building them if not already done.

In [57]:
import re
import networkx as nx
from graspologic.partition import hierarchical_leiden
from collections import defaultdict

from llama_index.core.llms import ChatMessage
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore


class GraphRAGStore(Neo4jPropertyGraphStore):
    community_summary = {}
    entity_info = None
    max_cluster_size = 5
    llm = llm

    def generate_community_summary(self, text):
        """Generate summary for a given text using an LLM."""
        messages = [
            ChatMessage(
                role="system",
                content=(
                    "You are provided with a set of relationships from a knowledge graph, each represented as "
                    "(relationship$$$$<source_entity>$$$$<target_entity>$$$$<relation>$$$$<relationship_description>)." 
                    "Your task is to create a summary of these relationships. The summary should include the names of the entities involved and a concise synthesis "
                    "of the relationship descriptions. The goal is to capture the most critical and relevant details that "
                    "highlight the nature and significance of each relationship. Ensure that the summary is coherent and "
                    "integrates the information in a way that emphasizes the key aspects of the relationships."
                ),
            ),
            ChatMessage(role="user", content=text),
        ]
        response = llm.chat(messages)
        clean_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return clean_response

    def build_communities(self):
        """Builds communities from the graph and summarizes them."""
        nx_graph = self._create_nx_graph()
        community_hierarchical_clusters = hierarchical_leiden(
            nx_graph, max_cluster_size=self.max_cluster_size
        )
        self.entity_info, community_info = self._collect_community_info(
            nx_graph, community_hierarchical_clusters
        )
        self._summarize_communities(community_info)

    def _create_nx_graph(self):
        """Converts internal graph representation to NetworkX graph."""
        nx_graph = nx.Graph()
        triplets = self.get_triplets()
        for entity1, relation, entity2 in triplets:
            # if relation.properties.get("relationship_description"):

                # relation.properties["relationship_description"] = ""
            relationship_desc = relation.properties.get("relationship_description", "relationship_description_dummy")
            nx_graph.add_node(entity1.name)
            nx_graph.add_node(entity2.name)
            nx_graph.add_edge(
                relation.source_id,
                relation.target_id,
                relationship=relation.label,
                description=relationship_desc,
            )
        return nx_graph

    def _collect_community_info(self, nx_graph, clusters):
        """
        Collect information for each node based on their community,
        allowing entities to belong to multiple clusters.
        """
        entity_info = defaultdict(set)
        community_info = defaultdict(list)

        for item in clusters:
            node = item.node
            cluster_id = item.cluster

            # Update entity_info
            entity_info[node].add(cluster_id)

            for neighbor in nx_graph.neighbors(node):
                edge_data = nx_graph.get_edge_data(node, neighbor)
                if edge_data:
                    detail = f"{node} -> {neighbor} -> {edge_data['relationship']} -> {edge_data['description']}"
                    community_info[cluster_id].append(detail)

        # Convert sets to lists for easier serialization if needed
        entity_info = {k: list(v) for k, v in entity_info.items()}

        return dict(entity_info), dict(community_info)

    def _summarize_communities(self, community_info):
        """Generate and store summaries for each community."""
        for community_id, details in community_info.items():
            details_text = (
                "\n".join(details) + "."
            )  # Ensure it ends with a period
            self.community_summary[
                community_id
            ] = self.generate_community_summary(details_text)

    def get_community_summaries(self):
        """Returns the community summaries, building them if not already done."""
        if not self.community_summary:
            self.build_communities()
        return self.community_summary

## GraphRAGQueryEngine

The GraphRAGQueryEngine class is a custom query engine designed to process queries using the GraphRAG approach. It leverages the community summaries generated by the GraphRAGStore to answer user queries. Here's a breakdown of its functionality:

**Main Components:**

`graph_store:` An instance of GraphRAGStore, which contains the community summaries.
`llm:` A Language Model (LLM) used for generating and aggregating answers.


**Key Methods:**

`custom_query(query_str: str)`

1. This is the main entry point for processing a query. It retrieves community summaries, generates answers from each summary, and then aggregates these answers into a final response.

`generate_answer_from_summary(community_summary, query):`

1. Generates an answer for the query based on a single community summary.
Uses the LLM to interpret the community summary in the context of the query.

`aggregate_answers(community_answers):`

1. Combines individual answers from different communities into a coherent final response.
2. Uses the LLM to synthesize multiple perspectives into a single, concise answer.


**Query Processing Flow:**

1. Retrieve community summaries from the graph store.
2. For each community summary, generate a specific answer to the query.
3. Aggregate all community-specific answers into a final, coherent response.


**Example usage:**

```
query_engine = GraphRAGQueryEngine(graph_store=graph_store, llm=llm)

response = query_engine.query("query")
```

In [58]:
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.llms import LLM
from llama_index.core import PropertyGraphIndex

import re


class GraphRAGQueryEngine(CustomQueryEngine):
    graph_store: GraphRAGStore
    index: PropertyGraphIndex
    llm: LLM = llm
    similarity_top_k: int = 20

    def custom_query(self, query_str: str) -> str:
        """Process all community summaries to generate answers to a specific query."""

        entities = self.get_entities(query_str, self.similarity_top_k)

        community_ids = self.retrieve_entity_communities(
            self.graph_store.entity_info, entities
        )
        community_summaries = self.graph_store.get_community_summaries()
        community_answers = [
            self.generate_answer_from_summary(community_summary, query_str)
            for id, community_summary in community_summaries.items()
            if id in community_ids
        ]

        final_answer = self.aggregate_answers(community_answers)
        return final_answer

    def get_entities(self, query_str, similarity_top_k):
        nodes_retrieved = self.index.as_retriever(
            similarity_top_k=similarity_top_k
        ).retrieve(query_str)

        enitites = set()
        pattern = (
            r"^(\w+(?:\s+\w+)*)\s*->\s*([a-zA-Z\s]+?)\s*->\s*(\w+(?:\s+\w+)*)$"
        )

        for node in nodes_retrieved:
            matches = re.findall(
                pattern, node.text, re.MULTILINE | re.IGNORECASE
            )

            for match in matches:
                subject = match[0]
                obj = match[2]
                enitites.add(subject)
                enitites.add(obj)

        return list(enitites)

    def retrieve_entity_communities(self, entity_info, entities):
        """
        Retrieve cluster information for given entities, allowing for multiple clusters per entity.

        Args:
        entity_info (dict): Dictionary mapping entities to their cluster IDs (list).
        entities (list): List of entity names to retrieve information for.

        Returns:
        List of community or cluster IDs to which an entity belongs.
        """
        community_ids = []

        for entity in entities:
            if entity in entity_info:
                community_ids.extend(entity_info[entity])

        return list(set(community_ids))

    def generate_answer_from_summary(self, community_summary, query):
        """Generate an answer from a community summary based on a given query using LLM."""
        prompt = (
            f"Given the community summary: {community_summary}, "
            f"how would you answer the following query? Query: {query}"
        )
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content="I need an answer based on the above information.",
            ),
        ]
        response = self.llm.chat(messages)
        cleaned_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return cleaned_response

    def aggregate_answers(self, community_answers):
        """Aggregate individual community answers into a final, coherent response."""
        # intermediate_text = " ".join(community_answers)
        prompt = "Combine the following intermediate answers into a final, concise response."
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content=f"Intermediate answers: {community_answers}",
            ),
        ]
        final_response = self.llm.chat(messages)
        cleaned_final_response = re.sub(
            r"^assistant:\s*", "", str(final_response)
        ).strip()
        return cleaned_final_response

##  Build End to End GraphRAG Pipeline

Now that we have defined all the necessary components, let’s construct the GraphRAG pipeline:

1. Create nodes/chunks from the text.
2. Build a PropertyGraphIndex using `GraphRAGExtractor` and `GraphRAGStore`.
3. Construct communities and generate a summary for each community using the graph built above.
4. Create a `GraphRAGQueryEngine` and begin querying.

### Create nodes/ chunks from the text.

In [59]:
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(documents)

In [60]:
len(nodes)

5

### Build ProperGraphIndex using `GraphRAGExtractor` and `GraphRAGStore`

In [61]:
KG_TRIPLET_EXTRACT_TMPL = """
-Goal-
Given a text document, identify all entities and their entity types from the text and all relationships among the identified entities.
Given the text, extract up to {max_knowledge_triplets} entity-relation triplets.

-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: Type of the entity
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity"$$$$<entity_name>$$$$<entity_type>$$$$<entity_description>)

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relation: relationship between source_entity and target_entity
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other

Format each relationship as ("relationship"$$$$<source_entity>$$$$<target_entity>$$$$<relation>$$$$<relationship_description>)

3. When finished, output.

-Real Data-
######################
text: {text}
######################
output:"""

In [62]:
entity_pattern = r'\("entity"\$\$\$\$"(.+?)"\$\$\$\$"(.+?)"\$\$\$\$"(.+?)"\)'
relationship_pattern = r'\("relationship"\$\$\$\$"(.+?)"\$\$\$\$"(.+?)"\$\$\$\$"(.+?)"\$\$\$\$"(.+?)"\)'
from llama_index.core.indices.property_graph import DynamicLLMPathExtractor

# def parse_fn(response_str: str) -> Any:
#     entities = re.findall(entity_pattern, response_str)
#     relationships = re.findall(relationship_pattern, response_str)
#     print(f"response_str: {response_str}")
#     print(f"entities: {entities}")
#     print(f"relationships: {relationships}")
#     if entities == []:
#         entities = [("DummyE", "DummyE", "DummyE",)]
#     if relationships == []:
#         relationships = [("DummyR", "DummyR", "DummyR", "DummyR")]
#     return entities, relationships

def parse_fn(response_str: str) -> Any:
    # Updated patterns to match actual output format
    entity_pattern = r'\("entity"\$\$\$\$(.*?)\$\$\$\$(.*?)\$\$\$\$(.*?)\)'
    relationship_pattern = r'\("relationship"\$\$\$\$(.*?)\$\$\$\$(.*?)\$\$\$\$(.*?)\$\$\$\$(.*?)\)'
    
    # Find all matches
    entities = re.findall(entity_pattern, response_str, re.DOTALL)
    relationships = re.findall(relationship_pattern, response_str, re.DOTALL)
    
    # Clean up any whitespace
    entities = [(e1.strip(), e2.strip(), e3.strip()) for e1, e2, e3 in entities]
    relationships = [(r1.strip(), r2.strip(), r3.strip(), r4.strip()) 
                    for r1, r2, r3, r4 in relationships]
    
    # Add default if empty (keeping your original fallback)
    if not entities:
        entities = [("DummyE", "DummyE", "DummyE")]
    if not relationships:
        relationships = [("DummyR", "DummyR", "DummyR", "DummyR")]
    
    print(f"Found entities: {entities}")
    print(f"Found relationships: {relationships}")
    
    return entities, relationships

kg_extractor = GraphRAGExtractor(
    llm=llm,
    extract_prompt=KG_TRIPLET_EXTRACT_TMPL,
    max_paths_per_chunk=20,
    num_workers=4,
    parse_fn=parse_fn,

)
# max_triplets_per_chunk=20,
#         num_workers=4
# kg_extractor = DynamicLLMPathExtractor(
#             llm=llm,
#             max_triplets_per_chunk=20,
#             num_workers=4,
#             allowed_entity_types=None,
#             allowed_relation_types=None,
#             allowed_relation_props=["relationship_description"],
#             allowed_entity_props=[],
#             parse_fn=parse_fn,
#             extract_prompt=KG_TRIPLET_EXTRACT_TMPL,
# )

## Docker Setup And Neo4J setup

To launch Neo4j locally, first ensure you have docker installed. Then, you can launch the database with the following docker command.

```
docker run \
    -p 7474:7474 -p 7687:7687 \
    -v $PWD/data:/data -v $PWD/plugins:/plugins \
    --name neo4j-apoc \
    -e NEO4J_apoc_export_file_enabled=true \
    -e NEO4J_apoc_import_file_enabled=true \
    -e NEO4J_apoc_import_file_use__neo4j__config=true \
    -e NEO4JLABS_PLUGINS=\[\"apoc\"\] \
    neo4j:latest
```
From here, you can open the db at http://localhost:7474/. On this page, you will be asked to sign in. Use the default username/password of neo4j and neo4j.

Once you login for the first time, you will be asked to change the password.

In [None]:
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore

# Note: used to be `Neo4jPGStore`
graph_store = GraphRAGStore(
    username="neo4j", password="password", url="bolt://localhost:7687", database=database
)

In [64]:
from llama_index.core import PropertyGraphIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# embed_model = HuggingFaceEmbedding("sentence-transformers/all-MiniLM-L6-v2")
embed_model = HuggingFaceEmbedding("avsolatorio/GIST-all-MiniLM-L6-v2")
# GIST-all-MiniLM-L6-v2 
# PropertyGraphIndex.from_documents(
#             documents,
#             property_graph_store=graph_store,
#             llm=self.llm,
#             embed_model=self.embed_model,
#             embed_kg_nodes=True,
#             kg_extractors=[self.kg_extractor],
#             show_progress=True
# )

index = PropertyGraphIndex(
    nodes=nodes,
    kg_extractors=[kg_extractor],
    property_graph_store=graph_store,
    llm=llm,
    show_progress=True,
    embed_model=embed_model,
)

Extracting paths from text:  20%|██        | 1/5 [00:49<03:16, 49.21s/it]

llm_response: ### Entities Identification

1. ("entity"$$$$Latent Relation Mapping Engine$$$$Algorithm$$$$An algorithm designed to map analogical relations between lists of words using a large corpus of raw text.")
2. ("entity"$$$$Structure Mapping Theory (SMT)$$$$Theory$$$$A theory on computational modeling of analogy-making, which involves mapping structures between two domains.")
3. ("entity"$$$$Structure Mapping Engine (SME)$$$$Engine$$$$An implementation of Structure Mapping Theory used to create analogical mappings but requires complex hand-coded representations.")
4. ("entity"$$$$Latent Relational Analysis (LRA)$$$$Analysis Technique$$$$A method for discovering semantic relations among words without the need for hand-coded representations, combined with SME in LRME.")
5. ("entity"$$$$Analogical Mapping Problems$$$$Problem Set$$$$A set of problems used to evaluate the performance of different analogy-making algorithms, consisting of ten scientific analogies and ten common metapho

Extracting paths from text:  40%|████      | 2/5 [00:50<01:03, 21.31s/it]

llm_response: ```plaintext
# Entities

("entity"$$$$Feature Detector$$$$Detector$$$$A software component designed to identify specific features within an image or video stream. This entity is discussed in the context of corner detection and its performance metrics include repeatability and efficiency.)

("entity"$$$$Harris Detector$$$$Detector$$$$An older feature detector, used as a baseline for comparison with other detectors such as SIFT. It is noted that this detector does not operate at frame rate when compared to others.)

("entity"$$$$SIFT (Scale-Invariant Feature Transform)$$$$Detector$$$$A well-known method for detecting and describing local features in images or videos. It operates slower than the machine learning-based feature detector described in the text, as evidenced by its higher processing time requirement.)

("entity"$$$$Machine Learning Approach$$$$Method$$$$A new technique discussed in the paper that utilizes machine learning to derive a feature detector capable of o

Extracting paths from text:  60%|██████    | 3/5 [02:01<01:27, 43.59s/it]

llm_response: Let's break down the text to identify entities, their types, descriptions, relationships, and relevant triplets.

### Step 1: Identifying Entities

#### Entity 1:
- **entity_name**: Graphical Model
- **entity_type**: Concept/Model
- **entity_description**: A model used in statistical inference defined on a graph. It can be either tree-like (without loops) or loopy (with loops). In the zero-temperature limit, it is considered for finding the global minimum using Belief Propagation and Linear Programming.

#### Entity 2:
- **entity_name**: Iterative Belief Propagation (BP)
- **entity_type**: Algorithm
- **entity_description**: An iterative algorithm used in solving graphical models defined on tree-like graphs. It converges to a unique minimum of the Bethe free energy functional, but for loopy graphs, it may converge to one of multiple minima or not converge at all.

#### Entity 3:
- **entity_name**: Bethe Free Energy Functional
- **entity_type**: Concept/Function
- **entity

Extracting paths from text:  80%|████████  | 4/5 [02:16<00:32, 32.52s/it]

llm_response: ### Step 1: Identify All Entities and Their Types

1. **entity**$$$$Grammar-Based Random Walkers$$$$Process$$$$A model for navigating semantic networks using a grammar-based approach.**
2. **entity**$$$$Semantic Networks$$$$Concept$$$$Collections of vertices (nodes) connected by edges, representing relationships between concepts or entities.**
3. **entity**$$$$Vertices$$$$Node$$$$Individual points in the network that represent concepts or entities.**
4. **entity**$$$$Edges$$$$Relationship$$$$Lines connecting vertices to represent relationships between them.**
5. **entity**$$$$Central Vertex$$$$Node$$$$Nodes considered important within a semantic network due to their centrality measures.**
6. **entity**$$$$Context-Based Rankings$$$$Method$$$$Ranks nodes based on user-defined contexts or criteria, often used in metrics for semantic networks.**
7. **entity**$$$$Semantic Network Metrics$$$$Framework$$$$Tools and methods for quantifying the importance of vertices in a semantic

Extracting paths from text: 100%|██████████| 5/5 [02:24<00:00, 28.98s/it]


llm_response: ### Step 1: Identify all entities

#### Entity 1:
- **entity_name**: Automating Access Control Logics
- **entity_type**: Process/Task
- **entity_description**: The process of translating and embedding access control logics into simple type theory using a theorem prover.

("entity"$$$$Automating Access Control Logics$$$$Process/Task$$$$The process of translating and embedding access control logics into simple type theory using a theorem prover.)

#### Entity 2:
- **entity_name**: LEO-II
- **entity_type**: Tool/Software
- **entity_description**: A higher-order automated theorem prover used for reasoning in and about normal multimodal and monomodal logics.

("entity"$$$$LEO-II$$$$Tool/Software$$$$A higher-order automated theorem prover used for reasoning in and about normal multimodal and monomodal logics.)

#### Entity 3:
- **entity_name**: Simple Type Theory
- **entity_type**: Formal System
- **entity_description**: A formal system that includes simple type theory, also kn

Generating embeddings: 100%|██████████| 1/1 [00:00<00:00,  1.94it/s]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00,  3.93it/s]


In [65]:
index.property_graph_store.get_triplets()

# for triplet in index.property_graph_store.get_triplets():
#     print(triplet.re)

[[EntityNode(label='STATISTICAL_MODEL', embedding=None, properties={'id': 'Graphical Model', 'type': 'reducible LP with TUM constraints', 'triplet_source_id': 'b2f2dda2-a5e4-4cc9-ae08-a977488d1db4'}, name='Graphical Model'),
  Relation(label='DEFINED_ON', source_id='Graphical Model', target_id='arbitrary graphical model', properties={'type': 'graph', 'triplet_source_id': 'b2f2dda2-a5e4-4cc9-ae08-a977488d1db4'}),
  EntityNode(label='STATISTICAL_MODEL', embedding=None, properties={'id': 'arbitrary graphical model', 'type': 'arbitrary', 'triplet_source_id': 'b2f2dda2-a5e4-4cc9-ae08-a977488d1db4'}, name='arbitrary graphical model')],
 [EntityNode(label='STATISTICAL_MODEL', embedding=None, properties={'id': 'Graphical Model', 'type': 'reducible LP with TUM constraints', 'triplet_source_id': 'b2f2dda2-a5e4-4cc9-ae08-a977488d1db4'}, name='Graphical Model'),
  Relation(label='SOLVED_EXACTLY_AND_EFFICIENTLY', source_id='Graphical Model', target_id='tree (graph without loops)', properties={'meth

In [66]:
index.property_graph_store.get_triplets()[10][0].properties

{'id': 'Graphical Model',
 'type': 'reducible LP with TUM constraints',
 'triplet_source_id': 'b2f2dda2-a5e4-4cc9-ae08-a977488d1db4'}

In [67]:
index.property_graph_store.get_triplets()[10][1].properties

{'triplet_source_id': '5081b3d2-dff5-457d-a7c5-326b659c0def',
 'relationship_description': 'The iterative BP algorithm is used to solve graphical models defined on tree-like graphs.'}

### Build communities

This will create communities and summary for each community.

In [68]:
index.property_graph_store.build_communities()

In [72]:
import networkx as nx
from pyvis.network import Network

net = Network(
    directed = True,
    select_menu = True, 
    filter_menu = True, 
)
net.show_buttons() 
net.from_nx(graph_store._create_nx_graph()) 
net.write_html('community_graph.html')

### Create QueryEngine

In [69]:
query_engine = GraphRAGQueryEngine(
    graph_store=index.property_graph_store,
    llm=llm,
    index=index,
    similarity_top_k=10,
)

### Querying

In [70]:
response = query_engine.query(
    "What are the main topics discussed in the papers?"
)
display(Markdown(f"{response.response}"))

The main topics discussed in the papers include:

1. **Graphical Model Entities and Their Definitions:**
   - Arbitrary graphical models.
   - Trees (graphs without loops).
   - The Bethe free energy functional.

2. **Solvability and Algorithms:**
   - Solvability of graphical models at different temperatures, particularly focusing on the Zero-Temperature limit.
   - Performance of specific algorithms like the Zero-Temperature Version of BP Algorithm.
   - The g-BP algorithm as a generalized version of BP for finding global minima.

3. **Convergence and Minimization:**
   - Convergence behavior of Iterative Belief Propagation (BP) in tree-like and loopy structures.
   - Minimizing the Bethe free energy functional using iterative BP algorithms on tree-like graphs.

4. **Interrelationships Between Algorithms and Functionalities:**
   - Relationship between Belief Propagation and its Zero-Temperature Version for solving specific graphical models.
   - The role of g-BP algorithm in finding global minima, especially under zero-temperature limits.

5. **Efficiency and Solvability:**
   - Efficiency of Iterative BP for tree-like structures.
   - Challenges with convergence in loopy graphs.

6. **Convergence Issues and Solutions:**
   - Convergence issues faced by iterative algorithms like BP in loopy structures.
   - Effectiveness of Zero-Temperature versions of algorithms and g-BP algorithm in finding global minima under specific conditions.

7. **Functional Minimization:**
   - The role of the Bethe free energy functional in minimizing for tree-like structures, often aligning with ML solutions at zero temperature limits.

8. **Tree-like Graphs and Their Solution Methods:**
   - Use of Iterative Belief Propagation (BP) for solving tree-like graphs exactly.
   - Minimization by iterative BP algorithms in tree-like structures, with potential multiple minima.

9. **Zero-Temperature Limit:**
   - Convergence and failure modes of the zero-temperature limit of Belief Propagation and certain graphical models.
   - The process finds Maximum Likelihood (ML) solutions but can fail to converge completely or find global minima in some cases.

10. **Linear Programming with Transversality Uniqueness Maximization (TUM):**
    - Application of Linear Programming for solving specific graphical models, particularly those with loops in the zero-temperature limit.
    - The method provides efficient solutions for certain ML problems.

11. **Generalized Belief Propagation (g-BP) Algorithm:**
    - A generalization of the BP algorithm applicable to some specific problems or scenarios.

12. **Model Specifics and Interconnectedness:**
    - Two models discussed in [05KW] and [08BSS] share key features, suggesting their interconnectedness in theoretical or practical aspects.
    
13. **Interrelated Concepts:**
    - The relationship between the zero-temperature version of BP algorithms and their application to solving specific graphical models.

Additionally, there is another set of papers that focus on:

14. **Bethe Free Energy Functional (BFEF):** 
   - Multiple minima explored, particularly at zero temperature where one minimum can align with the Maximum-Likelihood (ML) solution.
   
15. **Maximum-Likelihood Solution:** 
   - Connection between BFEF and ML solutions highlighted, noting that while BFEF can have multiple local optima, it converges to the ML solution in the absence of thermal fluctuations or at zero temperature.

16. **Belief Propagation (BP):**
   - Role in calculating BFEF on tree structures.
   
17. **Generalized Belief Propagation (g-BP) Algorithm:**
   - Generalization aimed at finding global minima and related to ML solutions under certain conditions.

In [73]:
response = query_engine.query(
    "Which papers have the most in common?"
)
display(Markdown(f"{response.response}"))

Based on the provided summaries, here are the key findings:

1. **Papers with Shared Features**:
   - **05KW** (published around 2005) and **08BSS** (published around 2008) share specific key features related to graphical models and algorithms, focusing on iterative algorithms like BP at zero temperature limits and their relation to finding ML solutions.

2. **Entities with the Most in Common**:
   - **LEO-II and Modal Logic S4**: These are directly involved as LEO-II can automate reasoning within modal logic S4.
   - **Garg and Abadi's Contribution**: Their work involves translating prominent access control logics into modal logic S4, enhancing automated reasoning using LEO-II.

3. **Papers Focusing on Random Walks and Graph Theory**:
   - Works that explore random walks through semantic networks defined by user-prescribed grammars.
   - Utilization of RDF to structure and analyze data in these networks.
   - Application of centrality metrics like eigenvector centrality and PageRank.
   - Influence of context-based inputs on traversal paths.

4. **Latent Relation Mapping Engine (LRME)**:
   - Evaluation through 20 analogical mapping tasks, focusing on its performance with scientific and metaphorical words.

5. **Papers Related to Bethe Free Energy Functional (BFEF), Maximum-Likelihood (ML) Solutions, Belief Propagation (BP), and Generalized BP (g-BP)**:
   - Papers discussing the properties and applications of BFEF.
   - Studies on ML solutions in graphical models.
   - Research on BP algorithms and their extensions like g-BP.

6. **Papers by Garg and Abadi**:
   - Their work involves translating access control logics into Modal Logic S4, using Simple Type Theory (Higher-Order Logic) to enhance automated reasoning with LEO-II.

7. **Central Vertices in Semantic Networks**:
   - Papers focusing on identifying and analyzing central vertices.
   - Application of PageRank and eigenvector centrality metrics.
   - Utilization of RDF as a framework for these applications.

Overall, the papers or works that share the most common ground tend to integrate elements like random walks, RDF, centrality metrics, and automated reasoning frameworks.

In [71]:
response = query_engine.query("What are the main news in energy sector?")
display(Markdown(f"{response.response}"))

It seems there might be a misunderstanding. The provided summaries focus on topics such as graphical models, theorem provers, machine learning techniques, and feature detectors, which are not directly related to the current news in the energy sector.

For main news in the energy sector, I recommend checking recent reports from reputable sources like the International Energy Agency (IEA), Reuters, Bloomberg, or other energy-specific publications. These sources typically cover topics such as:

1. **Renewable Energy Advancements**: Updates on solar, wind, and other renewable technologies.
2. **Fossil Fuel Prices and Policies**: Changes in oil, gas, and coal prices, as well as policy updates affecting these industries.
3. **Energy Transition Initiatives**: Government policies, corporate strategies, and technological innovations aimed at reducing carbon emissions.
4. **Infrastructure Developments**: New projects related to energy production, transmission, and storage infrastructure.

If you need specific recent news articles or data points, feel free to ask for guidance on where to find them!