# LlamaIndex GraphRAG

- 1. Install Required Libraries
- 2. Load Data
    - 2.1. Load CSV file with three columns: title, date, and text
    - 2.2. Concatenate title and text to get documents
    - 2.3. Split Text Blocks
    - 2.4. Verification
- 3. Extract Entities and Relationships
    - 3.1. Define GraphRAGExtractor Class
    - 3.2. Use Local Ollama Model and Set as Global LLM
    - 3.3. Use Local Ollama Embedding Model and Set as Global embed_model
    - 3.4. Define extract_prompt
    - 3.5. Define parse_fn
    - 3.6. Instantiate GraphRAGExtractor as kg_extractor Object
- 4. Store Graph Information in Neo4j
    - 4.1. Define GraphRAGStore Class
    - 4.2. Instantiate GraphRAGStore as graph_store Object Using Local Neo4j Graph Database
- 5. GraphRAG Index
    - 5.1. Create Index
    - 5.2. Verification
- 6. Build Communities and Generate Community Summaries
- 7. GraphRAG Query
    - 7.1. Define GraphRAGQueryEngine Class
    - 7.2. Instantiate GraphRAGQueryEngine as query_engine Object
    - 7.3. Retrieve Information

## 1. Install Required Libraries

In [1]:
%%capture
!pip install llama-index-embeddings-ollama llama-index-graph-stores-neo4j graspologic numpy scipy==1.12.0 future 

## 2. Load Data

### 2.1. Load CSV file with three columns: title, date, and text

In [2]:
import pandas as pd
from llama_index.core import Document

news = pd.read_csv(
    "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
)[:10]

news.head()

Unnamed: 0,title,date,text
0,Chevron: Best Of Breed,2031-04-06T01:36:32.000000000+00:00,JHVEPhoto Like many companies in the O&G secto...
1,FirstEnergy (NYSE:FE) Posts Earnings Results,2030-04-29T06:55:28.000000000+00:00,FirstEnergy (NYSE:FE – Get Rating) posted its ...
2,Dáil almost suspended after Sinn Féin TD put p...,2023-06-15T14:32:11.000000000+00:00,The Dáil was almost suspended on Thursday afte...
3,Epic’s latest tool can animate hyperrealistic ...,2023-06-15T14:00:00.000000000+00:00,"Today, Epic is releasing a new tool designed t..."
4,"EU to Ban Huawei, ZTE from Internal Commission...",2023-06-15T13:50:00.000000000+00:00,The European Commission is planning to ban equ...


### 2.2. Concatenate title and text to get documents

In [3]:
documents = [Document(text=f'{row['title']}:{row['text']}') for i,row in news.iterrows()]

### 2.3. Split Text Blocks

In [4]:
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(documents)

### 2.4. Verification

In [5]:
len(nodes)

10

In [6]:
print(nodes[0].text)

Chevron: Best Of Breed:JHVEPhoto Like many companies in the O&G sector, the stock of Chevron (NYSE:CVX) has declined about 10% over the past 90-days despite the fact that Q2 consensus earnings estimates have risen sharply (~25%) during that same time frame. Over the years, Chevron has kept a very strong balance sheet. That allowed the...


In [7]:
print(nodes[1].text)

FirstEnergy (NYSE:FE) Posts Earnings Results:FirstEnergy (NYSE:FE – Get Rating) posted its earnings results on Tuesday. The utilities provider reported $0.53 earnings per share for the quarter, topping the consensus estimate of $0.52 by $0.01, RTT News reports. FirstEnergy had a net margin of 10.85% and a return on equity of 17.17%. During the same period...
If the content contained herein violates any of your rights, including those of copyright, you are requested to immediately notify us using via the following email address operanews-external(at)opera.com
Top News


## 3. Extract Entities and Relationships

### 3.1. Define GraphRAGExtractor Class

In [8]:
import asyncio
import nest_asyncio

nest_asyncio.apply()

from typing import Any, List, Callable, Optional, Union, Dict
from IPython.display import Markdown, display

from llama_index.core.async_utils import run_jobs
from llama_index.core.indices.property_graph.utils import (
    default_parse_triplets_fn,
)
from llama_index.core.graph_stores.types import (
    EntityNode,
    KG_NODES_KEY,
    KG_RELATIONS_KEY,
    Relation,
)
from llama_index.core.llms.llm import LLM
from llama_index.core.prompts import PromptTemplate
from llama_index.core.prompts.default_prompts import (
    DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
)
from llama_index.core.schema import TransformComponent, BaseNode
from llama_index.core.bridge.pydantic import BaseModel, Field


class GraphRAGExtractor(TransformComponent):
    """Extract triples from a graph.

    Uses an LLM and a simple prompt + output parsing to extract paths (i.e. triples) and entity, relation descriptions from text.

    Args:
        llm (LLM):
            The language model to use.
        extract_prompt (Union[str, PromptTemplate]):
            The prompt to use for extracting triples.
        parse_fn (callable):
            A function to parse the output of the language model.
        num_workers (int):
            The number of workers to use for parallel processing.
        max_paths_per_chunk (int):
            The maximum number of paths to extract per chunk.
    """

    llm: LLM
    extract_prompt: PromptTemplate
    parse_fn: Callable
    num_workers: int
    max_paths_per_chunk: int

    def __init__(
        self,
        llm: Optional[LLM] = None,
        extract_prompt: Optional[Union[str, PromptTemplate]] = None,
        parse_fn: Callable = default_parse_triplets_fn,
        max_paths_per_chunk: int = 10,
        num_workers: int = 4,
    ) -> None:
        """Init params."""
        from llama_index.core import Settings

        if isinstance(extract_prompt, str):
            extract_prompt = PromptTemplate(extract_prompt)

        super().__init__(
            llm=llm or Settings.llm,
            extract_prompt=extract_prompt or DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
            parse_fn=parse_fn,
            num_workers=num_workers,
            max_paths_per_chunk=max_paths_per_chunk,
        )

    @classmethod
    def class_name(cls) -> str:
        return "GraphExtractor"

    def __call__(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes."""
        return asyncio.run(
            self.acall(nodes, show_progress=show_progress, **kwargs)
        )

    async def _aextract(self, node: BaseNode) -> BaseNode:
        """Extract triples from a node."""
        assert hasattr(node, "text")

        text = node.get_content(metadata_mode="llm")
        try:
            llm_response = await self.llm.apredict(
                self.extract_prompt,
                text=text,
                max_knowledge_triplets=self.max_paths_per_chunk,
            )
            print(f'extract text --->:\n{text}')
            entities, entities_relationship = self.parse_fn(llm_response)
        except ValueError:
            entities = []
            entities_relationship = []

        existing_nodes = node.metadata.pop(KG_NODES_KEY, [])
        existing_relations = node.metadata.pop(KG_RELATIONS_KEY, [])
        entity_metadata = node.metadata.copy()
        for entity, entity_type, description in entities:
            entity_metadata["entity_description"] = description  
            entity_node = EntityNode(
                name=entity, label=entity_type, properties=entity_metadata
            )
            existing_nodes.append(entity_node)

        relation_metadata = node.metadata.copy()
        for triple in entities_relationship:
            subj, obj, rel, description = triple
            relation_metadata["relationship_description"] = description
            rel_node = Relation(
                label=rel,
                source_id=subj,
                target_id=obj,
                properties=relation_metadata,
            )
            existing_relations.append(rel_node)

        node.metadata[KG_NODES_KEY] = existing_nodes
        node.metadata[KG_RELATIONS_KEY] = existing_relations
        return node

    async def acall(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes async."""
        jobs = []
        for node in nodes:
            jobs.append(self._aextract(node))

        return await run_jobs(
            jobs,
            workers=self.num_workers,
            show_progress=show_progress,
            desc="Extracting paths from text",
        )

### 3.2. Use Local Ollama Model and Set as Global LLM

In [9]:
import os
from llama_index.llms.ollama import Ollama
from llama_index.core import Settings

os.environ["no_proxy"] = "127.0.0.1,localhost"

llm = Ollama(model="llama3", request_timeout=660.0)

Settings.llm = llm

In [10]:
response = llm.complete("What is the capital of France?")
print(response)

The capital of France is Paris.


### 3.3. Use Local Ollama Embedding Model and Set as Global embed_model

In [11]:
from llama_index.embeddings.ollama import OllamaEmbedding

ollama_embedding = OllamaEmbedding(
    model_name="nomic-embed-text",
    base_url="http://localhost:11434",
    ollama_additional_kwargs={"mirostat": 0},
    request_timeout=660.0
)

# changing the global default
Settings.embed_model = ollama_embedding

### 3.4. Define extract_prompt

In [12]:
KG_TRIPLET_EXTRACT_TMPL = """
-Goal-
Given a text document, identify all entities and their entity types from the text and all relationships among the identified entities.
Given the text.

-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: Type of the entity
- entity_description: Comprehensive description of the entity's attributes and activities

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relation: relationship between source_entity and target_entity
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other

3. Output Formatting:
- Return the result in valid JSON format with two keys: 'entities' (list of entity objects) and 'relationships' (list of relationship objects).
- Exclude any text outside the JSON structure (e.g., no explanations or comments).
- If no entities or relationships are identified, return empty lists: { "entities": [], "relationships": [] }.

-An Output Example-
{
  "entities": [
    {
      "entity_name": "Albert Einstein",
      "entity_type": "Person",
      "entity_description": "Albert Einstein was a theoretical physicist who developed the theory of relativity and made significant contributions to physics."
    },
    {
      "entity_name": "Theory of Relativity",
      "entity_type": "Scientific Theory",
      "entity_description": "A scientific theory developed by Albert Einstein, describing the laws of physics in relation to observers in different frames of reference."
    },
    {
      "entity_name": "Nobel Prize in Physics",
      "entity_type": "Award",
      "entity_description": "A prestigious international award in the field of physics, awarded annually by the Royal Swedish Academy of Sciences."
    }
  ],
  "relationships": [
    {
      "source_entity": "Albert Einstein",
      "target_entity": "Theory of Relativity",
      "relation": "developed",
      "relationship_description": "Albert Einstein is the developer of the theory of relativity."
    },
    {
      "source_entity": "Albert Einstein",
      "target_entity": "Nobel Prize in Physics",
      "relation": "won",
      "relationship_description": "Albert Einstein won the Nobel Prize in Physics in 1921."
    }
  ]
}

-Real Data-
######################
text: {text}
######################
output:"""

### 3.5. Define parse_fn

In [13]:
import json
import re

def parse_fn(response_str: str) -> Any:
    print(f'parse_fn ---> response_str:\n{response_str}')
    json_pattern = r'\{.*\}'
    match = re.search(json_pattern, response_str, re.DOTALL) 
    entities = []
    relationships = []
    if not match: return entities, relationships      
    json_str = match.group(0)
    try:
        data = json.loads(json_str)
        entities = [(entity['entity_name'], entity['entity_type'], entity['entity_description']) for entity in data.get('entities', [])]
        relationships = [(relation['source_entity'], relation['target_entity'], relation['relation'], relation['relationship_description']) for relation in data.get('relationships', [])]
        print(f'parse_fn ---> entities:\n{entities}')
        print(f'parse_fn ---> relationships:\n{relationships}')
        return entities, relationships
    except json.JSONDecodeError as e:
        print("Error parsing JSON:", e)
        return entities, relationships

### 3.6. Instantiate GraphRAGExtractor as kg_extractor Object

In [14]:
kg_extractor = GraphRAGExtractor(
    llm=llm,
    extract_prompt=KG_TRIPLET_EXTRACT_TMPL,
    max_paths_per_chunk=2,
    parse_fn=parse_fn,
)

## 4. Store Graph Information in Neo4j

### 4.1. Define GraphRAGStore Class

In [15]:
import re
import networkx as nx
from graspologic.partition import hierarchical_leiden
from llama_index.core.llms import ChatMessage
from collections import defaultdict
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore


class GraphRAGStore(Neo4jPropertyGraphStore):
    community_summary = {}
    max_cluster_size = 5
    entity_info = None

    def generate_community_summary(self, text):
        """Generate summary for a given text using an LLM."""
        messages = [
            ChatMessage(
                role="system",
                content=(
                    "You are provided with a set of relationships from a knowledge graph, each represented as "
                    "entity1->entity2->relation->relationship_description. Your task is to create a summary of these "
                    "relationships. The summary should include the names of the entities involved and a concise synthesis "
                    "of the relationship descriptions. The goal is to capture the most critical and relevant details that "
                    "highlight the nature and significance of each relationship. Ensure that the summary is coherent and "
                    "integrates the information in a way that emphasizes the key aspects of the relationships."
                ),
            ),
            ChatMessage(role="user", content=text),
        ]
        response = llm.chat(messages)
        clean_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return clean_response

    def build_communities(self):
        """Builds communities from the graph and summarizes them."""
        nx_graph = self._create_nx_graph()
        community_hierarchical_clusters = hierarchical_leiden(
            nx_graph, max_cluster_size=self.max_cluster_size
        )
        self.entity_info, community_info = self._collect_community_info(
            nx_graph, community_hierarchical_clusters
        )
        self._summarize_communities(community_info)

    def _create_nx_graph(self):
        """Converts internal graph representation to NetworkX graph."""
        nx_graph = nx.Graph()
        triplets = self.get_triplets()
        for entity1, relation, entity2 in triplets:
            nx_graph.add_node(entity1.name)
            nx_graph.add_node(entity2.name)
            nx_graph.add_edge(
                relation.source_id,
                relation.target_id,
                relationship=relation.label,
                description=relation.properties["relationship_description"],
            )
        return nx_graph

    def _collect_community_info(self, nx_graph, clusters):
        """
        Collect information for each node based on their community,
        allowing entities to belong to multiple clusters.
        """
        entity_info = defaultdict(set)
        community_info = defaultdict(list)
        
        for item in clusters:
            node = item.node
            cluster_id = item.cluster

            # Update entity_info
            entity_info[node].add(cluster_id)

            for neighbor in nx_graph.neighbors(node):
                edge_data = nx_graph.get_edge_data(node, neighbor)
                if edge_data:
                    detail = f"{node} -> {neighbor} -> {edge_data['relationship']} -> {edge_data['description']}"
                    community_info[cluster_id].append(detail)
        
        # Convert sets to lists for easier serialization if needed
        entity_info = {k: list(v) for k, v in entity_info.items()}

        return dict(entity_info), dict(community_info)

    def _summarize_communities(self, community_info):
        """Generate and store summaries for each community."""
        for community_id, details in community_info.items():
            details_text = (
                "\n".join(details) + "."
            )  # Ensure it ends with a period
            self.community_summary[
                community_id
            ] = self.generate_community_summary(details_text)

    def get_community_summaries(self):
        """Returns the community summaries, building them if not already done."""
        if not self.community_summary:
            self.build_communities()
        return self.community_summary

### 4.2. Instantiate GraphRAGStore as graph_store Object Using Local Neo4j Graph Database

In [16]:
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore

# Note: used to be `Neo4jPGStore`
graph_store = GraphRAGStore(
    username="neo4j", password="neo4j", url="bolt://localhost:7687"
)

## 5. GraphRAG Index

### 5.1. Create Index

In [17]:
from llama_index.core import PropertyGraphIndex

index = PropertyGraphIndex(
    nodes=nodes,
    property_graph_store=graph_store,
    kg_extractors=[kg_extractor],
    show_progress=True,
)

Extracting paths from text:  10%|█▌              | 1/10 [01:10<10:37, 70.80s/it]

extract text --->:
EU to Ban Huawei, ZTE from Internal Commission Networks:The European Commission is planning to ban equipment from Chinese vendors Huawei Technologies Co. and ZTE Corp. from its own internal telecommunications networks, people familiar with the matter said.
The ban comes ahead of an anticipated update to the European Union’s guidance on 5G mobile networks within the bloc that’s expected to more forcefully encourage members to phase out equipment from the companies, which it considers high risk, the people said, asking not to be identified because the plan isn’t yet public.
As the relationship between the US and its allies and China has deteriorated, countries have blocked Chinese technology from their core telecommunications networks because of spying concerns. The move is similar to the commission’s decision to block its staff from using TikTok Inc. over security concerns related to the social-media app’s data-collection practices.
Read More: TikTok Banned From EU Co

Extracting paths from text:  20%|███▏            | 2/10 [01:14<04:12, 31.58s/it]

extract text --->:
Vivo X90s Officially Teased, Tipped to Run on New MediaTek Dimensity 9200+ SoC:Vivo X90s has been teased by Jia Jingdong, Vivo's Vice President and General Manager of Product Strategy via Weibo on Thursday. The new smartphone has a design similar to its Vivo X90 series siblings — Vivo X90 and Vivo X90 Pro. The Vivo X90s is shown to have a Zeiss-tuned triple rear camera unit. The upcoming model is expected to retain the key specification of the Vivo X90. Meanwhile, Chinese tipsters have leaked the specifications of the upcoming handset. The Vivo X90s is said to run on MediaTek Dimensity 9200+ SoC and could be offered in four different colour options.
Jia Jingdong posted an image of the Vivo X90s on the Chinese microblogging platform providing us a glimpse of the design from the rear. The render shows the handset in a white finish with rounded corners. The image suggests Zeiss branded triple rear cameras on the rear panel along with an LED flash. It seems to have a gla

Extracting paths from text:  30%|████▊           | 3/10 [01:36<03:09, 27.10s/it]

extract text --->:
Epic’s latest tool can animate hyperrealistic MetaHumans with an iPhone:Today, Epic is releasing a new tool designed to capture an actor’s facial performance using a device as simple as an iPhone and apply it to a hyperrealistic “MetaHuman” in the Unreal Engine in “minutes.” The feature, dubbed MetaHuman Animator, was detailed at the Game Developers Conference in March but is now available for developers to try out for themselves. Epic has also released a new video today produced by one of its internal teams to show what the tool is capable of.
While Epic’s short film shows off some impressively subtle facial animation, the big benefit the company is emphasizing is the speed with which MetaHuman Animator produces results. “The animation is produced locally using GPU hardware, with the final animation available in minutes,” the company’s press release reads. That has the potential to not just save a studio money by making performance capture more efficient but also, E

Extracting paths from text:  40%|██████▍         | 4/10 [01:39<01:45, 17.61s/it]

extract text --->:
KeyBank’s American Fork Branch Celebrates One Year Anniversary:CLEVELAND, UT / ACCESSWIRE / June 15, 2023 / KeyBank recently commemorated the one-year anniversary of its American Fork branch - the company's first new branch in the Western half of the U.S. in more than a decade. The celebration included a networking event with the branch's business clients and the American Fork Chamber of Commerce, as well as a $10,000 donation to the Five.12 Foundation.
"We've had a great first year in American Fork," said Drew Yergensen, KeyBank Utah market president and commercial banking leader. "We have really enjoyed meeting and working more closely with our new neighbors, clients and community partners, and we look forward to strengthening those relationships even further in the coming years."
The American Fork branch highlights KeyBank's state-of-the-art financial wellness center model, which is staffed with financial wellness consultants rather than a traditional teller line.

Extracting paths from text:  50%|████████        | 5/10 [01:46<01:08, 13.66s/it]

extract text --->:
Chevron: Best Of Breed:JHVEPhoto Like many companies in the O&G sector, the stock of Chevron (NYSE:CVX) has declined about 10% over the past 90-days despite the fact that Q2 consensus earnings estimates have risen sharply (~25%) during that same time frame. Over the years, Chevron has kept a very strong balance sheet. That allowed the...
parse_fn ---> response_str:
Here is the output in JSON format:

{
  "entities": [
    {
      "entity_name": "Chevron",
      "entity_type": "Company",
      "entity_description": "A multinational energy corporation with a diverse range of businesses, including oil and natural gas exploration, production, refining, marketing, and transportation."
    },
    {
      "entity_name": "NYSE:CVX",
      "entity_type": "Stock Ticker Symbol",
      "entity_description": "The stock ticker symbol for Chevron's publicly traded shares on the New York Stock Exchange."
    }
  ],
  "relationships": [
    {
      "source_entity": "Chevron",
      "

Extracting paths from text:  60%|█████████▌      | 6/10 [02:16<01:16, 19.10s/it]

extract text --->:
FirstEnergy (NYSE:FE) Posts Earnings Results:FirstEnergy (NYSE:FE – Get Rating) posted its earnings results on Tuesday. The utilities provider reported $0.53 earnings per share for the quarter, topping the consensus estimate of $0.52 by $0.01, RTT News reports. FirstEnergy had a net margin of 10.85% and a return on equity of 17.17%. During the same period...
If the content contained herein violates any of your rights, including those of copyright, you are requested to immediately notify us using via the following email address operanews-external(at)opera.com
Top News
parse_fn ---> response_str:
Here is the output in JSON format:

{
  "entities": [
    {
      "entity_name": "FirstEnergy",
      "entity_type": "Company",
      "entity_description": "FirstEnergy (NYSE:FE) is a utilities provider that reported its earnings results on Tuesday."
    },
    {
      "entity_name": "NYSE:FE",
      "entity_type": "Stock Symbol",
      "entity_description": "NYSE:FE is the st

Extracting paths from text:  70%|███████████▏    | 7/10 [02:47<01:09, 23.22s/it]

extract text --->:
XPeng Stock Rises. The Tesla Rival Rolled Out Self-Driving Tech.:Chinese electric-vehicle maker
XPeng
said Thursday its assisted-driving technology has been launched in Beijing and three other cities. The
Tesla
rival’s stock was rising in premarket trading.
parse_fn ---> response_str:
Here is the output for the given text:

{
  "entities": [
    {
      "entity_name": "XPeng",
      "entity_type": "Company",
      "entity_description": "Chinese electric-vehicle maker XPeng said Thursday its assisted-driving technology has been launched in Beijing and three other cities."
    },
    {
      "entity_name": "Tesla",
      "entity_type": "Company",
      "entity_description": "The Tesla Rival Rolled Out Self-Driving Tech.: The Tesla rival’s stock was rising in premarket trading."
    }
  ],
  "relationships": [
    {
      "source_entity": "XPeng",
      "target_entity": "Tesla",
      "relation": "rival",
      "relationship_description": "XPeng is a rival to Tesla, an 

Extracting paths from text:  90%|██████████████▍ | 9/10 [02:50<00:11, 11.45s/it]

extract text --->:
Ryanair sacks chief pilot over sexual misconduct claims:Reuters
Ryanair has sacked its chief pilot after an investigation into his alleged sexual harassment of female colleagues.
The airline told staff that he had been fired for "a pattern of repeated inappropriate and unacceptable behaviour towards a number of female pilots".
The chief pilot, named in reports as Aidan Murray was appointed in 2020 and had been with the airline for 28 years.
Ryanair declined to comment "on queries relating to individual employees".
According to The Independent, Mr Murray allegedly harassed nine junior colleagues, including sending text messages to some with comments on their bodies.
Mr Murray, 58, is also accused of altering flight rosters to fly with certain female pilots.
In a note to staff, Ryanair's chief people officer, Darrell Hughes, said Mr Murray's employment had been "terminated with immediate effect".
An investigation found his behaviour "was in breach of our anti-harassmen

Extracting paths from text: 100%|███████████████| 10/10 [03:15<00:00, 19.52s/it]


extract text --->:
Arsenal have Rice bid rejected, new Premier League fixtures released, Man United out of Harry Kane race, Chelsea reject Mason Mount bid:Jude Bellingham has also insisted his move to Real Madrid had nothing to do with money, as he explained how his transfer was wrapped up so quickly.
“Money is not a thing for me,” he said.
“I dont think about money at all when I make these kinds of decisions. I never have and I never will. I play the game purely out of love.
“I spoke with people from Real Madrid when I was given permission by Borussia Dortmund and I love the feeling I got from the club. I couldn’t hide it. I told them straight away what I felt and after that happened on Monday it all happened quickly.”
He also opened up on how he was given the No.5 shirt, which was worn by defender Jesus Vallejo last season, and what it means to him to have Zidane’s old number on his back.
“For a start I’d like to thank Jesus Vallejo for letting me wear the No.5,” he explained.
“I con

Generating embeddings: 100%|██████████████████████| 1/1 [00:00<00:00,  1.30it/s]
Generating embeddings: 100%|██████████████████████| 1/1 [00:00<00:00, 26.13it/s]


### 5.2. Verification

In [18]:
len(index.property_graph_store.get_triplets())

202

In [19]:
index.property_graph_store.get_triplets()[10]

[EntityNode(label='Company', embedding=None, properties={'id': 'FirstEnergy', 'entity_description': 'FirstEnergy (NYSE:FE) is a utilities provider', 'triplet_source_id': '144af8a1-4078-4991-a234-4fc930bcd029'}, name='FirstEnergy'),
 Relation(label='was reported by', source_id='FirstEnergy', target_id='RTT News', properties={'relationship_description': "RTT News reported on FirstEnergy's earnings results.", 'triplet_source_id': '28fc501c-11b7-4289-a19d-3865057f67b3'}),
 EntityNode(label='News Source', embedding=None, properties={'id': 'RTT News', 'entity_description': 'A news source reporting on earnings results and other financial information', 'triplet_source_id': '144af8a1-4078-4991-a234-4fc930bcd029'}, name='RTT News')]

## 6. Build Communities and Generate Community Summaries

In [20]:
index.property_graph_store.build_communities()

## 7. GraphRAG Query

### 7.1. Define GraphRAGQueryEngine Class

In [21]:
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.llms import LLM
from llama_index.core import PropertyGraphIndex
import re

class GraphRAGQueryEngine(CustomQueryEngine):
    graph_store: GraphRAGStore
    llm: LLM
    index: PropertyGraphIndex
    similarity_top_k: int = 20

    def custom_query(self, query_str: str) -> str:
        """Process all community summaries to generate answers to a specific query."""
        
        entities = self.get_entities(query_str, self.similarity_top_k)

        community_ids = self.retrieve_entity_communities(
            self.graph_store.entity_info, entities
        )
        
        community_summaries = self.graph_store.get_community_summaries()
        community_answers = [
            self.generate_answer_from_summary(community_summary, query_str)
            for id, community_summary in community_summaries.items()
            if id in community_ids # Filter using cluster IDs
        ]

        final_answer = self.aggregate_answers(community_answers)
        return final_answer

    def get_entities(self, query_str, similarity_top_k):
        nodes_retrieved = self.index.as_retriever(
            similarity_top_k=similarity_top_k
        ).retrieve(query_str)

        enitites = set()
        pattern = (
            r"^(\w+(?:\s+\w+)*)\s*->\s*([a-zA-Z\s]+?)\s*->\s*(\w+(?:\s+\w+)*)$"
        )

        for node in nodes_retrieved:
            matches = re.findall(
                pattern, node.text, re.MULTILINE | re.IGNORECASE
            )

            for match in matches:
                subject = match[0]
                obj = match[2]
                enitites.add(subject)
                enitites.add(obj)

        return list(enitites)

    def retrieve_entity_communities(self, entity_info, entities):
        """
        Retrieve cluster information for given entities, allowing for multiple clusters per entity.

        Args:
        entity_info (dict): Dictionary mapping entities to their cluster IDs (list).
        entities (list): List of entity names to retrieve information for.

        Returns:
        List of community or cluster IDs to which an entity belongs.
        """
        community_ids = []

        for entity in entities:
            if entity in entity_info:
                community_ids.extend(entity_info[entity])

        return list(set(community_ids))
    
    def generate_answer_from_summary(self, community_summary, query):
        """Generate an answer from a community summary based on a given query using LLM."""
        prompt = (
            f"Given the community summary: {community_summary}, "
            f"how would you answer the following query? Query: {query}"
        )
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content="I need an answer based on the above information.",
            ),
        ]
        response = self.llm.chat(messages)
        cleaned_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return cleaned_response

    def aggregate_answers(self, community_answers):
        """Aggregate individual community answers into a final, coherent response."""
        # intermediate_text = " ".join(community_answers)
        prompt = "Combine the following intermediate answers into a final, concise response."
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content=f"Intermediate answers: {community_answers}",
            ),
        ]
        final_response = self.llm.chat(messages)
        cleaned_final_response = re.sub(
            r"^assistant:\s*", "", str(final_response)
        ).strip()
        return cleaned_final_response

### 7.2. Instantiate GraphRAGQueryEngine as query_engine Object

In [22]:
query_engine = GraphRAGQueryEngine(
    graph_store=index.property_graph_store, 
    llm=llm,
    index=index,
    similarity_top_k=10
)

### 7.3. Retrieve Information

In [23]:
response = query_engine.query(
    "What are the main news discussed in the document?"
)
display(Markdown(f"{response.response}"))

Here are the combined intermediate answers:

The main news discussed in the documents are:

* Earnings results of FirstEnergy
* European Commission's plan to ban equipment from ZTE Corp. and Huawei Technologies Co. from its internal telecommunications networks, and classification of Huawei as a high-risk vendor by the European Union
* Jude Bellingham's transfer to Real Madrid and his decision to wear the iconic No.5 shirt, as well as his admiration for Zidane and previous affiliation with Borussia Dortmund
* Ryanair's leadership shake-up with Aidan Murray's departure due to allegations of sexual harassment and appointment of Darrell Hughes as chief people officer
* European Commission's plan to ban equipment from two Chinese companies (ZTE Corp. and Huawei Technologies Co.) and the US blocking similar Chinese technology from its core telecommunications networks due to concerns about spying
* Deterioration of ties between the United States (US) and China, with the US blocking Chinese technology from its core telecommunications networks and the European Commission following suit
* Ryanair appointing Darrell Hughes as chief people officer and sacking Aidan Murray due to allegations of sexual harassment

These are the main news points highlighted in the documents.