# GraphRAG Implementation with LlamaIndex

[GraphRAG (Graphs + Retrieval Augmented Generation)](https://www.microsoft.com/en-us/research/project/graphrag/) combines the strengths of Retrieval Augmented Generation (RAG) and Query-Focused Summarization (QFS) to effectively handle complex queries over large text datasets. While RAG excels in fetching precise information, it struggles with broader queries that require thematic understanding, a challenge that QFS addresses but cannot scale well. GraphRAG integrates these approaches to offer responsive and thorough querying capabilities across extensive, diverse text corpora.

This notebook provides guidance on constructing the GraphRAG pipeline using the LlamaIndex PropertyGraph abstractions using Neo4J.

## Setup LLM and embedding models

- LLM used for indexing and querying
- Embedding model for embeddings calculation

In [1]:
import yaml

# load cofig.yaml
with open("../config.yaml", "r") as f:
	config = yaml.safe_load(f)

In [2]:
config

{'OLLAMA_EMBEDDING_MODEL': 'bge-m3:latest', 'OLLAMA_LLM_MODEL': 'gemma3n:e4b'}

In [3]:
from llama_index.llms.ollama import Ollama

llm = Ollama(
    model=config["OLLAMA_LLM_MODEL"], 
    request_timeout=7200.0
)

In [4]:
from llama_index.embeddings.ollama import OllamaEmbedding

ollama_embedding = OllamaEmbedding(
    model_name=config["OLLAMA_EMBEDDING_MODEL"],
    base_url="http://localhost:11434",
)

# Indexing

## Source Documents →Text Chunks

Prepare documents as required by LlamaIndex

In [5]:
import pandas as pd

characters_df = pd.read_csv("../data/simpsons/simpsons_characters.csv")
episodes_df = pd.read_csv("../data/simpsons/simpsons_episodes.csv")
locations_df = pd.read_csv("../data/simpsons/simpsons_locations.csv")
script_lines_df = pd.read_csv("../data/simpsons/simpsons_script_lines.csv")

  script_lines_df = pd.read_csv("../data/simpsons/simpsons_script_lines.csv")


In [6]:
episodes_df.sort_values(by=["season", "id"], inplace=True)
script_lines_df.sort_values(by=["episode_id", "number"], inplace=True)

Cross script lines data with characters, episodes and locations.

In [7]:
# Use the episode_id from script_lines_df to get the episode title season and the number_in_season from episodes_df
script_lines_df = script_lines_df.merge(
	episodes_df[["id", "title", "season", "number_in_season", "number_in_series"]],
	left_on="episode_id",
	right_on="id",
	suffixes=("", "_episode"),
)
# use the location_id from script_lines_df to get the location name from locations_df
script_lines_df = script_lines_df.merge(
	locations_df[["id", "normalized_name"]],
	left_on="location_id",
	right_on="id",
	suffixes=("", "_location"),
)
# rename the column to "location_name"
script_lines_df.rename(columns={"normalized_name": "location_name"}, inplace=True)
# use the character_id from script_lines_df to get the character name from characters_df
# take into account that character_id can be NaN, so we use a left join
characters_df['id'] = characters_df['id'].astype(str)
script_lines_df = script_lines_df.merge(
	characters_df[["id", "normalized_name"]],
	left_on="character_id",
	right_on="id",
	suffixes=("", "_character"),
)
# rename the column to "character_name"
script_lines_df.rename(columns={"normalized_name": "character_name"}, inplace=True)
# concatenate all the raw_text when speaking_line == True or true into a single string
# for a given episode_id
script_lines_df["speaking_line"] = script_lines_df["speaking_line"].astype(bool)

`get_episode_text` is a function that given the episode_id will create txt files which content will be documents that we will concatenate.

In [8]:
def get_episode_text(episode_id):
	episode_lines = script_lines_df[script_lines_df["episode_id"] == episode_id]
	# drop those where normalized_name is NaN
	episode_lines = episode_lines.dropna()
	speaking_lines = episode_lines[episode_lines["speaking_line"]]
	locations = speaking_lines["location_name"].tolist()
	characters = speaking_lines["character_name"].tolist()
	text_lines = speaking_lines["normalized_text"].tolist()
	# Concatenate every location name from locations list with the corresponding speaking line from text_lines list and character from characters list
	# such as: "[location] character_name: speaking line"
	text_lines = [f"[{loc}] ({char}): {text}" for loc, char, text in zip(locations, characters, text_lines)]
	# Join all the text lines into a single string, separated by newlines
	return f"\n".join(text_lines)

In [9]:
import os
output_dir = "../output/scripts"
os.makedirs(output_dir, exist_ok=True)
# Episodes ids to generate scripts for
episode_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9]
for episode_id in episode_ids:
	episode_text = get_episode_text(episode_id)
	# concatenate title, season and number in season
	title = script_lines_df[script_lines_df["episode_id"] == episode_id]["title"]
	season = script_lines_df[script_lines_df["episode_id"] == episode_id]["season"]
	number_in_season = script_lines_df[script_lines_df["episode_id"] == episode_id]["number_in_season"]
	number_in_series = script_lines_df[script_lines_df["episode_id"] == episode_id]["number_in_series"]
	episode_text = f"Season: {season.iloc[0]}, Episode: {number_in_season.iloc[0]}, Episode in series: {number_in_series.iloc[0]}\n\n{episode_text}"
	episode_text = f"Title: {title.iloc[0]}\n{episode_text}"
	# save into a file
	with open(f"{output_dir}/season_{season.iloc[0]}_episode_{episode_id}_text.txt", "w") as f:
		f.write(episode_text)

Prepare documents as required by LlamaIndex

In [10]:
from llama_index.core import Document

documents = []

# all_docs_paths = os.listdir(f"../output/scripts")
all_docs_paths = ["season_1_episode_1_text.txt"]
for doc_path in all_docs_paths:
	print(doc_path)
	with open(f"../output/scripts/{doc_path}", "r") as f:
		text = f.read()
		documents.append(Document(text=text))

season_1_episode_1_text.txt


In [11]:
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20,
)
# Nodes represent chunks of source documents in Llamaindex
nodes = splitter.get_nodes_from_documents(documents)

In [12]:
print(len(nodes))

8


# Indexing

## Text Chunks → Knowledge Graph

The GraphRAGExtractor class is designed to extract triples (subject-relation-object) from text and enrich them by adding descriptions for entities and relationships to their properties using an LLM.

This functionality is similar to that of the `SimpleLLMPathExtractor`, but includes additional enhancements to handle entity, relationship descriptions. For guidance on implementation, you may look at similar existing [extractors](https://docs.llamaindex.ai/en/latest/examples/property_graph/Dynamic_KG_Extraction/?h=comparing).

Other paths (i.e. triplets) extractors implemented in llamaindex include:
- `SimpleLLMPathExtractor`
- `SchemaLLMPathExtractor`
- `DynamicLLMPathExtractor` -> This one allows to start with initial nodes, relationships and their corresponding properties.

Here's a breakdown of its functionality:

**Key Components:**

1. `llm:` The language model used for extraction.
2. `extract_prompt:` A prompt template used to guide the LLM in extracting information.
3. `parse_fn:` A function to parse the LLM's output into structured data.
4. `max_paths_per_chunk:` Limits the number of triples extracted per text chunk.
5. `num_workers:` For parallel processing of multiple text nodes.


**Main Methods:**

1. `__call__:` The entry point for processing a list of text nodes.
2. `acall:` An asynchronous version of __call__ for improved performance.
3. `_aextract:` The core method that processes each individual node.


**Extraction Process:**

For each input node (chunk of text):
1. It sends the text to the LLM along with the extraction prompt.
2. The LLM's response is parsed to extract entities, relationships, descriptions for entities and relations.
3. Entities are converted into `EntityNode` objects. Entity description is stored in metadata
4. Relationships are converted into `Relation` objects. Relationship description is stored in metadata.
5. These are added to the node's metadata under `KG_NODES_KEY` and `KG_RELATIONS_KEY`.

**NOTE:** In the current implementation, we are using only relationship descriptions. In the next implementation, we will utilize entity descriptions during the retrieval stage.

In [13]:
# Notebook utilities to path event loop behavior (event loop is already running in ipython)
import nest_asyncio
nest_asyncio.apply()

import asyncio

from typing import Any, List, Callable, Optional, Union

from llama_index.core.async_utils import run_jobs
from llama_index.core.indices.property_graph.utils import (
    default_parse_triplets_fn,
)
from llama_index.core.graph_stores.types import (
    EntityNode,
    Relation,
    KG_NODES_KEY,
    KG_RELATIONS_KEY,
)
from llama_index.core.llms.llm import LLM
from llama_index.core.prompts import PromptTemplate
from llama_index.core.prompts.default_prompts import (
    DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
)
from llama_index.core.schema import TransformComponent, BaseNode
from prompts import KG_TRIPLET_EXTRACT_TMPL


class GraphRAGExtractor(TransformComponent):
    """Extract triples from a graph.

    Uses an LLM and a simple prompt + output parsing to extract paths (i.e. triples) and entity, relation descriptions from text.

    Args:
        llm (LLM):
            The language model to use.
        extract_prompt (Union[str, PromptTemplate]):
            The prompt to use for extracting triples.
        parse_fn (callable):
            A function to parse the output of the language model.
        num_workers (int):
            The number of workers to use for parallel processing.
        max_paths_per_chunk (int):
            The maximum number of paths to extract per chunk.
    """

    llm: LLM
    extract_prompt: PromptTemplate
    parse_fn: Callable
    num_workers: int
    max_paths_per_chunk: int

    def __init__(
        self,
        llm: Optional[LLM] = None,
        extract_prompt: Optional[Union[str, PromptTemplate]] = None,
        parse_fn: Callable = default_parse_triplets_fn,
        max_paths_per_chunk: int = 10,
        num_workers: int = 4,
    ) -> None:
        """Init params."""
        from llama_index.core import Settings

        if isinstance(extract_prompt, str):
            extract_prompt = PromptTemplate(extract_prompt)

        super().__init__(
            llm=llm or Settings.llm,
            extract_prompt=extract_prompt or DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
            parse_fn=parse_fn,
            num_workers=num_workers,
            max_paths_per_chunk=max_paths_per_chunk,
        )

    @classmethod
    def class_name(cls) -> str:
        return "GraphExtractor"

    def __call__(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes."""
        return asyncio.run(
            self.acall(nodes, show_progress=show_progress, **kwargs)
        )

    async def _aextract(self, node: BaseNode) -> BaseNode:
        """Extract triples from a node (chunk)."""
        assert hasattr(node, "text")

        text = node.get_content(metadata_mode="llm")
        # Extract entities and relationships from the text using the LLM
        # and parse them into a list of JSON objects
        # entities and relationships
        try:
            llm_response = await self.llm.apredict(
                self.extract_prompt,
                text=text,
                max_knowledge_triplets=self.max_paths_per_chunk,
            )
            entities, entities_relationship = self.parse_fn(llm_response)
        except ValueError:
            entities = []
            entities_relationship = []

		# Initialize
        existing_nodes = node.metadata.pop(KG_NODES_KEY, [])
        existing_relations = node.metadata.pop(KG_RELATIONS_KEY, [])	
        	
		# Create EntityNode and Relation objects from parsed information
        entity_metadata = node.metadata.copy()
        for entity_name, entity_type, entity_description in entities:
            entity_metadata["entity_description"] = entity_description
            entity_node = EntityNode(
                name=entity_name, label=entity_type, properties=entity_metadata
            )
            existing_nodes.append(entity_node)

		# Create Relation objects from parsed information
        relation_metadata = node.metadata.copy()
        for triple in entities_relationship:
            subj, obj, rel, description = triple
            relation_metadata["relationship_description"] = description
            rel_node = Relation(
                label=rel,
                source_id=subj,
                target_id=obj,
                properties=relation_metadata,
            )

            existing_relations.append(rel_node)
		# Index them under the corresponding key
        node.metadata[KG_NODES_KEY] = existing_nodes
        node.metadata[KG_RELATIONS_KEY] = existing_relations
        return node

    async def acall(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes async."""
        jobs = []
        for node in nodes:
            jobs.append(self._aextract(node))

        return await run_jobs(
            jobs,
            workers=self.num_workers,
            show_progress=show_progress,
            desc="Extracting paths from text",
        )

Function that parses the LLM response into structured data

In [14]:
import json

def parse_fn(response_str: str) -> Any:
    json_pattern = r"\{.*\}"
    match = re.search(json_pattern, response_str, re.DOTALL)
    entities = []
    relationships = []
    if not match:
        return entities, relationships
    json_str = match.group(0)
    try:
        data = json.loads(json_str)
        entities = [
            (
                entity["entity_name"],
                entity["entity_type"],
                entity["entity_description"],
            )
            for entity in data.get("entities", [])
        ]
        relationships = [
            (
                relation["source_entity"],
                relation["target_entity"],
                relation["relation"],
                relation["relationship_description"],
            )
            for relation in data.get("relationships", [])
        ]
        return entities, relationships
    except json.JSONDecodeError as e:
        print("Error parsing JSON:", e)
        return entities, relationships

## Knowledge Graph → Graph Communities and Community Summaries

The `GraphRAGStore` class is an extension of the `Neo4jPropertyGraphStore`class, designed to implement GraphRAG pipeline. Here's a breakdown of its key components and functions:

The class uses community detection algorithms to group related nodes in the graph and then it generates summaries for each community using an LLM.


**Key Methods:**

`build_communities():`

1. Converts the internal graph representation to a NetworkX graph.

2. Applies the hierarchical Leiden algorithm for community detection.

3. Collects detailed information about each community.

4. Generates summaries for each community.

`generate_community_summary(text):`

1. Uses LLM to generate a summary of the relationships in a community.
2. The summary includes entity names and a synthesis of relationship descriptions.

`_create_nx_graph():`

1. Converts the internal graph representation to a NetworkX graph for community detection.

`_collect_community_info(nx_graph, clusters):`

1. Collects detailed information about each node based on its community.
2. Creates a string representation of each relationship within a community.

`_summarize_communities(community_info):`

1. Generates and stores summaries for each community using LLM.

`get_community_summaries():`

1. Returns the community summaries by building them if not already done.

In [15]:
import re
import networkx as nx
from graspologic.partition import hierarchical_leiden
from collections import defaultdict

from llama_index.core.llms import ChatMessage
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore
from prompts import COMMUNITY_SUMMARY_TMPL



class GraphRAGStore(Neo4jPropertyGraphStore):
    community_summary = {}
    entity_info = None
    max_cluster_size = 5
    llm: LLM

    def build_communities(self):
        """Builds communities from the graph and summarizes them."""
        nx_graph = self._create_nx_graph()
        # Use Leiden algorithm to create communities
        community_hierarchical_clusters = hierarchical_leiden(
            nx_graph, max_cluster_size=self.max_cluster_size
        )
        self.entity_info, community_info = self._collect_community_info(
            nx_graph, community_hierarchical_clusters
        )
        self._summarize_communities(community_info)

    def _create_nx_graph(self):
        """Converts internal graph representation to NetworkX graph."""
        nx_graph = nx.Graph()
        triplets = self.get_triplets()
        for entity1, relation, entity2 in triplets:
            nx_graph.add_node(entity1.name)
            nx_graph.add_node(entity2.name)
            nx_graph.add_edge(
                relation.source_id,
                relation.target_id,
                relationship=relation.label,
                description=relation.properties["relationship_description"],
            )
        return nx_graph

    def _collect_community_info(self, nx_graph, clusters):
        """
        Collect information for each node based on their community,
        allowing entities to belong to multiple clusters.
        """
        entity_info = defaultdict(set)
        community_info = defaultdict(list)

        for item in clusters:
            node = item.node
            cluster_id = item.cluster

            # Update entity_info
            entity_info[node].add(cluster_id)

            for neighbor in nx_graph.neighbors(node):
                edge_data = nx_graph.get_edge_data(node, neighbor)
                if edge_data:
                    detail = f"{node} -> {neighbor} -> {edge_data['relationship']} -> {edge_data['description']}"
                    community_info[cluster_id].append(detail)

        # Convert sets to lists for easier serialization if needed
        entity_info = {k: list(v) for k, v in entity_info.items()}

        return dict(entity_info), dict(community_info)

    def generate_community_summary(self, text):
        """Generate summary for a given text using an LLM."""
        messages = [
            ChatMessage(
                role="system",
                content=COMMUNITY_SUMMARY_TMPL,
            ),
            ChatMessage(role="user", content=text),
        ]
        # hardcode
        # llm = Ollama(model="gemma3n:e4b", request_timeout=60.0)
        response = llm.chat(messages)
        clean_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return clean_response

    def _summarize_communities(self, community_info):
        """Generate and store summaries for each community."""
        for community_id, details in community_info.items():
            details_text = (
                "\n".join(details) + "."
            )  # Ensure it ends with a period
            self.community_summary[
                community_id
            ] = self.generate_community_summary(details_text)

    def get_community_summaries(self):
        """Returns the community summaries, building them if not already done."""
        if not self.community_summary:
            self.build_communities()
        return self.community_summary

  import pkg_resources


# Querying

## Community Summaries → Community Answers → Global Answer

The GraphRAGQueryEngine class is a custom query engine designed to process queries using the GraphRAG approach. It leverages the community summaries generated by the GraphRAGStore to answer user queries. Here's a breakdown of its functionality:

**Main Components:**

`graph_store:` An instance of GraphRAGStore, which contains the community summaries.
`llm:` A Language Model (LLM) used for generating and aggregating answers.


**Key Methods:**

`custom_query(query_str: str)`

1. This is the main entry point for processing a query. It retrieves community summaries, generates answers from each summary, and then aggregates these answers into a final response.

`generate_answer_from_summary(community_summary, query):`

1. Generates an answer for the query based on a single community summary.
Uses the LLM to interpret the community summary in the context of the query.

`aggregate_answers(community_answers):`

1. Combines individual answers from different communities into a coherent final response.
2. Uses the LLM to synthesize multiple perspectives into a single, concise answer.


**Query Processing Flow:**

1. Retrieve community summaries from the graph store.
2. For each community summary, generate a specific answer to the query.
3. Aggregate all community-specific answers into a final, coherent response.


**Example usage:**

```
query_engine = GraphRAGQueryEngine(graph_store=graph_store, llm=llm)

response = query_engine.query("query")
```

In [16]:
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core import PropertyGraphIndex
import re


class GraphRAGQueryEngine(CustomQueryEngine):
    graph_store: GraphRAGStore
    index: PropertyGraphIndex
    llm: LLM
    similarity_top_k: int = 20

    def custom_query(self, query_str: str) -> str:
        """Process all community summaries to generate answers to a specific query."""

        entities = self.get_entities(query_str, self.similarity_top_k)

        community_ids = self.retrieve_entity_communities(
            self.graph_store.entity_info, entities
        )
        community_summaries = self.graph_store.get_community_summaries()
        community_answers = [
            self.generate_answer_from_summary(community_summary, query_str)
            for id, community_summary in community_summaries.items()
            if id in community_ids
        ]

        final_answer = self.aggregate_answers(community_answers)
        return final_answer

    def get_entities(self, query_str, similarity_top_k):
        nodes_retrieved = self.index.as_retriever(
            similarity_top_k=similarity_top_k
        ).retrieve(query_str)

        enitites = set()
        pattern = (
            r"^(\w+(?:\s+\w+)*)\s*->\s*([a-zA-Z\s]+?)\s*->\s*(\w+(?:\s+\w+)*)$"
        )

        for node in nodes_retrieved:
            matches = re.findall(
                pattern, node.text, re.MULTILINE | re.IGNORECASE
            )

            for match in matches:
                subject = match[0]
                obj = match[2]
                enitites.add(subject)
                enitites.add(obj)

        return list(enitites)

    def retrieve_entity_communities(self, entity_info, entities):
        """
        Retrieve cluster information for given entities, allowing for multiple clusters per entity.

        Args:
        entity_info (dict): Dictionary mapping entities to their cluster IDs (list).
        entities (list): List of entity names to retrieve information for.

        Returns:
        List of community or cluster IDs to which an entity belongs.
        """
        community_ids = []

        for entity in entities:
            if entity in entity_info:
                community_ids.extend(entity_info[entity])

        return list(set(community_ids))

    def generate_answer_from_summary(self, community_summary, query):
        """Generate an answer from a community summary based on a given query using LLM."""
        prompt = (
            f"Given the community summary: {community_summary}, "
            f"how would you answer the following query? Query: {query}"
        )
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content="I need an answer based on the above information.",
            ),
        ]
        response = self.llm.chat(messages)
        cleaned_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return cleaned_response

    def aggregate_answers(self, community_answers):
        """Aggregate individual community answers into a final, coherent response."""
        prompt = "Combine the following intermediate answers into a final, concise response."
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content=f"Intermediate answers: {community_answers}",
            ),
        ]
        final_response = self.llm.chat(messages)
        cleaned_final_response = re.sub(
            r"^assistant:\s*", "", str(final_response)
        ).strip()
        return cleaned_final_response

##  Build End to End GraphRAG Pipeline

Now that we have defined all the necessary components, let’s construct the GraphRAG pipeline:

1. Create nodes/chunks from the text.
2. Build a PropertyGraphIndex using `GraphRAGExtractor` and `GraphRAGStore`.
3. Construct communities and generate a summary for each community using the graph built above.
4. Create a `GraphRAGQueryEngine` and begin querying.

### Build ProperGraphIndex using `GraphRAGExtractor` and `GraphRAGStore`

In [17]:
kg_extractor = GraphRAGExtractor(
    # LLM model to use for extracting triplets
    llm=llm,
    # Prompt passed to the LLM to extract triplets
    extract_prompt=KG_TRIPLET_EXTRACT_TMPL,
    # Maximum number of triplets to extract per chunk
    max_paths_per_chunk=2,
    # Function to parse the output of the LLM
    parse_fn=parse_fn,
)

In [18]:
# Note: used to be `Neo4jPGStore`
graph_store = GraphRAGStore(
    username="neo4j", 
    password="neo4j123",
    # Copy from Neo4j desktop
    url="neo4j://127.0.0.1:7687",
    # database name
    database="simpsons",
)
# LLM model to use for generating community summaries
graph_store.llm = llm

In [19]:

index = PropertyGraphIndex(
    # Documents to index
    nodes=nodes,
    # Embedding model to use for indexing
    embed_model=ollama_embedding,
    # LLM model to use for querying
    llm=llm,
    # Knowledge graph extractor
    kg_extractors=[kg_extractor],
    # Graph store and community and community summary building logic
    property_graph_store=graph_store,
    show_progress=True,
)

Extracting paths from text: 100%|██████████| 8/8 [05:54<00:00, 44.29s/it]
Generating embeddings: 100%|██████████| 1/1 [00:03<00:00,  3.59s/it]
Generating embeddings: 100%|██████████| 12/12 [00:04<00:00,  2.79it/s]


In [20]:
index.property_graph_store.get_triplets()[10]

[EntityNode(label='Person', embedding=None, properties={'id': 'Homer Simpson', 'entity_description': "Bart's father, a well-meaning but often incompetent and gluttonous man. He works at the Springfield Nuclear Power Plant and is a frequent source of comedic situations.", 'triplet_source_id': '12408fcb-c66a-4f21-9442-7014650c045a'}, name='Homer Simpson'),
 Relation(label='receives payment from', source_id='Homer Simpson', target_id='Clerk', properties={'triplet_source_id': '12408fcb-c66a-4f21-9442-7014650c045a', 'relationship_description': 'Homer Simpson receives his paycheck from the Clerk at the Personnel Office.'}),
 EntityNode(label='Person', embedding=None, properties={'id': 'Clerk', 'entity_description': 'An employee at the personnel office responsible for distributing paychecks and handling administrative tasks.', 'triplet_source_id': '12408fcb-c66a-4f21-9442-7014650c045a'}, name='Clerk')]

In [21]:
index.property_graph_store.get_triplets()[10][0].properties

{'id': 'Homer Simpson',
 'entity_description': "Bart's father, a well-meaning but often incompetent and gluttonous man. He works at the Springfield Nuclear Power Plant and is a frequent source of comedic situations.",
 'triplet_source_id': '12408fcb-c66a-4f21-9442-7014650c045a'}

In [22]:
index.property_graph_store.get_triplets()[10][1].properties

{'triplet_source_id': '12408fcb-c66a-4f21-9442-7014650c045a',
 'relationship_description': 'Homer Simpson receives his paycheck from the Clerk at the Personnel Office.'}

### Build communities

This will create communities and summary for each community.

In [23]:
index.property_graph_store.build_communities()

### Create QueryEngine

In [28]:
query_engine = GraphRAGQueryEngine(
    graph_store=index.property_graph_store,
    # llm to answer the query given community summaries
    llm=llm,
    index=index,
    similarity_top_k=10,
)

In [30]:
from IPython.display import display, Markdown

In [32]:
nest_asyncio.apply()

response = query_engine.query(
    "What's the character that has the greatest amount of co-ocurrences with Bart Simpson?"
)
display(Markdown(f"{response.response}"))

Based on the provided summaries, **Homer Simpson** has the greatest number of co-occurrences with Bart Simpson. This is supported by multiple points: they share a residence at the Simpson home, Homer expresses a desire to adopt Bart, and they have direct interactions, including Bart calling Homer. Furthermore, they share a belief in Santa Claus, with Homer even portraying him, creating a significant connection. While Bart also has a close relationship with Lisa Simpson (being her brother), the summaries highlight the more frequent and direct interactions between Homer and Bart.

In [34]:
nest_asyncio.apply()

response = query_engine.query(
    "What are the top 5 themes discussed in the episodes?"
)
display(Markdown(f"{response.response}"))

Homer Simpson episodes frequently explore themes of **Family & Relationships**, particularly the dynamics between Homer, Marge, and their children, Bart and Lisa. **Work & Employment** is a consistent element, often highlighting the mundane or absurd aspects of Homer's jobs at the Springfield Nuclear Power Plant and Santa's Workshop. **Christmas & the Holiday Season** is a major recurring theme, encompassing traditions, celebrations, and Homer's aspirations to embody the spirit of Santa Claus.  **Hope & Belief**, especially regarding fantastical concepts like "Whiirlwind" and Santa, and **Bart's Mischief & Playfulness** are also prominent, often intertwined with his relationships and desires for self-expression.  Furthermore, the **cultural impact of Santa Claus** and the **community of Elf County** are recurring elements, often explored through Homer's role-playing and Bart's childhood wonder.

# Querying (once indexed)

In [20]:
graph_store = GraphRAGStore(
    username="neo4j", 
    password="neo4j123",
    # Copy from Neo4j desktop
    url="neo4j://127.0.0.1:7687",
    # database name
    database="simpsons",
)
# LLM model to use for generating community summaries
graph_store.llm = llm
# load from existing graph/vector store
index = PropertyGraphIndex.from_existing(
    property_graph_store=graph_store,
    embed_model=ollama_embedding,
    llm=llm,
    # embed_kg_nodes=True,
)
# index = PropertyGraphIndex(
#     # Documents to index
#     nodes=nodes,
#     # Embedding model to use for indexing
#     embed_model=ollama_embedding,
#     # LLM model to use for querying
#     llm=llm,
#     # Knowledge graph extractor
#     kg_extractors=[kg_extractor],
#     # Graph store and community and community summary building logic
#     property_graph_store=graph_store,
#     show_progress=True,
# )
query_engine = GraphRAGQueryEngine(
    graph_store=graph_store,
    llm=llm,
    index=index,
    similarity_top_k=10,
)

In [21]:
from IPython.display import display, Markdown

In [22]:
nest_asyncio.apply()

response = query_engine.query(
    "What's the character that has the greatest amount of co-ocurrences with Bart Simpson?"
)
display(Markdown(f"{response.response}"))

TypeError: argument of type 'NoneType' is not iterable

In [None]:
nest_asyncio.apply()

response = query_engine.query(
    "What are the top 5 themes discussed in the episodes?"
)
display(Markdown(f"{response.response}"))

Based on the provided summaries, recurring themes in *The Simpsons* episodes involving specific characters include:

**Homer Simpson:** Family dynamics, workplace mishaps and employment, addiction and coping mechanisms (Duff Beer), material desires and financial struggles, and conflict with authority (particularly Montgomery Burns).

**Marge Simpson:** Family bonds, education and mentorship (both receiving and imparting), activities and leisure, Springfield life, and bowling.

**Lisa Simpson:** Family relationships, education and mentorship, personal interests and passions, social interactions and gift-giving, and adventure and exploration.

**Bart Simpson:** School and education (and its consequences), rebellion and mischief, family dynamics, pop culture and fandom, and location-specific adventures within Springfield.

**Dewey Largo:** Music education and the relationship with Lisa Simpson.

It's important to note that these themes are inferred from the provided summaries and a more comprehensive analysis of individual episodes would likely reveal additional recurring elements.