# **Knowledge Representation in RAG methods**

Contributors:
* Szymon Pająk
* Tomasz Ogiołda

## Temporary notes

### Plan

1. Introduction
2. Background
  - What is RAG? Why is it used?
  - What kinds of knowledge representations RAG can use?
    - Vectorized embeddings
    - Knowledge graph
    - Combination of both
    - Comparison https://neo4j.com/blog/genai/graphrag-manifesto/

  - Explain the dataflow for both knowledge representations (the whole process, from raw data, to querying the knowledge database)
3. Demo

Tools to be used:

- langchain?
- neo4j

4. Resources

- https://neo4j.com/blog/genai/graphrag-manifesto/
- https://neo4j.com/blog/developer/langchain4j-graphrag-vector-stores-retrievers/
- https://neo4j.com/blog/genai/what-is-retrieval-augmented-generation-rag/
- https://neo4j.com/blog/developer/knowledge-graph-rag-application/
- https://neo4j.com/blog/news/graphrag-ecosystem-tools/

---

## **Introduction**

#### Agenda Overview
1.  **What is RAG & Why Knowledge Representation Matters**
2.  **Common Knowledge Representation Options for RAG:**
    *   Vector Embeddings
    *   Knowledge Graphs
    *   The Hybrid: GraphRAG
3.  **Demo:** Building a Knowledge Graph from Spotify Data (using our notebook!)
4.  **The GraphRAG Ecosystem & Future**
5.  **Conclusion & Q&A**

#### Hook
*   "Imagine asking an AI: 'Find me a cheerful song from the 90s by an artist like Queen, but not actually Queen, and something I haven't heard a million times.' How can we build AI that *truly* understands and navigates complex requests over *your* specific data?"
*   "Large Language Models (LLMs) are powerful, but they don't know everything, especially about recent events or your private information. How can we bridge this gap without constant, costly retraining?"

#### What is RAG?
*   **RAG = Retrieval Augmented Generation**
*   It's a technique to enhance LLM responses by first retrieving relevant information from an external knowledge base and providing it to the LLM as context.


*   **Why RAG?**
    *   **Reduces Hallucinations:** LLMs are less likely to make things up if they have relevant facts.
    *   **Access to Current Data:** Overcomes knowledge cut-offs.
    *   **Domain-Specific Knowledge:** Allows LLMs to answer questions about private or specialized data.
    *   **Cost-Effective:** Cheaper than fine-tuning an LLM for every new piece of information.
    *   **Verifiability:** Users can often see the source of the information.

#### The "Knowledge" in RAG
*   The effectiveness of RAG heavily depends on **how we store, organize, and retrieve this external knowledge.**
*   This is where **Knowledge Representation** comes in. It's about choosing the right structure for your data so the "Retrieval" part of RAG is smart and efficient.

## Knowledge Representation Options in RAG

#### Option 1: Vectorized Embeddings
*   **Concept:**
    *   Text (documents, sentences, words) is converted into dense numerical vectors (embeddings).
    *   These vectors capture semantic meaning – similar concepts have vectors that are close together in "vector space."
*   **How it works in RAG:**
    1.  Your documents are chunked and each chunk is embedded. These embeddings are stored in a Vector Database.
    2.  The user's query is also embedded.
    3.  A similarity search (e.g., cosine similarity) is performed to find the document chunks most similar to the query.
    4.  These chunks are passed to the LLM as context.
*   **Pros:**
    *   Excellent for semantic similarity searches ("find me documents about X").
    *   Relatively mature technology and many tools available.
    *   Can be straightforward to implement for basic RAG.
*   **Cons:**
    *   **"Bag of words" problem:** Can lose nuanced relationships and context between pieces of information.
    *   **Black Box:** Similarity scores don't always explain *why* something is relevant.
    *   Retrieves chunks, which might not be the most efficient or complete context.

#### Option 2: Knowledge Graphs (KGs)
*   **Concept:**
    *   Information is represented as a network of:
        *   **Nodes (Entities):** Things, concepts, people (e.g., "Song," "Artist," "Genre").
        *   **Edges (Relationships):** How entities are connected (e.g., an `Artist` node `PERFORMED` a `Song` node).
        *   **Properties:** Attributes of nodes and relationships (e.g., a `Song` node has a `title` property).
    *   **(Reference: https://neo4j.com/blog/genai/graphrag-manifesto/, https://neo4j.com/blog/developer/knowledge-graph-rag-application/)**
*   **How it works in RAG:**
    1.  Your data is modeled and ingested into a Graph Database (like Neo4j).
    2.  The user's query can be parsed to identify entities and relationships.
    3.  The system can traverse the graph, following relationships to find highly relevant and contextual subgraphs.
    4.  This rich, structured context is passed to the LLM.
*   **Pros:**
    *   **Explicit Relationships:** Captures how information is interconnected, leading to more precise retrieval.
    *   **Context-Rich:** Retrieves subgraphs that provide a fuller picture, not just isolated chunks.
    *   **Explainable:** The path through the graph can explain *why* information is relevant.
    *   **Powerful for Complex Queries:** Can answer questions that require hopping across multiple entities and relationships.
*   **Cons:**
    *   Can be more complex to design and build the initial graph schema.
    *   Requires a different way of thinking about data (connections first).

#### Option 3: The Hybrid Approach - GraphRAG
*   **Concept: Combining the strengths of KGs and Vector Embeddings.**
    *   **(Reference: https://neo4j.com/blog/genai/graphrag-manifesto/, https://neo4j.com/blog/developer/langchain4j-graphrag-vector-stores-retrievers/)**
*   **How it can work:**
    *   Store rich, structured data in a Knowledge Graph.
    *   Generate vector embeddings for text properties within the graph (e.g., song lyrics, album descriptions) or even for nodes/subgraphs.
    *   **Use cases:**
        1.  Use KG traversal for structured queries and then vector search *within* the retrieved nodes for semantic details.
        2.  Use vector search to find initial entry points into the graph, then expand with graph traversal to get more context.
        3.  Embed and search for graph patterns or subgraphs.
*   **Benefits:**
    *   **Best of Both Worlds:** Precise, contextual retrieval from KGs + powerful semantic search from vectors.
    *   Handles a wider variety of queries.
    *   Leads to more nuanced and accurate information retrieval for the LLM.

#### Comparison
*   *(Consider creating a small table here or using a visual)*
*   | Feature          | Vector Embeddings        | Knowledge Graphs         | GraphRAG (Hybrid)      |
*   |------------------|--------------------------|--------------------------|------------------------|
*   | **Primary Use**  | Semantic Similarity      | Explicit Relationships   | Both                   |
*   | **Context**      | Chunk-based              | Rich, Interconnected     | Very Rich, Multi-faceted |
*   | **Explainability**| Low (similarity score)   | High (path traversal)    | High                   |
*   | **Query Type**   | Keyword, Semantic        | Complex, Relational      | Broadest Range         |
*   | **Data Structure**| Unstructured/Semi-struct.| Highly Structured        | Structured + Embeddings|
*   **(More insights in: https://neo4j.com/blog/genai/graphrag-manifesto/)**

## Demo: Building a Knowledge Graph for Music Recommendation RAG

#### Goal
*   To demonstrate how we can take tabular data (like our Spotify dataset) and transform it into a connected Knowledge Graph in Neo4j.
*   This graph will then serve as the "Knowledge Base" that a RAG system could use to answer music-related queries.

#### 1. Setup & Initialization
*   Importing necessary libraries (Neo4j, Google Generative AI, Pandas).
*   Configuring API keys and Neo4j connection details.
*   Initializing the embedding model (`text-embedding-004`) and the generative LLM (`gemini-1.5-flash-latest`).
    *   *Embedding Model:* Will be used (conceptually in a full RAG) to turn text like lyrics or queries into vectors.
    *   *Generative LLM:* Will be used (conceptually) to generate the final natural language response based on retrieved context.

##### Code

In [8]:
!pip install neo4j neo4j_graphrag[vertexai]



In [5]:
from google.colab import userdata

NEO4J_URI = userdata.get('NEO4J_URI')
NEO4J_PASS = userdata.get('NEO4J_PASS')
NEO4J_DB_USER = userdata.get('NEO4J_DB_USER')
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')

In [13]:
from neo4j import GraphDatabase

URI = "neo4j+s://3a2f9088.databases.neo4j.io"

def get_db():
  with GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_DB_USER, NEO4J_PASS)) as driver:
      driver.verify_connectivity()
      return driver

In [None]:
import kagglehub
import pandas as pd

path = kagglehub.dataset_download("devdope/900k-spotify")
songs_csv_path = path + '/spotify_dataset.csv'
full_df = pd.read_csv(songs_csv_path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/devdope/900k-spotify?dataset_version_number=3...


100%|██████████| 1.00G/1.00G [00:25<00:00, 41.8MB/s]

Extracting files...





In [None]:
import numpy as np

np.random.seed(9)

df = full_df.sample(20000)
df = df[['Artist(s)','song', 'text', 'emotion', 'Length', 'Album', 'Genre', 'Energy', 'Popularity', 'Danceability', 'Positiveness']]
df[['Energy', 'Popularity', 'Danceability', 'Positiveness']] = df[['Energy', 'Popularity', 'Danceability', 'Positiveness']].astype(int)/100

#### 2. Data Loading & Preparation
*   We're using a Spotify dataset with ~900k songs (we'll sample 20k for the demo).
*   Loading into a Pandas DataFrame.
*   It includes information like:
    *   `Artist(s)`, `song` title, `text` (lyrics), `emotion`, `Length`, `Album`, `Genre`, `Energy`, `Popularity`, `Danceability`, `Positiveness`.

In [1]:
df.head(3)

NameError: name 'df' is not defined

### Building vector indexes

In [11]:
!pip install vertexai



### GraphRAG

In [19]:
from google.colab import auth
import vertexai

auth.authenticate_user()

vertexai.init(
    project="krr-rag"
)

In [28]:
from neo4j_graphrag.retrievers import VectorRetriever, Text2CypherRetriever
from neo4j_graphrag.llm import LLMInterface
from vertexai.generative_models import GenerationConfig
import vertexai.language_models
from neo4j_graphrag.generation import GraphRAG
from neo4j_graphrag.embeddings import SentenceTransformerEmbeddings
import google.generativeai as genai


class GeminiLLM(LLMInterface):
    def __init__(self, model_name: str, generation_config: GenerationConfig = None):
        genai.configure(api_key=GOOGLE_API_KEY)

        self.model = genai.GenerativeModel(model_name=model_name)

    def invoke(self, input: str) -> str:
        response = self.model.generate_content(input)
        return response.text

    def ainvoke(self, input: str) -> str:
        response = self.model.generate_content(input)
        return response.text


INDEX_NAME = "index-name"

driver = get_db()

embedder = SentenceTransformerEmbeddings(model="all-MiniLM-L6-v2")
llm = GeminiLLM(model_name="gemini-1.5-flash-001")

vector_retriever = VectorRetriever(driver, INDEX_NAME, embedder)


hybrid_rag = GraphRAG(retriever=vector_retriever, llm=llm)

neo4j_schema="load db schema"
examples = [
    "USER INPUT: 'Which actors starred in the Matrix?' QUERY: MATCH (p:Person)-[:ACTED_IN]->(m:Movie) WHERE m.title = 'The Matrix' RETURN p.name"
]
graph_retriever = Text2CypherRetriever(
    driver=driver,
    llm=llm,
    neo4j_schema=neo4j_schema,
    examples=examples,
)

graph_rag = GraphRAG(retriever=graph_retriever, llm=llm)


# Query the graph
query_text = "How do I do similarity search in Neo4j?"
response = hybrid_rag.search(query_text=query_text, retriever_config={"top_k": 5})

Exception: No index with name index-name found

In [31]:

embedder.embed_query("How do I do similarity search in Neo4j?")
llm.invoke("Hello my friend, tell me sth about you")

"Hello there! It's nice to meet you. \n\nI am a large language model, trained by Google. \n\nHere's a little more about me:\n\n* **I'm trained on a massive dataset of text and code.** This allows me to generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way. \n* **I'm still under development.** I'm constantly learning and improving, and I'm always excited to see what new things I can do.\n* **I don't have personal opinions or beliefs.**  I'm designed to be objective and unbiased, and I'll always try to provide you with the most accurate and helpful information.\n* **I'm not a person.** I'm a computer program, and I don't have feelings or emotions.\n\nI'm here to assist you with any questions or tasks you might have. What can I do for you today? 😊 \n"

In [None]:
import pandas as pd
from concurrent.futures import ThreadPoolExecutor, as_completed

def embed_song():


def write_song(session, song_id, song_title, lyrics, time_length, energy, popularity, danceability, positiveness):
  embedding = embedder.embed_query()

  session.run("""
                MERGE (s:Song {id: $song_id})
                ON CREATE SET
                    s.title = $song_title,
                    s.lyrics = $lyrics,
                    s.time_length = $time_length,
                    s.energy = $energy,
                    s.popularity = $popularity,
                    s.danceability = $danceability,
                    s.positiveness = $positiveness
                ON MATCH SET
                    s.title = $song_title,
                    s.lyrics = $lyrics,
                    s.time_length = $time_length,
                    s.energy = $energy,
                    s.popularity = $popularity,
                    s.danceability = $danceability,
                    s.positiveness = $positiveness
            """, song_id=song_id, song_title=song_title, lyrics=lyrics,
               time_length=time_length, energy=float(energy) if pd.notna(energy) else None,
               popularity=float(popularity) if pd.notna(popularity) else None,
               danceability=float(danceability) if pd.notna(danceability) else None,
               positiveness=float(positiveness) if pd.notna(positiveness) else None)

def process_song_row(row_data, song_id, driver):
    try:
        with driver.session() as session:
            song_title = row_data['song']
            lyrics = str(row_data['text'])
            emotion = row_data['emotion']
            time_length = row_data['Length']
            album_name = row_data['Album']
            energy = row_data['Energy']
            popularity = row_data['Popularity']
            danceability = row_data['Danceability']
            positiveness = row_data['Positiveness']

            write_song(session, song_id, song_title, lyrics, time_length, energy, popularity, danceability, positiveness)

            # Artists
            artist_names = []
            if pd.notna(row_data['Artist(s)']):
                artist_names = [name.strip() for name in str(row_data['Artist(s)']).split(',')]
            for artist_name in artist_names:
                if artist_name:
                    session.run("""
                        MERGE (ar:Artist {name: $artist_name})
                        WITH ar
                        MATCH (s:Song {id: $song_id})
                        MERGE (ar)-[:PERFORMED]->(s)
                    """, artist_name=artist_name, song_id=song_id)

            # Album
            if pd.notna(album_name) and album_name.strip():
                session.run("""
                    MERGE (al:Album {name: $album_name})
                    WITH al
                    MATCH (s:Song {id: $song_id})
                    MERGE (s)-[:APPEARS_ON]->(al)
                """, album_name=album_name.strip(), song_id=song_id)

            # Genre
            genre_names = []
            if pd.notna(row_data['Genre']):
                genre_names = [name.strip() for name in str(row_data['Genre']).split(',')]
            for genre_name in genre_names:
                if genre_name:
                    session.run("""
                        MERGE (g:Genre {name: $genre_name})
                        WITH g
                        MATCH (s:Song {id: $song_id})
                        MERGE (s)-[:HAS_GENRE]->(g)
                    """, genre_name=genre_name, song_id=song_id)

            # Emotion
            if pd.notna(emotion) and emotion.strip():
                session.run("""
                    MERGE (e:Emotion {name: $emotion})
                    WITH e
                    MATCH (s:Song {id: $song_id})
                    MERGE (s)-[:EVOKES]->(e)
                """, emotion=emotion.strip(), song_id=song_id)
        driver.close() # Close driver for this task
        return f"Successfully processed song ID: {song_id}"
    except Exception as e:
        return f"Error processing song ID {song_id}: {e}"


def ingest_music_data_multithreaded(df, db_uri, db_user, db_password, max_workers=8):
    print(f"Starting multithreaded data ingestion for {len(df)} songs with {max_workers} workers...")

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = []
        for idx, row in df.iterrows():
            song_id = str(idx) # Ensure ID is a string for consistency
            futures.append(executor.submit(process_song_row, row, song_id, db_uri, db_user, db_password))

        processed_count = 0
        for future in as_completed(futures):
            result = future.result()
            processed_count += 1
            if "Error" in result:
                print(f"Error: {result}")
            # else:
            #     print(result) # Uncomment if you want to see success messages

            if processed_count % 1000 == 0:
                print(f"Processed {processed_count}/{len(df)} songs.")

    print(f"Finished multithreaded data ingestion. Total processed: {processed_count}/{len(df)} songs.")

def create_constraints():
  db = get_db()

  with db.session() as session:
      session.run("CREATE CONSTRAINT IF NOT EXISTS FOR (s:Song) REQUIRE s.id IS UNIQUE")
      session.run("CREATE CONSTRAINT IF NOT EXISTS FOR (ar:Artist) REQUIRE ar.name IS UNIQUE")
      session.run("CREATE CONSTRAINT IF NOT EXISTS FOR (al:Album) REQUIRE al.name IS UNIQUE") # Album names might not be unique across artists
      session.run("CREATE CONSTRAINT IF NOT EXISTS FOR (g:Genre) REQUIRE g.name IS UNIQUE")
      session.run("CREATE CONSTRAINT IF NOT EXISTS FOR (e:Emotion) REQUIRE e.name IS UNIQUE")

      session.run("CREATE INDEX IF NOT EXISTS FOR (s:Song) ON (s.title)")
      session.run("CREATE INDEX IF NOT EXISTS FOR (al:Album) ON (al.name)") # If querying albums by name

      print("Neo4j constraints and basic indexes for music graph ensured.")

create_constraints()
ingest_music_data_multithreaded(df.iloc[0:20], NEO4J_URI, NEO4J_DB_USER, NEO4J_PASS)


#### Why a Knowledge Graph for this data?
*   Music is inherently connected!
    *   Artists **PERFORM** Songs.
    *   Songs **APPEAR_ON** Albums.
    *   Songs **HAVE_GENRE** Genre.
    *   Songs can **EVOKE** Emotions.
    *   Artists can **COLLABORATE_WITH** other Artists (implicitly through songs).
*   A KG allows us to model these relationships explicitly, enabling powerful contextual queries that are hard with just tables or vectors alone.
    *   "Find rock songs by artists who also play blues and have collaborated with..."
    *   "Show me upbeat songs from albums that were popular in a specific year."

#### 3. Neo4j Ingestion: Building the Knowledge Graph!
*   **This is the core of transforming tabular data into a connected graph.**
*   We use a multithreaded approach for efficiency (`ingest_music_data_multithreaded`).
*   **Key function: `process_song_row`** (inside `lKrI12U228B-`)
    *   For each song (row in the DataFrame):
        *   **`MERGE (s:Song {id: $song_id})`**: Creates a `Song` node if it doesn't exist, or matches it if it does. Sets properties like `title`, `lyrics`, `energy`, etc.
        *   **Artists:**
            *   Parses artist names.
            *   `MERGE (ar:Artist {name: $artist_name})`
            *   `MERGE (ar)-[:PERFORMED]->(s)`: **Creates the crucial `PERFORMED` relationship!**
        *   **Album:**
            *   `MERGE (al:Album {name: $album_name})`
            *   `MERGE (s)-[:APPEARS_ON]->(al)`: **Creates the `APPEARS_ON` relationship!**
        *   **Genre:**
            *   `MERGE (g:Genre {name: $genre_name})`
            *   `MERGE (s)-[:HAS_GENRE]->(g)`: **Creates the `HAS_GENRE` relationship!**
        *   **Emotion:**
            *   `MERGE (e:Emotion {name: $emotion})`
            *   `MERGE (s)-[:EVOKES]->(e)`: **Creates the `EVOKES` relationship!**
*   **Constraints & Indexes:**
    *   `CREATE CONSTRAINT IF NOT EXISTS FOR (s:Song) REQUIRE s.id IS UNIQUE` (and for Artist name, Album name, etc.) - Ensures data integrity and performance.
    *   `CREATE INDEX IF NOT EXISTS FOR (s:Song) ON (s.title)` - Speeds up lookups.

Here, you see a graph query being triggered. It can optionally include a vector similarity component. You can choose to store your graphs and vectors either separately in two distinct databases, or use a graph database like Neo4j.

### **TUTAJ JAKIEŚ POMYSŁY NA DALSZĄ CZĘŚĆ DEMO**

## The Broader Ecosystem & Future

#### A Rapidly Evolving Field
*   Knowledge Representation in RAG, especially GraphRAG, is a hot area of research and development.
*   New tools, techniques, and best practices are emerging constantly.
*   **(Reference: https://neo4j.com/blog/news/graphrag-ecosystem-tools/)**

#### Key Players & Tools in the GraphRAG Ecosystem
*   *(Consider a slide with logos here)*
*   **Graph Databases:**
    *   Neo4j (leading property graph database, excellent for connected data)
*   **LLM Orchestration Frameworks:**
    *   LangChain (provides modules for building RAG pipelines, including graph components)
    *   LlamaIndex (data framework for LLM applications, supports graph structures)
*   **Embedding Models & Providers:**
    *   OpenAI, Cohere, Google (Vertex AI / Generative AI Studio - like our `text-embedding-004`), Hugging Face Sentence Transformers.
*   **LLMs for Generation:**
    *   OpenAI (GPT series), Google (Gemini family), Anthropic (Claude), Mistral, Llama models.
*   **Vector Databases (for hybrid approaches):**
    *   Pinecone, Weaviate, Milvus, Chroma, FAISS (and Neo4j itself has vector indexing capabilities).

#### Community & Open Source
*   A lot of innovation is driven by the open-source community.
*   Many libraries and integrations are available on platforms like GitHub.
*   Active discussions, blogs, and research papers are pushing the boundaries.

#### Future Trends
*   **Automated KG Construction:** LLMs helping to extract entities and relationships from unstructured text to build KGs.
*   **More Sophisticated Retrieval Strategies:** Combining graph algorithms, semantic search, and reasoning for even better context.
*   **Multi-Modal RAG:** Incorporating knowledge from images, audio, and video into KGs.
*   **Evaluation Frameworks:** Better ways to measure the quality of retrieval and generation in RAG systems.

## Conclusion

#### Recap: Key Takeaways
1.  **RAG is Essential:** It makes LLMs more factual, current, and domain-aware by connecting them to external knowledge.
2.  **Knowledge Representation is CRUCIAL:**
    *   **Vector Embeddings:** Great for semantic similarity over large text corpora.
    *   **Knowledge Graphs:** Excel at representing explicit relationships and providing rich, structured context. Ideal for complex queries.
3.  **GraphRAG (Hybrid) is Powerful:** Combining KGs with vector search offers the most robust and nuanced approach to knowledge retrieval.
4.  **Practical Application:** As shown in our demo, we can transform raw data into a connected Knowledge Graph (e.g., in Neo4j) to serve as the backbone for an advanced RAG system.

#### Future Outlook
*   The synergy between LLMs and structured knowledge (like KGs) will continue to drive innovation.
*   Expect more intelligent, context-aware, and explainable AI systems powered by these techniques.

#### Sources

1.  GraphRAG Manifesto: [https://neo4j.com/blog/genai/graphrag-manifesto/](https://neo4j.com/blog/genai/graphrag-manifesto/)
2.  Langchain4j & GraphRAG: [https://neo4j.com/blog/developer/langchain4j-graphrag-vector-stores-retrievers/](https://neo4j.com/blog/developer/langchain4j-graphrag-vector-stores-retrievers/)
3.  What is RAG?: [https://neo4j.com/blog/genai/what-is-retrieval-augmented-generation-rag/](https://neo4j.com/blog/genai/what-is-retrieval-augmented-generation-rag/)
4.  KG RAG Application: [https://neo4j.com/blog/developer/knowledge-graph-rag-application/](https://neo4j.com/blog/developer/knowledge-graph-rag-application/)
5.  GraphRAG Ecosystem: [https://neo4j.com/blog/news/graphrag-ecosystem-tools/](https://neo4j.com/blog/news/graphrag-ecosystem-tools/)