[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/osllmai/inDox/blob/master/Demo/Question_Answering_with_Graphs(Neo4j).ipynb)

**Overview**:
In this notebook, we will demonstrate how to create a Retrieval-Augmented Generation (RAG) system using Neo4j as a graph database, Indox for language model interaction, and Hugging Face's embedding models for vector search. This notebook will walk through the entire pipeline, from gathering data, creating graph documents, storing them in Neo4j, and finally using a question-answering system that retrieves information from the Neo4j graph.

**Key Concepts**:


*   **Graph Database (Neo4j)**: We will store knowledge in the form of a graph, with entities as nodes and relationships between entities as edges.
*   **Indox API**: Used for leveraging language models (LLMs) to extract entities and relationships and perform question-answering.
*   **Vector Search**: Vector-based semantic search that retrieves the most relevant information based on embeddings.
*   **Keyword and Hybrid Search**: Alternative search mechanisms that use simple keyword matching or a combination of keyword and vector search.

Installing Dependencies:

We begin by installing the required libraries: indox, neo4j, wikipedia, semantic_text_splitter, and sentence_transformers. These libraries will help us interact with APIs, fetch data, and handle embeddings.

In [62]:
# Install the necessary libraries
# This block installs all required packages for running the notebook.

!pip install indox  # For Indox API interaction
!pip install neo4j  # For connecting to the Neo4j graph database
!pip install wikipedia  # For fetching Wikipedia data
!pip install semantic_text_splitter  # For splitting text into chunks
!pip install sentence_transformers  # For embedding-based operations
!pip install sklearn  # For cosine similarity
!pip install duckduckgo_search # For searching on DuckDuckGo API


ERROR: Invalid requirement: '#'

[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: Invalid requirement: '#'

[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: Invalid requirement: '#'

[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: Invalid requirement: '#'

[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: Invalid requirement: '#'

[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: Invalid requirement: '#'

[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: Invalid requirement: '#'

[notice] A new release of

Setting Up Connections:



*   We set up the API key for Indox and connection credentials for Neo4j. These will be used to authenticate and connect to the respective services.

In [1]:
# Replace the API key and credentials with your actual values.

INDOX_API_KEY = 'INDOX_API_KEY'  # Replace with your actual API key
NEO4J_URI = "NEO4J_URI"  # Replace with your Neo4j URI
NEO4J_USERNAME = "NEO4J_USERNAME"  # Replace with your Neo4j username
NEO4J_PASSWORD = "NEO4J_PASSWORD"  # Replace with your Neo4j password


Fetching Data:



*   We load Wikipedia content for "Elizabeth I" using the Indox Wikipedia reader. The data will be split into smaller chunks to facilitate entity extraction and graph creation.

In [2]:
from indox.data_connectors import WikipediaReader
from indox.splitter import SemanticTextSplitter
from indox.llms import IndoxApi

# Initialize the Wikipedia reader to pull data from Wikipedia.
reader = WikipediaReader()

# Load Wikipedia content for "Elizabeth I".
# This will retrieve the first 500 characters of the Wikipedia page about Elizabeth I.
documents = reader.load_content(pages=["Elizabeth I"])
documents = documents[:500]

# Create a basic metadata dictionary to keep track of the document source.
metadata = {
    "title": "Elizabeth I",
    "source": "https://en.wikipedia.org/wiki/Elizabeth_I"
}


Text Splitting:

*   The fetched Wikipedia content is split into smaller chunks using a text-splitting technique. This allows us to process the document more effectively by breaking it into manageable pieces.

In [3]:
splitter = SemanticTextSplitter(chunk_size=50)
document_chunks = splitter.split_text(documents)


Embedding Model Initialization:



*   We initialize the HuggingFace embedding model to generate vector embeddings for each document. These embeddings will later be used for vector-based search.

In [4]:
# Initialize the embedding model using the HuggingFaceEmbedding class.

from indox.embeddings import HuggingFaceEmbedding

embedding_model = HuggingFaceEmbedding(api_key=INDOX_API_KEY, model="multi-qa-mpnet-base-cos-v1")


[32mINFO[0m: [1mInitialized HuggingFaceEmbedding with model: multi-qa-mpnet-base-cos-v1[0m


Graph Creation using LLM:



*   The LLMGraphTransformer is initialized, and we use it to transform document chunks into graph structures by extracting entities and relationships using a language model.
* In this example, we are initializing the Indox API client (llm_transformer) with a max_tokens parameter set to 10000.
* The max_tokens parameter determines the maximum number of tokens the model can process or return in a single API call.



**Why is this important?**

- Different LLM models have a limit on the number of tokens they can handle in a single request. If the number of tokens is too small, the response might be cut off, leading to incomplete or invalid JSON outputs.
- By setting max_tokens to a high value like 10000, we ensure that the entire response (including nodes and relationships) is generated without truncation.
- However, note that setting max_tokens too high may lead to slower responses or increased API costs, so it's important to choose a balanced value.


In [5]:
from indox.graph.llmgraphtransformer import *

In [9]:
# Initialize the Indox API client
llm_transformer = IndoxApi(api_key=INDOX_API_KEY, max_tokens=10000)

# Initialize the LLMGraphTransformer with the Indox API client
transformer = LLMGraphTransformer(llm_transformer=llm_transformer, embeddings_model=embedding_model)

# Convert the document chunks into graph documents that contain nodes (entities) and relationships
graph_documents = transformer.convert_to_graph_documents(document_chunks, metadata=metadata)



[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling

In [10]:
# Optionally, print out the graph documents to see how the entities and relationships are structured.
# This will output the graph document in dictionary form.

for graph_doc in graph_documents:
    print(graph_doc.to_dict())

{'nodes': [{'id': 'Chunk_0', 'type': 'Chunk', 'embedding': [0.016984721645712852, -0.04755743220448494, -0.015256768092513084, -0.004274148494005203, 0.049360454082489014, 0.006072521675378084, -0.059166181832551956, 0.0292876735329628, 0.025210566818714142, 0.04163852706551552, 0.034391649067401886, -0.0035680329892784357, -0.005019412375986576, -0.03713162988424301, -0.024321746081113815, -0.003699025372043252, -0.0003411942161619663, -0.016812073066830635, -0.02886049635708332, 0.013376509770751, -0.007076185196638107, -0.0009061662130989134, 0.02164304442703724, 0.011871328577399254, 0.0009609804837964475, 0.042316216975450516, -0.04530034214258194, -0.03424889221787453, -0.04024829715490341, 0.062283601611852646, 0.034163180738687515, -0.06792812049388885, 0.003124901792034507, -0.06122037023305893, 0.009800734929740429, 0.011953706853091717, 0.012485782615840435, -0.011334456503391266, -0.05158877745270729, -0.025743555277585983, -0.02961563691496849, 0.028690015897154808, -0.005

Storing Graph in Neo4j:



*   The graph documents are stored in a Neo4j database. This step creates nodes (entities) and relationships in the graph, which can be queried later.

In [11]:
from indox.graph.graphs.neo4j_graph import Neo4jGraph  # Assuming the Neo4jGraph is part of your custom code

# Initialize the Neo4j connection and add the graph documents to the database.
neo4j_graph = Neo4jGraph(uri=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD)
neo4j_graph.add_graph_documents(graph_documents, base_entity_label=True, include_source=True)
neo4j_graph.close()


Querying Neo4j for Relationships:



*   We demonstrate how to query the Neo4j database to find relationships for a given entity. In this case, we search for parent-child relationships for Elizabeth I.

In [12]:
entity_id = "Elizabeth"  # Set the entity ID (adjust this if the actual ID is different in the graph)
relationship_type = "PARENT"  # Set the relationship type to search for (e.g., parent-child relationship)

relationships = neo4j_graph.search_relationships_by_entity(entity_id, relationship_type)

# Print out the relationships we found in the database.
if relationships:
    for rel in relationships:
        a_node = rel['a'].get('id', 'Unknown')
        b_node = rel['b'].get('id', 'Unknown')
        rel_type = rel['rel_type']
        print(f"Entity {a_node} {rel_type} Entity {b_node}")
else:
    print(f"No relationships found for entity: {entity_id} with relationship: {relationship_type}")


Entity Elizabeth PARENT Entity Henry VIII
Entity Elizabeth PARENT Entity Anne Boleyn


Setting up the QA System:



*   We initialize the question-answering (QA) system using the Indox IndoxRetrievalAugmentation library. The QA system retrieves information from the Neo4j database and uses an LLM to answer questions.


In [13]:
from indox import IndoxRetrievalAugmentation

Performing Vector-Based Search:



*   We instantiate the QA system to use vector-based search, where questions are answered based on the semantic similarity between the question and the graph data.


In [14]:
openai_qa_indox = IndoxApi(api_key=INDOX_API_KEY)

In [15]:
# Instantiate the Neo4jVector with default vector-based search to perform QA based on the Neo4j graph data.
# This setup will use vector search to retrieve context for answering the questions.

from indox.vector_stores.neo4j_vector import Neo4jVector

neo4j_vector = Neo4jVector(uri=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD,
                           embedding_function=embedding_model, search_type='vector')

# Use the IndoxRetrievalAugmentation.QuestionAnswer class for question-answering.
qa_system_neo4j = IndoxRetrievalAugmentation.QuestionAnswer(llm=openai_qa_indox, vector_database=neo4j_vector)

# Ask a question about Elizabeth and retrieve the answer using vector-based search.
answer_from_vector_neo4j = qa_system_neo4j.invoke("Who was Elizabeth I?")
answer_from_vector_neo4j


[32mINFO[0m: [1mRetrieving context and scores from the vector database[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[32mINFO[0m: [1mGenerating answer without document relevancy filter[0m
[32mINFO[0m: [1mQuery answered successfully[0m


"Elizabeth I was the Queen of England and Ireland from 17 November 1558 until her death on 24 March 1603. She was the last monarch of the House of Tudor and the only surviving child of King Henry VIII and his second wife, Anne Boleyn. After her parents' marriage was annulled and her mother was executed when she was two years old, Elizabeth was declared illegitimate. However, she was later restored to the line of succession at the age of 10 through the Third Succession Act of 1543."

Performing Keyword-Based Search:



*   We also demonstrate how to use keyword-based search, where questions are answered by matching the keywords in the question with the data in the graph.

In [16]:
# Instantiate the Neo4jVector with default keyword-based search.
neo4j_vector_keyword = Neo4jVector(uri=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD,
                                   embedding_function=embedding_model, search_type='keyword')

# Use the IndoxRetrievalAugmentation.QuestionAnswer class for keyword search.
qa_system_keyword = IndoxRetrievalAugmentation.QuestionAnswer(llm=openai_qa_indox, vector_database=neo4j_vector_keyword)

# Ask a question using keyword search.
answer_from_keyword_search = qa_system_keyword.invoke("Who was Queen Elizabeth?")
answer_from_keyword_search


[32mINFO[0m: [1mRetrieving context and scores from the vector database[0m
[32mINFO[0m: [1mGenerating answer without document relevancy filter[0m
[32mINFO[0m: [1mQuery answered successfully[0m


'Queen Elizabeth refers to two prominent figures in British history: \n\n1. **Queen Elizabeth I (1533-1603)**: The daughter of King Henry VIII and Anne Boleyn, she reigned from 1558 until her death in 1603. Elizabeth I is known for her strong leadership during the Elizabethan Era, a period marked by the flourishing of English drama, the defeat of the Spanish Armada in 1588, and the establishment of Protestantism in England. She is often referred to as the "Virgin Queen" due to her decision not to marry.\n\n2. **Queen Elizabeth II (1926-2022)**: The daughter of King George VI and Queen Elizabeth (later known as the Queen Mother), she ascended'

Hybrid Search (Combining Vector and Keyword):



*   Finally, we demonstrate hybrid search, which combines the strengths of both vector and keyword-based search to retrieve the best possible context.

In [17]:
# Instantiate the Neo4jVector with hybrid search enabled.
neo4j_vector_hybrid = Neo4jVector(uri=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD,
                                  embedding_function=embedding_model, search_type='hybrid')

# Use the IndoxRetrievalAugmentation.QuestionAnswer class for hybrid search.
qa_system_hybrid = IndoxRetrievalAugmentation.QuestionAnswer(llm=openai_qa_indox, vector_database=neo4j_vector_hybrid)

# Ask a question using hybrid search.
answer_from_hybrid_search = qa_system_hybrid.invoke("Who was Queen Elizabeth?")
answer_from_hybrid_search


[32mINFO[0m: [1mRetrieving context and scores from the vector database[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[32mINFO[0m: [1mGenerating answer without document relevancy filter[0m
[32mINFO[0m: [1mQuery answered successfully[0m


"Queen Elizabeth I was the Queen of England and Ireland from 17 November 1558 until her death on 24 March 1603. She was the last monarch of the House of Tudor and the only surviving child of King Henry VIII and his second wife, Anne Boleyn. Elizabeth's early life was marked by her parents' tumultuous marriage, her mother's execution, and her initial declaration as illegitimate. However, she was later restored to the line of succession by her father through the Third Succession Act in 1543."

Setting up the AgenticRAG System:



*   The AgenticRAG system is initialized using the IndoxRetrievalAugmentation.AgenticRag class. We pass the LLM (mistral_qa) for answering queries and the neo4j_vector as the vector database, which interacts with the Neo4j database to retrieve relevant documents. The AgenticRAG system processes queries by retrieving context from Neo4j and generating answers using the LLM. AgenticRAG differs from standard QA because it can utilize more complex search methods like hybrid search.

In this setup, you can try different search methods (vector, keyword, or hybrid) within the AgenticRAG to optimize how documents are retrieved.


Performing Vector-Based Search (as default):

In [19]:
# Instantiate the Neo4jVector class
neo4j_vector = Neo4jVector(uri="bolt://localhost:7687", username="neo4j", password="neopass76", embedding_function=embedding_model)

# Instantiate the AgenticRag with the neo4j_vector as the vector database
rag_system = IndoxRetrievalAugmentation.AgenticRag(llm=openai_qa_indox, vector_database=neo4j_vector)

# Run a query
answer_vs = rag_system.run("Who was Queen Elizabeth?")
answer_vs


[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[32mINFO[0m: [1mRelevant doc[0m
[32mINFO[0m: [1mRelevant doc[0m
[32mINFO[0m: [1mRelevant doc[0m
[32mINFO[0m: [1mHallucination detected, Regenerate the answer...[0m


"Queen Elizabeth I was the Queen of England and Ireland from 17 November 1558 until her death on 24 March 1603. She was the last monarch of the House of Tudor and the only surviving child of King Henry VIII and his second wife, Anne Boleyn. After her parents' marriage was annulled and her mother was executed when she was two years old, Elizabeth was declared illegitimate. However, she was later restored to the line of succession at the age of 10 through the Third Succession Act of 1543."

Performing Keyword-Based Search:

In [20]:
# Instantiate the Neo4jVector class
neo4j_vector = Neo4jVector(uri="bolt://localhost:7687", username="neo4j", password="neopass76", embedding_function=embedding_model, search_type='hybrid')

# Instantiate the AgenticRag with the neo4j_vector as the vector database
rag_system = IndoxRetrievalAugmentation.AgenticRag(llm=openai_qa_indox, vector_database=neo4j_vector)

# Run a query
answer_ks = rag_system.run("Who was Queen Elizabeth?")
answer_ks

[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[32mINFO[0m: [1mRelevant doc[0m
[32mINFO[0m: [1mRelevant doc[0m
[32mINFO[0m: [1mRelevant doc[0m
[32mINFO[0m: [1mHallucination detected, Regenerate the answer...[0m


"Queen Elizabeth I was the Queen of England and Ireland from 17 November 1558 until her death on 24 March 1603. She was the last monarch of the House of Tudor and the only surviving child of King Henry VIII and his second wife, Anne Boleyn. After her parents' marriage was annulled and her mother was executed when she was two years old, Elizabeth was declared illegitimate. However, she was later restored to the line of succession by her father through the Third Succession Act in 1543."

Performing Hybrid Search:

In [22]:
# Instantiate the Neo4jVector class
neo4j_vector = Neo4jVector(uri="bolt://localhost:7687", username="neo4j", password="neopass76", embedding_function=embedding_model, search_type='keyword')

# Instantiate the AgenticRag with the neo4j_vector as the vector database
rag_system = IndoxRetrievalAugmentation.AgenticRag(llm=openai_qa_indox, vector_database=neo4j_vector)

# Run a query
answer_hs = rag_system.run("Who was Queen Elizabeth?")
answer_hs

[32mINFO[0m: [1mNo Relevant document found, Start web search[0m
[32mINFO[0m: [1mNo Relevant Context Found, Start Searching On Web...[0m


2024-09-16 10:53:36,846 - primp - INFO - response: https://duckduckgo.com/ 200 17821


[32mINFO[0m: [1mAnswer Base On Web Search[0m
[32mINFO[0m: [1mCheck For Hallucination In Generated Answer Base On Web Search[0m
[32mINFO[0m: [1mHallucination detected, Regenerate the answer...[0m


'Queen Elizabeth II, born Elizabeth Alexandra Mary on April 21, 1926, was the queen of the United Kingdom and other Commonwealth realms from February 6, 1952, until her death on September 8, 2022. She was the longest-reigning monarch in British history, surpassing Queen Victoria in 2015. Elizabeth II was the daughter of King George VI and Queen Elizabeth and married Prince Philip in 1947. During her reign, she was queen regnant of 32 sovereign states and remained the monarch of 15 realms at the time of her death. She was known for her dedication to her role and her significant impact on the monarchy and the Commonwealth.'

Conclusion:

In this notebook, we demonstrated the complete pipeline for building a RAG system using Neo4j and Indox. We covered the process of:



1.   Loading Wikipedia content,
2.   Creating graph structures using language models,
3.   Storing graph data in Neo4j,
4.   Implementing vector-based, keyword-based, and hybrid search for question answering.
5.   Implementing vector-based, keyword-based, and hybrid search for AgenticRAG.
