[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/osllmai/inDox/blob/master/cookbook/indoxArcg/Question_Answering_with_Graphs(MemGraph).ipynb)

**Overview**:
In this notebook, we will demonstrate how to create a Retrieval-Augmented Generation (RAG) system using Memgraph as a graph database, Indox for language model interaction, and Hugging Face's embedding models for vector search. This notebook will walk through the entire pipeline, from gathering data, creating graph documents, storing them in MemGraph, and finally using a question-answering system that retrieves information from the MemGraph.

**Key Concepts**:


*   **Graph Database (MemGraph)**: We will store knowledge in the form of a graph, with entities as nodes and relationships between entities as edges.
*   **Indox API**: Used for leveraging language models (LLMs) to extract entities and relationships and perform question-answering.
*   **Vector Search**: Vector-based semantic search that retrieves the most relevant information based on embeddings.
*   **Keyword and Hybrid Search**: Alternative search mechanisms that use simple keyword matching or a combination of keyword and vector search.

Installing Dependencies:

We begin by installing the required libraries: indox, neo4j, wikipedia, semantic_text_splitter, and sentence_transformers. These libraries will help us interact with APIs, fetch data, and handle embeddings.

In [None]:
!pip install indoxArcg  
!pip install neo4j  
!pip install wikipedia  
!pip install semantic_text_splitter  
!pip install sentence_transformers  

Setting Up APIs:



*   We set up the API key for Indox and HuggingFace

In [39]:
from dotenv import load_dotenv
import os 

load_dotenv('api.env')
INDOX_API_KEY = os.getenv('INDOX_API_KEY')
HUGGINGFACE_API_KEY = os.getenv('HUGGINGFACE_API_KEY')


Fetching Data:



*   We load Wikipedia content for "Elizabeth I" using the Indox Wikipedia reader. The data will be split into smaller chunks to facilitate entity extraction and graph creation.

In [None]:
from indoxArcg.data_connectors import WikipediaReader
from indoxArcg.splitter import SemanticTextSplitter
from indoxArcg.llms import IndoxApi

# Initialize the Wikipedia reader to pull data from Wikipedia.
reader = WikipediaReader()

# Load Wikipedia content for "Elizabeth I".
# This will retrieve the first 500 characters of the Wikipedia page about Elizabeth I.
documents = reader.load_content(pages=["Elizabeth I"])
documents = documents[:500]

# Create a basic metadata dictionary to keep track of the document source.
metadata = {
    "title": "Elizabeth I",
    "source": "https://en.wikipedia.org/wiki/Elizabeth_I"
}


Text Splitting:

*   The fetched Wikipedia content is split into smaller chunks using a text-splitting technique. This allows us to process the document more effectively by breaking it into manageable pieces.

In [41]:
splitter = SemanticTextSplitter(chunk_size=50)
document_chunks = splitter.split_text(documents)


Embedding Model Initialization:



*   We initialize the HuggingFace embedding model to generate vector embeddings for each document. These embeddings will later be used for vector-based search.

In [None]:
# Initialize the embedding model using the HuggingFaceEmbedding class.

from indoxArcg.embeddings import HuggingFaceEmbedding

embedding_model = HuggingFaceEmbedding(api_key=INDOX_API_KEY, model="multi-qa-mpnet-base-cos-v1")


2024-09-21 12:44:45,154 - sentence_transformers.SentenceTransformer - INFO - Use pytorch device_name: cuda
2024-09-21 12:44:45,155 - sentence_transformers.SentenceTransformer - INFO - Load pretrained SentenceTransformer: multi-qa-mpnet-base-cos-v1


[32mINFO[0m: [1mInitialized HuggingFaceEmbedding with model: multi-qa-mpnet-base-cos-v1[0m


Graph Creation using LLM:



*   The LLMGraphTransformer is initialized, and we use it to transform document chunks into graph structures by extracting entities and relationships using a language model.

In [None]:
from indoxArcg.graph.llmgraphtransformer import LLMGraphTransformer
llm_transformer = IndoxApi(api_key=INDOX_API_KEY)

transformer = LLMGraphTransformer(llm_transformer=llm_transformer, embeddings_model=embedding_model)

# Convert the document chunks into graph documents that contain nodes (entities) and relationships.
graph_documents = transformer.convert_to_graph_documents(document_chunks, metadata=metadata)


[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m


Batches: 100%|██████████| 1/1 [00:00<00:00, 98.36it/s]


[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m


Batches: 100%|██████████| 1/1 [00:00<00:00, 43.99it/s]


[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m


Batches: 100%|██████████| 1/1 [00:00<00:00, 77.39it/s]


In [44]:
for graph_doc in graph_documents:
    print(graph_doc.to_dict())

{'nodes': [{'id': 'Chunk_0', 'type': 'Chunk', 'embedding': [0.016984712332487106, -0.047557469457387924, -0.015256771817803383, -0.004274189006537199, 0.04936050623655319, 0.00607256218791008, -0.05916622281074524, 0.02928762510418892, 0.025210509076714516, 0.04163850098848343, 0.03439163416624069, -0.003568015294149518, -0.005019450560212135, -0.03713161125779152, -0.02432173490524292, -0.0036989161744713783, -0.0003411499783396721, -0.01681206375360489, -0.0288605447858572, 0.013376510702073574, -0.0070761218667030334, -0.0009061498567461967, 0.021643079817295074, 0.011871310882270336, 0.0009609689586795866, 0.04231629520654678, -0.045300304889678955, -0.034248944371938705, -0.04024829342961311, 0.06228352338075638, 0.03416317701339722, -0.06792814284563065, 0.003124883398413658, -0.061220433562994, 0.009800738655030727, 0.011953706853091717, 0.012485765852034092, -0.011334456503391266, -0.05158885195851326, -0.025743518024683, -0.029615631327033043, 0.028689999133348465, -0.00592769

Storing Graph in MemGraph:



*   The graph documents are stored in a MemGraph database. This step creates nodes (entities) and relationships in the graph, which can be queried later.

In [None]:
from indoxRag.graph.graphs import MemgraphDB  

memgraph = MemgraphDB("bolt://localhost:7687")
memgraph.add_graph_documents(graph_documents)
memgraph.close()


Querying MemGraph for Relationships:



*   We demonstrate how to query the Neo4j database to find relationships for a given entity. In this case, we search for parent-child relationships for Elizabeth I.

In [46]:
entity_id = "Elizabeth" 
relationship_type = "PARENT" 

relationships = memgraph.search_relationships_by_entity(entity_id, relationship_type)

if relationships:
    for rel in relationships:
        a_node = rel['a'].get('id', 'Unknown')
        b_node = rel['b'].get('id', 'Unknown')
        rel_type = rel['rel_type']
        print(f"Entity {a_node} {rel_type} Entity {b_node}")
else:
    print(f"No relationships found for entity: {entity_id} with relationship: {relationship_type}")


Entity Elizabeth PARENT Entity Henry VIII
Entity Elizabeth PARENT Entity Anne Boleyn


Setting up the QA System:



*   We initialize the question-answering (QA) system using the Indox IndoxRetrievalAugmentation library. The QA system retrieves information from the MemGraph database and uses an LLM to answer questions.


In [None]:

from indoxRag.llms import HuggingFaceAPIModel

mistral_qa = HuggingFaceAPIModel(api_key=HUGGINGFACE_API_KEY, model="mistralai/Mistral-7B-Instruct-v0.2")


[32mINFO[0m: [1mInitializing HuggingFaceAPIModel with model: mistralai/Mistral-7B-Instruct-v0.2[0m
[32mINFO[0m: [1mHuggingFaceAPIModel initialized successfully[0m


Performing Vector-Based Search:



*   We instantiate the QA system to use vector-based search, where questions are answered based on the semantic similarity between the question and the graph data.


In [49]:
openai_qa_indox = IndoxApi(api_key=INDOX_API_KEY)


In [None]:
from indoxArcg.vector_stores import MemgraphVector
from indoxArcg.pipelines.rag import RAG


memgraphvector = MemgraphVector(uri="bolt://localhost:7687",username = "", password = "",embedding_function=embedding_model, search_type='vector')

memgraphvector._similarity_search_with_score(query="Who was Elizabeth I?")



qa_system = RAG(llm=openai_qa_indox,vector_store=memgraphvector)

answer_from_vector_neo4j = qa_system.infer(question="Who was Elizabeth I?")
print(answer_from_vector_neo4j)


[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m


Batches: 100%|██████████| 1/1 [00:00<00:00, 177.92it/s]

[32mINFO[0m: [1mRetrieving context and scores from the vector database[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m



Batches: 100%|██████████| 1/1 [00:00<00:00, 77.21it/s]

[32mINFO[0m: [1mGenerating answer without document relevancy filter[0m





[32mINFO[0m: [1mQuery answered successfully[0m
Elizabeth I was the Queen of England and Ireland from 17 November 1558 until her death on 24 March 1603. She was the last monarch of the House of Tudor and the only surviving child of King Henry VIII and his second wife, Anne Boleyn. After her parents' marriage was annulled and her mother was executed when she was two years old, Elizabeth was declared illegitimate. However, she was later restored to the line of succession by her father through the Third Succession Act in 1543.


Performing Keyword-Based Search:



*   We also demonstrate how to use keyword-based search, where questions are answered by matching the keywords in the question with the data in the graph.

In [51]:
memgraph_vector_keyword = MemgraphVector(uri="bolt://localhost:7687",username = "", password = "",embedding_function=embedding_model, search_type='keyword')


qa_system_keyword = RAG(llm=openai_qa_indox,vector_store=memgraph_vector_keyword)

answer_from_keyword_search = qa_system_keyword.infer("Who was Queen Elizabeth?")
print(answer_from_keyword_search)

[32mINFO[0m: [1mRetrieving context and scores from the vector database[0m
[32mINFO[0m: [1mGenerating answer without document relevancy filter[0m
[32mINFO[0m: [1mQuery answered successfully[0m
Queen Elizabeth typically refers to either Queen Elizabeth I or Queen Elizabeth II, two significant figures in British history.

1. **Queen Elizabeth I (1533-1603)**: She was the daughter of King Henry VIII and Anne Boleyn. Elizabeth I reigned from 1558 until her death in 1603 and is known for the Elizabethan Era, a period marked by English cultural flourishing, including the works of William Shakespeare and advancements in exploration. Her reign is often noted for the defeat of the Spanish Armada in 1588 and the establishment of Protestantism in England.

2. **Queen Elizabeth II (1926-2022)**: She was the daughter of King George VI and Queen Mary. Elizabeth II became queen in 1952 and was the longest-reigning monarch in British history, serving for over 70 years until her death in 202

Hybrid Search (Combining Vector and Keyword):



*   Finally, we demonstrate hybrid search, which combines the strengths of both vector and keyword-based search to retrieve the best possible context.

In [52]:
memgraph_vector_hybrid = MemgraphVector(uri="bolt://localhost:7687",username = "", password = "",embedding_function=embedding_model, search_type='hybrid')



qa_system_hybrid = RAG(llm=openai_qa_indox,vector_store=memgraph_vector_hybrid)

answer_from_hybrid_search = qa_system_hybrid.infer("Who was Queen Elizabeth?")
print(answer_from_hybrid_search)


[32mINFO[0m: [1mRetrieving context and scores from the vector database[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m


Batches: 100%|██████████| 1/1 [00:00<00:00, 133.27it/s]

[32mINFO[0m: [1mGenerating answer without document relevancy filter[0m





[32mINFO[0m: [1mQuery answered successfully[0m
Queen Elizabeth I was the Queen of England and Ireland from 17 November 1558 until her death on 24 March 1603. She was the last monarch of the House of Tudor and the only surviving child of King Henry VIII and his second wife, Anne Boleyn. Elizabeth was declared illegitimate after her parents' marriage was annulled and her mother was executed, but she was later restored to the line of succession by her father through the Third Succession Act 1543 when she was 10 years old.
