# Tutorial on Graph RAG(Local Search) with Couchbase Vector Store
This notebook walks through the process of setting up a search engine that combines Couchbase for storing embeddings, OpenAI's models for generating embeddings, knowledge graph and communities from textual data.

## Setting up Couchbase

Before running this notebook, set up the following in Couchbase:

1. Create a bucket named "graphrag-demo" (or as specified in COUCHBASE_BUCKET_NAME)
2. Within the bucket, create a scope named "shared" (or as specified in COUCHBASE_SCOPE_NAME)
3. Within the scope, create a collection named "entity_description_embeddings" (or as specified in COUCHBASE_COLLECTION_NAME)

These settings should match the environment variables defined in your .env file or the default values in the code.

4. In the Couchbase Full Text Search (FTS) index section, create a new index by importing the `graphrag_demo_index.json` file. This file contains the necessary configuration for the vector search index.


## Local and Global Search in Graph RAG Systems

Local and global search are two approaches used in Graph RAG (Retrieval-Augmented Generation) systems:

### Local Search
Local search method generates answers by combining relevant data from the AI-extracted knowledge-graph with text chunks of the raw documents. This method is suitable for questions that require an understanding of specific entities mentioned in the documents.

### Global Search
Global search method generates answers by searching over all AI-generated community reports in a map-reduce fashion. This is a resource-intensive method, but often gives good responses for questions that require an understanding of the dataset as a whole.

## Couchbase as a Vector Store for Local Search

Couchbase can be used as a vector store to support local search in Graph RAG systems. Its capabilities include:

- **Vector Storage**: Couchbase can store vector embeddings alongside document data.
- **Vector Search**: It supports similarity search on vector fields using algorithms like cosine similarity.
- **Indexing**: Couchbase offers indexing options to optimize vector searches.
- **Scalability**: As a distributed database, it can handle large amounts of vector data.

To implement local search, you would store node embeddings in Couchbase and use its vector search capabilities to find similar nodes efficiently within a local context.


# Importing Necessary Libraries

In this section, we import all the essential Python libraries required to perform various tasks, 
such as loading data, interacting with Couchbase, and using OpenAI models for generating text and embeddings.


In [1]:
import os

import pandas as pd
import tiktoken
from couchbase.auth import PasswordAuthenticator
from couchbase.options import ClusterOptions
from dotenv import load_dotenv

from graphrag.query.context_builder.entity_extraction import EntityVectorStoreKey
from graphrag.query.indexer_adapters import (
    read_indexer_covariates,
    read_indexer_entities,
    read_indexer_relationships,
    read_indexer_reports,
    read_indexer_text_units,
)
from graphrag.query.input.loaders.dfs import store_entity_semantic_embeddings
from graphrag.query.llm.oai.chat_openai import ChatOpenAI
from graphrag.query.llm.oai.embedding import OpenAIEmbedding
from graphrag.query.llm.oai.typing import OpenaiApiType
from graphrag.query.structured_search.local_search.mixed_context import (
    LocalSearchMixedContext,
)
from graphrag.query.structured_search.local_search.search import LocalSearch
from graphrag.vector_stores.couchbasedb import CouchbaseVectorStore

# Configuring Environment Variables
Here, we configure various environment variables that define paths, API keys, and connection strings. These values are essential for connecting to Couchbase and OpenAI, loading data, and defining other constants.

- INPUT_DIR: This specifies the directory path where the input data files are located. These files typically contain the raw data that will be processed and analyzed in the notebook.
- COUCHBASE_CONNECTION_STRING: This is the connection string used to establish a connection with the Couchbase database. It usually includes the protocol and host information (e.g., "couchbase://localhost").
- OPENAI_API_KEY: This is your personal API key for accessing OpenAI's services. It's required for authentication when making requests to OpenAI's API, allowing you to use their language models and other AI services.
- LLM_MODEL: This variable specifies which Large Language Model (LLM) from OpenAI to use for text generation tasks. For example, it could be set to "gpt-4" for using GPT-4, or "gpt-3.5-turbo" for using ChatGPT.
- EMBEDDING_MODEL: This defines the specific model used for generating text embeddings. Text embeddings are vector representations of text that capture semantic meaning. For OpenAI, a common choice is "text-embedding-ada-002".

These environment variables are crucial for the notebook's functionality, as they provide necessary configuration details for data access, database connections, and AI model interactions.

In [2]:
load_dotenv()

INPUT_DIR = os.getenv("INPUT_DIR")
COUCHBASE_CONNECTION_STRING = os.getenv(
    "COUCHBASE_CONNECTION_STRING", "couchbase://localhost"
)
COUCHBASE_USERNAME = os.getenv("COUCHBASE_USERNAME", "Administrator")
COUCHBASE_PASSWORD = os.getenv("COUCHBASE_PASSWORD", "password")
COUCHBASE_BUCKET_NAME = os.getenv("COUCHBASE_BUCKET_NAME", "graphrag-demo")
COUCHBASE_SCOPE_NAME = os.getenv("COUCHBASE_SCOPE_NAME", "shared")
COUCHBASE_COLLECTION_NAME = os.getenv(
    "COUCHBASE_COLLECTION_NAME", "entity_description_embeddings"
)
COUCHBASE_VECTOR_INDEX_NAME = os.getenv("COUCHBASE_VECTOR_INDEX_NAME", "graphrag_index")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
LLM_MODEL = os.getenv("LLM_MODEL", "gpt-4o")
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL", "text-embedding-ada-002")

## Load text units and graph data tables as context for local search
In this part, we load data from Parquet files into a dictionary.We define functions that will handle the loading and processing of each paraquet.

In [3]:
data = {}

# Constants
COMMUNITY_LEVEL = 2

# Table names
TABLE_NAMES = {
    "COMMUNITY_REPORT_TABLE": "create_final_community_reports",
    "ENTITY_TABLE": "create_final_nodes",
    "ENTITY_EMBEDDING_TABLE": "create_final_entities",
    "RELATIONSHIP_TABLE": "create_final_relationships",
    "COVARIATE_TABLE": "create_final_covariates",
    "TEXT_UNIT_TABLE": "create_final_text_units",
}

## Read Entities:

In [4]:
try:
    data["entities"] = pd.read_parquet(
        f"{INPUT_DIR}/{TABLE_NAMES['ENTITY_TABLE']}.parquet"
    )
    entity_embeddings = pd.read_parquet(
        f"{INPUT_DIR}/{TABLE_NAMES['ENTITY_EMBEDDING_TABLE']}.parquet"
    )
    data["entities"] = read_indexer_entities(
        data["entities"], entity_embeddings, COMMUNITY_LEVEL
    )
except FileNotFoundError:
    data["entities"] = None

## Read Relationships:

In [5]:
try:
    data["relationships"] = pd.read_parquet(
        f"{INPUT_DIR}/{TABLE_NAMES['RELATIONSHIP_TABLE']}.parquet"
    )
    data["relationships"] = read_indexer_relationships(data["relationships"])
except FileNotFoundError:
    data["relationships"] = None

## Read Covariates:


In [6]:
try:
    data["covariates"] = pd.read_parquet(
        f"{INPUT_DIR}/{TABLE_NAMES['COVARIATE_TABLE']}.parquet"
    )
    data["covariates"] = read_indexer_covariates(data["covariates"])
except FileNotFoundError:
    data["covariates"] = None

## Read Reports:

In [7]:
try:
    data["reports"] = pd.read_parquet(
        f"{INPUT_DIR}/{TABLE_NAMES['COMMUNITY_REPORT_TABLE']}.parquet"
    )
    entity_data = pd.read_parquet(f"{INPUT_DIR}/{TABLE_NAMES['ENTITY_TABLE']}.parquet")
    data["reports"] = read_indexer_reports(
        data["reports"], entity_data, COMMUNITY_LEVEL
    )
except FileNotFoundError:
    data["reports"] = None

## Read Text units:


In [8]:
try:
    data["text_units"] = pd.read_parquet(
        f"{INPUT_DIR}/{TABLE_NAMES['TEXT_UNIT_TABLE']}.parquet"
    )
    data["text_units"] = read_indexer_text_units(data["text_units"])
except FileNotFoundError:
    data["text_units"] = None

print("Data loading completed")

Data loading completed


# Setting Up the Couchbase Vector Store
Couchbase is used here to store the semantic embeddings generated from entities. In this step, we define a method to connect to the Couchbase database using the provided credentials.

The CouchbaseVectorStore allows you to store, retrieve, and manage vector embeddings in Couchbase.
The connect() method initializes the connection to Couchbase using the provided connection string, username, and password.

In [9]:
couchbase_vector_store = CouchbaseVectorStore(
    collection_name=COUCHBASE_COLLECTION_NAME,
    bucket_name=COUCHBASE_BUCKET_NAME,
    scope_name=COUCHBASE_SCOPE_NAME,
    index_name=COUCHBASE_VECTOR_INDEX_NAME,
)

auth = PasswordAuthenticator(str(COUCHBASE_USERNAME), str(COUCHBASE_PASSWORD))
cluster_options = ClusterOptions(auth)

couchbase_vector_store.connect(
    connection_string=COUCHBASE_CONNECTION_STRING,
    cluster_options=cluster_options,
)

# Setting Up Language Models
In this section, we configure the language models using OpenAI’s API. We initialize:

ChatOpenAI: This is the language model used to generate responses to natural language queries.
OpenAIEmbedding: This is the model used to generate vector embeddings for text data.
tiktoken: This tokenizer is used to split text into tokens, which are essential for sending data to the language model.

In [10]:
llm = ChatOpenAI(
    api_key=OPENAI_API_KEY,
    model=LLM_MODEL,
    api_type=OpenaiApiType.OpenAI,
    max_retries=20,
)

token_encoder = tiktoken.get_encoding("cl100k_base")

text_embedder = OpenAIEmbedding(
    api_key=OPENAI_API_KEY,
    api_base=None,
    api_type=OpenaiApiType.OpenAI,
    model=EMBEDDING_MODEL,
    deployment_name=EMBEDDING_MODEL,
    max_retries=20,
)

# Storing Embeddings in Couchbase
After generating embeddings for the entities, we store them in Couchbase. We use the store_entity_semantic_embeddings function to store the embeddings.

This method checks if the input is either a dictionary or a list and processes it accordingly.
It uses the Couchbase vector store to save the embeddings, ensuring that entities have the proper 'id' attribute for storage.


In [11]:

try:
    if not isinstance(data["entities"], list):
        error_message = "data['entities'] must be a list"
        raise TypeError(error_message)

    store_entity_semantic_embeddings(
        entities=data["entities"], vectorstore=couchbase_vector_store
    )
except AttributeError as err:
    error_message = "Error storing entity semantic embeddings. Ensure all entities have an 'id' attribute"
    raise AttributeError(error_message) from err
except TypeError as err:
    error_message = "Error storing entity semantic embeddings. Ensure data['entities'] is a list"
    raise TypeError(error_message) from err
except Exception as err:
    error_message = "Error storing entity semantic embeddings"
    raise Exception(error_message) from err


### **7. Building the Search Engine (In the Context of Graphrag)**

Here, we explain the components of the search engine in detail and how they contribute to its functionality within Graphrag.

#### **1. Context Builder (LocalSearchMixedContext)**

The `LocalSearchMixedContext` class is the cornerstone of our search engine in Graphrag. It acts as a **contextual environment** for the search process by combining various types of data—such as **community reports, text units, entities, relationships, and covariates**—into a coherent structure that can be used by the search engine. In this context:

- **Community Reports**: These are structured documents or insights generated at a community level, such as summaries or analytics reports, which are crucial when trying to query community-specific data.
- **Text Units**: Smaller pieces of text, such as paragraphs, sentences, or tokens that are stored in the system. These units help in understanding specific parts of the context when answering questions.
- **Entities**: These represent the core subjects (people, organizations, products, etc.) around which your queries are structured. Each entity has certain attributes and semantic embeddings stored in Couchbase, and these are used to enrich the search results.
- **Relationships**: The connections between entities, which can represent anything from business partnerships to familial ties or data dependencies. Understanding these relationships helps in contextualizing the search results more effectively.
- **Covariates**: Additional variables or metadata that provide more information about entities and relationships. These could include factors like location, time, or other metrics that affect the relevance of the search.

All these elements work together to build the **context** that the search engine will use to find and rank results.

- **entity_text_embeddings**: The entity descriptions are stored as vector embeddings (using the Couchbase vector store) to help in finding semantically similar entities.
- **text_embedder**: This is the **OpenAI embedding model** used to embed both the entities and user queries in a similar vector space, allowing for meaningful similarity comparisons.
- **token_encoder**: Tokenization splits the input text into tokens (smaller chunks), making it easier to process by the language models.

#### **2. Local Search Parameters**

Once the context is established, we define the parameters for the **search engine**. These parameters guide how the search engine processes the context to answer a query.

- **text_unit_prop**: This sets the proportion of text units to be considered when building the context. In this case, 50% of the context comes from text units.
- **community_prop**: Similar to `text_unit_prop`, this defines how much weight to give community reports. Here, 10% of the context is derived from community reports.
- **conversation_history_max_turns**: This specifies how many conversation history turns are retained when building the context. It helps in multi-turn queries, where the context from previous queries may still be relevant.
- **top_k_mapped_entities**: Defines how many of the most relevant entities should be considered in each query. In this case, we are considering the top 10 entities.
- **top_k_relationships**: Similarly, we consider the top 10 relationships that are most relevant to the query.
- **include_entity_rank**: Whether to rank entities based on their relevance to the query.
- **include_relationship_weight**: Whether to include relationship weights in the ranking process. This is crucial because certain relationships may have higher importance based on the data being queried.
- **embedding_vectorstore_key**: Defines the **key** for accessing entity embeddings from Couchbase. Here, we use `EntityVectorStoreKey.ID` as the identifier for retrieving the correct embeddings.
- **max_tokens**: The maximum number of tokens to consider in the context.


#### **3. Language Model Parameters**

For answering the query, we use the **language model** (LLM) to generate the response. The parameters for the LLM are configured here:
- **max_tokens**: Limits the number of tokens (words or sub-words) in the generated answer.
- **temperature**: Controls the randomness of the output. Setting it to `0.0` makes the model’s answers more deterministic.


#### **4. Integrating Everything: Creating the Search Engine**

Finally, all components are integrated into the `LocalSearch` class, which serves as the main search engine. This class is responsible for:
- **Accepting queries** in natural language.
- **Using the context builder** to form a detailed context based on the available structured data (entities, relationships, text, reports).
- **Passing the query and context** to the language model (LLM), which generates the final answer.

The search engine is now ready to process queries, using the underlying Graphrag system to provide context-aware and semantically rich answers.


### **Summary**

This search engine leverages **structured data** (entities, relationships, reports, etc.) generated from the input files and integrates **semantic embeddings** stored in Couchbase. The search engine processes the query using OpenAI's language model, which uses the structured data context of the graph RAG to generate meaningful answers.

In [12]:
context_builder = LocalSearchMixedContext(
    community_reports=data["reports"],
    text_units=data["text_units"],
    entities=data["entities"],
    relationships=data["relationships"],
    covariates=data["covariates"],
    entity_text_embeddings=couchbase_vector_store,
    embedding_vectorstore_key=EntityVectorStoreKey.ID,
    text_embedder=text_embedder,
    token_encoder=token_encoder,
)

local_context_params = {
    "text_unit_prop": 0.5,
    "community_prop": 0.1,
    "conversation_history_max_turns": 5,
    "conversation_history_user_turns_only": True,
    "top_k_mapped_entities": 10,
    "top_k_relationships": 10,
    "include_entity_rank": True,
    "include_relationship_weight": True,
    "include_community_rank": False,
    "return_candidate_context": False,
    "embedding_vectorstore_key": EntityVectorStoreKey.ID,
    "max_tokens": 12_000,
}

llm_params = {
    "max_tokens": 2_000,
    "temperature": 0.0,
}

search_engine = LocalSearch(
    llm=llm,
    context_builder=context_builder,
    token_encoder=token_encoder,
    llm_params=llm_params,
    context_builder_params=local_context_params,
    response_type="multiple paragraphs",
)

# Running a Query
Finally, we run a query on the search engine. In this case, the query is "Give me a summary about the story". This simulates asking the search engine to summarize the entities and relationships stored in Couchbase.

In [13]:
question = "Give me a summary about the story"

try:
    result = await search_engine.asearch(question)
    print(f"Question: '{question}'")
    print(f"Answer: {result.response}")
except Exception as e:
    print(f"An error occurred while processing the query: {(e)}")

Question: 'Give me a summary about the story'
Answer: # Summary of the Story

## Introduction to the Paranormal Military Squad

The narrative centers around the Paranormal Military Squad, a secretive governmental faction tasked with investigating and engaging with extraterrestrial phenomena. This elite group operates primarily from the Dulce military base, where they are deeply involved in Operation: Dulce. The mission's primary objective is to mediate Earth's contact with alien intelligence, ensuring humanity's safety and preparing for potential first contact scenarios [Data: Paranormal Military Squad and Operation: Dulce (18)].

## Key Figures and Their Roles

Key figures within the squad include Alex Mercer, Dr. Jordan Hayes, Taylor Cruz, and Sam Rivera. Alex Mercer provides leadership and strategic insights, guiding the team through high-stakes operations. Dr. Jordan Hayes focuses on deciphering alien codes and understanding their intent, contributing significantly to the team's mi

With these steps, the entire process of loading data, setting up models, storing embeddings, and running a search engine query is written out in sequence without using functions. Let me know if any additional modifications are needed!