# Tutorial on GraphRAG with Couchbase
This notebook walks through the process of setting up a search engine that combines Couchbase for storing embeddings, OpenAI's models for generating embeddings, and a local search engine for querying structured data. This is useful when you need to search through structured data using natural language queries, leveraging both machine learning and a database.

# Importing Necessary Libraries
In this section, we import all the essential Python libraries required to perform various tasks, such as loading data, interacting with Couchbase, and using OpenAI models for generating text and embeddings.

The libraries used include:

asyncio: For running asynchronous tasks.
logging: For managing logs that help in debugging and monitoring the workflow.
pandas: For data manipulation and reading from data files.
tiktoken: For tokenizing text, which is essential for preparing text before passing it to a language model.
graphrag.query and vector_stores: These are custom libraries that handle entity extraction, searching, and vector storage.

In [1]:
import logging
import os
from typing import Any, Callable, Dict, List, Union

import pandas as pd
import tiktoken
from dotenv import load_dotenv

from graphrag.query.context_builder.entity_extraction import EntityVectorStoreKey
from graphrag.query.indexer_adapters import (
    read_indexer_covariates,
    read_indexer_entities,
    read_indexer_relationships,
    read_indexer_reports,
    read_indexer_text_units,
)
from graphrag.query.input.loaders.dfs import store_entity_semantic_embeddings
from graphrag.query.llm.oai.chat_openai import ChatOpenAI
from graphrag.query.llm.oai.embedding import OpenAIEmbedding
from graphrag.query.llm.oai.typing import OpenaiApiType
from graphrag.query.structured_search.local_search.mixed_context import (
    LocalSearchMixedContext,
)
from graphrag.query.structured_search.local_search.search import LocalSearch
from graphrag.vector_stores.couchbasedb import CouchbaseVectorStore

# Configuring Environment Variables
Here, we configure various environment variables that define paths, API keys, and connection strings. These values are essential for connecting to Couchbase and OpenAI, loading data, and defining other constants.

- INPUT_DIR: This specifies the directory path where the input data files are located. These files typically contain the raw data that will be processed and analyzed in the notebook.
- COUCHBASE_CONNECTION_STRING: This is the connection string used to establish a connection with the Couchbase database. It usually includes the protocol and host information (e.g., "couchbase://localhost").
- OPENAI_API_KEY: This is your personal API key for accessing OpenAI's services. It's required for authentication when making requests to OpenAI's API, allowing you to use their language models and other AI services.
- LLM_MODEL: This variable specifies which Large Language Model (LLM) from OpenAI to use for text generation tasks. For example, it could be set to "gpt-4" for using GPT-4, or "gpt-3.5-turbo" for using ChatGPT.
- EMBEDDING_MODEL: This defines the specific model used for generating text embeddings. Text embeddings are vector representations of text that capture semantic meaning. For OpenAI, a common choice is "text-embedding-ada-002".

These environment variables are crucial for the notebook's functionality, as they provide necessary configuration details for data access, database connections, and AI model interactions.

In [2]:
load_dotenv()

INPUT_DIR = os.getenv("INPUT_DIR")
COUCHBASE_CONNECTION_STRING = os.getenv("COUCHBASE_CONNECTION_STRING", "couchbase://localhost")
COUCHBASE_USERNAME = os.getenv("COUCHBASE_USERNAME", "Administrator")
COUCHBASE_PASSWORD = os.getenv("COUCHBASE_PASSWORD", "password")
COUCHBASE_BUCKET_NAME = os.getenv("COUCHBASE_BUCKET_NAME", "graphrag-demo")
COUCHBASE_SCOPE_NAME = os.getenv("COUCHBASE_SCOPE_NAME", "shared")
COUCHBASE_VECTOR_INDEX_NAME = os.getenv("COUCHBASE_VECTOR_INDEX_NAME", "graphrag_index")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
LLM_MODEL = os.getenv("LLM_MODEL", "gpt-4o")
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL", "text-embedding-ada-002")

# Loading Data from Parquet Files
In this part, we load data from Parquet files into a dictionary. Each file corresponds to a particular table in the dataset, and we define functions that will handle the loading and processing of each table.

read_indexer_entities, read_indexer_relationships, etc., are custom functions responsible for reading specific parts of the data, such as entities and relationships.

We use pandas to load the data from the files, and if a file is not found, we log a warning and continue.

## Entities table:
This table stores information about various entities in the system. Each entity has a unique ID, a short ID, a title, a type (e.g., PERSON), a description, and embeddings of the description. It may also include name embeddings, graph embeddings, community IDs, text unit IDs, document IDs, a rank, and additional attributes.

## Relationships table:
This table represents relationships between entities. Each relationship has a unique ID, a short ID, a source entity, a target entity, a weight, a description, and potentially description embeddings. It also includes text unit IDs, document IDs, and may have additional attributes like rank.

## Covariate table:
This table stores additional variables or attributes that may be associated with entities, relationships, or other elements in the system. Covariates are typically used to provide context or additional information that can be useful for analysis or modeling.

## Reports table:
This table contains community reports. Each report has an ID, a short ID, a title, a community ID, a summary, full content, a rank, and potentially embeddings for the summary and full content. It may also include additional attributes.

## Text units table:
This table stores text units, which are likely segments of text from documents. Each text unit has an ID, a short ID, the actual text content, and potentially text embeddings. It also includes entity IDs, relationship IDs, covariate IDs, the number of tokens, document IDs, and may have additional attributes.

In [3]:
# Set up logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)
logger.info("Loading data from parquet files")
data = {}

# Constants
COMMUNITY_LEVEL = 2

# Table names
TABLE_NAMES = {
    "COMMUNITY_REPORT_TABLE": "create_final_community_reports",
    "ENTITY_TABLE": "create_final_nodes",
    "ENTITY_EMBEDDING_TABLE": "create_final_entities",
    "RELATIONSHIP_TABLE": "create_final_relationships",
    "COVARIATE_TABLE": "create_final_covariates",
    "TEXT_UNIT_TABLE": "create_final_text_units",
}

try:
    data["entities"] = pd.read_parquet(f"{INPUT_DIR}/{TABLE_NAMES['ENTITY_TABLE']}.parquet")
    entity_embeddings = pd.read_parquet(f"{INPUT_DIR}/{TABLE_NAMES['ENTITY_EMBEDDING_TABLE']}.parquet")
    data["entities"] = read_indexer_entities(data["entities"], entity_embeddings, COMMUNITY_LEVEL)
    print("Entities table sample:")
    print(data["entities"][0])
except FileNotFoundError:
    logger.warning("ENTITY_TABLE file not found. Setting entities to None.")
    data["entities"] = None

try:
    data["relationships"] = pd.read_parquet(f"{INPUT_DIR}/{TABLE_NAMES['RELATIONSHIP_TABLE']}.parquet")
    data["relationships"] = read_indexer_relationships(data["relationships"])
    print("Relationships table sample:")
    print(data["relationships"][0])
except FileNotFoundError:
    logger.warning("RELATIONSHIP_TABLE file not found. Setting relationships to None.")
    data["relationships"] = None

try:
    data["covariates"] = pd.read_parquet(f"{INPUT_DIR}/{TABLE_NAMES['COVARIATE_TABLE']}.parquet")
    data["covariates"] = read_indexer_covariates(data["covariates"])
    print("Covariates table sample:")
    print(data["covariates"][0])
except FileNotFoundError:
    logger.warning("COVARIATE_TABLE file not found. Setting covariates to None.")
    data["covariates"] = None

try:
    data["reports"] = pd.read_parquet(f"{INPUT_DIR}/{TABLE_NAMES['COMMUNITY_REPORT_TABLE']}.parquet")
    entity_data = pd.read_parquet(f"{INPUT_DIR}/{TABLE_NAMES['ENTITY_TABLE']}.parquet")
    data["reports"] = read_indexer_reports(data["reports"], entity_data, COMMUNITY_LEVEL)
    print("Reports table sample:")
    print(data["reports"][0])
except FileNotFoundError:
    logger.warning("COMMUNITY_REPORT_TABLE file not found. Setting reports to None.")
    data["reports"] = None

try:
    data["text_units"] = pd.read_parquet(f"{INPUT_DIR}/{TABLE_NAMES['TEXT_UNIT_TABLE']}.parquet")
    data["text_units"] = read_indexer_text_units(data["text_units"])
    print("Text units table sample:")
    print(data["text_units"][0])
except FileNotFoundError:
    logger.warning("TEXT_UNIT_TABLE file not found. Setting text_units to None.")
    data["text_units"] = None

print("Data loading completed")

2024-09-11 16:28:21,629 - __main__ - INFO - Loading data from parquet files


Entities table sample:
Entity(id='3b040bcc19f14e04880ae52881a89c1c', short_id='27', title='AGENTS', type='PERSON', description='Agents Alex Mercer, Jordan Hayes, Taylor Cruz, and Sam Rivera are the team members exploring Dulce base', description_embedding=[0.007186084054410458, -0.023490138351917267, -0.01996060274541378, -0.04203357920050621, -0.02858390286564827, 0.04885200038552284, -0.00892411358654499, -0.009519054554402828, 0.004632517229765654, -0.036017321050167084, 0.018556809052824974, 0.03604406118392944, -0.013496468774974346, -0.006587800569832325, 0.016150306910276413, -0.012640823610126972, 0.003860431257635355, -0.038343608379364014, -0.014372168108820915, -0.015174335800111294, 0.001866710721515119, -0.017099536955356598, 0.013549946248531342, -0.02016114443540573, 0.0021457981783896685, -0.021351026371121407, 0.014626188203692436, -0.024813715368509293, 0.02861064113676548, 0.005768921226263046, 0.005023573990911245, 0.0026187426410615444, -0.005875877104699612, -0.01

# Setting Up the Couchbase Vector Store
Couchbase is used here to store the semantic embeddings generated from entities. In this step, we define a method to connect to the Couchbase database using the provided credentials.

The CouchbaseVectorStore allows you to store, retrieve, and manage vector embeddings in Couchbase.
The connect() method initializes the connection to Couchbase using the provided connection string, username, and password.

In [4]:
logger.info("Setting up CouchbaseVectorStore")

try:
    couchbase_vector_store = CouchbaseVectorStore(
        collection_name="entity_description_embeddings",
        bucket_name=COUCHBASE_BUCKET_NAME,
        scope_name=COUCHBASE_SCOPE_NAME,
        index_name=COUCHBASE_VECTOR_INDEX_NAME,
    )
    couchbase_vector_store.connect(
        connection_string=COUCHBASE_CONNECTION_STRING,
        username=COUCHBASE_USERNAME,
        password=COUCHBASE_PASSWORD,
    )
    logger.info("CouchbaseVectorStore setup completed")
except Exception as e:
    logger.error(f"Error setting up CouchbaseVectorStore: {str(e)}")
    raise

2024-09-11 16:28:21,824 - __main__ - INFO - Setting up CouchbaseVectorStore
2024-09-11 16:28:21,827 - graphrag.vector_stores.couchbasedb - INFO - Connecting to Couchbase at couchbase://localhost
2024-09-11 16:28:21,873 - graphrag.vector_stores.couchbasedb - INFO - Successfully connected to Couchbase
2024-09-11 16:28:21,874 - __main__ - INFO - CouchbaseVectorStore setup completed


# Setting Up Language Models
In this section, we configure the language models using OpenAI’s API. We initialize:

ChatOpenAI: This is the language model used to generate responses to natural language queries.
OpenAIEmbedding: This is the model used to generate vector embeddings for text data.
tiktoken: This tokenizer is used to split text into tokens, which are essential for sending data to the language model.

In [5]:
logger.info("Setting up LLM and embedding models")

try:
    llm = ChatOpenAI(
        api_key=OPENAI_API_KEY,
        model=LLM_MODEL,
        api_type=OpenaiApiType.OpenAI,
        max_retries=20,
    )

    token_encoder = tiktoken.get_encoding("cl100k_base")

    text_embedder = OpenAIEmbedding(
        api_key=OPENAI_API_KEY,
        api_base=None,
        api_type=OpenaiApiType.OpenAI,
        model=EMBEDDING_MODEL,
        deployment_name=EMBEDDING_MODEL,
        max_retries=20,
    )

    logger.info("LLM and embedding models setup completed")
except Exception as e:
    logger.error(f"Error setting up models: {str(e)}")
    raise

2024-09-11 16:28:21,911 - __main__ - INFO - Setting up LLM and embedding models
2024-09-11 16:28:22,536 - __main__ - INFO - LLM and embedding models setup completed


# Storing Embeddings in Couchbase
After generating embeddings for the entities, we store them in Couchbase. We use the store_entity_semantic_embeddings function to store the embeddings.

This method checks if the input is either a dictionary or a list and processes it accordingly.
It uses the Couchbase vector store to save the embeddings, ensuring that entities have the proper 'id' attribute for storage.


In [6]:
logger.info(f"Storing entity embeddings")

try:
    entities_list = list(data["entities"].values()) if isinstance(data["entities"], dict) else data["entities"]

    store_entity_semantic_embeddings(
        entities=entities_list, vectorstore=couchbase_vector_store
    )
    logger.info("Entity semantic embeddings stored successfully")
except AttributeError as e:
    logger.error(f"Error storing entity semantic embeddings: {str(e)}")
    logger.error("Ensure all entities have an 'id' attribute")
    raise
except Exception as e:
    logger.error(f"Error storing entity semantic embeddings: {str(e)}")
    raise

2024-09-11 16:28:22,550 - __main__ - INFO - Storing entity embeddings
2024-09-11 16:28:22,553 - graphrag.vector_stores.couchbasedb - INFO - Loading 96 documents into vector storage
2024-09-11 16:28:22,992 - graphrag.vector_stores.couchbasedb - INFO - Successfully loaded 96 out of 96 documents
2024-09-11 16:28:22,997 - __main__ - INFO - Entity semantic embeddings stored successfully


### **7. Building the Search Engine (In the Context of Graphrag)**

In this section, we focus on creating a search engine that integrates multiple components, specifically designed for the **Graphrag** system. Graphrag is a sophisticated architecture built for handling structured data, entities, relationships, and other contextual information to provide semantic search capabilities. This search engine allows you to query this structured data using **natural language** and get relevant, context-aware responses.

Here, we explain the components of the search engine in detail and how they contribute to its functionality within Graphrag.

#### **1. Context Builder (LocalSearchMixedContext)**

The `LocalSearchMixedContext` class is the cornerstone of our search engine in Graphrag. It acts as a **contextual environment** for the search process by combining various types of data—such as **community reports, text units, entities, relationships, and covariates**—into a coherent structure that can be used by the search engine. In this context:

- **Community Reports**: These are structured documents or insights generated at a community level, such as summaries or analytics reports, which are crucial when trying to query community-specific data.
- **Text Units**: Smaller pieces of text, such as paragraphs, sentences, or tokens that are stored in the system. These units help in understanding specific parts of the context when answering questions.
- **Entities**: These represent the core subjects (people, organizations, products, etc.) around which your queries are structured. Each entity has certain attributes and semantic embeddings stored in Couchbase, and these are used to enrich the search results.
- **Relationships**: The connections between entities, which can represent anything from business partnerships to familial ties or data dependencies. Understanding these relationships helps in contextualizing the search results more effectively.
- **Covariates**: Additional variables or metadata that provide more information about entities and relationships. These could include factors like location, time, or other metrics that affect the relevance of the search.

All these elements work together to build the **context** that the search engine will use to find and rank results.

- **entity_text_embeddings**: The entity descriptions are stored as vector embeddings (using the Couchbase vector store) to help in finding semantically similar entities.
- **text_embedder**: This is the **OpenAI embedding model** used to embed both the entities and user queries in a similar vector space, allowing for meaningful similarity comparisons.
- **token_encoder**: Tokenization splits the input text into tokens (smaller chunks), making it easier to process by the language models.

#### **2. Local Search Parameters**

Once the context is established, we define the parameters for the **search engine**. These parameters guide how the search engine processes the context to answer a query.

- **text_unit_prop**: This sets the proportion of text units to be considered when building the context. In this case, 50% of the context comes from text units.
- **community_prop**: Similar to `text_unit_prop`, this defines how much weight to give community reports. Here, 10% of the context is derived from community reports.
- **conversation_history_max_turns**: This specifies how many conversation history turns are retained when building the context. It helps in multi-turn queries, where the context from previous queries may still be relevant.
- **top_k_mapped_entities**: Defines how many of the most relevant entities should be considered in each query. In this case, we are considering the top 10 entities.
- **top_k_relationships**: Similarly, we consider the top 10 relationships that are most relevant to the query.
- **include_entity_rank**: Whether to rank entities based on their relevance to the query.
- **include_relationship_weight**: Whether to include relationship weights in the ranking process. This is crucial because certain relationships may have higher importance based on the data being queried.
- **embedding_vectorstore_key**: Defines the **key** for accessing entity embeddings from Couchbase. Here, we use `EntityVectorStoreKey.ID` as the identifier for retrieving the correct embeddings.
- **max_tokens**: The maximum number of tokens to consider in the context.


#### **3. Language Model Parameters**

For answering the query, we use the **language model** (LLM) to generate the response. The parameters for the LLM are configured here:
- **max_tokens**: Limits the number of tokens (words or sub-words) in the generated answer.
- **temperature**: Controls the randomness of the output. Setting it to `0.0` makes the model’s answers more deterministic.


#### **4. Integrating Everything: Creating the Search Engine**

Finally, all components are integrated into the `LocalSearch` class, which serves as the main search engine. This class is responsible for:
- **Accepting queries** in natural language.
- **Using the context builder** to form a detailed context based on the available structured data (entities, relationships, text, reports).
- **Passing the query and context** to the language model (LLM), which generates the final answer.

The search engine is now ready to process queries, using the underlying Graphrag system to provide context-aware and semantically rich answers.


### **Summary**

In this section, we have built a search engine specifically designed for the **Graphrag** system. This search engine leverages **structured data** (entities, relationships, reports, etc.) and integrates **semantic embeddings** stored in Couchbase. The search engine processes the query using OpenAI's language model, which uses the structured data context to generate meaningful answers.

Key steps include:
1. Setting up the **context builder** to combine different types of data.
2. Defining search parameters for handling text units, entities, relationships, and embedding similarities.
3. Integrating the **language model** to generate answers based on the context.

This search engine is highly useful for querying large-scale structured data and generating insights using natural language queries. It’s particularly relevant for systems like Graphrag, where the data has both structured and unstructured components that need to be processed together for an enriched search experience.

In [7]:
logger.info("Creating search engine")

context_builder = LocalSearchMixedContext(
    community_reports=data["reports"],
    text_units=data["text_units"],
    entities=data["entities"],
    relationships=data["relationships"],
    covariates=data["covariates"],
    entity_text_embeddings=couchbase_vector_store,
    embedding_vectorstore_key=EntityVectorStoreKey.ID,
    text_embedder=text_embedder,
    token_encoder=token_encoder,
)

local_context_params = {
    "text_unit_prop": 0.5,
    "community_prop": 0.1,
    "conversation_history_max_turns": 5,
    "conversation_history_user_turns_only": True,
    "top_k_mapped_entities": 10,
    "top_k_relationships": 10,
    "include_entity_rank": True,
    "include_relationship_weight": True,
    "include_community_rank": False,
    "return_candidate_context": False,
    "embedding_vectorstore_key": EntityVectorStoreKey.ID,
    "max_tokens": 12_000,
}

llm_params = {
    "max_tokens": 2_000,
    "temperature": 0.0,
}

search_engine = LocalSearch(
    llm=llm,
    context_builder=context_builder,
    token_encoder=token_encoder,
    llm_params=llm_params,
    context_builder_params=local_context_params,
    response_type="multiple paragraphs",
)

logger.info("Search engine created")

2024-09-11 16:28:23,046 - __main__ - INFO - Creating search engine
2024-09-11 16:28:23,052 - __main__ - INFO - Search engine created


# Running a Query
Finally, we run a query on the search engine. In this case, the query is "Give me a summary about the story". This simulates asking the search engine to summarize the entities and relationships stored in Couchbase.

asearch: This is an asynchronous search function that takes a query and returns a response generated by the language model.

In [8]:
question = "Give me a summary about the story"
logger.info(f"Running query: '{question}'")

try:
    result = await search_engine.asearch(question)
    print(f"Question: '{question}'")
    print(f"Answer: {result.response}")
    logger.info("Query completed successfully")
except Exception as e:
    logger.error(f"An error occurred while processing the query: {str(e)}")
    print(f"An error occurred while processing the query: {str(e)}")

2024-09-11 16:28:23,093 - __main__ - INFO - Running query: 'Give me a summary about the story'
2024-09-11 16:28:23,096 - graphrag.vector_stores.couchbasedb - INFO - Performing similarity search by text with k=20
2024-09-11 16:28:23,914 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-09-11 16:28:24,107 - graphrag.vector_stores.couchbasedb - INFO - Performing similarity search by vector with k=20
2024-09-11 16:28:24,115 - graphrag.vector_stores.couchbasedb - INFO - Found 20 results in similarity search by vector
2024-09-11 16:28:24,212 - graphrag.query.structured_search.local_search.search - INFO - GENERATE ANSWER: 1726052303.0959587. QUERY: Give me a summary about the story
2024-09-11 16:28:25,232 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-11 16:28:34,545 - __main__ - INFO - Query completed successfully


Question: 'Give me a summary about the story'
Answer: ## Summary of the Story

### Introduction to the Mission

The narrative revolves around a mission undertaken by the Paranormal Military Squad at the Dulce military base. The mission, known as Operation: Dulce, is a high-stakes endeavor involving the investigation and potential first contact with extraterrestrial intelligence. The team, led by key figures such as Alex, Taylor Cruz, and Dr. Jordan Hayes, navigates the complexities and dangers of the base, which is filled with advanced technology and hidden secrets [Data: Entities (21, 47, 50, 54, 157, 193)].

### Key Characters and Their Roles

**Alex** is a central figure in the mission, displaying leadership and a mix of respect, mentorship, and skepticism. He works closely with other team members, including Dr. Jordan Hayes and Sam Rivera, to decode and respond to the alien signal [Data: Entities (47); Relationships (117, 88, 56, 155, 196)].

**Dr. Jordan Hayes** is a scientist stu

With these steps, the entire process of loading data, setting up models, storing embeddings, and running a search engine query is written out in sequence without using functions. Let me know if any additional modifications are needed!