# GraphRAG with Neo4j and LangChain and Gemini on VertexAI Reasoning Engine

This is a demonstration of a GeNAI API with advanced RAG patterns combining vector and graph search.

It is deployed on Vertex AI Reasoning Engine (Preview) as scalable infrastructure and can then be integrated with GenAI applications on Cloud Run via REST APIs.

## Dataset

The dataset is a graph about companies, associated industries, and people and articles that report on those companies.

![Graph Model](https://i.imgur.com/lWJZSEe.png)

The articles are chunked and the chunks also stored in the graph.

Embeddings are computed for each of the text chunks with `textembedding-gecko` (786 dim) and stored on each chunk node.
A Neo4j vector index `news_google` and a fulltext index `news_fulltext` (for hybrid search) were created.

The database is publicly available with a readonly user:

https://demo.neo4jlabs.com:7473/browser/

* URI: neo4j+s://demo.neo4jlabs.com
* User: companies
* Password: companies
* Companies: companies

We utilize the Neo4jVector LangChain integration, which allows for advanced RAG patterns.
We will utilize both hybrid search as well as parent-child retrievers and GraphRAG (extract relevant context).

In our configuration we provide both the vector and fulltext index as well as a retrieval query that fetches the following additional information for each chunk

* Parent `Article` of the `Chunk` (aggregate all chunks for a single article)
* `Organization`(s) mentioned
* `IndustryCategory`(ies) for the organization
* `Person`(s) connected to the organization and their roles (e.g. investor, chairman, ceo)

We will retrieve the top-k = 5 results from the vector index.

As LLM we will utilize Vertex AI *Gemini Pro 1.0*

We use a temperature of 0.1, top-k=40, top-p=0.8

Our `LangchainCode` class contains the methods for initialization which can only hold serializable information (strings and numbers).

In `set_up()` Gemini as LLM, VertexAI Embeddings and the `Neo4jVector` retriever are combined into a LangChain chain.

Which is then used in `query`  with `chain.invoke()`.

The class is deployed as ReasoningEngine with the Google Vertex AI Python SDK.
For the deployment you provide the instance of the class which captures relevant environment variables and configuration and the dependencies, in our case `google-cloud-vertexai, langchain, langchain_google_vertexai, neo4j`.

And after successful deploymnet we can use the resulting object via the `query` method, passing in our user question.

In [None]:
PROJECT_ID = "iamtests-315719"
REGION = "us-central1"
STAGING_BUCKET = "gs://neo4j-vertex-ai-extension2"


In [None]:
from google.colab import auth
auth.authenticate_user(project_id=PROJECT_ID)

!gcloud config set project vertex-ai-neo4j-extension

In [None]:
!pip install --quiet neo4j==5.19.0
!pip install --quiet langchain_google_vertexai==1.0.4
!pip install --quiet --force-reinstall langchain==0.2.0 langchain_community==0.2.0


In [None]:
!pip install --quiet google-cloud-aiplatform==1.51.0
!pip install --quiet  google-cloud-resource-manager==1.12.3

In [None]:
import vertexai
from vertexai.preview import reasoning_engines

vertexai.init(
    project=PROJECT_ID,
    location=REGION,
    staging_bucket=STAGING_BUCKET,
)

In [None]:
from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate
from langchain_google_vertexai import ChatVertexAI, VertexAIEmbeddings
from langchain_community.vectorstores import Neo4jVector
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableParallel, RunnablePassthrough


If you installed any packages above, you can restart the runtime to pick them up:

Click on the "Runtime" button at the top of Colab.
Select "Restart session".
You would not need to re-run the cell above this one (to reinstall the packages).

In [45]:
import os

URI = os.getenv('NEO4J_URI', 'neo4j+s://demo.neo4jlabs.com')
USER = os.getenv('NEO4J_USERNAME','companies')
PASSWORD = os.getenv('NEO4J_PASSWORD','companies')
DATABASE = os.getenv('NEO4J_DATABASE','companies')

class LangchainCode:
    def __init__(self):
        self.model_name = "gemini-1.5-pro-preview-0409" #"gemini-pro"
        self.max_output_tokens = 1024
        self.temperature = 0.1
        self.top_p = 0.8
        self.top_k = 40
        self.project_id = PROJECT_ID
        self.location = REGION
        self.uri = URI
        self.username = USER
        self.password = PASSWORD
        self.database = DATABASE
        self.prompt_input_variables = ["query"]
        self.prompt_template="""
            You are a venture capital assistant that provides useful answers about companies, their boards, financing etc.
            only using the information from a company database already provided in the context.
            Prefer higher rated information in your context and add source links in your answers.
            Context: {context}"""

    def configure_qa_rag_chain(self, llm, embeddings):
        qa_prompt = ChatPromptTemplate.from_messages([
            SystemMessagePromptTemplate.from_template(self.prompt_template),
            HumanMessagePromptTemplate.from_template("Question: {question}"
                                                      "\nWhat else can you tell me about it?"),
        ])

        # Vector + Knowledge Graph response
        kg = Neo4jVector.from_existing_index(
            embedding=embeddings,
            url=self.uri, username=self.username, password=self.password,database=self.database,
            search_type="hybrid",
            keyword_index_name="news_fulltext",
            index_name="news_google",
            retrieval_query="""
              WITH node as c,score
              MATCH (c)<-[:HAS_CHUNK]-(article:Article)

              WITH article, collect(distinct c.text) as texts, avg(score) as score
              RETURN article {.title, .sentiment, .siteName, .summary,
                    organizations: [ (article)-[:MENTIONS]->(org:Organization) |
                          org { .name, .revenue, .nbrEmployees, .isPublic, .motto, .summary,
                          orgCategories: [ (org)-[:HAS_CATEGORY]->(i) | i.name],
                          people: [ (org)-[rel]->(p:Person) | p { .name, .summary, role: replace(type(rel),"HAS_","") }]}],
                    texts: texts} as text,
              score, {source: article.siteName} as metadata
            """,
        )
        retriever = kg.as_retriever(search_kwargs={"k": 5})

        def format_docs(docs):
          return "\n\n".join(doc.page_content for doc in docs)

        chain = (
            {"context": retriever | format_docs , "question": RunnablePassthrough()}
            | qa_prompt
            | llm
            | StrOutputParser()
        )
        return chain

    def set_up(self):
        from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate
        from langchain_google_vertexai import VertexAIEmbeddings, ChatVertexAI
        from langchain_community.vectorstores import Neo4jVector
        from langchain_core.output_parsers import StrOutputParser
        from langchain_core.runnables import RunnableParallel, RunnablePassthrough

        self.llm = ChatVertexAI(
            model_name=self.model_name,
            max_output_tokens=self.max_output_tokens,
            max_input_tokens=32000,
            temperature=self.temperature,
            top_p=self.top_p,
            top_k=self.top_k,
            project = self.project_id,
            location = self.location,
            # convert_system_message_to_human=True,
            response_validation=False,
            verbose=True
        )
        self.embeddings = VertexAIEmbeddings("textembedding-gecko@001")

        self.qa_chain = self.configure_qa_rag_chain(self.llm, self.embeddings)

    def query(self, query):
        from langchain.agents import initialize_agent
        from langchain.chains.conversation.memory import ConversationBufferWindowMemory

        # conversational memory
        conversational_memory = ConversationBufferWindowMemory(
            memory_key='chat_history',
            k=0,
            return_messages=True
        )

        from langchain.agents import Tool

        tools = [
            Tool(
                name='Knowledge Base',
                func=self.qa_chain.invoke,
                description=(
                    'use this tool when answering specific news queries to get '
                    'more information about the topic'
                )
            )
        ]

        agent = initialize_agent(
            agent='chat-conversational-react-description',
            tools=tools,
            llm=self.llm,
            verbose=True,
            max_iterations=3,
            early_stopping_method='generate',
            memory=conversational_memory
        )
        return agent(query)

In [46]:
from langchain.globals import set_debug
set_debug(False)

# testing locally
lc = LangchainCode()
lc.set_up()

In [47]:
response = lc.query('What is 2x5?')
print(response)

  warn_deprecated(




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{
 "action": "Final Answer",
 "action_input": "2 x 5 = 10"
}
```[0m

[1m> Finished chain.[0m
{'input': 'What is 2x5?', 'chat_history': [], 'output': '2 x 5 = 10'}


In [48]:
response = lc.query('What are the news about IBM and its acquisitions and who are the people involved?')
print(response)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{
    "action": "Knowledge Base",
    "action_input": "Recent IBM acquisitions and key people involved"
}
```[0m
Observation: [36;1m[1;3mIBM acquired the following companies: SPSS [source](domain-b.com), Ounce Labs [source](domain-b.com), Ascential Software [source](CHINAdaily.com.cn), Cognos [source](CHINAdaily.com.cn), Netezza [source](CHINAdaily.com.cn), OpenPages [source](CHINAdaily.com.cn), Softek Storage Solutions [source](Information Technology Planning, Implementation and IT Solutions for Business - News & Reviews - BaselineMag.com), DWL [source](MarketWatch), and Lombardi Software [source](MC Press Online).

Key people involved in these acquisitions include:

* **IBM:**
    * Arvind Krishna (CEO) [source](CHINAdaily.com.cn)
    * Bill Kelleher (Board Member) [source](CHINAdaily.com.cn)
    * Michael L. Eskew (Board Member) [source](CHINAdaily.com.cn)
    * Mark Ritter (Chairman) [source](CHINAdaily.com.cn)

In [None]:
remote_app = reasoning_engines.ReasoningEngine.create(
    LangchainCode(),
    requirements=[
        "google-cloud-aiplatform==1.51.0",
        "langchain_google_vertexai==1.0.4",
        "langchain==0.2.0",
        "langchain_community==0.2.0",
        "neo4j==5.19.0"
    ],
    display_name="Neo4j Vertex AI RE Companies",
    description="Neo4j Vertex AI RE Companies",
    sys_version="3.10",
    extra_packages=[]
)

In [None]:
response = remote_app.query(query="Who is on the board of Siemens?")
print(response)