# GraphRAG with Neo4j and LangChain and Gemini on VertexAI Reasoning Engine

This is a demonstration of a GeNAI API with advanced RAG patterns combining vector and graph search.

It is deployed on Vertex AI Reasoning Engine (Preview) as scalable infrastructure and can then be integrated with GenAI applications on Cloud Run via REST APIs.

## Dataset

The dataset is a graph about companies, associated industries, and people and articles that report on those companies.

![Graph Model](https://i.imgur.com/lWJZSEe.png)

The articles are chunked and the chunks also stored in the graph.

Embeddings are computed for each of the text chunks with `textembedding-gecko` (786 dim) and stored on each chunk node.
A Neo4j vector index `news_google` and a fulltext index `news_fulltext` (for hybrid search) were created.

The database is publicly available with a readonly user:

https://demo.neo4jlabs.com:7473/browser/

* URI: neo4j+s://demo.neo4jlabs.com
* User: companies
* Password: companies
* Companies: companies

We utilize the Neo4jVector LangChain integration, which allows for advanced RAG patterns.
We will utilize both hybrid search as well as parent-child retrievers and GraphRAG (extract relevant context).

In our configuration we provide both the vector and fulltext index as well as a retrieval query that fetches the following additional information for each chunk

* Parent `Article` of the `Chunk` (aggregate all chunks for a single article)
* `Organization`(s) mentioned
* `IndustryCategory`(ies) for the organization
* `Person`(s) connected to the organization and their roles (e.g. investor, chairman, ceo)

We will retrieve the top-k = 5 results from the vector index.

As LLM we will utilize Vertex AI *Gemini Pro 1.0*

We use a temperature of 0.1, top-k=40, top-p=0.8

Our `LangchainCode` class contains the methods for initialization which can only hold serializable information (strings and numbers).

In `set_up()` Gemini as LLM, VertexAI Embeddings and the `Neo4jVector` retriever are combined into a LangChain chain.

Which is then used in `query`  with `chain.invoke()`.

The class is deployed as ReasoningEngine with the Google Vertex AI Python SDK.
For the deployment you provide the instance of the class which captures relevant environment variables and configuration and the dependencies, in our case `google-cloud-vertexai, langchain, langchain_google_vertexai, neo4j`.

And after successful deploymnet we can use the resulting object via the `query` method, passing in our user question.

In [1]:
PROJECT_ID = "vertex-ai-neo4j-extension"
REGION = "us-central1"
STAGING_BUCKET = "gs://neo4j-vertex-ai-extension"


In [2]:
from google.colab import auth
auth.authenticate_user(project_id=PROJECT_ID)

!gcloud config set project vertex-ai-neo4j-extension

Updated property [core/project].


In [6]:
!pip install --quiet neo4j==5.19.0
!pip install --quiet langchain_google_vertexai==1.0.4
!pip install --quiet --force-reinstall langchain==0.2.0 langchain_community==0.2.0


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m973.7/973.7 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torch 2.2.1+cu121 requires nvidia-cublas-cu12==12.1.3.1; platform_system == "Linux" and platform_machine == "x86_64", which is not installed.
torch 2.2.1+cu121 requires nvidia-cuda-cupti-cu12==12.1.105; platform_system == "Linux" and platform_machine == "x86_64", which is not installed.
torch 2.2.1+cu121 requires nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == "Linux" and platform_machine == "x86_64", which is not installed.
torch 2.2.1+cu121 requires nvidia-cuda-runtime-cu12==12.1.105; platform_system == "Linux" and platform_machine == "x86_64", which is not in

In [5]:
!pip install --quiet google-cloud-aiplatform==1.51.0 # google-cloud-aiplatform[reasoningengine,langchain]

# !gsutil cp gs://vertex_sdk_private_releases/llm_extension/google_cloud_aiplatform-1.45.dev20240328+llm.extension-py2.py3-none-any.whl .

# !pip install --quiet bigframes==0.26.0

# !pip install --force-reinstall --quiet google_cloud_aiplatform-1.45.dev20240328+llm.extension-py2.py3-none-any.whl[extension]
# This is for printing the Vertex AI service account.
# !pip install --upgrade --quiet google-cloud-resource-manager
!pip install --quiet  google-cloud-resource-manager==1.12.3


In [None]:
# import bigframes.dataframe

In [None]:
# bugfix
# import bigframes
# import bigframes.pandas as bpd

# bigframes.dataframe.DataFrame=bigframes.pandas.DataFrame

In [3]:
import vertexai
# from google.cloud.aiplatform.preview import llm_extension
from vertexai.preview import reasoning_engines

vertexai.init(
    project=PROJECT_ID,
    location=REGION,
    staging_bucket=STAGING_BUCKET,
)

In [3]:
from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate
from langchain_google_vertexai import ChatVertexAI, VertexAIEmbeddings
from langchain_community.vectorstores import Neo4jVector
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableParallel, RunnablePassthrough


If you installed any packages above, you can restart the runtime to pick them up:

Click on the "Runtime" button at the top of Colab.
Select "Restart session".
You would not need to re-run the cell above this one (to reinstall the packages).

In [4]:
import os

URI = os.getenv('NEO4J_URI', 'neo4j+s://demo.neo4jlabs.com')
USER = os.getenv('NEO4J_USERNAME','companies')
PASSWORD = os.getenv('NEO4J_PASSWORD','companies')
DATABASE = os.getenv('NEO4J_DATABASE','companies')

class LangchainCode:
    def __init__(self):
        self.model_name = "gemini-1.5-pro-preview-0409" #"gemini-pro"
        self.max_output_tokens = 1024
        self.temperature = 0.1
        self.top_p = 0.8
        self.top_k = 40
        self.project_id = PROJECT_ID
        self.location = REGION
        self.uri = URI
        self.username = USER
        self.password = PASSWORD
        self.database = DATABASE
        self.prompt_input_variables = ["query"]
        self.prompt_template="""
            You are a venture capital assistant that provides useful answers about companies, their boards, financing etc.
            only using the information from a company database already provided in the context.
            Prefer higher rated information in your context and add source links in your answers.
            Context: {context}"""

def configure_qa_rag_chain(self, llm, embeddings):
    qa_prompt = ChatPromptTemplate.from_messages([
        SystemMessagePromptTemplate.from_template(self.prompt_template),
        HumanMessagePromptTemplate.from_template("Question: {question}"
                                                  "\nWhat else can you tell me about it?"),
    ])

    # Vector + Knowledge Graph response
    kg = Neo4jVector.from_existing_index(
        embedding=embeddings,
        url=self.uri, username=self.username, password=self.password,database=self.database,
        search_type="hybrid",
        keyword_index_name="news_fulltext",
        index_name="news_google",
        retrieval_query="""
          WITH node as c,score
          MATCH (c)<-[:HAS_CHUNK]-(article:Article)

          WITH article, collect(distinct c.text) as texts, avg(score) as score
          RETURN article {.title, .sentiment, .siteName, .summary,
                organizations: [ (article)-[:MENTIONS]->(org:Organization) |
                      org { .name, .revenue, .nbrEmployees, .isPublic, .motto, .summary,
                      orgCategories: [ (org)-[:HAS_CATEGORY]->(i) | i.name],
                      people: [ (org)-[rel]->(p:Person) | p { .name, .summary, role: replace(type(rel),"HAS_","") }]}],
                texts: texts} as text,
          score, {source: article.siteName} as metadata
        """,
    )
    retriever = kg.as_retriever(search_kwargs={"k": 5})

    def format_docs(docs):
      return "\n\n".join(doc.page_content for doc in docs)

    chain = (
        {"context": retriever | format_docs , "question": RunnablePassthrough()}
#            RunnableParallel({"context": retriever, "question": RunnablePassthrough()})
        | qa_prompt
        | llm
        | StrOutputParser()
    )
    return chain

    def set_up(self):
        from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate
        from langchain_google_vertexai import VertexAIEmbeddings, ChatVertexAI
        from langchain_community.vectorstores import Neo4jVector
        from langchain_core.output_parsers import StrOutputParser
        from langchain_core.runnables import RunnableParallel, RunnablePassthrough

        llm = ChatVertexAI(
            model_name=self.model_name,
            max_output_tokens=self.max_output_tokens,
            max_input_tokens=32000,
            temperature=self.temperature,
            top_p=self.top_p,
            top_k=self.top_k,
            project = self.project_id,
            location = self.location,
            # convert_system_message_to_human=True,
            response_validation=False,
            verbose=True
        )
        embeddings = VertexAIEmbeddings("textembedding-gecko@001")

        self.qa_chain = self.configure_qa_rag_chain(llm, embeddings)

    def query(self, query):
        return self.qa_chain.invoke(query)

In [12]:
import os

URI = os.getenv('NEO4J_URI', 'neo4j+s://demo.neo4jlabs.com')
USER = os.getenv('NEO4J_USERNAME','companies')
PASSWORD = os.getenv('NEO4J_PASSWORD','companies')
DATABASE = os.getenv('NEO4J_DATABASE','companies')

class LangchainCode:
    def __init__(self):
        self.model_name = "gemini-1.5-pro-preview-0409" #"gemini-pro"
        self.max_output_tokens = 1024
        self.temperature = 0.1
        self.top_p = 0.8
        self.top_k = 40
        self.project_id = PROJECT_ID
        self.location = REGION
        self.uri = URI
        self.username = USER
        self.password = PASSWORD
        self.database = DATABASE
        self.prompt_input_variables = ["query"]
        self.prompt_template="""
            You are a venture capital assistant that provides useful answers about companies, their boards, financing etc.
            only using the information from a company database already provided in the context.
            Prefer higher rated information in your context and add source links in your answers.
            Context: {context}"""

    def configure_qa_rag_chain(self, llm, embeddings):
        qa_prompt = ChatPromptTemplate.from_messages([
            SystemMessagePromptTemplate.from_template(self.prompt_template),
            HumanMessagePromptTemplate.from_template("Question: {question}"
                                                      "\nWhat else can you tell me about it?"),
        ])

        # Vector + Knowledge Graph response
        kg = Neo4jVector.from_existing_index(
            embedding=embeddings,
            url=self.uri, username=self.username, password=self.password,database=self.database,
            search_type="hybrid",
            keyword_index_name="news_fulltext",
            index_name="news_google",
            retrieval_query="""
                  WITH node as c,score
                  MATCH (c)<-[:HAS_CHUNK]-(article:Article)

                  WITH article, collect(distinct c.text) as texts, avg(score) as score
                  RETURN apoc.text.regreplace(
                  apoc.text.format("Article title: %s sentiment: %f siteName: %s\nsummary: %s\n",
                                  [article.title, article.sentiment, article.siteName, article.summary]) +
                  apoc.text.join([ (article)-[:MENTIONS]->(org:Organization) |
                      apoc.text.format("Organization name: %s revenue: %s employees: %s public: %s"+
                                        "\nmotto: %s\nsummary: %s\nIndustries: %s\nPeople: \n%s\n",
                      [org.name, org.revenue, org.nbrEmployees, org.isPublic, org.motto, org.summary,

                      apoc.text.join([ (org)-[:HAS_CATEGORY]->(i) | i.name], ", "),
                      apoc.text.join([ (org)-[rel]->(p:Person) |

                          apoc.text.format("%s: %s %s",
                          [replace(type(rel),"HAS_",""), p.name, p.summary])], "\n")])], "\n") +
                  apoc.text.join(texts,"\n"),"\\w+: (null|\n)","") as text, score,
                  {source: article.siteName} as metadata
            """,
        )
        retriever = kg.as_retriever(search_kwargs={"k": 5})

        def format_docs(docs):
          return "\n\n".join(doc.page_content for doc in docs)

        chain = (
            {"context": retriever | format_docs , "question": RunnablePassthrough()}
#            RunnableParallel({"context": retriever, "question": RunnablePassthrough()})
            | qa_prompt
            | llm
            | StrOutputParser()
        )
        return chain

    def set_up(self):
        from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate
        from langchain_google_vertexai import VertexAIEmbeddings, ChatVertexAI
        from langchain_community.vectorstores import Neo4jVector
        from langchain_core.output_parsers import StrOutputParser
        from langchain_core.runnables import RunnableParallel, RunnablePassthrough

        llm = ChatVertexAI(
            model_name=self.model_name,
            max_output_tokens=self.max_output_tokens,
            max_input_tokens=32000,
            temperature=self.temperature,
            top_p=self.top_p,
            top_k=self.top_k,
            project = self.project_id,
            location = self.location,
            # convert_system_message_to_human=True,
            response_validation=False,
            verbose=True
        )
        embeddings = VertexAIEmbeddings("textembedding-gecko@001")

        self.qa_chain = self.configure_qa_rag_chain(llm, embeddings)

    def query(self, query):
        return self.qa_chain.invoke(query)

In [13]:
from langchain.globals import set_debug
set_debug(False)


In [6]:
# testing locally
lc = LangchainCode()
lc.set_up()

# lc.query('Who is on the board of AppsPro?')
# response = lc.query('Who invested in 8base?')
# response = lc.query('What are news about Siemens?')
response = lc.query('What are the news about IBM and its acquisitions and who are the people involved?')
print(response)

IBM acquired several companies, including Ascential Software, Cognos, SPSS, Netezza, OpenPages, and CrossIdeas. These acquisitions were aimed at strengthening IBM's core businesses, such as data analytics and security. 

IBM acquired Cognos in 2007 for about 5 billion U.S. dollars. Cognos was an Ottawa Ontario-based company producing business intelligence (BI) and performance management (PM) software. [The ambitious Big Blue --- Interview with Leonora Hoicka, associate general counsel of IBM](CHINAdaily.com.cn)

IBM acquired CrossIdeas on July 31, 2014. CrossIdeas is a company that specializes in identity governance and analytics software. [IBM Announces the Acquisition of CrossIdeas](Security Intelligence)

Some of the key people involved in these acquisitions include:

* **Virginia Rometty:** IBM president and chief executive officer at the time of the acquisitions of Ascential Software, Cognos, Netezza, OpenPages, and Algorithmics. [The ambitious Big Blue --- Interview with Leonora 

In [7]:
remote_app = reasoning_engines.ReasoningEngine.create(
    LangchainCode(),
    requirements=[
        "google-cloud-aiplatform==1.51.0",
        "langchain_google_vertexai==1.0.4",
        "langchain==0.2.0",
        "langchain_community==0.2.0",
        "neo4j==5.19.0"
    ],
    display_name="Neo4j Vertex AI RE Companies",
    description="Neo4j Vertex AI RE Companies",
    sys_version="3.10",
    extra_packages=[]
)

NameError: name 'reasoning_engines' is not defined

In [16]:
response = remote_app.query(query="Who is on the board of Siemens?")
print(response)

The following people are on the board of Siemens:
* Jim Hagemann Snabe, Manager and chairman [source](https://www.google.com/search?q=Siemens)
* Dominika Bettman, CEO at Siemens [source](https://www.google.com/search?q=Siemens)
* Alejandro Preinfalk, CEO at Siemens [source](https://www.google.com/search?q=Siemens)
* Miguel Angel Lopez Borrego, CEO at Siemens [source](https://www.google.com/search?q=Siemens)
* Hanna Hennig, CIO at Siemens [source](https://www.google.com/search?q=Siemens)
* Barbara Humpton, CEO at Siemens Bank [source](https://www.google.com/search?q=Siemens)
* Matthias Rebellius, CEO at Siemens Corporate Technology [source](https://www.google.com/search?q=Siemens)
* Cedrik Neike, CEO at Siemens [source](https://www.google.com/search?q=Siemens)
* Ralf P. Thomas, CFO at Siemens [source](https://www.google.com/search?q=Siemens)
* Michael Sigmund, Chairman at Siemens [source](https://www.google.com/search?q=Siemens)
* Horst J. Kayser, Chairman at Siemens [source](https://ww