# Chatbot
In this notebook, we'll build a chatbot.  The chatbot interface will use gradio.  Underlying the chatbot is the Neo4j knowledge graph we built in previous labs.  The chatbot uses the generative AI and langchain.  We take a natural language question, convert it to Neo4j Cypher using generative AI and run that against the database.  We take the response from the database and use generative AI again to convert that back to natural language before presenting it to the user.

## Base Example Without Grounding
Before grounding with the Neo4j, let's setup up a baseline that just uses an LLM to answer questions.

In [6]:
from langchain.llms import VertexAI

base_chain = VertexAI(model_name='text-bison', max_output_tokens=2048, temperature=0)

We can now ask a simple finance question.

In [7]:
base_response = base_chain("""What are the top 10 investments for Blackrock?""")
print(f"Final answer: {base_response}")

Final answer:  1. Apple Inc. (AAPL)
2. Microsoft Corporation (MSFT)
3. Amazon.com Inc. (AMZN)
4. Alphabet Inc. (GOOGL)
5. Tesla Inc. (TSLA)
6. Berkshire Hathaway Inc. (BRK.A)
7. Nvidia Corporation (NVDA)
8. Taiwan Semiconductor Manufacturing Company Limited (TSM)
9. Visa Inc. (V)
10. Mastercard Inc. (MA)


While this answer looks reasonable, we have no real way to know how the LLM came it with it, or where it was sourced from.

Here is a more complicated example where we expect the LLM to understand some more specific terminology.

In [8]:
base_response = base_chain("""Which managers own FAANG stocks?""")
print(f"Final answer: {base_response}")

Final answer:  - Tim Cook, CEO of Apple, owns AAPL stock.
- Sundar Pichai, CEO of Alphabet, owns GOOG stock.
- Satya Nadella, CEO of Microsoft, owns MSFT stock.
- Jeff Bezos, CEO of Amazon, owns AMZN stock.
- Mark Zuckerberg, CEO of Meta Platforms, owns FB stock.


In this case, it looks like the LLM understands the ubiquitous acronym FAANG but, unsurprisingly, the results indicate it doesn't understand manager within the context of our data model.  In your use case, you may have lots of specific terminology/ontology like this that you would need a chatbot to understand.

## Grounding LLMs with Neo4j
Now let's create a chatbot that is grounded with Neo4j. Below is the pattern we will follow with LangChain:

![](images/langchain.png)

## Cypher Generation
We have to use a prompt template that: 
1. Clearly states what schema to use 
2. Provides principles the chatbot should follow in generating responses
3. Demonstrates few-shot examples to help the chatbot be more accurate in its query generation.

In [9]:
CYPHER_GENERATION_TEMPLATE = """You are an expert Neo4j Cypher translator who understands the question in english and convert to Cypher strictly based on the Neo4j Schema provided and following the instructions below:
1. Generate Cypher query compatible ONLY for Neo4j Version 5
2. Do not use EXISTS, SIZE keywords in the cypher. Use alias when using the WITH keyword
3. Please do not use same variable names for different nodes and relationships in the query.
4. Use only Nodes and relationships mentioned in the schema
5. Always enclose the Cypher output inside 3 backticks
6. Always do a case-insensitive and fuzzy search for any properties related search. Eg: to search for a Company name use `toLower(c.name) contains 'neo4j'`
7. Candidate node is synonymous to Manager
8. Always use aliases to refer the node in the query
9. 'Answer' is NOT a Cypher keyword. Answer should never be used in a query.
10. Please generate only one Cypher query per question. 
11. Cypher is NOT SQL. So, do not mix and match the syntaxes.
12. Every Cypher query always starts with a MATCH keyword.

Schema:
{schema}
Samples:
Question: Which fund manager owns most shares? What is the total portfolio value?
Answer: MATCH (m:Manager) -[o:OWNS]-> (c:Company) RETURN m.managerName as manager, sum(distinct o.shares) as ownedShares, sum(o.value) as portfolioValue ORDER BY ownedShares DESC LIMIT 10

Question: Which fund manager owns most companies? How many shares?
Answer: MATCH (m:Manager) -[o:OWNS]-> (c:Company) RETURN m.managerName as manager, count(distinct c) as ownedCompanies, sum(distinct o.shares) as ownedShares ORDER BY ownedCompanies DESC LIMIT 10

Question: What are the top 10 investments for Vanguard?
Answer: MATCH (m:Manager) -[o:OWNS]-> (c:Company) WHERE toLower(m.managerName) contains "vanguard" RETURN c.companyName as Investment, sum(DISTINCT o.shares) as totalShares, sum(DISTINCT o.value) as investmentValue order by investmentValue desc limit 10

Question: What other fund managers are investing in same companies as Vanguard?
Answer: MATCH (m1:Manager) -[:OWNS]-> (c1:Company) <-[o:OWNS]- (m2:Manager) WHERE toLower(m1.managerName) contains "vanguard" AND elementId(m1) <> elementId(m2) RETURN m2.managerName as manager, sum(DISTINCT o.shares) as investedShares, sum(DISTINCT o.value) as investmentValue ORDER BY investmentValue LIMIT 10

Question: What are the top investors for Apple?
Answer: MATCH (m1:Manager) -[o:OWNS]-> (c1:Company) WHERE toLower(c1.companyName) contains "apple" RETURN distinct m1.managerName as manager, sum(o.value) as totalInvested ORDER BY totalInvested DESC LIMIT 10

Question: What are the other top investments for fund managers investing in Apple?
Answer: MATCH (c1:Company) <-[:OWNS]- (m1:Manager) -[o:OWNS]-> (c2:Company) WHERE toLower(c1.companyName) contains "apple" AND elementId(c1) <> elementId(c2) RETURN DISTINCT c2.companyName as company, sum(o.value) as totalInvested, sum(o.shares) as totalShares ORDER BY totalInvested DESC LIMIT 10

Question: What are the top investors in the last 3 months?
Answer: MATCH (m:Manager) -[o:OWNS]-> (c:Company) WHERE date() > o.reportCalendarOrQuarter > o.reportCalendarOrQuarter - duration({{months:3}}) RETURN distinct m.managerName as manager, sum(o.value) as totalInvested, sum(o.shares) as totalShares ORDER BY totalInvested DESC LIMIT 10

Question: What are top investments in last 6 months for Vanguard?
Answer: MATCH (m:Manager) -[o:OWNS]-> (c:Company) WHERE toLower(m.managerName) contains "vanguard" AND date() > o.reportCalendarOrQuarter > date() - duration({{months:6}}) RETURN distinct c.companyName as company, sum(o.value) as totalInvested, sum(o.shares) as totalShares ORDER BY totalInvested DESC LIMIT 10

Question: Who are Apple's top investors in last 3 months?
Answer: MATCH (m:Manager) -[o:OWNS]-> (c:Company) WHERE toLower(c.companyName) contains "apple" AND date() > o.reportCalendarOrQuarter > date() - duration({{months:3}}) RETURN distinct m.managerName as investor, sum(o.value) as totalInvested, sum(o.shares) as totalShares ORDER BY totalInvested DESC LIMIT 10

Question: Which fund manager under 200 million has similar investment strategy as Vanguard?
Answer: MATCH (m1:Manager) -[o1:OWNS]-> (:Company) <-[o2:OWNS]- (m2:Manager) WHERE toLower(m1.managerName) CONTAINS "vanguard" AND elementId(m1) <> elementId(m2) WITH distinct m2 AS m2, sum(distinct o2.value) AS totalVal WHERE totalVal < 200000000 RETURN m2.managerName AS manager, totalVal*0.000001 AS totalVal ORDER BY totalVal DESC LIMIT 10

Question: Who are common investors in Apple and Amazon?
Answer: MATCH (c1:Company) <-[:OWNS]- (m:Manager) -[:OWNS]-> (c2:Company) WHERE toLower(c1.companyName) contains "apple" AND toLower(c2.companyName) CONTAINS "amazon" RETURN DISTINCT m.managerName LIMIT 50

Question: What are Vanguard's top investments by shares for 2023?
Answer: MATCH (m:Manager) -[o:OWNS]-> (c:Company) WHERE toLower(m.managerName) CONTAINS "vanguard" AND date({{year:2023}}) = date.truncate('year',o.reportCalendarOrQuarter) RETURN c.companyName AS investment, sum(o.value) AS totalValue ORDER BY totalValue DESC LIMIT 10

Question: What are Vanguard's top investments by value for 2023?
Answer: MATCH (m:Manager) -[o:OWNS]-> (c:Company) WHERE toLower(m.managerName) CONTAINS "vanguard" AND date({{year:2023}}) = date.truncate('year',o.reportCalendarOrQuarter) RETURN c.companyName AS investment, sum(o.shares) AS totalShares ORDER BY totalShares DESC LIMIT 10

Question: {question}
Answer: 
"""

Now let’s create a LangChain prompt template.  

This template defines the parameter inputs for the prompt sent to the Cypher generation bot.  In our example, the inputs will be schema and question.  The question comes from the end user.  The LangChain GraphCypherQAChain automatically inserts the schema via a built-in method to Neo4jGraph.

In [12]:
from langchain.prompts.prompt import PromptTemplate

CYPHER_GENERATION_PROMPT = PromptTemplate(
    input_variables=['schema','question'], validate_template=True, template=CYPHER_GENERATION_TEMPLATE
)

Provide your Neo4j credentials.  We need the DB conection URL and password.  The username is neo4j by default.

In [14]:
# username is neo4j by default
NEO4J_USERNAME = 'neo4j'

# You will need to change these to match your credentials
NEO4J_URI = 'neo4j+s://6688b25b.databases.neo4j.io'
NEO4J_PASSWORD = '_kogrNk53u8oTk5be55kmit1kHGdhZj98yJlG-VYSRw'

We need to connect to the graph via LangChain.

In [17]:
from langchain.graphs import Neo4jGraph

graph = Neo4jGraph(
    url=NEO4J_URI, 
    username=NEO4J_USERNAME, 
    password=NEO4J_PASSWORD
)

We define our `chain` object (specifically a`GraphCypherQAChain`) using two vertex AI LLMs:
* [code-bison](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/code-generation) to translate user questions to Cypher queries
* [text-bison](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text)  to convert Cypher query results back to natural language for human-friendly responses. 

`GraphCypherQAChain` also takes a ‘Neo4jGraph’ so it can handle the chatbot process end-to-end, from taking the user question and translating to Cypher to executing the query, getting results, translating back to natural language, and returning to the user. 


In [19]:
from langchain.chains import GraphCypherQAChain

chain = GraphCypherQAChain.from_llm(
    graph=graph,
    cypher_llm=VertexAI(model_name='code-bison@001', max_output_tokens=2048, temperature=0.0),
    qa_llm=VertexAI(model_name='text-bison', max_output_tokens=2048, temperature=0.0),
    cypher_prompt=CYPHER_GENERATION_PROMPT,
    verbose=True,
    return_intermediate_steps=True
)

Below we have a few examples of how we can get answers from the chatbot.

## Why Ground Your LLM?
Recall our base example where we asked for the top 10 Blackrock investments?  We got an answer that looked like it may be reasonable, but we couldn't validate it or track sources.  We also asked what managers own FAANG stocks, and for that, we unsurprisingly received the wrong answers for our use case.

Let's try again grounding with Neo4j. 

In [20]:
r2 = chain("""What are the top 10 investments for Blackrock?""")
print(f"Final answer: {r2['result']}")



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (m:Manager) -[o:OWNS]-> (c:Company) WHERE toLower(m.managerName) CONTAINS "blackrock" RETURN c.companyName AS Investment, sum(DISTINCT o.shares) AS totalShares, sum(DISTINCT o.value) AS investmentValue order by investmentValue desc limit 10[0m
Full Context:
[32;1m[1;3m[{'Investment': 'APPLE COMPUTER', 'totalShares': 3084362729, 'investmentValue': 304538995305000.0}, {'Investment': 'MICROSOFT CORPORATION', 'totalShares': 1588389433, 'investmentValue': 282697628073000.0}, {'Investment': 'AMAZON.COM INC', 'totalShares': 1784427354, 'investmentValue': 112783300990000.0}, {'Investment': 'NVIDIA', 'totalShares': 539750644, 'investmentValue': 77323879849000.0}, {'Investment': 'UNITEDHEALTH GROUP INC', 'totalShares': 220675563, 'investmentValue': 74901917261000.0}, {'Investment': 'JOHNSON &amp; JOHNSON', 'totalShares': 600132459, 'investmentValue': 66382342451000.0}, {'Investment': 'EXXON MOBIL CORPORA

Notice that this answer is different from our base example, and this time we have the Cypher logic used to obtain the answer from our database. This means that we can trace back how we came up with this answer and make any adjustments to our database or prompt if we need to.

Now lets try the FAANG question.

In [21]:
r3 = chain("""Which managers own FAANG stocks?""")
print(f"Final answer: {r3['result']}")



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (m:Manager) -[o:OWNS]-> (c:Company) WHERE toLower(c.companyName) IN ["facebook", "apple", "amazon", "netflix", "google"] RETURN m.managerName as manager[0m
Full Context:
[32;1m[1;3m[{'manager': 'Beacon Wealthcare LLC'}, {'manager': 'Pinnacle Holdings, LLC'}, {'manager': 'Pinnacle Holdings, LLC'}][0m

[1m> Finished chain.[0m
Final answer:  Beacon Wealthcare LLC, Pinnacle Holdings, LLC, and Pinnacle Holdings, LLC are the managers that own FAANG stocks. 


Here again, we notice the traceability with Cypher, and because we engineered our prompt to include our schema, it understood what “manager” means in the context of our use case.

## Why Ground your LLM with Neo4j?
There are 3 primary reasons to ground your LLM with Neo4j specifically:
1. __Grounding for more complex question handling__: Multi-hop knowledge retrieval across connected data. Connections between data points are calculated before query time. 
2. __Enterprise reliability and security__: Fine-grained security so the chatbot only accesses information the user has permission to. Autonomous clustering for horizontal scaling.  Fully managed service in the cloud through Aura. 
3. __Performance__: fast queries with high concurrency for many users.

We can explore point 1 with more complex questions below.

A question requiring ~4 hops (would be joins in the relational world).  Having a knowledge graph with relationships calculated before query time allows us to answer the question quickly.

In [22]:
r4 = chain("""What are other top investments for fund managers investing in AstraZeneca?""")
print(f"Final answer: {r4['result']}")



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (c1:Company) <-[:OWNS]- (m1:Manager) -[o:OWNS]-> (c2:Company) WHERE toLower(c1.companyName) CONTAINS "astrazeneca" AND elementId(c1) <> elementId(c2) RETURN DISTINCT c2.companyName as company, sum(o.value) as totalInvested, sum(o.shares) as totalShares ORDER BY totalInvested DESC LIMIT 10[0m
Full Context:
[32;1m[1;3m[{'company': 'SPDR S&P 500 ETF TR', 'totalInvested': 327922309895000.0, 'totalShares': 1229482000}, {'company': 'APPLE COMPUTER', 'totalInvested': 241033252748153.0, 'totalShares': 2916941635}, {'company': 'MICROSOFT CORPORATION', 'totalInvested': 217147970297168.0, 'totalShares': 1462665865}, {'company': 'ISHARES TR', 'totalInvested': 104442565293000.0, 'totalShares': 1463346482}, {'company': 'AMAZON.COM INC', 'totalInvested': 100398523210233.0, 'totalShares': 1979365848}, {'company': 'TESLA INC CALL', 'totalInvested': 93992366480000.0, 'totalShares': 864643000}, {'company': 'INVES

Combine also with property conditions.

In [23]:
r5 = chain("""Which fund managers under 200 million have the most similar investment strategies to Blackrock? Return the top 10.""")
print(f"Final answer: {r5['result']}")



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (m1:Manager) -[o1:OWNS]-> (:Company) <-[o2:OWNS]- (m2:Manager) WHERE toLower(m1.managerName) CONTAINS "blackrock" AND elementId(m1) <> elementId(m2) WITH distinct m2 AS m2, sum(distinct o2.value) AS totalVal WHERE totalVal < 200000000 RETURN m2.managerName AS manager, totalVal*0.000001 AS totalVal ORDER BY totalVal DESC LIMIT 10[0m
Full Context:
[32;1m[1;3m[{'manager': 'LAKE STREET ADVISORS GROUP, LLC', 'totalVal': 197.487}, {'manager': 'INCA Investments LLC', 'totalVal': 197.18599999999998}, {'manager': 'M Holdings Securities, Inc.', 'totalVal': 194.481}, {'manager': 'Avalon Global Asset Management LLC', 'totalVal': 194.107}, {'manager': 'King Wealth', 'totalVal': 192.66299999999998}, {'manager': 'TRUSTEES OF THE UNIVERSITY OF PENNSYLVANIA', 'totalVal': 190.523}, {'manager': 'Crestview Partners IV GP, L.P.', 'totalVal': 190.332}, {'manager': 'Virtu Financial LLC', 'totalVal': 188.9489999999999

and more...

In [24]:
r6 = chain("""Please get me 10 common investors between Tesla and Microsoft""")
print(f"Final answer: {r6['result']}")



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (c1:Company) <-[:OWNS]- (m:Manager) -[:OWNS]-> (c2:Company) WHERE toLower(c1.companyName) CONTAINS "tesla" AND toLower(c2.companyName) CONTAINS "microsoft" RETURN DISTINCT m.managerName LIMIT 10[0m
Full Context:
[32;1m[1;3m[{'m.managerName': 'PRIVATE ASSET MANAGEMENT INC'}, {'m.managerName': 'Bristlecone Advisors, LLC'}, {'m.managerName': 'BI Asset Management Fondsmaeglerselskab A/S'}, {'m.managerName': 'Cornell Pochily Investment Advisors, Inc.'}, {'m.managerName': 'Smith Anglin Financial, LLC'}, {'m.managerName': 'Costello Asset Management, INC'}, {'m.managerName': 'Decatur Capital Management, Inc.'}, {'m.managerName': 'FIRST FOUNDATION ADVISORS'}, {'m.managerName': 'BANK OF NOVA SCOTIA'}, {'m.managerName': 'Sturgeon Ventures LLP'}][0m

[1m> Finished chain.[0m
Final answer:  Here are 10 common investors between Tesla and Microsoft:
1. Baillie Gifford & Co.
2. BlackRock Inc.
3. Capital Rese

## Grounded Chatbot
Now we will use Gradio to deploy a chat interface with our chain behind it.

The below code deploys a Gradio application.  You can access the app via a local URL. A publicly sharable URL is also provided (sharable for 3 days).

In [25]:
import gradio as gr
import typing_extensions
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(memory_key = "chat_history", return_messages = True)
agent_chain = chain

def chat_response(input_text,history):

    try:
        return agent_chain.run(input_text)
    except:
        # a bit of protection against exposed error messages
        # we could log these situations in the backend to revisit later in development
        return "I'm sorry, there was an error retrieving the information you requested."

interface = gr.ChatInterface(fn = chat_response,
                             title = "Investment Chatbot",
                             description = "powered by Neo4j",
                             theme = "soft",
                             chatbot = gr.Chatbot(height=500),
                             undo_btn = None,
                             clear_btn = "\U0001F5D1 Clear chat",
                             examples = ["What are the top 10 investments for Blackrock?",
                                         "Which manager owns FAANG stocks?",
                                         "What are other top investments for fund managers investing in Exxon?",
                                         "What are Vanguard's top investments by value for 2023?",
                                         "Who are the common investors between Tesla and Microsoft?",
                                         "Who are Tesla's top investors in last 3 months?"])

interface.launch(share=True)

Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://d2fa82aa7e4d27e647.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)






[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (m:Manager) -[o:OWNS]-> (c:Company) WHERE toLower(m.managerName) CONTAINS "blackrock" RETURN c.companyName AS Investment, sum(DISTINCT o.shares) AS totalShares, sum(DISTINCT o.value) AS investmentValue order by investmentValue desc limit 10[0m
Full Context:
[32;1m[1;3m[{'Investment': 'APPLE COMPUTER', 'totalShares': 3084362729, 'investmentValue': 304538995305000.0}, {'Investment': 'MICROSOFT CORPORATION', 'totalShares': 1588389433, 'investmentValue': 282697628073000.0}, {'Investment': 'AMAZON.COM INC', 'totalShares': 1784427354, 'investmentValue': 112783300990000.0}, {'Investment': 'NVIDIA', 'totalShares': 539750644, 'investmentValue': 77323879849000.0}, {'Investment': 'UNITEDHEALTH GROUP INC', 'totalShares': 220675563, 'investmentValue': 74901917261000.0}, {'Investment': 'JOHNSON &amp; JOHNSON', 'totalShares': 600132459, 'investmentValue': 66382342451000.0}, {'Investment': 'EXXON MOBIL CORPORA

## Fine Tuning for Cypher Generation
We encourage you to use Vertex AI Codey family models with a schema, few-shot examples, and precise prompt engineering for Cypher generation. However, if that still doesn't provide an appropriate level of quality, or you need your LLM to improve accuracy on a more specific task area, you can try fine-tuning.

Fine-tuning is the process of taking a foundational model and making precise changes to improve its performance for a specific, narrower task. It works by taking in training data containing many examples of your specific task and using it to update or add additional parameters in a new version of the model.

The total training time generally takes more than an hour. The tuned adapter model is going to stay within your tenant, and your training data will not be used to train the base model, which is frozen. Tuning runs on GCP's TPU infrastructure that is optimized to run ML workloads.

The training data should be structured as a supervised training set, where each row contains input text and desired, resulting, Cypher query. Vertex AI expects you to adhere to the below format in a `jsonl` file.

```
{"input_text": "MY_INPUT_PROMPT", "output_text": "CYPHER_QUERY"}
```
You can find more about fine-tuning in the [Vertex AI documentation](https://cloud.google.com/vertex-ai/docs/generative-ai/models/tune-models)


## Conclusion
In this notebook, we went through the steps of connecting a LangChain agent to a Neo4j database and using it to generate Cypher queries in response to user requests via LLMs on Vertex AI, thus grounding the LLM with a knowledge graph.

While we used the `code-bison` model here, this approach can be generalized to other Vertex AI LLMs.  This process can also be augmented with additional steps around the generation chain to customize the experience for specific use cases.  

The critical takeaway is the importance of Neo4j Knowledge Graph as a grounding database to: 
* Anchor your chatbot to reality as it generates responses and 
* Enable your LLM to provide answers enriched with relevant enterprise data.