# Minimum Viable Graph (MVG)

In this notebook, you'll create the Minimumm Viable Graph consisting of `Chunk` nodes arranged into linked lists.


1. Extract text from Form10k files, split into chunks, create `Chunk` nodes
2. Enhance each `Chunk` node with a text embedding
3. Expand the `Chunk` nodes with `NEXT` relationships to form linked lists

```cypher
(:Chunk 
  chunkId: string
  source: string
  text: string
  header1: string
  header2: string
  header3: string
  header4: string
  path: string
  documentUri: string
  ebmbedding: float[]
)
```

```cypher
(:Chunk)-[:NEXT]->(:Chunk)
```

## Setup

Import some python packages, set up global constants, and create a connection to the Neo4j database.

In [360]:
%run 'shared.ipynb'

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv
Connecting to Neo4j at bolt://neo4j-1:7687 as neo4j
Using data from /home/jovyan/data/single
Embedding with ollama using mxbai-embed-large
Chatting with ollama using llama3


## Prepare a GraphDatabase interface

You will use the Neo4j `GraphDatabase` interface to send queries to the Neo4j database.

In [345]:
# Expect `gdb` to be defined in the shared notebook
# gdb = GraphDatabase.driver(uri=NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

result = gdb.execute_query("RETURN 'Hello, World!' AS message")

result.records[0].get('message')

'Hello, World!'

# Input data pre-preprocessing

The IIHF data you will be working with has been preprocessed from the original source. 

Please see the [Form10k Preprocessing](https://github.com/neo4j-product-examples/data-prep-sec-edgar/) repository for more details.

# Step by step inspection of a single form 10k document

### Start with one file

Get the the file name and then loading the json.

In [174]:
loader = DirectoryLoader('/data-transfer/iihf', glob="**/*.md", loader_cls=TextLoader)
documents = loader.load()

print (documents[0].metadata["source"])
print (len(documents))

/data-transfer/iihf/rulebook.md
1


### Text splitter from Langchain

You can use a text splitter function from Langchain.

The `RecursiveCharacterTextSplitter` will use newlines
and then whitespace characters to break down a text until
the chunks are small enough. This strategy is generally
good at keeping paragraphs together.

Set a chunk size of 2000 characters,
with 200 characters of overlap between each chunk,
using the built-in `len` function to calculate the 
text length.


In [108]:
# Splitting text into chunks using the RecursiveCharacterTextSplitter 
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000,
    chunk_overlap  = 200,
    length_function = len,
    is_separator_regex = False,
)

### Text splitter demonstration

You can see what the text splitter will do by splitting up
the `item1_text`.

In [18]:
item1_text_chunks = text_splitter.split_text(documents[0].page_content)
item1_text_chunks[0]


'# IIHF Official Rulebook 2023/24\n\n## Welcome\n\nNo matter where ice hockey is played, the object of the game is the same – to put the puck into the opponent’s goal. Beyond that, ice hockey across the globe is subject to certain variations. This makes the rules of the game extremely important. These rules must be followed all times, in all countries, in all age categories, for the game to be enjoyed by everyone.\n\nHockey’s speed is one of the qualities that makes it so exciting. But this skill and excitement must be balanced with fair play and respect.\n\nIt is, therefore, important to make a clear separation between the purpose of all the elements of the game and to use these respectfully. These distinctions can be taught at an early age or whenever one begins to show interest in the game. And this is why hockey development begins with parents and coaches, those people most influential in guiding a person, old or young, into playing the game properly and within the rules.\n\nThe II

### Markdown Header Text Splitter from Langchain



In [346]:
print (documents[0].metadata["source"])
print (len(documents))

headers_to_split_on = [
    ("#", "header1"),
    ("##", "header2"),
    ("###", "header3"),
    ("####", "header4"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers=True)
md_header_splits = markdown_splitter.split_text(documents[0].page_content)

/data-transfer/iihf/rulebook.md
1


In [37]:
md_header_splits[0]

Document(page_content='No matter where ice hockey is played, the object of the game is the same – to put the puck into the opponent’s goal. Beyond that, ice hockey across the globe is subject to certain variations. This makes the rules of the game extremely important. These rules must be followed all times, in all countries, in all age categories, for the game to be enjoyed by everyone.  \nHockey’s speed is one of the qualities that makes it so exciting. But this skill and excitement must be balanced with fair play and respect.  \nIt is, therefore, important to make a clear separation between the purpose of all the elements of the game and to use these respectfully. These distinctions can be taught at an early age or whenever one begins to show interest in the game. And this is why hockey development begins with parents and coaches, those people most influential in guiding a person, old or young, into playing the game properly and within the rules.  \nThe IIHF Championship program enco

### Recursive Character Text Splitter from Langchain

In [347]:
# Char-level splits
from langchain_text_splitters import RecursiveCharacterTextSplitter

chunk_size = 600
chunk_overlap = 0
text_splitter = RecursiveCharacterTextSplitter(
    separators=[
        "\n\n",
        "\n",
        " ",
        ".",
        ",",
        "\u200b",  # Zero-width space
        "\uff0c",  # Fullwidth comma
        "\u3001",  # Ideographic comma
        "\uff0e",  # Fullwidth full stop
        "\u3002",  # Ideographic full stop
        "",
    ],
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    length_function = len,
    is_separator_regex = False,    
)

# Split
chunks = text_splitter.split_documents(md_header_splits)
chunks[0]

Document(page_content='No matter where ice hockey is played, the object of the game is the same – to put the puck into the opponent’s goal. Beyond that, ice hockey across the globe is subject to certain variations. This makes the rules of the game extremely important. These rules must be followed all times, in all countries, in all age categories, for the game to be enjoyed by everyone.  \nHockey’s speed is one of the qualities that makes it so exciting. But this skill and excitement must be balanced with fair play and respect.', metadata={'header1': 'IIHF Official Rulebook 2023/24', 'header2': 'Welcome'})

In [103]:
chunks[0]

Document(page_content='No matter where ice hockey is played, the object of the game is the same – to put the puck into the opponent’s goal. Beyond that, ice hockey across the globe is subject to certain variations. This makes the rules of the game extremely important. These rules must be followed all times, in all countries, in all age categories, for the game to be enjoyed by everyone.', metadata={'header1': 'IIHF Official Rulebook 2023/24', 'header2': 'Welcome'})

In [348]:
len(chunks)

271

## Create a graph from the chunks

You now have chunks prepared for creating a knowledge graph.

The graph will have 1 node per chunk, containing the chunk text and metadata as properties.

### Merge chunk query

You will use a Cypher query to merge the chunks into the graph.

This query accepts a query parameter called `chunkParam` which is expected
to have the data record containing the chunk and metadata.

The `MERGE` query will first match an existing node with the same `chunkId` property.

If no such node exists, it will create a new node and the `ON CREATE` clause will set the properties using values from the `chunkParam` query parameter.

In [349]:
merge_chunk_node_query = """
MERGE(c:Chunk {chunkId: $chunkParam.chunkId})
    ON CREATE SET 
        c.source = $chunkParam.source, 
        c.chunkSeqId = $chunkParam.chunkSeqId, 
        c.path = $chunkParam.path,
        c.text = $chunkParam.text,
        c.documentUri = $chunkParam.documentUri,
        c += $chunkParam.metadata
RETURN c
"""

In [350]:
# Helper function to create nodes for all chunks.
# This will use the `merge_chunk_node_query` to create a `:Chunk` node for each chunk.
def create_chunk_id(metadata, idx) -> str:
    id = metadata["header1"]
    if 'header2' in metadata:
        id = id + '|' + metadata["header2"]
    if 'header3' in metadata:
        id = id + '|' + metadata["header3"]
    if 'header4' in metadata:
        id = id + '|' + metadata["header4"]
    id = id + '|' + str(idx)    
    return hashlib.sha1(id.encode()).hexdigest()

def create_path(metadata) -> str: 
    path = metadata["header1"]
#    if 'header2' in metadata:
#        path = path + '/' + metadata["header2"]
#    if 'header3' in metadata:
#        path = path + '/' + metadata["header3"]
#    if 'header4' in metadata:
#        path = path + '/' + metadata["header4"]
    return path.replace(' ', '_').replace('.', '_').lower()

def create_nodes_for_all_chunks(documentUri, chunks):
    node_count = 0
    for i, chunk in enumerate(chunks):
        chunk_id = create_chunk_id(chunk.metadata, i)
        path = create_path(chunk.metadata)
        gdb.execute_query(merge_chunk_node_query, 
                chunkParam = { "chunkId": chunk_id, "source":"", "chunkSeqId":i, "text": chunk.page_content, "metadata": chunk.metadata, "path": path, "documentUri": documentUri }
        )
        node_count += 1
    print(f"Created {node_count} nodes")

### Prepare unique constraint

Before calling the helper function to create a knowledge graph,
we will take one extra step to make sure we don't duplicate data.

The uniqueness constraint is also index. It's job is to ensure that
a particular property is unique for all nodes that share a common label.



In [351]:
# Create a uniqueness constraint on the chunkId property of Chunk nodes 
gdb.execute_query("""
CREATE CONSTRAINT unique_chunk IF NOT EXISTS 
    FOR (c:Chunk) REQUIRE c.chunkId IS UNIQUE
""")

created_indexes = gdb.execute_query('SHOW CONSTRAINTS').records
print(created_indexes)

[<Record id=10 name='unique_chunk' type='UNIQUENESS' entityType='NODE' labelsOrTypes=['Chunk'] properties=['chunkId'] ownedIndex='unique_chunk' propertyType=None>]


### Create index

To speed up lookup on the "path" property, we create an index


In [352]:
# Create a uniqueness constraint on the chunkId property of Chunk nodes 
gdb.execute_query("""
CREATE INDEX FOR (c:Chunk) ON (c.path)
""")

created_indexes = gdb.execute_query('SHOW INDEXES').records
print(created_indexes)

[<Record id=11 name='index_2bc8b8e7' state='ONLINE' populationPercent=100.0 type='RANGE' entityType='NODE' labelsOrTypes=['Chunk'] properties=['path'] indexProvider='range-1.0' owningConstraint=None lastRead=None readCount=None>, <Record id=6 name='sections_vector' state='ONLINE' populationPercent=100.0 type='VECTOR' entityType='NODE' labelsOrTypes=['Section'] properties=['summaryEmbedding'] indexProvider='vector-2.0' owningConstraint=None lastRead=neo4j.time.DateTime(2024, 6, 2, 17, 46, 42, 478000000, tzinfo=<UTC>) readCount=1>, <Record id=7 name='unique_chunk' state='ONLINE' populationPercent=100.0 type='RANGE' entityType='NODE' labelsOrTypes=['Chunk'] properties=['chunkId'] indexProvider='range-1.0' owningConstraint='unique_chunk' lastRead=None readCount=None>]


## Load all chunks

Perform the node creation for all files in an import directory. 

In [353]:
%%time

create_nodes_for_all_chunks(documents[0].metadata["source"], chunks)

# Check the number of nodes in the graph
gdb.execute_query("MATCH (c:Chunk) RETURN count(c) as chunkCount").records[0].get('chunkCount')

Created 271 nodes
CPU times: user 182 ms, sys: 32.1 ms, total: 214 ms
Wall time: 1.49 s


271

In [324]:
# Check the number of unique company CUSIPs (company IDs) in the graph
# Expect this to match the `uniqueCompanyCount` from the previous cell
gdb.execute_query("MATCH (c:Chunk) RETURN count(distinct(c.header4)) as uniqueHeader4Count").records[0]

<Record uniqueHeader4Count=130>

# Enhance - vector embeddings for the text of each chunk  

## Setup

You will use the `embeddings_api` defined in `shared.ipynb` to get the vector embeddings 
for the text of each chunk. This api will use an LLM to calculate an embedding for text.

In [None]:
# A simple example of how to use the embeddings API
text_embedding = embeddings_api.embed_query("embed this text using an LLM")

print(text_embedding)

# all embeddings will have the same size, which is the dimensions of the vector
vector_dimensions = len(text_embedding) 

print(f"Text embeddings will have {vector_dimensions} dimensions")

### Prepare a vector index

Now that you have a graph populated with `Chunk` nodes, 
you can add vector embeddings.

First, prepare a vector index to store the embeddings.

The index will be called `chunks_vector` and will store
embeddings for nodes labeled as `Chunk` in a property
called `emedding`.

The embeddings index will match the dimensions of the 
embeddings returned by the `embeddings_api` and will use 
the cosine similarity function.

In [354]:
# Create a vector index called "chunks_vector" the `embedding`` property of nodes labeled `Chunk`. 
# neo4j_create_vector_index(kg, VECTOR_INDEX_NAME, 'Chunk', 'embedding')
gdb.execute_query("""
         CREATE VECTOR INDEX `chunks_vector` IF NOT EXISTS
          FOR (c:Chunk) ON (c.embedding) 
          OPTIONS { indexConfig: {
            `vector.dimensions`: $vectorDimensionsParam,
            `vector.similarity_function`: 'cosine'    
         }}
""",
  vectorDimensionsParam = vector_dimensions
)

# Check the vector indexes in the graph
gdb.execute_query('SHOW VECTOR INDEXES').records

[<Record id=12 name='chunks_vector' state='ONLINE' populationPercent=100.0 type='VECTOR' entityType='NODE' labelsOrTypes=['Chunk'] properties=['embedding'] indexProvider='vector-2.0' owningConstraint=None lastRead=None readCount=None>,
 <Record id=6 name='sections_vector' state='ONLINE' populationPercent=100.0 type='VECTOR' entityType='NODE' labelsOrTypes=['Section'] properties=['summaryEmbedding'] indexProvider='vector-2.0' owningConstraint=None lastRead=neo4j.time.DateTime(2024, 6, 2, 17, 46, 42, 478000000, tzinfo=<UTC>) readCount=1>]

In [63]:
# Using Langchain to create the vector index (alternative to the previous cell)
from langchain.vectorstores import Neo4jVector
Neo4jVector.from_existing_graph(
    embedding=embeddings_api,
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    database=NEO4J_DATABASE,
    index_name='chunks_vector',
    node_label="Chunk",
    text_node_properties=['text'],
    embedding_node_property='embedding',
)

# Check the vector indexes in the graph
gdb.execute_query('SHOW VECTOR INDEXES').records

[<Record id=12 name='chunks_vector' state='ONLINE' populationPercent=100.0 type='VECTOR' entityType='NODE' labelsOrTypes=['Chunk'] properties=['embedding'] indexProvider='vector-2.0' owningConstraint=None lastRead=None readCount=0>]

### Create text embeddings

Creating the text embeddings will be a two step process. 

First, collect all chunk text and chunk ids from the graph.
Yes these are the same chunk ids that were used to create the graph
and you could save time by doing this all at once. We're doing
this incrementally to show the process, not optimized for speed.

Next, use the `embeddings_api` to get the embeddings for the text
and write those values back into the graph. 

This will take some time to run as we're doing it one chunk at a time,
calling out to the `embeddings_api` for each then writing all those
results back into the graph.

In [355]:
%%time

def text_for_embedding(text, header2, header3, header4) -> str:
    text_for_embed = text;
    if header4 is not None:
        text_for_embed = header4 + '>>' + text_for_embed
    return text_for_embed;

# Create vector embeddings for all the Chunk text, in batches.
# Use this for larger number of chunks so that the query
# can be re-run without losing all progress
print("Finding all chunks that need embedding...")
all_chunks_for_embed = gdb.execute_query("""
  MATCH (chunk:Chunk) WHERE chunk.embedding IS NULL
  RETURN chunk.text AS text, chunk.header2 as header2, chunk.header3 as header3, chunk.header4 as header4, chunk.chunkId AS chunkId
  """).records

print("Generating vector embeddings, then writing into each chunk...")
for chunk in all_chunks_for_embed:
  text = text_for_embedding(chunk['text'], chunk['header2'], chunk['header3'], chunk['header4'])
  #print (text)
  embedding = embeddings_api.embed_query(text)
  gdb.execute_query("""
    MATCH (chunk:Chunk {chunkId: $chunkIdParam})
    CALL db.create.setNodeVectorProperty(chunk, "embedding", $embeddingParam)    
    """, 
    chunkIdParam=chunk['chunkId'], embeddingParam=embedding
  )

Finding all chunks that need embedding...
Generating vector embeddings, then writing into each chunk...
CPU times: user 1.82 s, sys: 209 ms, total: 2.03 s
Wall time: 22.6 s


# Expand - connect the chunks into linked lists

You can now create relationships between all
nodes in that list of chunks,
effectively creating a linked list from the
first chunk to the last.



In [332]:
%%time

# Collect all the form IDs and form 10k item names
distinct_path_result = gdb.execute_query("""
MATCH (c:Chunk) RETURN DISTINCT c.path as path
""").records

distinct_path_list = list(map(lambda x: x['path'], distinct_path_result))

# Connect *all* section chunks into a linked list..
cypher = """
  MATCH (from_same_path:Chunk) // match all chunks
  WHERE from_same_path.path = $path // where the chunks are from the same path
  WITH from_same_path // with those collections of chunks
    ORDER BY from_same_path.chunkSeqId ASC // order the chunks by their sequence ID
  WITH collect(from_same_path) as same_path_chunk_list // collect the chunks into a list
    CALL apoc.nodes.link(same_path_chunk_list, "NEXT", {avoidDuplicates: true}) // then create a linked list in the graph
  RETURN size(same_path_chunk_list)
"""

for path in distinct_path_list:
    gdb.execute_query(cypher, 
             path=path
    )


CPU times: user 2.67 ms, sys: 4.96 ms, total: 7.63 ms
Wall time: 37.6 ms


# Example questions - vector similarity search with Neo4j

### Try Neo4j vector search helper

The `shared.ipynb` notebook has a helper function to perform a vector similarity search
using the Neo4j Knowledge Graph.

It will perform vector similarity search using the `chunks_vector` vector index.

Try it out by searching for information about one of the companies in the graph.

In [363]:
search_results = neo4j_vector_search(
    'what is the size of the rink?', VECTOR_INDEX_NAME
)
search_results[0]


Using vector index: chunks_vector


<Record score=0.817984938621521 text='The official size of the Rink shall be 60 m long and 26 m to 30 m wide. The corners shall be rounded in the arc of a circle with a radius of 7.0 m to 8.50 m. Any deviations from these dimensions for any IIHF competition require IIHF approval.'>

### Question Answering chat with Langchain 

Notice that we only performed vector search. So what we're getting
back is the raw chunk text.

If we want to create a chatbot that provides actual answers to
a question, we can build a RAG system using Langchain.

The basic RAG flow goes through these steps:

1. accept a question from the user
2. perform a database query to find relevant text that may provide an answer
3. package the original question plus the relevant text into a prompt
4. pass the entire prompt to an LLM to produce an answer
5. finally, return the LLM's answer to the user

Langchain is a great framework for creating a complete RAG workflow.

It has excellent integration with Neo4j. 


In [167]:
# try the chat api directly
result = chat_api.invoke("what is the size of the icehockey rink")

result.content

'In professional and amateur ice hockey, the standard size of an NHL-sized ice hockey rink is:\n\n* Length: 200 feet (61 meters)\n* Width: 85 feet (26 meters)\n\nThis is the same size used in the National Hockey League (NHL) and many other international competitions. The dimensions are specified by the International Ice Hockey Federation (IIHF).\n\nIt\'s worth noting that there are smaller rinks, often referred to as "junior" or "youth" rinks, which are typically 180 feet (55 meters) long and 80 feet (24 meters) wide. These smaller rinks are used for younger age groups or recreational play.\n\nIn some countries, like Europe, the rink size may be slightly different, but the NHL-sized dimensions are widely adopted as the standard.'

### Neo4j Vector Store

The easiest way to start using Neo4j with Langchain
is with the `Neo4jVector` interface.

This makes Neo4j look like a vector store using
the vector index you created earlier.

Under the hood, it will use the Cypher language
for performing vector similarity searches.

The configuration specifies a few important things:
- use the defined `embeddings_api` for embeddings
- how to connect to the Neo4j database
- the name of the vector index to use
- the label of the nodes to search
- the property name of the text on those nodes
- and, the property name of the embeddings on those nodes

That vector store then gets converted into a retriever
and finally added to a Question Answering chain.

In [357]:
# Create a langchain vector store from the existing Neo4j knowledge graph.
neo4j_vector_store = Neo4jVector.from_existing_graph(
    embedding=embeddings_api,
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    index_name=VECTOR_INDEX_NAME,
    node_label=VECTOR_NODE_LABEL,
    text_node_properties=[VECTOR_SOURCE_PROPERTY],
    embedding_node_property=VECTOR_EMBEDDING_PROPERTY,
)

# RAG prompt
prompt = hub.pull("rlm/rag-prompt-llama")

# Create a retriever from the vector store
retriever = neo4j_vector_store.as_retriever()

# Create a chatbot Question & Answer chain from the retriever
#chain = RetrievalQAWithSourcesChain.from_chain_type(
#    chat_api, chain_type="stuff", retriever=retriever
#)

chain = RetrievalQA.from_chain_type(
    chat_api, chain_type="stuff", retriever=retriever, verbose=True, chain_type_kwargs={"prompt": prompt, "verbose": True}
)

prettyVectorSearch = prettifyChain(chain)

### Ask some questions

Finally, you can use the Langchain chain, which combines the retriever
and the vector store into a nice question and answer interface.

You can see both the answer and the source that the answer came from.

In [358]:
prettyVectorSearch("who needs visor or facial protection")



[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mHuman: [INST]<<SYS>> You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.<</SYS>> 
Question: who needs visor or facial protection 
Context: 
text: The compilation and the explanations of the signals of the Game Officials are located in the Appendix I.


text: There are three (3) permissible types of facial protection which can be attached to the front of a Players’ helmet: a visor protection, a cage protection, or a full-face protection visor.


text: Dangerous Equipment includes wearing a visor in a way that may cause injury to an opponent, wearing non-approved equipment, using dangerous or illegal skates or stick, failing to

In [187]:
prettyVectorSearch("what is the size of the rink?")



[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mHuman: [INST]<<SYS>> You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.<</SYS>> 
Question: what is the size of the rink? 
Context: 
text: The official size of the Rink shall be 60 m long and 26 m to 30 m wide. The corners shall be rounded in the arc of a circle with a radius of 7.0 m to 8.50 m. Any deviations from these dimensions for any IIHF competition require IIHF approval.


text: For more information for Supplementary Discipline in pre-championship games and exhibition games, refer to the IIHF Disciplinary Code.


text: Within the Face-off Spot, draw two parallel lines 8 cm from the top and bottom of the spot. The area

### Vector search with graph pattern

You can now create a question answering chain.

The default Neo4jVector uses a basic cypher query
to peform vector similarity search.

That query can be extended to do whatever you
want in a Cypher.

This Cypher query extension will receive two variables: `node` and `score`
and it should should return three fields: `text`, `score`, and `metadata`.

  - The `text` should be plain text to be passed to the LLM.
  - The `score` column should be the similarity score of the text.
  - The `metadata` can be any additional information you want to pass, like the source of the text.


In this example, we'll use the previous/next chunks to expand the context of the text passed to the LLM.

Create two QA chains, one with and one without the chunk window.


In [368]:
retrieval_query_window = """
OPTIONAL MATCH window=
    (:Chunk)-[:NEXT*0..1]->(node)-[:NEXT*0..1]->(:Chunk)
WITH node, score, window as longestWindow 
  ORDER BY node, length(window) DESC LIMIT 1
WITH nodes(longestWindow) as chunkList, node, score
  UNWIND chunkList as chunkRows
WITH collect(chunkRows.text) as textList, node, score
RETURN apoc.text.join(textList, " \n ") as text,
    score,
    node {.source} AS metadata
"""

vector_store_window = Neo4jVector.from_existing_index(
    embedding=embeddings_api,
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    database="neo4j",
    index_name=VECTOR_INDEX_NAME,
    text_node_property=VECTOR_SOURCE_PROPERTY,
    retrieval_query=retrieval_query_window
)

# Create a retriever from the vector store
retriever_window = vector_store_window.as_retriever(search_kwargs={'k': 2})

# Create a chatbot Question & Answer chain from the retriever
chain_window = prettifyChain(RetrievalQA.from_chain_type(
    chat_api, 
    chain_type="stuff", 
    retriever=retriever_window,
    chain_type_kwargs={"verbose": True}
))

In [369]:
docs = retriever_window.invoke("What can you tell me about the rink")
docs

[Document(page_content='For more information for Supplementary Discipline in pre-championship games and exhibition games, refer to the IIHF Disciplinary Code.', metadata={'source': ''})]

In [370]:
chain_window("What is the size of a rink and what is the height of the boards")



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
The official size of the Rink shall be 60 m long and 26 m to 30 m wide. The corners shall be rounded in the arc of a circle with a radius of 7.0 m to 8.50 m. Any deviations from these dimensions for any IIHF competition require IIHF approval.
Human: What is the size of a rink and what is the height of the boards[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m
According to the provided context, the official size of the Rink shall be:  *
Length: 60 m * Width: 26 m to 30 m  As for the height of the boards, it's not
mentioned in the given context. If you're looking for information on the
standard height of hockey boards, I can try to find that out for you!


In [366]:
retrieval_query_window = """
OPTIONAL MATCH window=
    (:Chunk)-[:NEXT*0..1]->(node)-[:NEXT*0..1]->(:Chunk)
WITH node, score, window as longestWindow 
  ORDER BY node, length(window) DESC LIMIT 1
WITH nodes(longestWindow) as chunkList, node, score
  UNWIND chunkList as chunkRows
WITH collect(chunkRows.text) as textList, node, score
RETURN apoc.text.join(textList, " \n ") as text,
    score,
    node {.source} AS metadata
"""

def neo4j_vector_search_2(question, retrieval_query):
  """Search for similar nodes using the Neo4j vector index"""
  vector_search_query = """
    CALL db.index.vector.queryNodes($index_name, $top_k, $question_embedding) 
        YIELD node, score
  """ + retrieval_query
  similar = []

  print ("Using vector index: " + str(VECTOR_INDEX_NAME))
    
  question_embedding = embeddings_api.embed_query(question)
  return gdb.execute_query(vector_search_query,
                      question=question, 
                      question_embedding=question_embedding, 
                      index_name=VECTOR_INDEX_NAME, 
                      top_k=10
                    ).records

search_results = neo4j_vector_search_2(
    'what is the size of the rink?', retrieval_query_window
)
search_results[0]
    

Using vector index: chunks_vector


<Record text='The official size of the Rink shall be 60 m long and 26 m to 30 m wide. The corners shall be rounded in the arc of a circle with a radius of 7.0 m to 8.50 m. Any deviations from these dimensions for any IIHF competition require IIHF approval.' score=0.817984938621521 metadata={'source': ''}>