# Minimum Viable Graph (MVG)

In this notebook, you'll create the Minimumm Viable Graph consisting of `Chunk` nodes arranged into linked lists.


1. Extract text from Form10k files, split into chunks, create `Chunk` nodes
2. Enhance each `Chunk` node with a text embedding
3. Expand the `Chunk` nodes with `NEXT` relationships to form linked lists

```cypher
(:Chunk 
  chunkId: string
  cik: int
  cusip6: string
  item: string
  source: string
  text: string
  textEmbedding: float[]
)
```

```cypher
(:Chunk)-[:NEXT]->(:Chunk)
```

## Setup

Import some python packages, set up global constants, and create a connection to the Neo4j database.

In [None]:
%run 'shared.ipynb'

## Prepare a GraphDatabase interface

You will use the Neo4j `GraphDatabase` interface to send queries to the Neo4j database.

In [None]:
# Expect `gdb` to be defined in the shared notebook
# gdb = GraphDatabase.driver(uri=NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

result = gdb.execute_query("RETURN 'Hello, World!' AS message")

result.records[0].get('message')

# Form 10k pre-preprocessing

The Form10k data you will be working with has been preprocessed from the original source. 

Please see the [Form10k Preprocessing](https://github.com/neo4j-product-examples/data-prep-sec-edgar/) repository for more details.

# Step by step inspection of a single form 10k document

### Start with one file

Get the the file name and then loading the json.

In [None]:
first_file_name = [f"{DATA_DIR}/form10k/" + x for x in os.listdir(f"{DATA_DIR}/form10k/")][0]

first_file_as_object = json.load(open(first_file_name))

first_file_name

### Look at the available keys

You can loop through they keys to check what the json contains.

The first few keys, with names like "item1" are the text extracted from the form 10k.
The names match the sections within the form.

Then there are some fields that were pulled from the mapping file,
including cik, cusip6, cusip and names.

The source field contains a URL to the download page on the SEC website.

In [None]:
for k,v in first_file_as_object.items():
    print(k, type(v))

### Look at one of the items

Take a look at one of the items to see what the text is like.
This full text will be split into chunks and stored in the graph.

In [None]:
item1_text = first_file_as_object['item1']

item1_text[0:1500]

### Text splitter from Langchain

You can use a text splitter function from Langchain.

The `RecursiveCharacterTextSplitter` will use newlines
and then whitespace characters to break down a text until
the chunks are small enough. This strategy is generally
good at keeping paragraphs together.

Set a chunk size of 2000 characters,
with 200 characters of overlap between each chunk,
using the built-in `len` function to calculate the 
text length.


In [None]:
# Splitting text into chunks using the RecursiveCharacterTextSplitter 
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000,
    chunk_overlap  = 200,
    length_function = len,
    is_separator_regex = False,
)

### Text splitter demonstration

You can see what the text splitter will do by splitting up
the `item1_text`.

In [None]:
item1_text_chunks = text_splitter.split_text(item1_text)
item1_text_chunks[0]


### Helper function for loading form 10k data

You can create a helper function to load the form 10k data from the json files.

This function will load the file as a json object, 
then for each text section in the json, it will split the text into chunks.
For each chunk, it will create a new object with the chunk text and the metadata from the form 10k.


In [None]:
def split_form10k_data_from_file(file):
    chunks_with_metadata = [] # use this to accumlate chunk records
    file_as_object = json.load(open(file)) # open the json file
    for item in ['item1','item1a','item7','item7a']: # pull these keys from the json
        print(f'Processing {item} from {file}') 
        item_text = file_as_object[item] # grab the text of the item
        item_text_chunks = text_splitter.split_text(item_text) # split the text into chunks
        chunk_seq_id = 0
        for chunk in item_text_chunks[:20]: # only take the first 20 chunks
            form_id = file[file.rindex('/') + 1:file.rindex('.')] # extract form id from file name
            # finally, construct a record with metadata and the chunk text
            chunks_with_metadata.append({
                'text': chunk, 
                # metadata from looping...
                'item': item,
                'chunkSeqId': chunk_seq_id,
                # constructed metadata...
                'formId': f'{form_id}', # pulled from the filename
                'chunkId': f'{form_id}-{item}-chunk{chunk_seq_id:04d}',
                # metadata from file...
                'names': file_as_object['names'],
                'cik': file_as_object['cik'],
                'cusip6': file_as_object['cusip6'],
                'source': file_as_object['source'],
            })
            chunk_seq_id += 1
        print(f'\tSplit into {chunk_seq_id} chunks')
    return chunks_with_metadata

## Create a graph from the chunks

You now have chunks prepared for creating a knowledge graph.

The graph will have 1 node per chunk, containing the chunk text and metadata as properties.

### Merge chunk query

You will use a Cypher query to merge the chunks into the graph.

This query accepts a query parameter called `chunkParam` which is expected
to have the data record containing the chunk and metadata.

The `MERGE` query will first match an existing node with the same `chunkId` property.

If no such node exists, it will create a new node and the `ON CREATE` clause will set the properties using values from the `chunkParam` query parameter.

In [None]:
merge_chunk_node_query = """
MERGE(mergedChunk:Chunk {chunkId: $chunkParam.chunkId})
    ON CREATE SET 
        mergedChunk.names = $chunkParam.names,
        mergedChunk.formId = $chunkParam.formId, 
        mergedChunk.cik = $chunkParam.cik, 
        mergedChunk.cusip6 = $chunkParam.cusip6, 
        mergedChunk.source = $chunkParam.source, 
        mergedChunk.item = $chunkParam.item, 
        mergedChunk.chunkSeqId = $chunkParam.chunkSeqId, 
        mergedChunk.text = $chunkParam.text
RETURN mergedChunk
"""

In [None]:
# Helper function to create nodes for all chunks.
# This will use the `merge_chunk_node_query` to create a `:Chunk` node for each chunk.
def create_nodes_for_all_chunks(chunks_with_metadata_list):
    node_count = 0
    for chunk in chunks_with_metadata_list:
        gdb.execute_query(merge_chunk_node_query, 
                chunkParam = chunk
        )
        node_count += 1
    print(f"Created {node_count} nodes")

### Prepare unique constraint

Before calling the helper function to create a knowledge graph,
we will take one extra step to make sure we don't duplicate data.

The uniqueness constraint is also index. It's job is to ensure that
a particular property is unique for all nodes that share a common label.



In [None]:
# Create a uniqueness constraint on the chunkId property of Chunk nodes 
gdb.execute_query("""
CREATE CONSTRAINT unique_chunk IF NOT EXISTS 
    FOR (c:Chunk) REQUIRE c.chunkId IS UNIQUE
""")

created_indexes = gdb.execute_query('SHOW CONSTRAINTS').records
print(created_indexes)

## Load all form10 files

Perform the node creation for all files in an import directory. 

In [None]:
%%time

import glob

IMPORT_DATA_DIRECTORY = f"{DATA_DIR}/form10k/"

all_file_names = glob.glob(IMPORT_DATA_DIRECTORY + "*.json")
counter = 0

for file_name in all_file_names:
    counter += 1
    print(f'=== Processing {counter} of {len(all_file_names)} ===')
    # get and split text data
    print(f'Reading and splitting Form10k file {file_name}...')
    chunk_list = split_form10k_data_from_file(file_name)
    #load nodes
    print('Creating Chunk Nodes...')
    create_nodes_for_all_chunks(chunk_list)
    print(f'Done Processing {file_name}')

# Check the number of nodes in the graph
gdb.execute_query("MATCH (n:Chunk) RETURN count(n) as chunkCount").records[0].get('chunkCount')

In [None]:
# Check the number of unique form IDs in the graph
gdb.execute_query("MATCH (c:Chunk) RETURN count(distinct(c.formId)) as uniqueFormCount").records[0]

In [None]:
# Check the number of unique company CUSIPs (company IDs) in the graph
# Expect this to match the `uniqueCompanyCount` from the previous cell
gdb.execute_query("MATCH (c:Chunk) RETURN count(distinct(c.cusip6)) as uniqueCompanyCount").records[0]

In [None]:
# See a list of the company names in the graph
gdb.execute_query("MATCH (c:Chunk) RETURN DISTINCT c.names").records

# Enhance - vector embeddings for the text of each chunk  

## Setup

You will use the `embeddings_api` defined in `shared.ipynb` to get the vector embeddings 
for the text of each chunk. This api will use an LLM to calculate an embedding for text.

In [None]:
# A simple example of how to use the embeddings API
text_embedding = embeddings_api.embed_query("embed this text using an LLM")

print(text_embedding)

# all embeddings will have the same size, which is the dimensions of the vector
vector_dimensions = len(text_embedding) 

print(f"Text embeddings will have {vector_dimensions} dimensions")

### Prepare a vector index

Now that you have a graph populated with `Chunk` nodes, 
you can add vector embeddings.

First, prepare a vector index to store the embeddings.

The index will be called `form_10k_chunks` and will store
embeddings for nodes labeled as `Chunk` in a property
called `textEmbedding`.

The embeddings index will match the dimensions of the 
embeddings returned by the `embeddings_api` and will use 
the cosine similarity function.

In [None]:
# Create a vector index called "form_10k_chunks" the `textEmbedding`` property of nodes labeled `Chunk`. 
# neo4j_create_vector_index(kg, VECTOR_INDEX_NAME, 'Chunk', 'textEmbedding')
gdb.execute_query("""
         CREATE VECTOR INDEX `form_10k_chunks` IF NOT EXISTS
          FOR (c:Chunk) ON (c.textEmbedding) 
          OPTIONS { indexConfig: {
            `vector.dimensions`: $vectorDimensionsParam,
            `vector.similarity_function`: 'cosine'    
         }}
""",
  vectorDimensionsParam = vector_dimensions
)

# Check the vector indexes in the graph
gdb.execute_query('SHOW VECTOR INDEXES').records

### Create text embeddings

Creating the text embeddings will be a two step process. 

First, collect all chunk text and chunk ids from the graph.
Yes these are the same chunk ids that were used to create the graph
and you could save time by doing this all at once. We're doing
this incrementally to show the process, not optimized for speed.

Next, use the `embeddings_api` to get the embeddings for the text
and write those values back into the graph. 

This will take some time to run as we're doing it one chunk at a time,
calling out to the `embeddings_api` for each then writing all those
results back into the graph.

In [None]:
# Create vector embeddings for all the Chunk text, in batches.
# Use this for larger number of chunks so that the query
# can be re-run without losing all progress
print("Finding all chunks that need textEmbedding...")
all_chunk_text_id = gdb.execute_query("""
  MATCH (chunk:Chunk) WHERE chunk.textEmbedding IS NULL
  RETURN chunk.text AS text, chunk.chunkId AS chunkId
  """).records

print("Generating vector embeddings, then writing into each chunk...")
for chunk_text_id in all_chunk_text_id:
  text_embedding = embeddings_api.embed_query(chunk_text_id['text'])
  gdb.execute_query("""
    MATCH (chunk:Chunk {chunkId: $chunkIdParam})
    CALL db.create.setNodeVectorProperty(chunk, "textEmbedding", $textEmbeddingParam)    
    """, 
    chunkIdParam=chunk_text_id['chunkId'], textEmbeddingParam=text_embedding
  )


# Expand - connect the chunks into linked lists

You can now create relationships between all
nodes in that list of chunks,
effectively creating a linked list from the
first chunk to the last.



In [None]:
# Collect all the form IDs and form 10k item names
distinct_form_id_result = gdb.execute_query("""
MATCH (c:Chunk) RETURN DISTINCT c.formId as formId
""").records

distinct_form_id_list = list(map(lambda x: x['formId'], distinct_form_id_result))

# Connect *all* section chunks into a linked list..
cypher = """
  MATCH (from_same_form_and_section:Chunk) // match all chunks
  WHERE from_same_form_and_section.formId = $formIdParam // where the chunks are from the same form
    AND from_same_form_and_section.item = $itemParam // and from the same section
  WITH from_same_form_and_section // with those collections of chunks
    ORDER BY from_same_form_and_section.chunkSeqId ASC // order the chunks by their sequence ID
  WITH collect(from_same_form_and_section) as section_chunk_list // collect the chunks into a list
    CALL apoc.nodes.link(section_chunk_list, "NEXT", {avoidDuplicates: true}) // then create a linked list in the graph
  RETURN size(section_chunk_list)
"""

for form_id in distinct_form_id_list:
  for form10kItemName in ['item1', 'item1a', 'item7', 'item7a']:
    gdb.execute_query(cypher, 
             formIdParam=form_id, itemParam=form10kItemName
    )


# Example questions - vector similarity search with Neo4j

### Try Neo4j vector search helper

The `shared.ipynb` notebook has a helper function to perform a vector similarity search
using the Neo4j Knowledge Graph.

It will perform vector similarity search using the `form_10k_chunks` vector index.

Try it out by searching for information about one of the companies in the graph.

In [None]:
search_results = neo4j_vector_search(
    'In a single sentence, tell me about Netapp.'
)
search_results[0].get('text')

### Question Answering chat with Langchain 

Notice that we only performed vector search. So what we're getting
back is the raw chunk text.

If we want to create a chatbot that provides actual answers to
a question, we can build a RAG system using Langchain.

The basic RAG flow goes through these steps:

1. accept a question from the user
2. perform a database query to find relevant text that may provide an answer
3. package the original question plus the relevant text into a prompt
4. pass the entire prompt to an LLM to produce an answer
5. finally, return the LLM's answer to the user

Langchain is a great framework for creating a complete RAG workflow.

It has excellent integration with Neo4j. 


In [None]:
# try the chat api directly
result = chat_api.invoke("What is the capital of France?")

result.content

### Neo4j Vector Store

The easiest way to start using Neo4j with Langchain
is with the `Neo4jVector` interface.

This makes Neo4j look like a vector store using
the vector index you created earlier.

Under the hood, it will use the Cypher language
for performing vector similarity searches.

The configuration specifies a few important things:
- use the defined `embeddings_api` for embeddings
- how to connect to the Neo4j database
- the name of the vector index to use
- the label of the nodes to search
- the property name of the text on those nodes
- and, the property name of the embeddings on those nodes

That vector store then gets converted into a retriever
and finally added to a Question Answering chain.

In [None]:
# Create a langchain vector store from the existing Neo4j knowledge graph.
neo4j_vector_store = Neo4jVector.from_existing_graph(
    embedding=embeddings_api,
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    index_name=VECTOR_INDEX_NAME,
    node_label=VECTOR_NODE_LABEL,
    text_node_properties=[VECTOR_SOURCE_PROPERTY],
    embedding_node_property=VECTOR_EMBEDDING_PROPERTY,
)

# Create a retriever from the vector store
retriever = neo4j_vector_store.as_retriever()

# Create a chatbot Question & Answer chain from the retriever
chain = RetrievalQAWithSourcesChain.from_chain_type(
    chat_api, chain_type="stuff", retriever=retriever
)

prettyVectorSearch = prettifyChain(chain)

### Ask some questions

Finally, you can use the Langchain chain, which combines the retriever
and the vector store into a nice question and answer interface.

You can see both the answer and the source that the answer came from.

In [None]:
prettyVectorSearch("What is Netapp's primary business?")

In [None]:
prettyVectorSearch("Where is Netapp headquartered?")

In [None]:
prettyVectorSearch("Briefly tell me about Netapp in a single sentence.")