# Minimum Viable Graph (MVG)



Data import steps:

1. iteratively load all json files that we had cleaned up from the source Form 10k
2. from each json, extract metadata and the text sections (there are multiple sections, named "item1", "item1a", "item7" and "item7a".)
3. for each section, split the text up into chunks and calculate a vector embedding for the text in that chunk
4. for each chunk create a node in the knowledge graph with the metadata from the form, the text and the text embedding

---

So, you have access to a json file with a Form 10k.


The resulting nodes in the knowledge graph will have the following schema:
```cypher
(:Chunk 
  chunkId: string
  chunkSeqId: int
  cik: int
  cusip6: string
  f10kItem: string
  source: string
  text: string
  textEmbedding: [float]
)
```

## Imports

### Import python packages

To start we'll load some useful python packages,
including some great stuff from langchain.


In [2]:
from dotenv import load_dotenv
import os

# Common data processing
import json
import textwrap

# Langchain
from langchain_community.graphs import Neo4jGraph
from langchain_community.vectorstores import Neo4jVector
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_openai import ChatOpenAI

## Set up Neo4j and Langchain

### Gglobal variables

You will set up some global variable from the environment and some constants that
to use later.

In [7]:
# Load from environment
load_dotenv('.env', override=True)
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')
NEO4J_DATABASE = os.getenv('NEO4J_DATABASE') or 'neo4j'
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

DATA_DIR = f"../{os.getenv('DATA_DIR') or 'data/single'}"

# Global constants
VECTOR_INDEX_NAME = 'form_10k_chunks'
VECTOR_NODE_LABEL = 'Chunk'
VECTOR_SOURCE_PROPERTY = 'text'
VECTOR_EMBEDDING_PROPERTY = 'textEmbedding'


In [8]:
print(f"Connecting to Neo4j at {NEO4J_URI} as {NEO4J_USERNAME}")
print(f"Using data from {DATA_DIR}")

Connecting to Neo4j at neo4j://localhost:7687 as neo4j
Using data from ../data/sample


### Prepare a knowledge graph interface

You can use the Langchain `Neo4jGraph` interface to send queries
to the Knowledge Graph.

In [5]:
# Create a knowledge graph using Langchain's Neo4j integration.
# This will be used for direct querying of the knowledge graph. 
kg = Neo4jGraph(
    url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD, database=NEO4J_DATABASE
)

# Form 10k pre-preprocessing

The Form10k data you will be working with has been preprocessed from the original source. 

The Form 10K forms are complicated files.

The pre-processing extracted useful text sections,
then combined that with a mapping file 
that has extra metadata about the filing company.

The preprocessing steps included:
- downloading the original Form 10k files from the SEC website
- extracting sections of text from the forms
- storing the extracted text along with some metadata in a json file
  - "source" - each json retains a URL for the original download source"
    For example:
    ```
    "https://www.sec.gov/Archives/edgar/data/106040/000010604023000024/0000106040-23-000024-index.htm"
    ```
- the metadata from the mapping file includes:
  - "cik": a "central index key" for the company, defined by the SEC
  - "cusip6": a cusip code identifies a financial instrumenet, like a stock,
    issued by a particular company.
    The first 6 characters of the "cusip" code are used to identify the company.
    defined by Committee on Uniform Securities Identification Procedures.
  - "cusip": the full 9-character cusip code extends the base cusip with specific
    information about the financial instrument itself. 
    A single company will a single 6-character cusip code, but may have many
    9-character cusip codes for different financial instruments.
    For example: 
    ```
    [
        "958102905",
        "958102955",
        "958102105",
        "958102AM7"
    ]
    ```
  - "names": an array of known names for the company. 
    For example:
    ```
      [
        "WESTERN DIGITAL CORP",
        "WESTERN DIGITAL CORP."
    ]
    ```
  


# Step by step inspection of a single form 10k document

### Start with one file

Get the the file name and then loading the json.

In [9]:
first_file_name = [f"{DATA_DIR}/form10k/" + x for x in os.listdir(f"{DATA_DIR}/form10k/")][0]

first_file_as_object = json.load(open(first_file_name))

first_file_name

'../data/sample/form10k/0001650372-23-000040.json'

### Script - look at the available keys

You can loop through they keys to check what the json contains.

The first few keys, with names like "item1" are the text extracted from the form 10k.
The names match the sections within the form.

Then there are some fields that were pulled from the mapping file,
including cik, cusip6, cusip and names.

The source field contains a URL to the download page on the SEC website.

In [11]:
for k,v in first_file_as_object.items():
    print(k, type(v))

item1 <class 'str'>
item1a <class 'str'>
item7 <class 'str'>
item7a <class 'str'>
cik <class 'str'>
cusip6 <class 'str'>
cusip <class 'list'>
names <class 'list'>
source <class 'str'>


### Script - look at one of the items

Take a look at one of the items to see what the text is like.

These sections can be quite long. So you can use a text splitter 
to break the text into smaller chunks.

In [12]:
item1_text = first_file_as_object['item1']

item1_text[0:1500]

'>ITEM\xa01. BUSINESS\nCompany Overview\n\xa0\nOur mission is to unleash the potential of every team.\nOur products help teams organize, discuss and complete shared work — delivering superior outcomes for their organizations.\nOur primary products include Jira Software and Jira Work Management for planning and project management, Confluence for content creation and sharing, Trello for capturing and adding structure to fluid, fast-forming work for teams, Jira Service Management for team service, management and support applications, Jira Align for enterprise agile planning, and Bitbucket for code sharing and management. Together, our products form an integrated system for organizing, discussing and completing shared work, becoming deeply entrenched in how teams collaborate and how organizations run. The Atlassian platform is the common technology foundation for our products that drives connection between teams, information, and workflows. It allows work to flow seamlessly across tools, a

### Script - text splitter from Langchain

You can use a text splitter function from Langchain.

The `RecursiveCharacterTextSplitter` will use newlines
and then whitespace characters to break down a text until
the chunks are small enough. This strategy is generally
good at keeping paragraphs together.

Set a chunk size of 2000 characters,
with 200 characters of overlap between each chunk,
using the built-in `len` function to calculate the 
text length.


In [13]:
# Splitting text into chunks using the RecursiveCharacterTextSplitter 
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000,
    chunk_overlap  = 200,
    length_function = len,
    is_separator_regex = False,
)

### Script - text splitter demonstration

You can see what the text splitter will do by splitting up
the text from "item1" that we saved earlier.

In [14]:
item1_text_chunks = text_splitter.split_text(item1_text)

### Script - what's in the chunk

As expected, the first chunk looks like the beginning of the
text you'd seen earlier.

It's always worth doing some sanity checking along the way
when doing data processing.

In [15]:
item1_text_chunks[0]

# edit - wrap this line in len() function to get next cell
# len(item1_text_chunks[0])

'>ITEM\xa01. BUSINESS\nCompany Overview\n\xa0\nOur mission is to unleash the potential of every team.\nOur products help teams organize, discuss and complete shared work — delivering superior outcomes for their organizations.\nOur primary products include Jira Software and Jira Work Management for planning and project management, Confluence for content creation and sharing, Trello for capturing and adding structure to fluid, fast-forming work for teams, Jira Service Management for team service, management and support applications, Jira Align for enterprise agile planning, and Bitbucket for code sharing and management. Together, our products form an integrated system for organizing, discussing and completing shared work, becoming deeply entrenched in how teams collaborate and how organizations run. The Atlassian platform is the common technology foundation for our products that drives connection between teams, information, and workflows. It allows work to flow seamlessly across tools, a

### Script - helper function for loading form 10k data

You can now create a helper function to load the form 10k data from the json files.

This function will load the file as a json object, 

then for each text section in the json, it will split the text into chunks

for each chunk, it will create a new object with the chunk text and the metadata from the form 10k.


In [16]:
def split_form10k_data_from_file(file):
    chunks_with_metadata = [] # use this to accumlate chunk records
    file_as_object = json.load(open(file)) # open the json file
    for item in ['item1','item1a','item7','item7a']: # pull these keys from the json
        print(f'Processing {item} from {file}') 
        item_text = file_as_object[item] # grab the text of the item
        item_text_chunks = text_splitter.split_text(item_text) # split the text into chunks
        chunk_seq_id = 0
        for chunk in item_text_chunks[:20]: # only take the first 20 chunks
            form_id = file[file.rindex('/') + 1:file.rindex('.')] # extract form id from file name
            # finally, construct a record with metadata and the chunk text
            chunks_with_metadata.append({
                'text': chunk, 
                # metadata from looping...
                'f10kItem': item,
                'chunkSeqId': chunk_seq_id,
                # constructed metadata...
                'formId': f'{form_id}', # pulled from the filename
                'chunkId': f'{form_id}-{item}-chunk{chunk_seq_id:04d}',
                # metadata from file...
                'names': file_as_object['names'],
                'cik': file_as_object['cik'],
                'cusip6': file_as_object['cusip6'],
                'source': file_as_object['source'],
            })
            chunk_seq_id += 1
        print(f'\tSplit into {chunk_seq_id} chunks')
    return chunks_with_metadata

In [17]:
first_file_name

'../data/sample/form10k/0001650372-23-000040.json'

In [18]:
first_file_chunks = split_form10k_data_from_file(first_file_name)

Processing item1 from ../data/sample/form10k/0001650372-23-000040.json
	Split into 20 chunks
Processing item1a from ../data/sample/form10k/0001650372-23-000040.json
	Split into 20 chunks
Processing item7 from ../data/sample/form10k/0001650372-23-000040.json
	Split into 20 chunks
Processing item7a from ../data/sample/form10k/0001650372-23-000040.json
	Split into 4 chunks


### Script - take a look at the first chunk

You can take a look at the first chunk record to see what it looks like.

As you'd expect, it has the metadata from the form 10k, the text from the chunk, and the calculated properties like
`formId` and `chunkId`.

Notice that this is from a company named "Netapp". Later you can ask questions about this company.

In [19]:
first_file_chunks[0]

{'text': '>ITEM\xa01. BUSINESS\nCompany Overview\n\xa0\nOur mission is to unleash the potential of every team.\nOur products help teams organize, discuss and complete shared work — delivering superior outcomes for their organizations.\nOur primary products include Jira Software and Jira Work Management for planning and project management, Confluence for content creation and sharing, Trello for capturing and adding structure to fluid, fast-forming work for teams, Jira Service Management for team service, management and support applications, Jira Align for enterprise agile planning, and Bitbucket for code sharing and management. Together, our products form an integrated system for organizing, discussing and completing shared work, becoming deeply entrenched in how teams collaborate and how organizations run. The Atlassian platform is the common technology foundation for our products that drives connection between teams, information, and workflows. It allows work to flow seamlessly across

## Create a graph from the chunks

### Script - create a graph from the chunks

You now have chunks prepared for creating a knowledge graph.

The graph will have 1 node per chunk, containing the chunk text and metadata as properties.

### Script - merge chunk query

You will use a Cypher query to merge the chunks into the graph.

This query accepts a query parameter called `chunkParam` which is expected
to have the data record containing the chunk and metadata.

The `MERGE` query will first match an existing node with the same `chunkId` property.

If no such node exists, it will create a new node and the `ON CREATE` clause will set the properties using values from the `chunkParam` query parameter.

In [20]:
merge_chunk_node_query = """
MERGE(mergedChunk:Chunk {chunkId: $chunkParam.chunkId})
    ON CREATE SET 
        mergedChunk.names = $chunkParam.names,
        mergedChunk.formId = $chunkParam.formId, 
        mergedChunk.cik = $chunkParam.cik, 
        mergedChunk.cusip6 = $chunkParam.cusip6, 
        mergedChunk.source = $chunkParam.source, 
        mergedChunk.f10kItem = $chunkParam.f10kItem, 
        mergedChunk.chunkSeqId = $chunkParam.chunkSeqId, 
        mergedChunk.text = $chunkParam.text
RETURN mergedChunk
"""

### Script - create 1 chunk

Try creating a node using that query by passing in the first
chunk as a parameter.

Thre returned value is the newly created node.

In [21]:
kg.query(merge_chunk_node_query, 
         params={'chunkParam':first_file_chunks[0]})

[{'mergedChunk': {'formId': '0001650372-23-000040',
   'f10kItem': 'item1',
   'names': ['ATLASSIAN CORP PLC', 'ATLASSIAN CORPORATION PLC'],
   'cik': '1650372',
   'cusip6': 'G06242',
   'source': 'https://www.sec.gov/Archives/edgar/data/1650372/000165037223000040/0001650372-23-000040-index.htm',
   'text': '>ITEM\xa01. BUSINESS\nCompany Overview\n\xa0\nOur mission is to unleash the potential of every team.\nOur products help teams organize, discuss and complete shared work — delivering superior outcomes for their organizations.\nOur primary products include Jira Software and Jira Work Management for planning and project management, Confluence for content creation and sharing, Trello for capturing and adding structure to fluid, fast-forming work for teams, Jira Service Management for team service, management and support applications, Jira Align for enterprise agile planning, and Bitbucket for code sharing and management. Together, our products form an integrated system for organizing,

### Script - prepare unique constraint

Before calling the helper function to create a knowledge graph,
we will take one extra step to make sure we don't duplicate data.

In the previous lesson, you created a vector index.

The uniqueness constraint is also index. It's job is to ensure that
a particular property is unique for all nodes that share a common label.



In [22]:
# Create a uniqueness constraint on the chunkId property of Chunk nodes 
kg.query("""
CREATE CONSTRAINT unique_chunk IF NOT EXISTS 
    FOR (c:Chunk) REQUIRE c.chunkId IS UNIQUE
""")

created_indexes = kg.query('SHOW INDEXES')
print(created_indexes)

[{'id': 7, 'name': 'unique_chunk', 'state': 'ONLINE', 'populationPercent': 100.0, 'type': 'RANGE', 'entityType': 'NODE', 'labelsOrTypes': ['Chunk'], 'properties': ['chunkId'], 'indexProvider': 'range-1.0', 'owningConstraint': 'unique_chunk', 'lastRead': None, 'readCount': None}]


## Load all form10 files

Perform the node creation for all files in an import directory. 

In [25]:
# Helper function to create nodes for all chunks
def create_nodes_for_all_chunks(chunks_with_metadata_list):
    node_count = 0
    for chunk in chunks_with_metadata_list:
        print(f"Creating `:Chunk` node for chunk ID {chunk['chunkId']}")
        kg.query(merge_chunk_node_query, 
                params={
                    'chunkParam': chunk
                })
        node_count += 1
    print(f"Created {node_count} nodes")

In [26]:
%%time

IMPORT_DATA_DIRECTORY = f"{DATA_DIR}/form10k/"

all_file_names = [IMPORT_DATA_DIRECTORY + x for x in os.listdir(IMPORT_DATA_DIRECTORY)]
counter = 0

for file_name in all_file_names:
    counter += 1
    print(f'=== Processing {counter} of {len(all_file_names)} ===')
    # get and split text data
    print('Reading and splitting Form10k file...')
    chunk_list = split_form10k_data_from_file(file_name)
    #load nodes
    print('Creating Chunk Nodes...')
    create_nodes_for_all_chunks(chunk_list)
    print(f'Done Processing {file_name}')

# Check the number of nodes in the graph
kg.query("MATCH (n:Chunk) RETURN count(n) as chunkCount")

=== Processing 1 of 10 ===
Reading and splitting Form10k file...
Processing item1 from ../data/sample/form10k/0001650372-23-000040.json
	Split into 20 chunks
Processing item1a from ../data/sample/form10k/0001650372-23-000040.json
	Split into 20 chunks
Processing item7 from ../data/sample/form10k/0001650372-23-000040.json
	Split into 20 chunks
Processing item7a from ../data/sample/form10k/0001650372-23-000040.json
	Split into 4 chunks
Creating Chunk Nodes...
Creating `:Chunk` node for chunk ID 0001650372-23-000040-item1-chunk0000
Creating `:Chunk` node for chunk ID 0001650372-23-000040-item1-chunk0001
Creating `:Chunk` node for chunk ID 0001650372-23-000040-item1-chunk0002
Creating `:Chunk` node for chunk ID 0001650372-23-000040-item1-chunk0003
Creating `:Chunk` node for chunk ID 0001650372-23-000040-item1-chunk0004
Creating `:Chunk` node for chunk ID 0001650372-23-000040-item1-chunk0005
Creating `:Chunk` node for chunk ID 0001650372-23-000040-item1-chunk0006
Creating `:Chunk` node for 

[{'chunkCount': 544}]

In [27]:
kg.query("MATCH (c:Chunk) RETURN count(distinct(c.formId)) as uniqueFormCount")

[{'uniqueFormCount': 10}]

In [28]:
kg.query("MATCH (c:Chunk) RETURN count(distinct(c.cusip6)) as uniqueCompanyCount")

[{'uniqueCompanyCount': 10}]

In [29]:
kg.query("MATCH (c:Chunk) RETURN DISTINCT c.names")

[{'c.names': ['ATLASSIAN CORP PLC', 'ATLASSIAN CORPORATION PLC']},
 {'c.names': ['FedEx Corp', 'FEDEX CORP']},
 {'c.names': ['NEWS CORP   CLASS B', 'News Corp.', 'NEWS CORP NEW']},
 {'c.names': ['GSI TECHNOLOGY INC']},
 {'c.names': ['APPLE INC']},
 {'c.names': ['Netapp Inc', 'NETAPP INC']},
 {'c.names': ['Palo Alto Networks Inc.',
   'PALO ALTO NETWORKS INC',
   'PALO ALTO NETWORKS INC PUT',
   'None']},
 {'c.names': ['WESTERN DIGITAL CORP', 'WESTERN DIGITAL CORP.']},
 {'c.names': ['NIKE Inc.', 'NIKE INC']},
 {'c.names': ['SEAGATE TECHNOLOGY']}]

# Enhance - vector embeddings for the text of each chunk  

### Script - prepare a vector index

Now that you have a graph populated with `Chunk` nodes, 
you can add vector embeddings.

First, prepare a vector index to store the embeddings.

The index will be called `form_10k_chunks` and will store
embeddings for nodes labeled as `Chunk` in a property
called `textEmbedding`.

The embeddings will match the recommended configuration
for the OpenAI default embeddings model,
with a dimension of 1,536
and using the cosine similarity function.

In [30]:
# Create a vector index called "form_10k_chunks" the `textEmbedding`` property of nodes labeled `Chunk`. 
# neo4j_create_vector_index(kg, VECTOR_INDEX_NAME, 'Chunk', 'textEmbedding')
kg.query("""
         CREATE VECTOR INDEX `form_10k_chunks` IF NOT EXISTS
          FOR (c:Chunk) ON (c.textEmbedding) 
          OPTIONS { indexConfig: {
            `vector.dimensions`: 1536,
            `vector.similarity_function`: 'cosine'    
         }}
""")

[]

### Script - check the indexes

You can check that the index was created successfully
using "SHOW INDEXES".

There's the vector index we just created, 
along with the uniqueness constraint from before.

In [31]:
kg.query('SHOW INDEXES')

[{'id': 10,
  'name': 'form_10k_chunks',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'VECTOR',
  'entityType': 'NODE',
  'labelsOrTypes': ['Chunk'],
  'properties': ['textEmbedding'],
  'indexProvider': 'vector-1.0',
  'owningConstraint': None,
  'lastRead': None,
  'readCount': None},
 {'id': 7,
  'name': 'unique_chunk',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'RANGE',
  'entityType': 'NODE',
  'labelsOrTypes': ['Chunk'],
  'properties': ['chunkId'],
  'indexProvider': 'range-1.0',
  'owningConstraint': 'unique_chunk',
  'lastRead': neo4j.time.DateTime(2024, 2, 25, 20, 54, 43, 971000000, tzinfo=<UTC>),
  'readCount': 1694}]

### Script - creating text embeddings

You can now use a single query to match all chunks,
then `WITH` the chunk call "OpenAI" to get an embedding,
and finally `SET` the embedding on each nodes.

This may take a minute to run, depending on network traffic.


In [33]:
# Create vector embeddings for all the Chunk text, in batches.
# Use this for larger number of chunks so that the query
# can be re-run without losing all progress
kg.query("""
  MATCH (chunk:Chunk) WHERE chunk.textEmbedding IS NULL
  CALL {
    WITH chunk
    WITH chunk, genai.vector.encode(chunk.text, "OpenAI", {token: $openAiApiKey}) AS vector
    CALL db.create.setNodeVectorProperty(chunk, "textEmbedding", vector)    
  } IN TRANSACTIONS OF 10 ROWS
  """, 
  params={"openAiApiKey":OPENAI_API_KEY} 
)

[]

## Example queries - vector similarity search with Neo4j

### Script - a helper function for vector search using Neo4j

You can now create a helper function to perform vector search
using Neo4j. The function will accept a text question,
then submit that to Neo4j as a parameter in a Cypher query.

The query also accepts parameters for the OpenAI key,
the vector index name, and the number of results to return.


In [34]:
def neo4j_vector_search(question):
  """Search for similar nodes using the Neo4j vector index"""
  vector_search_query = """
    WITH genai.vector.encode($question, "OpenAI", 
        {token: $openAiApiKey}) AS question_embedding
    CALL db.index.vector.queryNodes($index_name, $top_k, question_embedding) 
        YIELD node, score
    RETURN score, node.text AS text
  """
  similar = kg.query(vector_search_query, 
                     params={
                      'question': question, 
                      'openAiApiKey':OPENAI_API_KEY,
                      'index_name':VECTOR_INDEX_NAME, 
                      'top_k': 10})
  return similar

### Script - try Neo4j vector search helper

You may recall that the form we've turned into a knowledge graph is
from a company called "Netapp".

You can try our vector search helper function to ask about Netapp.

In [37]:
search_results = neo4j_vector_search(
    'In a single sentence, tell me about Netapp.'
)
search_results[0]

{'score': 0.9356340169906616,
 'text': '>Item 1.  \nBusiness\n\n\nOverview\n\n\nNetApp, Inc. (NetApp, we, us or the Company) is a global cloud-led, data-centric software company. We were incorporated in 1992 and are headquartered in San Jose, California. Building on more than three decades of innovation, we give customers the freedom to manage applications and data across hybrid multicloud environments. Our portfolio of cloud services, and storage infrastructure, powered by intelligent data management software, enables applications to run faster, more reliably, and more securely, all at a lower cost.\n\n\nOur opportunity is defined by the durable megatrends of data-driven digital and cloud transformations. NetApp helps organizations meet the complexities created by rapid data and cloud growth, multi-cloud management, and the adoption of next-generation technologies, such as AI, Kubernetes, and modern databases. Our modern approach to hybrid, multicloud infrastructure and data managemen

# RAG With Langchain

### OpenAI integration for Langchain (probably a slide)

Notice that we only performed vector search. So what we're getting
back is the raw chunk text.

If we want to create a chatbot that provides actual answers to
a question, we can build a RAG system using Langchain.

Let's take a look at how you'll do that. (note to editor: show a slide!)

The basic RAG flow goes through these steps: (possible diagram)

1. accept a question from the user
2. perform a database query to find relevant text that may provide an answer
3. package the original question plus the relevant text into a prompt
4. pass the entire prompt to an LLM to produce an answer
5. finally, return the LLM's answer to the user

Langchain is a great framework for creating a complete RAG workflow.

It has excellent integration with Neo4j. 

OK, let's get back to the notebook to try this out.

### Neo4j Vector Store

The easiest way to start using Neo4j with Langchain
is with the `Neo4jVector` interface.

This makes Neo4j look like a vector store.

Under the hood, it will use the Cypher language
for performing vector similarity searches.

The configuration specifies a few important things:
- use OpenAI for embeddings
- how to connect to the Neo4j database
- the name of the vector index to use
- the label of the nodes to search
- the property name of the text on those nodes
- and, the property name of the embeddings on those nodes

That vector store then gets converted into a retriever
and finally added to a Question & answer chain.

In [38]:
# Create a langchain vector store from the existing Neo4j knowledge graph.
neo4j_vector_store = Neo4jVector.from_existing_graph(
    embedding=OpenAIEmbeddings(),
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    index_name=VECTOR_INDEX_NAME,
    node_label=VECTOR_NODE_LABEL,
    text_node_properties=[VECTOR_SOURCE_PROPERTY],
    embedding_node_property=VECTOR_EMBEDDING_PROPERTY,
)

# Create a retriever from the vector store
retriever = neo4j_vector_store.as_retriever()

# Create a chatbot Question & Answer chain from the retriever
chain = RetrievalQAWithSourcesChain.from_chain_type(
    ChatOpenAI(temperature=0), chain_type="stuff", retriever=retriever
)


### Script - pretty chain helper function

The prettychain helper function calls the 
chain then formats the answer to make it more readable.

by pulling out just the `answer` field then
printing with `textwrap` to limit each line to 80 characters.

In [39]:

# helper function to pretty print the chain's response
def prettychain(question: str) -> str:
    """Pretty print the chain's response to a question"""
    response = chain({"question": question},
        return_only_outputs=True,)
    print(textwrap.fill(response['answer'], 80))

### Ask some questions

Finally, you can use the Langchain chain, which combines the retriever
and the vector store into a nice question and answer interface.

You can see both the answer and the source that the answer came from.

In [41]:
prettychain("What is Netapp's primary business?")

NetApp's primary business is enterprise storage and data management, cloud
storage, and cloud operations.


In [42]:
prettychain("Where is Netapp headquartered?")

Netapp is headquartered in San Jose, California.


In [45]:
prettychain("Briefly tell me about Apple.")

There is no information about Apple in the provided content.


In [44]:
prettychain("In a single sentence, tell me about Fedex")

FedEx Corporation provides customers and businesses worldwide with a broad
portfolio of transportation, e-commerce, and business services, offering
integrated business solutions through operating companies competing
collectively, operating collaboratively, and innovating digitally as one FedEx.
