# Lesson 4: Constructing a Knowledge Graph from Text Documents

### Import packages and set up Neo4j

In [1]:
%pip install langchain-openai

Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install langchain

Note: you may need to restart the kernel to use updated packages.


In [3]:
from dotenv import load_dotenv
import os

# Common data processing
import json
import textwrap

# LangChain
from langchain_community.graphs import Neo4jGraph
from langchain_community.vectorstores import Neo4jVector
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_openai import ChatOpenAI

In [4]:
# Warnings control
import warnings

warnings.filterwarnings("ignore")

Load from environment

In [5]:
load_dotenv(".env", override=True)

NEO4J_URI = os.getenv("NEO4J_URI")
NEO4J_USERNAME = os.getenv("NEO4J_USERNAME")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD")
NEO4J_DATABASE = os.getenv("NEO4J_DATABASE") or "neo4j"
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

Global constants

In [6]:
VECTOR_INDEX_NAME = 'form_10k_chunks'
VECTOR_NODE_LABEL = 'Chunk'
VECTOR_SOURCE_PROPERTY = 'text'
VECTOR_EMBEDDING_PROPERTY = 'textEmbedding'

### Take a look at a Form 10-K json file

- Publicly traded companies are required to fill a form 10-K each year with the Securities and Exchange Commision (SEC)
- You can search these filings using the SEC's [EDGAR database](https://www.sec.gov/edgar/search/)
- For the next few lessons, you'll work with a single 10-K form for a company called [NetApp](https://www.netapp.com/)

In [7]:
first_file_name = "./data/form10k/0000950170-23-027948.json"

In [8]:
first_file_as_object = json.load(open(first_file_name))

In [9]:
type(first_file_as_object)

dict

Print the keys and value types

In [10]:
for k, v in first_file_as_object.items():
    print(k, type(v))

item1 <class 'str'>
item1a <class 'str'>
item7 <class 'str'>
item7a <class 'str'>
cik <class 'str'>
cusip6 <class 'str'>
cusip <class 'list'>
names <class 'list'>
source <class 'str'>


In [11]:
item1_text = first_file_as_object["item1"]

In [12]:
len(item1_text)

397359

Take a peek into this text

In [13]:
item1_text[:1500]

'>Item 1.  \nBusiness\n\n\nOverview\n\n\nNetApp, Inc. (NetApp, we, us or the Company) is a global cloud-led, data-centric software company. We were incorporated in 1992 and are headquartered in San Jose, California. Building on more than three decades of innovation, we give customers the freedom to manage applications and data across hybrid multicloud environments. Our portfolio of cloud services, and storage infrastructure, powered by intelligent data management software, enables applications to run faster, more reliably, and more securely, all at a lower cost.\n\n\nOur opportunity is defined by the durable megatrends of data-driven digital and cloud transformations. NetApp helps organizations meet the complexities created by rapid data and cloud growth, multi-cloud management, and the adoption of next-generation technologies, such as AI, Kubernetes, and modern databases. Our modern approach to hybrid, multicloud infrastructure and data management, which we term ‘evolved cloud’, provi

### Split Form 10-K sections into chunks

- Set up text splitter using LangChain
- For now, split only the text from the "item 1" section 

In [14]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False
)

In [15]:
item1_text_chunks = text_splitter.split_text(item1_text)

In [16]:
type(item1_text_chunks)

list

In [17]:
len(item1_text_chunks)

254

In [18]:
item1_text_chunks[0]

'>Item 1.  \nBusiness\n\n\nOverview\n\n\nNetApp, Inc. (NetApp, we, us or the Company) is a global cloud-led, data-centric software company. We were incorporated in 1992 and are headquartered in San Jose, California. Building on more than three decades of innovation, we give customers the freedom to manage applications and data across hybrid multicloud environments. Our portfolio of cloud services, and storage infrastructure, powered by intelligent data management software, enables applications to run faster, more reliably, and more securely, all at a lower cost.\n\n\nOur opportunity is defined by the durable megatrends of data-driven digital and cloud transformations. NetApp helps organizations meet the complexities created by rapid data and cloud growth, multi-cloud management, and the adoption of next-generation technologies, such as AI, Kubernetes, and modern databases. Our modern approach to hybrid, multicloud infrastructure and data management, which we term ‘evolved cloud’, provi

- Set up helper function to chunk all sections of the Form 10-K
- You'll limit the number of chunks in each section to 20 to speed things up

In [19]:
def split_form10k_data_from_file(file):
    chunks_with_metadata = []  # use this to accumulate chunk records
    # Load the json file
    file_as_object = json.load(open(file))
    # Extract form id from file name
    form_id = file[file.rindex("/")+1: file.rindex(".")]

    for item in ["item1", "item1a", "item7", "item7a"]:  # pull these keys from the json
        print(f"Processing {item} from {file}")
        # Grab the text of the item
        item_text = file_as_object[item]
        # Split the text into chunks
        item_text_chunks = text_splitter.split_text(item_text)
        # Iterate over the first 20 chunks
        chunk_seq_id = 0
        for chunk in item1_text_chunks[:20]:
            # Construct a record with metadata and chunk text
            chunks_with_metadata.append({
                "text": chunk,
                # metadata from looping
                "f10kItem": item,
                "chunkSeqId": chunk_seq_id,
                # constructed metadata
                "formId": f"{form_id}",  # pulled from the filename
                "chunkId": f"{form_id}-{item}-chunk{chunk_seq_id:04d}",
                # metadata from file
                "names": file_as_object["names"],
                "cik": file_as_object["cik"],
                "cusip6": file_as_object["cusip6"],
                "source": file_as_object["source"]
            })
            chunk_seq_id += 1
        
        print(f"Split into {chunk_seq_id} chunks")
    
    return chunks_with_metadata

In [20]:
first_file_chunks = split_form10k_data_from_file(file=first_file_name)

Processing item1 from ./data/form10k/0000950170-23-027948.json
Split into 20 chunks
Processing item1a from ./data/form10k/0000950170-23-027948.json
Split into 20 chunks
Processing item7 from ./data/form10k/0000950170-23-027948.json
Split into 20 chunks
Processing item7a from ./data/form10k/0000950170-23-027948.json
Split into 20 chunks


In [21]:
first_file_chunks[0]

{'text': '>Item 1.  \nBusiness\n\n\nOverview\n\n\nNetApp, Inc. (NetApp, we, us or the Company) is a global cloud-led, data-centric software company. We were incorporated in 1992 and are headquartered in San Jose, California. Building on more than three decades of innovation, we give customers the freedom to manage applications and data across hybrid multicloud environments. Our portfolio of cloud services, and storage infrastructure, powered by intelligent data management software, enables applications to run faster, more reliably, and more securely, all at a lower cost.\n\n\nOur opportunity is defined by the durable megatrends of data-driven digital and cloud transformations. NetApp helps organizations meet the complexities created by rapid data and cloud growth, multi-cloud management, and the adoption of next-generation technologies, such as AI, Kubernetes, and modern databases. Our modern approach to hybrid, multicloud infrastructure and data management, which we term ‘evolved clou

In [22]:
first_file_chunks[1]

{'text': "•\nFlexibility and consistency: NetApp makes moving data and applications between environments seamless through a common storage foundation across on-premises and multicloud environments.\n\n\n•\nCyber resilience: NetApp unifies monitoring, data protection, security, governance, and compliance for total cyber resilience - with consistency and automation across environments. \n\n\n•\nContinuous operations: NetApp uses AI-driven automation for continuous optimization to service applications and store stateless and stateful applications at the lowest possible costs.\n\n\n•\nSustainability: NetApp has industry-leading tools to audit consumption, locate waste, and set guardrails to stop overprovisioning.\n\n\nProduct, Solutions and Services Portfolio\n \n\n\nNetApp's portfolio of cloud services and storage infrastructure is powered by intelligent data management software. Our operations are organized into two segments: Hybrid Cloud and Public Cloud.\n\n\n \n\n\nHybrid Cloud\n\n\nH

### Create graph nodes using text chunks 

In [23]:
merge_chunk_node_query = """
    MERGE(mergedChunk:Chunk {chunkId: $chunkParam.chunkId})
    ON CREATE SET
        mergedChunk.names = $chunkParam.names,
        mergedChunk.formId = $chunkParam.formId,
        mergedChunk.cik = $chunkParam.cik,
        mergedChunk.cusip6 = $chunkParam.cusip6,
        mergedChunk.source = $chunkParam.source,
        mergedChunk.f10kItem = $chunkParam.f10kItem,
        mergedChunk.chunkSeqId = $chunkParam.chunkSeqId,
        mergedChunk.text = $chunkParam.text
    RETURN mergedChunk
"""

- Set up connection to graph instance using LangChain

In [24]:
kg = Neo4jGraph(url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD, database=NEO4J_DATABASE)

- Create a single chunk node for now

In [25]:
kg.query(query=merge_chunk_node_query, params={"chunkParam": first_file_chunks[0]})

[{'mergedChunk': {'formId': '0000950170-23-027948',
   'f10kItem': 'item1',
   'names': ['Netapp Inc', 'NETAPP INC'],
   'cik': '1002047',
   'textEmbedding': [-0.017124786972999573,
    -0.01839623413980007,
    0.016224179416894913,
    -0.02597193233668804,
    -0.007098906673491001,
    0.013178007677197456,
    -0.0042050424963235855,
    -0.012926367111504078,
    -0.003082594135776162,
    -0.03697788715362549,
    0.015800364315509796,
    0.013853463344275951,
    0.013773998245596886,
    -0.013423025608062744,
    0.002726655686274171,
    0.004479860421270132,
    0.015032199211418629,
    -0.02926974557340145,
    -0.01500571146607399,
    -0.0215351153165102,
    -0.031335845589637756,
    0.010575516149401665,
    0.00021335625206120312,
    -0.006423451006412506,
    -0.015508991666138172,
    -0.010827156715095043,
    0.004089155700057745,
    -0.015442770905792713,
    0.019469015300273895,
    -0.004453371744602919,
    0.008635236881673336,
    -0.02001203037798404

- Create a uniqueness constraint to avoid duplicate chunks

In [26]:
query = """
    CREATE CONSTRAINT unique_chunk IF NOT EXISTS
    FOR (c:Chunk) REQUIRE c.chunkId IS UNIQUE
"""

kg.query(query=query)

[]

In [27]:
kg.query(query="SHOW INDEXES")

[{'id': 4,
  'name': 'form_10k_chunks',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'VECTOR',
  'entityType': 'NODE',
  'labelsOrTypes': ['Chunk'],
  'properties': ['textEmbedding'],
  'indexProvider': 'vector-2.0',
  'owningConstraint': None,
  'lastRead': neo4j.time.DateTime(2024, 5, 16, 19, 20, 36, 800000000, tzinfo=<UTC>),
  'readCount': 1},
 {'id': 0,
  'name': 'index_343aff4e',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'LOOKUP',
  'entityType': 'NODE',
  'labelsOrTypes': None,
  'properties': None,
  'indexProvider': 'token-lookup-1.0',
  'owningConstraint': None,
  'lastRead': neo4j.time.DateTime(2024, 5, 16, 19, 20, 27, 601000000, tzinfo=<UTC>),
  'readCount': 10},
 {'id': 1,
  'name': 'index_f7700477',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'LOOKUP',
  'entityType': 'RELATIONSHIP',
  'labelsOrTypes': None,
  'properties': None,
  'indexProvider': 'token-lookup-1.0',
  'owningConstraint': None,
  'lastRead': None,
  'rea

- Loop through and create nodes for all chunks
- Should create 23 nodes because you set a limit of 20 chunks in the text splitting function above

In [28]:
node_count = 0

for chunk in first_file_chunks:
    print(f"Creating `:Chunk` node for chunk ID {chunk['chunkId']}")
    kg.query(query=merge_chunk_node_query,
             params={"chunkParam": chunk})
    node_count += 1

print(f"Created {node_count} nodes")

Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0000
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0001
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0002
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0003
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0004
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0005
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0006
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0007
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0008
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0009
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0010
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0011
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0012
Creating `:Chunk` node for chunk ID 0000950170-23-0

In [29]:
query = """
    MATCH (n)
    RETURN COUNT(n) as nodeCount
"""

kg.query(query=query)

[{'nodeCount': 80}]

### Create a vector index

In [30]:
query = """
    CREATE VECTOR INDEX `form_10k_chunks` IF NOT EXISTS
    FOR (c:Chunk) ON (c.textEmbedding)
    OPTIONS { indexConfig: {
        `vector.dimensions`: 1536,
        `vector.similarity_function`: 'cosine'
    }}
"""

kg.query(query=query)

[]

In [31]:
kg.query(query="SHOW INDEXES")

[{'id': 4,
  'name': 'form_10k_chunks',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'VECTOR',
  'entityType': 'NODE',
  'labelsOrTypes': ['Chunk'],
  'properties': ['textEmbedding'],
  'indexProvider': 'vector-2.0',
  'owningConstraint': None,
  'lastRead': neo4j.time.DateTime(2024, 5, 16, 19, 20, 36, 800000000, tzinfo=<UTC>),
  'readCount': 1},
 {'id': 0,
  'name': 'index_343aff4e',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'LOOKUP',
  'entityType': 'NODE',
  'labelsOrTypes': None,
  'properties': None,
  'indexProvider': 'token-lookup-1.0',
  'owningConstraint': None,
  'lastRead': neo4j.time.DateTime(2024, 5, 18, 10, 36, 45, 731000000, tzinfo=<UTC>),
  'readCount': 13},
 {'id': 1,
  'name': 'index_f7700477',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'LOOKUP',
  'entityType': 'RELATIONSHIP',
  'labelsOrTypes': None,
  'properties': None,
  'indexProvider': 'token-lookup-1.0',
  'owningConstraint': None,
  'lastRead': None,
  'rea

### Calculate embedding vectors for chunks and populate index
- This query calculates the embedding vector and stores it as a property called `textEmbedding` on each `Chunk` node.

In [32]:
query = """
    MATCH (chunk:Chunk) WHERE chunk.textEmbedding IS NULL
    WITH chunk, genai.vector.encode(
        chunk.text,
        "OpenAI",
        {
            token: $openAiApiKey
        }) AS vector
    CALL db.create.setNodeVectorProperty(chunk, "textEmbedding", vector)
"""

kg.query(query=query, params={"openAiApiKey": OPENAI_API_KEY})

[]

In [33]:
kg.refresh_schema()

print(kg.schema)

Node properties:
Chunk {chunkId: STRING, names: LIST, formId: STRING, cik: STRING, cusip6: STRING, source: STRING, f10kItem: STRING, chunkSeqId: INTEGER, text: STRING, textEmbedding: LIST}
Relationship properties:

The relationships:



### Use similarity search to find relevant chunks

- Setup a help function to perform similarity search using the vector index

In [34]:
def neo4j_vector_search(question):
    """Search for similar nodes using the Neo4j vector index"""
    vector_search_query = """
        WITH genai.vector.encode(
            $question,
            "OpenAI",
            {
                token: $openAiAPIKey
            }) AS question_embedding
        CALL db.index.vector.queryNodes($index_name, $top_k, question_embedding) yield node, score
        RETURN score, node.text AS text
    """

    similar = kg.query(query=vector_search_query,
                        params={
                            "question": question,
                            "openAiAPIKey": OPENAI_API_KEY,
                            "index_name": VECTOR_INDEX_NAME,
                            "top_k": 10})
    
    return similar

- Ask a question!

In [35]:
search_results = neo4j_vector_search("In a single sentence, tell me about Netapp.")

In [36]:
search_results[0]

{'score': 0.93589186668396,
 'text': '>Item 1.  \nBusiness\n\n\nOverview\n\n\nNetApp, Inc. (NetApp, we, us or the Company) is a global cloud-led, data-centric software company. We were incorporated in 1992 and are headquartered in San Jose, California. Building on more than three decades of innovation, we give customers the freedom to manage applications and data across hybrid multicloud environments. Our portfolio of cloud services, and storage infrastructure, powered by intelligent data management software, enables applications to run faster, more reliably, and more securely, all at a lower cost.\n\n\nOur opportunity is defined by the durable megatrends of data-driven digital and cloud transformations. NetApp helps organizations meet the complexities created by rapid data and cloud growth, multi-cloud management, and the adoption of next-generation technologies, such as AI, Kubernetes, and modern databases. Our modern approach to hybrid, multicloud infrastructure and data management,

### Set up a LangChain RAG workflow to chat with the form

In [37]:
neo4j_vector_store = Neo4jVector.from_existing_graph(
    embedding=OpenAIEmbeddings(),
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    index_name=VECTOR_INDEX_NAME,
    node_label=VECTOR_NODE_LABEL,
    text_node_properties=[VECTOR_SOURCE_PROPERTY],
    embedding_node_property=VECTOR_EMBEDDING_PROPERTY
)

In [38]:
retriever = neo4j_vector_store.as_retriever()

- Set up a RetrievalQAWithSourcesChain to carry out question answering
- You can check out the LangChain documentation for this chain [here](https://api.python.langchain.com/en/latest/chains/langchain.chains.qa_with_sources.retrieval.RetrievalQAWithSourcesChain.html)

In [39]:
chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=ChatOpenAI(temperature=0),
    chain_type="stuff",
    retriever=retriever
)

In [40]:
def prettychain(question: str) -> str:
    """Pretty print the chain's response to a question"""
    response = chain({"question": question}, return_only_outputs=True)
    print(textwrap.fill(response["answer"], 60))

- Ask a question!

In [42]:
question = "What is Netapp's primary business?"

In [43]:
prettychain(question=question)

NetApp's primary business is enterprise storage and data
management, cloud storage, and cloud operations markets.


In [44]:
prettychain("Where is Netapp headquartered?")

NetApp is headquartered in San Jose, California.


In [45]:
prettychain("Tell me about Netapp")

NetApp provides intelligent data management software,
including NetApp ONTAP, NetApp Snapshot, NetApp SnapCenter
Backup Management, NetApp SnapMirror Data Replication, and
NetApp SnapLock Data Compliance. These solutions offer
various data management and protection features, automatic
ransomware protection, data transport capabilities, and
storage efficiency. NetApp software helps customers achieve
business continuity goals and ensures data integrity and
safety in data centers and the cloud.


Let's pass instruction to LLM to limit answer to a single sentence.

In [46]:
prettychain("""
    Tell me about Netapp.
    Limit your answer to a single sentence.
""")

NetApp provides intelligent data management software
solutions for data centers and the cloud.


Let's ask about another company

In [47]:
prettychain("""
    Tell me about Apple.
    Limit your answer to a single sentence.
""")

NetApp, Inc. is a global cloud-led, data-centric software
company headquartered in San Jose, California.


This is a clear case of hallucination. Let's instruct LLM to handle this.

In [48]:
prettychain("""
    Tell me about Apple.
    Limit your answer to a single sentence.
    If you are unsure about the answer, say you don't know.
""")

I don't know.


### Ask your own question!
- Add your own question to the call to prettychain below to find out more about NetApp
- Here is NetApp's website if you want some inspiration: https://www.netapp.com/

In [49]:
prettychain("""
    Tell me the benefits of becoming Netapp partner.
""")

Becoming a NetApp partner allows for maximizing the business
value of IT and cloud investments, meeting evolving customer
needs, and ensuring growth and success.
