# Lesson 4: Constructing a Knowledge Graph from Text Documents

<p style="background-color:#fd4a6180; padding:15px; margin-left:20px"> <b>Note:</b> This notebook takes about 30 seconds to be ready to use. Please wait until the "Kernel starting, please wait..." message clears from the top of the notebook before running any cells. You may start the video while you wait.</p>

### Import packages and set up Neo4j

In [111]:
from dotenv import load_dotenv
import os

# Common data processing
import json
import textwrap

# Langchain
from langchain_community.graphs import Neo4jGraph
from langchain_community.vectorstores import Neo4jVector
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_openai import ChatOpenAI


# Warning control
import warnings
warnings.filterwarnings("ignore")

In [164]:
# Load from environment
load_dotenv('.env', override=True)
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')
NEO4J_DATABASE = os.getenv('NEO4J_DATABASE') or 'neo4j'
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
# Note the code below is unique to this course environment, and not a
# standard part of Neo4j's integration with OpenAI. Remove if running
# in your own environment.
#OPENAI_ENDPOINT = os.getenv('OPENAI_BASE_URL') + '/embeddings'

# Global constants
VECTOR_INDEX_NAME = 'abstract_chunks'
VECTOR_NODE_LABEL = 'Chunk'
VECTOR_SOURCE_PROPERTY = 'text'
VECTOR_EMBEDDING_PROPERTY = 'textEmbedding'

In [138]:
kg = Neo4jGraph(
    url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD, database=NEO4J_DATABASE
)

### Query the abstracts of the documents


In [239]:
papers = kg.query("""
    MATCH (p:Paper)
    WHERE p.abstract IS NOT NULL AND p.abstract <> ''
    RETURN p.paperId as paperId, p.abstract as abstract, p.title as title
""")

In [240]:
len(papers)

6867

In [241]:
papers[0]

{'paperId': '1e5cd344430d8de5bdb866ea5e8612467e69b689',
 'abstract': '(1800). III. On the origin and progress of the manufacture of pig-iron with pit-coal; and comparison of the value and effects of pit-coal, wood, and peat-char. The Philosophical Magazine: Vol. 7, No. 25, pp. 35-46.',
 'title': 'III. On the origin and progress of the manufacture of pig-iron with pit-coal; and comparison of the value and effects of pit-coal, wood, and peat-char'}

In [242]:
item1_text = papers[0]['title'] + ' ' + papers[0]['abstract']

In [243]:
item1_text[0:1500]

'III. On the origin and progress of the manufacture of pig-iron with pit-coal; and comparison of the value and effects of pit-coal, wood, and peat-char (1800). III. On the origin and progress of the manufacture of pig-iron with pit-coal; and comparison of the value and effects of pit-coal, wood, and peat-char. The Philosophical Magazine: Vol. 7, No. 25, pp. 35-46.'

### Split Form 10-K sections into chunks
- Set up text splitter using LangChain
- For now, split only the text from the "item 1" section 

In [143]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000,
    chunk_overlap  = 200,
    length_function = len,
    is_separator_regex = False,
)

In [144]:
item1_text_chunks = text_splitter.split_text(item1_text)

In [145]:
type(item1_text_chunks)

list

In [146]:
len(item1_text_chunks)

1

In [147]:
item1_text_chunks[0]

'III. On the origin and progress of the manufacture of pig-iron with pit-coal; and comparison of the value and effects of pit-coal, wood, and peat-char (1800). III. On the origin and progress of the manufacture of pig-iron with pit-coal; and comparison of the value and effects of pit-coal, wood, and peat-char. The Philosophical Magazine: Vol. 7, No. 25, pp. 35-46.'

- Set up helper function to chunk all sections of the Form 10-K
- You'll limit the number of chunks in each section to 20 to speed things up

In [237]:
def split_paper_abstract(paper: dict):
    chunks_with_metadata = [] # use this to accumlate chunk records
    for item in ['abstract']: # pull these keys from the json
        item_text = paper[item] # grab the text of the item
        item_text_chunks = text_splitter.split_text(item_text) # split the text into chunks
        chunk_seq_id = 0
        for chunk in item_text_chunks[:20]: # only take the first 20 chunks
            paper_id = paper['paperId']
            # finally, construct a record with metadata and the chunk text
            chunks_with_metadata.append({
                'text': chunk,
                # metadata from looping...
                'key': item,
                'chunkSeqId': chunk_seq_id,
                # constructed metadata...
                'paper_id': f'{paper_id}', # pulled from the filename
                'chunkId': f'{paper_id}-{item}-chunk{chunk_seq_id:04d}',
                # metadata from file...
                'title': paper['title'],
                'source': f'{paper_id}'
            })
            chunk_seq_id += 1
        print(f'\tSplit into {chunk_seq_id} chunks')
    return chunks_with_metadata

In [207]:
first_file_chunks = split_paper_abstract(papers[0])

	Split into 1 chunks


In [208]:
first_file_chunks[0]

{'text': '(1800). III. On the origin and progress of the manufacture of pig-iron with pit-coal; and comparison of the value and effects of pit-coal, wood, and peat-char. The Philosophical Magazine: Vol. 7, No. 25, pp. 35-46.',
 'key': 'abstract',
 'chunkSeqId': 0,
 'paper_id': '1e5cd344430d8de5bdb866ea5e8612467e69b689',
 'chunkId': '1e5cd344430d8de5bdb866ea5e8612467e69b689-abstract-chunk0000',
 'title': 'III. On the origin and progress of the manufacture of pig-iron with pit-coal; and comparison of the value and effects of pit-coal, wood, and peat-char',
 'source': '1e5cd344430d8de5bdb866ea5e8612467e69b689'}

### Create graph nodes using text chunks

In [215]:
merge_chunk_node_query = """
MERGE(mergedChunk:Chunk {chunkId: $chunkParam.chunkId})
    ON CREATE SET
        mergedChunk.title = $chunkParam.title,
        mergedChunk.chunkSeqId = $chunkParam.chunkSeqId,
        mergedChunk.text = $chunkParam.text,
        mergedChunk.source = $chunkParam.source
RETURN mergedChunk
"""

- Set up connection to graph instance using LangChain

- Create a single chunk node for now

In [244]:
kg.query(merge_chunk_node_query,
         params={'chunkParam':first_file_chunks[0]})

Processing paper 1e5cd344430d8de5bdb866ea5e8612467e69b689
	Split into 1 chunks
	Processing paper chunk 1e5cd344430d8de5bdb866ea5e8612467e69b689-abstract-chunk0000
Processing paper c85acc8c6ac5eb7466e26517ae33aeac6fec6689
	Split into 1 chunks
	Processing paper chunk c85acc8c6ac5eb7466e26517ae33aeac6fec6689-abstract-chunk0000
Processing paper bd5ae6e3dd85b54442c70701f6b19a35891423f0
	Split into 1 chunks
	Processing paper chunk bd5ae6e3dd85b54442c70701f6b19a35891423f0-abstract-chunk0000
Processing paper eaeeb3c3d6726464ead8561082a7d72d8e7a0d64
	Split into 1 chunks
	Processing paper chunk eaeeb3c3d6726464ead8561082a7d72d8e7a0d64-abstract-chunk0000
Processing paper 1432bfc59b9de4fd02554be99f181e2db4e10f37
	Split into 1 chunks
	Processing paper chunk 1432bfc59b9de4fd02554be99f181e2db4e10f37-abstract-chunk0000
Processing paper 49ac1b055ed8da31193644c868b3173ed3e2a24f
	Split into 1 chunks
	Processing paper chunk 49ac1b055ed8da31193644c868b3173ed3e2a24f-abstract-chunk0000
Processing paper 8658b

- Create a uniqueness constraint to avoid duplicate chunks

In [245]:
kg.query("""
CREATE CONSTRAINT unique_chunk IF NOT EXISTS
    FOR (c:Chunk) REQUIRE c.chunkId IS UNIQUE
""")


[]

In [218]:
kg.query("SHOW INDEXES")

[{'id': 11,
  'name': 'abstract_chunks',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'VECTOR',
  'entityType': 'NODE',
  'labelsOrTypes': ['Chunk'],
  'properties': ['textEmbedding'],
  'indexProvider': 'vector-2.0',
  'owningConstraint': None,
  'lastRead': neo4j.time.DateTime(2024, 3, 31, 12, 33, 52, 774000000, tzinfo=<UTC>),
  'readCount': 2},
 {'id': 1,
  'name': 'index_343aff4e',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'LOOKUP',
  'entityType': 'NODE',
  'labelsOrTypes': None,
  'properties': None,
  'indexProvider': 'token-lookup-1.0',
  'owningConstraint': None,
  'lastRead': neo4j.time.DateTime(2024, 3, 31, 12, 37, 16, 996000000, tzinfo=<UTC>),
  'readCount': 39406},
 {'id': 2,
  'name': 'index_f7700477',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'LOOKUP',
  'entityType': 'RELATIONSHIP',
  'labelsOrTypes': None,
  'properties': None,
  'indexProvider': 'token-lookup-1.0',
  'owningConstraint': None,
  'lastRead': None,
  

- Loop through and create nodes for all chunks
- Should create 23 nodes because you set a limit of 20 chunks in the text splitting function above

In [247]:
node_count = 0
for paper in papers:
    print(f'Processing paper {paper["paperId"]}')
    chunks = split_paper_abstract(paper)
    for chunk in chunks:
        print(f'\tProcessing paper chunk {chunk["chunkId"]}')
        kg.query(merge_chunk_node_query,
                params={
                    'chunkParam': chunk
                })
        node_count += 1
print(f"Created {node_count} nodes")

Processing paper 1e5cd344430d8de5bdb866ea5e8612467e69b689
	Split into 1 chunks
	Processing paper chunk 1e5cd344430d8de5bdb866ea5e8612467e69b689-abstract-chunk0000
Processing paper c85acc8c6ac5eb7466e26517ae33aeac6fec6689
	Split into 1 chunks
	Processing paper chunk c85acc8c6ac5eb7466e26517ae33aeac6fec6689-abstract-chunk0000
Processing paper bd5ae6e3dd85b54442c70701f6b19a35891423f0
	Split into 1 chunks
	Processing paper chunk bd5ae6e3dd85b54442c70701f6b19a35891423f0-abstract-chunk0000
Processing paper eaeeb3c3d6726464ead8561082a7d72d8e7a0d64
	Split into 1 chunks
	Processing paper chunk eaeeb3c3d6726464ead8561082a7d72d8e7a0d64-abstract-chunk0000
Processing paper 1432bfc59b9de4fd02554be99f181e2db4e10f37
	Split into 1 chunks
	Processing paper chunk 1432bfc59b9de4fd02554be99f181e2db4e10f37-abstract-chunk0000
Processing paper 49ac1b055ed8da31193644c868b3173ed3e2a24f
	Split into 1 chunks
	Processing paper chunk 49ac1b055ed8da31193644c868b3173ed3e2a24f-abstract-chunk0000
Processing paper 8658b

In [248]:
kg.query("""
         MATCH (n)
         RETURN count(n) as nodeCount
         """)

[{'nodeCount': 59083}]

### Create a vector index

In [221]:
kg.query("""
         CREATE VECTOR INDEX `abstract_chunks`
          FOR (c:Chunk) ON (c.textEmbedding)
          OPTIONS { indexConfig: {
            `vector.dimensions`: 1536,
            `vector.similarity_function`: 'cosine'
         }}
""")

ClientError: {code: Neo.ClientError.Schema.EquivalentSchemaRuleAlreadyExists} {message: An equivalent index already exists, 'Index( id=11, name='abstract_chunks', type='VECTOR', schema=(:Chunk {textEmbedding}), indexProvider='vector-2.0' )'.}

In [222]:
kg.query("SHOW INDEXES")

[{'id': 11,
  'name': 'abstract_chunks',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'VECTOR',
  'entityType': 'NODE',
  'labelsOrTypes': ['Chunk'],
  'properties': ['textEmbedding'],
  'indexProvider': 'vector-2.0',
  'owningConstraint': None,
  'lastRead': neo4j.time.DateTime(2024, 3, 31, 12, 33, 52, 774000000, tzinfo=<UTC>),
  'readCount': 2},
 {'id': 1,
  'name': 'index_343aff4e',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'LOOKUP',
  'entityType': 'NODE',
  'labelsOrTypes': None,
  'properties': None,
  'indexProvider': 'token-lookup-1.0',
  'owningConstraint': None,
  'lastRead': neo4j.time.DateTime(2024, 3, 31, 12, 37, 16, 996000000, tzinfo=<UTC>),
  'readCount': 39406},
 {'id': 2,
  'name': 'index_f7700477',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'LOOKUP',
  'entityType': 'RELATIONSHIP',
  'labelsOrTypes': None,
  'properties': None,
  'indexProvider': 'token-lookup-1.0',
  'owningConstraint': None,
  'lastRead': None,
  

### Calculate embedding vectors for chunks and populate index
- This query calculates the embedding vector and stores it as a property called `textEmbedding` on each `Chunk` node.

In [249]:
kg.query("""
    MATCH (chunk:Chunk) WHERE chunk.textEmbedding IS NULL
    WITH chunk, genai.vector.encode(
      chunk.text,
      "OpenAI",
      {
        token: $openAiApiKey
      }) AS vector
    CALL db.create.setNodeVectorProperty(chunk, "textEmbedding", vector)
    """,
    params={"openAiApiKey":OPENAI_API_KEY} )

KeyboardInterrupt: 

In [224]:
kg.refresh_schema()
print(kg.schema)

Node properties are the following:
Paper {openAccessPdfStatus: STRING, fieldsOfStudy: LIST, publicationDate: STRING, citationCount: INTEGER, influentialCitationCount: INTEGER, isOpenAccess: BOOLEAN, openAccessPdfUrl: STRING, url: STRING, venue: STRING, year: INTEGER, referenceCount: INTEGER, externalIdDoi: STRING, corpusId: INTEGER, title: STRING, externalIdMag: STRING, externalIdCorpus: INTEGER, paperId: STRING, publicationTypes: LIST, abstract: STRING},Author {name: STRING, authorId: STRING},Chunk {textEmbedding: LIST, source: STRING, title: STRING, chunkId: STRING, chunkSeqId: INTEGER, text: STRING}
Relationship properties are the following:

The relationships are the following:



### Use similarity search to find relevant chunks

- Setup a help function to perform similarity search using the vector index

In [225]:
def neo4j_vector_search(question):
  """Search for similar nodes using the Neo4j vector index"""
  vector_search_query = """
    WITH genai.vector.encode(
      $question,
      "OpenAI",
      {
        token: $openAiApiKey
      }) AS question_embedding
    CALL db.index.vector.queryNodes($index_name, $top_k, question_embedding) yield node, score
    RETURN score, node.text AS text
  """
  similar = kg.query(vector_search_query,
                     params={
                      'question': question,
                      'openAiApiKey':OPENAI_API_KEY,
                      'index_name':VECTOR_INDEX_NAME,
                      'top_k': 10})
  return similar

- Ask a question!

In [255]:
search_results = neo4j_vector_search(
    'sensitive magnetometers are Magnetocardiography .'
)

In [256]:
search_results[0]

{'score': 0.9355030059814453,
 'text': 'Presently, among the most demanding applications for highly sensitive magnetometers are Magnetocardiography (MCG) and Magnetoencephalography (MEG), where sensitivities of around 1pT.Hz-1/2 and 1fT.Hz-1/2 are required. Cryogenic Superconducting Quantum Interference Devices (SQUIDs) are currently used as the magnetometers. However, there has been some recent work on replacing these devices with magnetometers based on atomic spectroscopy and operating at room temperature. There are demonstrations of MCG and MEG signals measured using atomic spectroscopy These atomic magnetometers are based on chip-scale microfabricated components. In this paper we discuss the prospects of using photonic crystal optical fibres or hollow core fibres (HCFs) loaded with Rb vapour in atomic magnetometer systems. We also consider new components for magnetometers based on mode-locked semiconductor lasers for measuring magnetic field via coherent population trapping (CPT) i

### Set up a LangChain RAG workflow to chat with the form

In [257]:
neo4j_vector_store = Neo4jVector.from_existing_graph(
    embedding=OpenAIEmbeddings(),
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    index_name=VECTOR_INDEX_NAME,
    node_label=VECTOR_NODE_LABEL,
    text_node_properties=[VECTOR_SOURCE_PROPERTY],
    embedding_node_property=VECTOR_EMBEDDING_PROPERTY,
)


In [258]:
retriever = neo4j_vector_store.as_retriever()

- Set up a RetrievalQAWithSourcesChain to carry out question answering
- You can check out the LangChain documentation for this chain [here](https://api.python.langchain.com/en/latest/chains/langchain.chains.qa_with_sources.retrieval.RetrievalQAWithSourcesChain.html)

In [259]:
chain = RetrievalQAWithSourcesChain.from_chain_type(
    ChatOpenAI(temperature=0),
    chain_type="stuff",
    retriever=retriever
)

In [260]:
def prettychain(question: str) -> str:
    """Pretty print the chain's response to a question"""
    response = chain({"question": question},
        return_only_outputs=True,)
    print(textwrap.fill(response['answer'], 60))

- Ask a question!

In [261]:
question = "Explain me about magnetometers are Magnetocardiography "

In [262]:
prettychain(question)

Magnetometers are used in Magnetocardiography (MCG) and
Magnetoencephalography (MEG) applications, requiring high
sensitivities. Cryogenic Superconducting Quantum
Interference Devices (SQUIDs) are currently used as
magnetometers, but there is ongoing research on replacing
them with atomic spectroscopy-based magnetometers operating
at room temperature. These atomic magnetometers are based on
chip-scale microfabricated components. There are also
demonstrations of MCG and MEG signals measured using atomic
spectroscopy. Additionally, there are prospects of using
photonic crystal optical fibers or hollow core fibers (HCFs)
loaded with Rb vapor in atomic magnetometer systems, as well
as new components based on mode-locked semiconductor lasers
for measuring magnetic fields via coherent population
trapping (CPT) in Rb loaded HCFs.


In [None]:
prettychain("Where is Netapp headquartered?")

In [None]:
prettychain("""
    Tell me about Netapp.
    Limit your answer to a single sentence.
""")

In [None]:
prettychain("""
    Tell me about Apple.
    Limit your answer to a single sentence.
""")

In [None]:
prettychain("""
    Tell me about Apple.
    Limit your answer to a single sentence.
    If you are unsure about the answer, say you don't know.
""")

### Ask you own question!
- Add your own question to the call to prettychain below to find out more about NetApp
- Here is NetApp's website if you want some inspiration: https://www.netapp.com/

In [None]:
prettychain("""
    ADD YOUR OWN QUESTION HERE
""")