# Lesson 4: Constructing a Knowledge Graph from Text Documents

<p style="background-color:#fd4a6180; padding:15px; margin-left:20px"> <b>Note:</b> This notebook takes about 30 seconds to be ready to use. Please wait until the "Kernel starting, please wait..." message clears from the top of the notebook before running any cells. You may start the video while you wait.</p>

In [1]:
!git clone https://github.com/luiigirusso/CyberSA-RAG.git

Cloning into 'CyberSA-RAG'...
remote: Enumerating objects: 13, done.[K
remote: Counting objects: 100% (13/13), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 13 (delta 2), reused 7 (delta 1), pack-reused 0 (from 0)[K
Receiving objects: 100% (13/13), 64.99 KiB | 3.61 MiB/s, done.
Resolving deltas: 100% (2/2), done.


In [None]:
!pip install python-dotenv
!pip install langchain-community
!pip install langchain-openai
!pip install neo4j



### Import packages and set up Neo4j

In [None]:
from dotenv import load_dotenv
import os

# Common data processing
import json
import textwrap

# Langchain
from langchain_community.graphs import Neo4jGraph
from langchain_community.vectorstores import Neo4jVector
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_openai import ChatOpenAI


# Warning control
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Load from environment
load_dotenv('/content/.env', override=True)
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')
NEO4J_DATABASE = os.getenv('NEO4J_DATABASE') or 'neo4j'
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
OPENAI_ENDPOINT = os.getenv('OPENAI_BASE_URL') + '/embeddings'

# Global constants
VECTOR_INDEX_NAME = 'malwaredb_chunks'
VECTOR_NODE_LABEL = 'Chunk'
VECTOR_SOURCE_PROPERTY = 'text'
VECTOR_EMBEDDING_PROPERTY = 'textEmbedding'

### Text file

In [None]:
file_name="/content/ThreatConnect_Operation_Arachnophobia_Report.pdf.txt"
# Leggi tutto il contenuto come stringa
with open(file_name, 'r', encoding='utf-8') as file:
    text = file.read()

### Split Form 10-K sections into chunks
- Set up text splitter using LangChain
- For now, split only the text from the "item 1" section

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000,
    chunk_overlap  = 200,
    length_function = len,
    is_separator_regex = False,
)

In [None]:
item1_text_chunks = text_splitter.split_text(text)

In [None]:
type(item1_text_chunks)

list

In [None]:
len(item1_text_chunks)

83

In [None]:
item1_text_chunks[0]

'Operation\nArachnophobia\nCaught in the Spider’s Web\n\n\n\n\nRich Barger | Cyber Squared Inc.\nMike Oppenheim | FireEye Labs\nChris Phillips | FireEye Labs\n\x0cContents\nTeam Introduction....................................................................................................................................................... 1\n\nKey Findings.................................................................................................................................................................. 1\n\nSummary....................................................................................................................................................................... 1\n\nBackstory......................................................................................................................................................................2\n\nVPSNOC/Digital Linx/Tranchulas...................................................................................

- Set up helper function to chunk all sections of the Form 10-K
- You'll limit the number of chunks in each section to 20 to speed things up

In [None]:
def split_form10k_data_from_file(file_name):
    chunks_with_metadata = [] # use this to accumlate chunk records
    with open(file_name, 'r', encoding='utf-8') as file:
      text = file.read()
      text_chunks = text_splitter.split_text(text) # split the text into chunks
      chunk_seq_id = 0
      for chunk in text_chunks[:20]: # only take the first 20 chunks
          form_id=file_name
          # finally, construct a record with metadata and the chunk text
          chunks_with_metadata.append({
              'text': chunk,
              # metadata from looping...
              'chunkSeqId': chunk_seq_id,
              # constructed metadata...
              'formId': f'{form_id}', # pulled from the filename
              'chunkId': f'{form_id}-chunk{chunk_seq_id:04d}',
              'source': file_name,
          })
          chunk_seq_id += 1
      print(f'\tSplit into {chunk_seq_id} chunks')
    return chunks_with_metadata

In [None]:
first_file_chunks = split_form10k_data_from_file(file_name)

	Split into 20 chunks


In [None]:
first_file_chunks[0]

{'text': 'Operation\nArachnophobia\nCaught in the Spider’s Web\n\n\n\n\nRich Barger | Cyber Squared Inc.\nMike Oppenheim | FireEye Labs\nChris Phillips | FireEye Labs\n\x0cContents\nTeam Introduction....................................................................................................................................................... 1\n\nKey Findings.................................................................................................................................................................. 1\n\nSummary....................................................................................................................................................................... 1\n\nBackstory......................................................................................................................................................................2\n\nVPSNOC/Digital Linx/Tranchulas..........................................................................

### Create graph nodes using text chunks

In [None]:
merge_chunk_node_query = """
MERGE(mergedChunk:Chunk {chunkId: $chunkParam.chunkId})
    ON CREATE SET
        mergedChunk.formId = $chunkParam.formId,
        mergedChunk.chunkSeqId = $chunkParam.chunkSeqId,
        mergedChunk.source = $chunkParam.source,
        mergedChunk.text = $chunkParam.text
RETURN mergedChunk
"""

- Set up connection to graph instance using LangChain

In [None]:
kg = Neo4jGraph(
    url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD, database=NEO4J_DATABASE
)

- Create a single chunk node for now

In [None]:
kg.query(merge_chunk_node_query,
         params={'chunkParam':first_file_chunks[0]})

[{'mergedChunk': {'formId': '/content/ThreatConnect_Operation_Arachnophobia_Report.pdf.txt',
   'source': '/content/ThreatConnect_Operation_Arachnophobia_Report.pdf.txt',
   'text': 'Operation\nArachnophobia\nCaught in the Spider’s Web\n\n\n\n\nRich Barger | Cyber Squared Inc.\nMike Oppenheim | FireEye Labs\nChris Phillips | FireEye Labs\n\x0cContents\nTeam Introduction....................................................................................................................................................... 1\n\nKey Findings.................................................................................................................................................................. 1\n\nSummary....................................................................................................................................................................... 1\n\nBackstory......................................................................................................

- Create a uniqueness constraint to avoid duplicate chunks

In [None]:
kg.query("""
CREATE CONSTRAINT unique_chunk IF NOT EXISTS
    FOR (c:Chunk) REQUIRE c.chunkId IS UNIQUE
""")


[]

In [None]:
kg.query("SHOW INDEXES")

[{'id': 0,
  'name': 'index_343aff4e',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'LOOKUP',
  'entityType': 'NODE',
  'labelsOrTypes': None,
  'properties': None,
  'indexProvider': 'token-lookup-1.0',
  'owningConstraint': None,
  'lastRead': None,
  'readCount': 0},
 {'id': 1,
  'name': 'index_f7700477',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'LOOKUP',
  'entityType': 'RELATIONSHIP',
  'labelsOrTypes': None,
  'properties': None,
  'indexProvider': 'token-lookup-1.0',
  'owningConstraint': None,
  'lastRead': None,
  'readCount': 0},
 {'id': 2,
  'name': 'unique_chunk',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'RANGE',
  'entityType': 'NODE',
  'labelsOrTypes': ['Chunk'],
  'properties': ['chunkId'],
  'indexProvider': 'range-1.0',
  'owningConstraint': 'unique_chunk',
  'lastRead': None,
  'readCount': None}]

- Loop through and create nodes for all chunks
- Should create 23 nodes because you set a limit of 20 chunks in the text splitting function above

In [None]:
node_count = 0
for chunk in first_file_chunks:
    print(f"Creating `:Chunk` node for chunk ID {chunk['chunkId']}")
    kg.query(merge_chunk_node_query,
            params={
                'chunkParam': chunk
            })
    node_count += 1
print(f"Created {node_count} nodes")

Creating `:Chunk` node for chunk ID /content/ThreatConnect_Operation_Arachnophobia_Report.pdf.txt-chunk0000
Creating `:Chunk` node for chunk ID /content/ThreatConnect_Operation_Arachnophobia_Report.pdf.txt-chunk0001
Creating `:Chunk` node for chunk ID /content/ThreatConnect_Operation_Arachnophobia_Report.pdf.txt-chunk0002
Creating `:Chunk` node for chunk ID /content/ThreatConnect_Operation_Arachnophobia_Report.pdf.txt-chunk0003
Creating `:Chunk` node for chunk ID /content/ThreatConnect_Operation_Arachnophobia_Report.pdf.txt-chunk0004
Creating `:Chunk` node for chunk ID /content/ThreatConnect_Operation_Arachnophobia_Report.pdf.txt-chunk0005
Creating `:Chunk` node for chunk ID /content/ThreatConnect_Operation_Arachnophobia_Report.pdf.txt-chunk0006
Creating `:Chunk` node for chunk ID /content/ThreatConnect_Operation_Arachnophobia_Report.pdf.txt-chunk0007
Creating `:Chunk` node for chunk ID /content/ThreatConnect_Operation_Arachnophobia_Report.pdf.txt-chunk0008
Creating `:Chunk` node for c

In [None]:
kg.query("""
         MATCH (n)
         RETURN count(n) as nodeCount
         """)

[{'nodeCount': 20}]

### Create a vector index

In [None]:
kg.query("""
         CREATE VECTOR INDEX `malwaredb_chunks` IF NOT EXISTS
          FOR (c:Chunk) ON (c.textEmbedding)
          OPTIONS { indexConfig: {
            `vector.dimensions`: 1536,
            `vector.similarity_function`: 'cosine'
         }}
""")

[]

In [None]:
kg.query("SHOW INDEXES")

[{'id': 0,
  'name': 'index_343aff4e',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'LOOKUP',
  'entityType': 'NODE',
  'labelsOrTypes': None,
  'properties': None,
  'indexProvider': 'token-lookup-1.0',
  'owningConstraint': None,
  'lastRead': None,
  'readCount': 0},
 {'id': 1,
  'name': 'index_f7700477',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'LOOKUP',
  'entityType': 'RELATIONSHIP',
  'labelsOrTypes': None,
  'properties': None,
  'indexProvider': 'token-lookup-1.0',
  'owningConstraint': None,
  'lastRead': None,
  'readCount': 0},
 {'id': 4,
  'name': 'malwaredb_chunks',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'VECTOR',
  'entityType': 'NODE',
  'labelsOrTypes': ['Chunk'],
  'properties': ['textEmbedding'],
  'indexProvider': 'vector-2.0',
  'owningConstraint': None,
  'lastRead': None,
  'readCount': None},
 {'id': 2,
  'name': 'unique_chunk',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'RANGE',
  'enti

### Calculate embedding vectors for chunks and populate index
- This query calculates the embedding vector and stores it as a property called `textEmbedding` on each `Chunk` node.

In [None]:
kg.query("""
    MATCH (chunk:Chunk) WHERE chunk.textEmbedding IS NULL
    WITH chunk, genai.vector.encode(
      chunk.text,
      "OpenAI",
      {
        token: $openAiApiKey,
        endpoint: $openAiEndpoint
      }) AS vector
    CALL db.create.setNodeVectorProperty(chunk, "textEmbedding", vector)
    """,
    params={"openAiApiKey":OPENAI_API_KEY, "openAiEndpoint": OPENAI_ENDPOINT} )

[]

In [None]:
kg.refresh_schema()
print(kg.schema)

Node properties:
Chunk {chunkId: STRING, formId: STRING, chunkSeqId: INTEGER, source: STRING, text: STRING, textEmbedding: LIST}
Relationship properties:

The relationships:



### Use similarity search to find relevant chunks

- Setup a help function to perform similarity search using the vector index

In [None]:
def neo4j_vector_search(question):
  """Search for similar nodes using the Neo4j vector index"""
  vector_search_query = """
    WITH genai.vector.encode(
      $question,
      "OpenAI",
      {
        token: $openAiApiKey,
        endpoint: $openAiEndpoint
      }) AS question_embedding
    CALL db.index.vector.queryNodes($index_name, $top_k, question_embedding) yield node, score
    RETURN score, node.text AS text
  """
  similar = kg.query(vector_search_query,
                     params={
                      'question': question,
                      'openAiApiKey':OPENAI_API_KEY,
                      'openAiEndpoint': OPENAI_ENDPOINT,
                      'index_name':VECTOR_INDEX_NAME,
                      'top_k': 10})
  return similar

- Ask a question!

In [None]:
search_results = neo4j_vector_search(
    'What evidence links BITTERBUG activity to Pakistan-based entities?'
)

In [None]:
search_results[0]

{'score': 0.919342041015625,
 'text': 'Digital Appendix 4: Maltego Visualization.............................................................................................................35\n\n\n\n\ni    •     OPERATION ARACHNOPHOBIA\n\x0cTeam Introduction\nCyber Squared Inc.’s ThreatConnect Intelligence Research Team (TCIRT) tracks a number of threat groups around the world.\nBeginning in the summer of 2013, TCIRT identified a suspected Pakistani-origin threat group. This group was revealed by TCIRT\npublicly in August 2013. In the months following the disclosure, we identified new activity. Cyber Squared partnered with experts\nat FireEye Labs to examine these new observations in an attempt to discover new research and insight into the group and its\nOperation “Arachnophobia”. The following report is a product of collaborative research and threat intelligence sharing between\nCyber Squared Inc.’s TCIRT and FireEye Labs.\n\n\nKey Findings\n•\t While we are not conclusively attributi

### Set up a LangChain RAG workflow to chat with the form

In [None]:
neo4j_vector_store = Neo4jVector.from_existing_graph(
    embedding=OpenAIEmbeddings(),
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    index_name=VECTOR_INDEX_NAME,
    node_label=VECTOR_NODE_LABEL,
    text_node_properties=[VECTOR_SOURCE_PROPERTY],
    embedding_node_property=VECTOR_EMBEDDING_PROPERTY,
)


In [None]:
retriever = neo4j_vector_store.as_retriever()

- Set up a RetrievalQAWithSourcesChain to carry out question answering
- You can check out the LangChain documentation for this chain [here](https://api.python.langchain.com/en/latest/chains/langchain.chains.qa_with_sources.retrieval.RetrievalQAWithSourcesChain.html)

In [None]:
chain = RetrievalQAWithSourcesChain.from_chain_type(
    ChatOpenAI(temperature=0),
    chain_type="stuff",
    retriever=retriever
)

In [None]:
def prettychain(question: str) -> str:
    """Pretty print the chain's response to a question"""
    response = chain({"question": question},
        return_only_outputs=True,)
    print(textwrap.fill(response['answer'], 60))

- Ask a question!

In [None]:
question = "What evidence links BITTERBUG activity to Pakistan-based entities?"

In [None]:
prettychain(question)

The evidence linking BITTERBUG activity to Pakistan-based
entities includes the use of a Pakistani-based hosting
provider for command and control, the hosting of malware on
IP addresses operated by a Pakistan-based hosting provider,
and the presence of build paths containing references to a
Pakistani security firm and an employee of that firm.
Additionally, employees at the Pakistan-based hosting
provider and the security firm have connections within each
other's social networks.


In [None]:
prettychain("How can organizations better protect themselves against similar threats?")

Organizations can better protect themselves against similar
threats by employing personnel with offensive cyber
expertise and closely monitoring any suspicious activities.
It is also important to maintain open communication with
hosting providers and promptly address any inconsistencies
in claims or responses. Regularly conducting technical
reviews of malware associated with potential threats can
also help in identifying and mitigating risks.


In [None]:
prettychain("""
    What evidence links BITTERBUG activity to Pakistan-based entities.
    Limit your answer to a single sentence.
""")

The evidence linking BITTERBUG activity to Pakistan-based
entities includes the use of a Pakistani-based hosting
provider for command and control, and the presence of build
paths containing references to a Pakistani security firm and
employee.


In [None]:
prettychain("""
    Tell me about Apple.
    Limit your answer to a single sentence.
""")

I don't know.


In [None]:
prettychain("""
    Tell me about Apple.
    Limit your answer to a single sentence.
    If you are unsure about the answer, say you don't know.
""")

I don't know.


### Ask you own question!
- Add your own question to the call to prettychain below to find out more about NetApp
- Here is NetApp's website if you want some inspiration: https://www.netapp.com/

In [None]:
prettychain("""
    ADD YOUR OWN QUESTION HERE
""")