# Expand With Summary

The Chunk Nodes are individual chunks of text that used
to be part of an entire document.

You can reconstruct the original context by connecting the nodes
with relationships. It also makes the data easy to navigate and understand.

That is super helpful when you're building an application, for debugging and testing.

And, you can provide a better user experience. Your users will be able 
to directly interact with the data and even provide feedback that will
improve subsequent answers. 

You will create a connected context by making the following
changes to the knowledge graph:

1. Extract, create `(:Document)` nodes for each original source Form.
2. Enhance, add a summarized text property to each `(:Form)` node.
3. Expand, connect each `(:Chunk)` to the `(:Form)` node that it is part of

The graph will look like this...

```cypher
(:Section 
  sectionId: string //  a unique identifier for the form
  documentUri: string
  summary: string // text summary generated with the LLM 
  summaryEmbedding: float[] // vector embedding of summary
)
```

```cypher
// Document contains sections
(:Document)-[:CONTAINS]->(:Section)

// Section contains Chunks
(:Section)-[:CONTAINS]->(:Chunk)
```

## Setup

In [11]:
%run 'shared.ipynb'

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv
Connecting to Neo4j at bolt://neo4j-1:7687 as neo4j
Using data from /home/jovyan/data/single
Embedding with ollama using mxbai-embed-large
Chatting with ollama using llama3


# Enhance - create summary property, with embedding

During the file processing above, the text from all interesting
items were added to the `fullText` property of the `all_forms` dictionaries.

We'll use an LLM to summarize the text and create an embedding.

Both the text summary and the embdding will be added to the Section nodes.

In [2]:
# Create an embedding to find out the dimensions of the vector
text_embedding = embeddings_api.embed_query("embed this text using an LLM")
vector_dimensions = len(text_embedding) 

print(f"Text embeddings will have {vector_dimensions} dimensions")
# Create a vector index called "sections_vector" the `summaryEmbedding`` property of nodes labeled `Section`. 
gdb.execute_query("""
         CREATE VECTOR INDEX `sections_vector` IF NOT EXISTS
          FOR (s:Section) ON (s.summaryEmbedding) 
          OPTIONS { indexConfig: {
            `vector.dimensions`: $vectorDimensionsParam,
            `vector.similarity_function`: 'cosine'    
         }}
""",
  vectorDimensionsParam=vector_dimensions
)

# Check the vector indexes in the graph
gdb.execute_query('SHOW VECTOR INDEXES').records

Text embeddings will have 1024 dimensions


[<Record id=11 name='chunks_vector' state='ONLINE' populationPercent=100.0 type='VECTOR' entityType='NODE' labelsOrTypes=['Chunk'] properties=['embedding'] indexProvider='vector-2.0' owningConstraint=None lastRead=neo4j.time.DateTime(2024, 6, 5, 9, 30, 46, 871000000, tzinfo=<UTC>) readCount=19>,
 <Record id=2 name='sections_vector' state='ONLINE' populationPercent=100.0 type='VECTOR' entityType='NODE' labelsOrTypes=['Section'] properties=['summaryEmbedding'] indexProvider='vector-2.0' owningConstraint=None lastRead=None readCount=0>]

In [8]:
%%time

query = """
MATCH (s:Section)-[:CONTAINS]->(c:Chunk)
WHERE s.summary is null
WITH s, collect(c.text) as textList
RETURN s.sectionId AS sectionId
        , s.level AS level
        , apoc.text.join(textList, " \n ") AS text
"""

sections_with_text = gdb.execute_query(query).records
for section in sections_with_text[0:]:
    section_info = {}
    text = section["text"]
    result = chat_api.invoke(
      f"""Write a single, very brief sentence summary
       based on the following information...\n {text}. Do not respond with "Here is a brief sentence summary:" 
      """)
    summary = result.content
    print(f"Summarized {len(summary)} characters. Here's a preview...")
    print(f"\t{summary[:120]}")
    section_info['summary'] = summary
    section_info['sectionId'] = section["sectionId"]
    
    summary_embedding = embeddings_api.embed_query(summary)
    section_info['summaryEmbedding'] = summary_embedding
    print(f"\tUpdating section with ID {section_info['sectionId']} with summary and embedding...")
    gdb.execute_query("""
      MATCH (s:Section {sectionId: $sectionInfoParam.sectionId})
        SET s.summary = $sectionInfoParam.summary 
      WITH s
        CALL db.create.setNodeVectorProperty(s, "summaryEmbedding", $sectionInfoParam.summaryEmbedding)    
    """, 
    sectionInfoParam=section_info
    )


Summarized 181 characters. Here's a preview...
	The International Ice Hockey Federation (IIHF) establishes a single set of rules to ensure fair play and respect across 
	Updating section with ID ca4f9dcf204e2037bfe5884867bead98bd9cbaf8 with summary and embedding...
Summarized 138 characters. Here's a preview...
	The IIHF governs games played on an ice rink that adheres to specific dimensions and rules, with limited exceptions for 
	Updating section with ID 70669d65799f0b85f53fa4d87f745b1ec129af58 with summary and embedding...
Summarized 155 characters. Here's a preview...
	The official size of an ice rink used in IIHF competitions is 60m long and 26-30m wide, with rounded corners formed by a
	Updating section with ID 3256ff621b0e47207f42b2db33a4589ec1ddd458 with summary and embedding...
Summarized 166 characters. Here's a preview...
	The Rink's boards, measuring 1.07m high, are constructed to be smooth and obstruction-free, with approved protective gla
	Updating section with ID b9ea9b8

KeyboardInterrupt: 

### To view the summaries for one `Section` node

```
MATCH (c:Chunk)<-[:CONTAINS]-(s:Section)
WHERE c.header4 = '8.1. INJURED PLAYER'
    RETURN c, s
```    

### Remove all summary and summaryEmbedding from all `Section` nodes

In [15]:
cypher_remove_summaries = """
MATCH (s:Section) remove s.summary, s.summaryEbedding
"""

gdb.execute_query(cypher_remove_summaries)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0xffff5541e710>, keys=[])