# Markdown Tree for ingestion

In [1]:
import sys
sys.path.append("../")
from src.ingestion.markdown_tree import *

Tech details: Major libraries used:
- mistletoe - for markdown parsing and rendering; this creates tokens
- nutree - for representing and reasoning over the tree; this creates nodes that wrap around tokens

## Generic Demo

In [2]:
markdown_text = """
This is the first paragraph with no heading.

# Heading 1

First paragraph under Heading 1. Paragraph H1.p1, sentence 2. Paragraph H1.p1, sentence 3. Paragraph H1.p1, sentence 4. Last sentence of 'H1.p1'.

Second paragraph under Heading 1. Paragraph H1.p2, sentence 2. Paragraph H1.p2, sentence 3. Paragraph H1.p2, sentence 4. Paragraph H1.p2, sentence 5. Paragraph H1.p2, sentence 6. Last sentence of 'H1.p2'.

## Heading 2

Paragraph 1 under Heading 2 with [a link](http://to.nowhere.com). Paragraph H2.p1, sentence 2. Paragraph H2.p1, sentence 3. Paragraph H2.p1, sentence 4. Last sentence of 'H2.p1'.

Paragraph 2 under Heading 2. Paragraph H2.p2, sentence 2. Paragraph H2.p2, sentence 3. Paragraph H2.p2, sentence 4. Paragraph H2.p2, sentence 5. Paragraph H2.p2, sentence 6. Last sentence of 'H2.p2'.

List intro:
* Item H2.3.L1.1
* Item H2.3.L1.2
* Item H2.3.L1.3

### Heading 3

Only paragraph under **Heading 3**. Paragraph H3.p1, sentence 2. Paragraph H3.p1, sentence 3. Paragraph H3.p1, sentence 4. Last sentence.

Table intro:
| H3.2.T1: header 1 | H3.2.T1: header 2 | H3.2.T1: header 3 | H3.2.T1: header 4 |
|---|---|---|---|
| H3.2.T1: row 1, col 1 | H3.2.T1: row 1, col 2 | H3.2.T1: row 1, col 3 | H3.2.T1: row 1, col 4 |
| H3.2.T1: row 2, col 1 | H3.2.T1: row 2, col 2 | H3.2.T1: row 2, col 3 | H3.2.T1: row 2, col 4 |
| H3.2.T1: row 3, col 1 | H3.2.T1: row 3, col 2 | H3.2.T1: row 3, col 3 | H3.2.T1: row 3, col 4 |

# Second H1 without a paragraph

### Skip to H3

A paragraph under "Second H1>Heading 3". Paragraph H1>H3.p1, sentence 2. Paragraph H1>H3.p1, sentence 3. Following list has *no* intro.

* Item H1>H3.L1.1
* Item H1>H3.L1.2
  * Item H1>H3.L1.subL.1
    * Item H1>H3.L1.subL.2
      * Item H1>H3.L1.subL.3
* Item H1>H3.L1.1
* Item H1>H3.L1.2

Paragraph before list with long list items.
* H1>H3.L2.1 -- Paragraph L2.item1, sentence 2. Paragraph L2.item1, sentence 3. Paragraph L2.item1, sentence 4.
* H1>H3.L2.2 -- Paragraph L2.item2, sentence 2. Paragraph L2.item2, sentence 3. Paragraph L2.item2, sentence 4. Paragraph L2.item2, sentence 5.
* H1>H3.L2.3 -- Paragraph L2.item3, sentence 2. Paragraph L2.item3, sentence 3. Paragraph L2.item3, sentence 4. Paragraph L2.item3, sentence 5. Paragraph L2.item3, sentence 6.
* H1>H3.L2.4 -- Paragraph L2.item1, sentence 2. Paragraph L2.item1, sentence 3. Paragraph L2.item1, sentence 4.
* H1>H3.L2.5 -- Paragraph L2.item2, sentence 2. Paragraph L2.item2, sentence 3. Paragraph L2.item2, sentence 4. Paragraph L2.item2, sentence 5.
* H1>H3.L2.6 -- Paragraph L2.item3, sentence 2. Paragraph L2.item3, sentence 3. Paragraph L2.item3, sentence 4. Paragraph L2.item3, sentence 5. Paragraph L2.item3, sentence 6.

Final paragraph.
"""

'\nThis is the first paragraph with no heading.\n\n# Heading 1\n\nFirst paragraph under Heading 1. Paragraph H1.p1, sentence 2. Paragraph H1.p1, sentence 3. Paragraph H1.p1, sentence 4. Last sentence of \'H1.p1\'.\n\nSecond paragraph under Heading 1. Paragraph H1.p2, sentence 2. Paragraph H1.p2, sentence 3. Paragraph H1.p2, sentence 4. Paragraph H1.p2, sentence 5. Paragraph H1.p2, sentence 6. Last sentence of \'H1.p2\'.\n\n## Heading 2\n\nParagraph 1 under Heading 2 with [a link](http://to.nowhere.com). Paragraph H2.p1, sentence 2. Paragraph H2.p1, sentence 3. Paragraph H2.p1, sentence 4. Last sentence of \'H2.p1\'.\n\nParagraph 2 under Heading 2. Paragraph H2.p2, sentence 2. Paragraph H2.p2, sentence 3. Paragraph H2.p2, sentence 4. Paragraph H2.p2, sentence 5. Paragraph H2.p2, sentence 6. Last sentence of \'H2.p2\'.\n\nList intro:\n* Item H2.3.L1.1\n* Item H2.3.L1.2\n* Item H2.3.L1.3\n\n### Heading 3\n\nOnly paragraph under **Heading 3**. Paragraph H3.p1, sentence 2. Paragraph H3.p1, 

In [3]:
markdown = normalize_markdown(markdown_text)
print(markdown)


This is the first paragraph with no heading.

# Heading 1

First paragraph under Heading 1. Paragraph H1.p1, sentence 2. Paragraph H1.p1, sentence 3. Paragraph H1.p1, sentence 4. Last sentence of 'H1.p1'.

Second paragraph under Heading 1. Paragraph H1.p2, sentence 2. Paragraph H1.p2, sentence 3. Paragraph H1.p2, sentence 4. Paragraph H1.p2, sentence 5. Paragraph H1.p2, sentence 6. Last sentence of 'H1.p2'.

## Heading 2

Paragraph 1 under Heading 2 with [a link](http://to.nowhere.com). Paragraph H2.p1, sentence 2. Paragraph H2.p1, sentence 3. Paragraph H2.p1, sentence 4. Last sentence of 'H2.p1'.

Paragraph 2 under Heading 2. Paragraph H2.p2, sentence 2. Paragraph H2.p2, sentence 3. Paragraph H2.p2, sentence 4. Paragraph H2.p2, sentence 5. Paragraph H2.p2, sentence 6. Last sentence of 'H2.p2'.

List intro:
* Item H2.3.L1.1
* Item H2.3.L1.2
* Item H2.3.L1.3

### Heading 3

Only paragraph under **Heading 3**. Paragraph H3.p1, sentence 2. Paragraph H3.p1, sentence 3. Paragraph H3.p1, se

In [4]:
tree = create_markdown_tree(markdown_text)
tree.print()

Tree<'Markdown tree'>
╰── Document D_1: '<mistletoe.block_token.Document with 37 children line_number=1 at 0x105b72ff0>'
    ├── BlankLine BL_1
    ├── Paragraph P_2 of length 45 across 1 children
    │   ╰── RawText s.0: "<mistletoe.span_token.RawText content='This is the first paragraph wi'...+14 at 0x105b853a0>"
    ├── BlankLine BL_3
    ├── Heading H1_4: '# Heading 1\n'
    │   ╰── RawText s.1: "<mistletoe.span_token.RawText content='Heading 1' at 0x105b85460>"
    ├── BlankLine BL_5
    ├── Paragraph P_6 of length 146 across 1 children
    │   ╰── RawText s.2: "<mistletoe.span_token.RawText content='First paragraph under Heading '...+115 at 0x105bfd490>"
    ├── BlankLine BL_7
    ├── Paragraph P_8 of length 205 across 1 children
    │   ╰── RawText s.3: "<mistletoe.span_token.RawText content='Second paragraph under Heading'...+174 at 0x105b85640>"
    ├── BlankLine BL_9
    ├── Heading H2_10: '## Heading 2\n'
    │   ╰── RawText s.4: "<mistletoe.span_token.RawText content='Headi

### Tree examination

In [5]:
with AstRenderer() as ast_renderer:
    doc = mistletoe.Document(markdown_text)
    ast_json = ast_renderer.render(doc)
print(ast_json)

{
  "type": "Document",
  "footnotes": {},
  "line_number": 1,
  "children": [
    {
      "type": "Paragraph",
      "line_number": 2,
      "children": [
        {
          "type": "RawText",
          "content": "This is the first paragraph with no heading."
        }
      ]
    },
    {
      "type": "Heading",
      "line_number": 4,
      "level": 1,
      "children": [
        {
          "type": "RawText",
          "content": "Heading 1"
        }
      ]
    },
    {
      "type": "Paragraph",
      "line_number": 6,
      "children": [
        {
          "type": "RawText",
          "content": "First paragraph under Heading 1. Paragraph H1.p1, sentence 2. Paragraph H1.p1, sentence 3. Paragraph H1.p1, sentence 4. Last sentence of 'H1.p1'."
        }
      ]
    },
    {
      "type": "Paragraph",
      "line_number": 8,
      "children": [
        {
          "type": "RawText",
          "content": "Second paragraph under Heading 1. Paragraph H1.p2, sentence 2. Paragraph H

In [6]:
from mistletoe.base_renderer import BaseRenderer
with BaseRenderer() as ast_renderer:
    doc = mistletoe.Document(markdown_text)
    text = ast_renderer.render(doc)
print(text)

This is the first paragraph with no heading.Heading 1First paragraph under Heading 1. Paragraph H1.p1, sentence 2. Paragraph H1.p1, sentence 3. Paragraph H1.p1, sentence 4. Last sentence of 'H1.p1'.Second paragraph under Heading 1. Paragraph H1.p2, sentence 2. Paragraph H1.p2, sentence 3. Paragraph H1.p2, sentence 4. Paragraph H1.p2, sentence 5. Paragraph H1.p2, sentence 6. Last sentence of 'H1.p2'.Heading 2Paragraph 1 under Heading 2 with a link. Paragraph H2.p1, sentence 2. Paragraph H2.p1, sentence 3. Paragraph H2.p1, sentence 4. Last sentence of 'H2.p1'.Paragraph 2 under Heading 2. Paragraph H2.p2, sentence 2. Paragraph H2.p2, sentence 3. Paragraph H2.p2, sentence 4. Paragraph H2.p2, sentence 5. Paragraph H2.p2, sentence 6. Last sentence of 'H2.p2'.List intro:Item H2.3.L1.1Item H2.3.L1.2Item H2.3.L1.3Heading 3Only paragraph under Heading 3. Paragraph H3.p1, sentence 2. Paragraph H3.p1, sentence 3. Paragraph H3.p1, sentence 4. Last sentence.Table intro:H3.2.T1: row 1, col 1H3.2.T1: 

In [7]:
def normalize_markdown(markdown: str) -> str:
    with MarkdownRenderer(normalize_whitespace=True) as renderer:
        # "the parsing phase is currently tightly connected with initiation and closing of a renderer.
        # Therefore, you should never call Document(...) outside of a with ... as renderer block"
        doc = mistletoe.Document(markdown)
        return renderer.render(doc)

md = normalize_markdown(markdown_text)
print(md)


This is the first paragraph with no heading.

# Heading 1

First paragraph under Heading 1. Paragraph H1.p1, sentence 2. Paragraph H1.p1, sentence 3. Paragraph H1.p1, sentence 4. Last sentence of 'H1.p1'.

Second paragraph under Heading 1. Paragraph H1.p2, sentence 2. Paragraph H1.p2, sentence 3. Paragraph H1.p2, sentence 4. Paragraph H1.p2, sentence 5. Paragraph H1.p2, sentence 6. Last sentence of 'H1.p2'.

## Heading 2

Paragraph 1 under Heading 2 with [a link](http://to.nowhere.com). Paragraph H2.p1, sentence 2. Paragraph H2.p1, sentence 3. Paragraph H2.p1, sentence 4. Last sentence of 'H2.p1'.

Paragraph 2 under Heading 2. Paragraph H2.p2, sentence 2. Paragraph H2.p2, sentence 3. Paragraph H2.p2, sentence 4. Paragraph H2.p2, sentence 5. Paragraph H2.p2, sentence 6. Last sentence of 'H2.p2'.

List intro:
* Item H2.3.L1.1
* Item H2.3.L1.2
* Item H2.3.L1.3

### Heading 3

Only paragraph under **Heading 3**. Paragraph H3.p1, sentence 2. Paragraph H3.p1, sentence 3. Paragraph H3.p1, se

In [8]:
print(tree['L_38'].render())

* Item H1>H3.L1.1
* Item H1>H3.L1.2
  * Item H1>H3.L1.subL.1
    * Item H1>H3.L1.subL.2
      * Item H1>H3.L1.subL.3
* Item H1>H3.L1.1
* Item H1>H3.L1.2



In [9]:
print(tree['LI_39'].render())

* Item H1>H3.L1.2
  * Item H1>H3.L1.subL.1
    * Item H1>H3.L1.subL.2
      * Item H1>H3.L1.subL.3



In [10]:
print(tree['LI_41'].render())

* Item H1>H3.L1.subL.2
  * Item H1>H3.L1.subL.3



In [11]:
from pprint import pprint
pprint(describe_tree(tree))

{'children': defaultdict(<class 'set'>,
                         {'Document': {'BlankLine',
                                       'Heading',
                                       'List',
                                       'Paragraph',
                                       'Table'},
                          'Emphasis': {'RawText'},
                          'Heading': {'RawText'},
                          'Link': {'RawText'},
                          'List': {'ListItem'},
                          'ListItem': {'Paragraph', 'List'},
                          'Paragraph': {'Emphasis',
                                        'Link',
                                        'RawText',
                                        'Strong'},
                          'Strong': {'RawText'},
                          'Table': {'TableRow'},
                          'TableCell': {'RawText'},
                          'TableRow': {'TableCell'}}),
 'counts': defaultdict(<class 'int'>,
        

In [12]:
doc_node = tree.children[0]
len(doc_node.children)

37

In [13]:
heading_nodes = tree.find_all(match=lambda n: n.data_type == "Heading")
pprint(heading_nodes)
len(heading_nodes)

[Node<"Heading H1_4: '# Heading 1\\n'", data_id=H1_4>,
 Node<"Heading H2_10: '## Heading 2\\n'", data_id=H2_10>,
 Node<"Heading H3_21: '### Heading 3\\n'", data_id=H3_21>,
 Node<"Heading H1_32: '# Second H1 without a paragraph\\n'", data_id=H1_32>,
 Node<"Heading H3_34: '### Skip to H3\\n'", data_id=H3_34>]


5

In [14]:
table_nodes = tree.find_all(match=lambda n: n.data_type == "Table")

[Node<"Table T_26: '| H3.2.T1: header 1     | H3.2.T1: header 2     | H3.2.T1: header 3     | H3.2.T1: header 4     |'", data_id=T_26>]

In [15]:
table_node = table_nodes[0]
pprint(vars(table_node))

{'data_id': 'T_26',
 'data_type': 'Table',
 'token': <mistletoe.block_token.Table with 3 children line_number=26 column_align=[None, None, None, None] at 0x105bfe000>,
 'tree': Tree<'Markdown tree'>}


* Tokens are a result of parsing the original markdown text
    - They have `_parent` and `_children` attributes
* A (tree) node holds each token and places them in a tree according to the `_parent` and `_children` attributes

In [16]:
pprint(vars(table_node.token))

{'_children': [<mistletoe.block_token.TableRow with 4 children line_number=28 row_align=[None, None, None, None] at 0x105bfe090>,
               <mistletoe.block_token.TableRow with 4 children line_number=29 row_align=[None, None, None, None] at 0x105bfe330>,
               <mistletoe.block_token.TableRow with 4 children line_number=30 row_align=[None, None, None, None] at 0x105bfe390>],
 '_parent': <mistletoe.block_token.Document with 37 children line_number=1 at 0x105b72ff0>,
 'column_align': [None, None, None, None],
 'data_id': 'T_26',
 'header': <mistletoe.block_token.TableRow with 4 children line_number=26 row_align=[None, None, None, None] at 0x105bfe060>,
 'line_number': 26,
 'type': 'Table'}


A Table token has a `header` and `children` TabelRow tokens:

In [17]:
table_node.token.header

<mistletoe.block_token.TableRow with 4 children line_number=26 row_align=[None, None, None, None] at 0x105bfe060>

In [18]:
table_node.token.children

[<mistletoe.block_token.TableRow with 4 children line_number=28 row_align=[None, None, None, None] at 0x105bfe090>,
 <mistletoe.block_token.TableRow with 4 children line_number=29 row_align=[None, None, None, None] at 0x105bfe330>,
 <mistletoe.block_token.TableRow with 4 children line_number=30 row_align=[None, None, None, None] at 0x105bfe390>]

The token's children are reflected in the node's children:

In [19]:
table_node.children

[Node<"TableRow TR_28: '| H3.2.T1: row 1, col 1 | H3.2.T1: row 1, col 2 | H3.2.T1: row 1, col 3 | H3.2.T1: row 1, col 4 |'", data_id=TR_28>,
 Node<"TableRow TR_29: '| H3.2.T1: row 2, col 1 | H3.2.T1: row 2, col 2 | H3.2.T1: row 2, col 3 | H3.2.T1: row 2, col 4 |'", data_id=TR_29>,
 Node<"TableRow TR_30: '| H3.2.T1: row 3, col 1 | H3.2.T1: row 3, col 2 | H3.2.T1: row 3, col 3 | H3.2.T1: row 3, col 4 |'", data_id=TR_30>]

Nodes make working with and rendering tokens easier:

In [20]:
table_node.last_child().render()

'| H3.2.T1: row 3, col 1 | H3.2.T1: row 3, col 2 | H3.2.T1: row 3, col 3 | H3.2.T1: row 3, col 4 |'

The tree makes reasoning about the markdown structure easier (i.e., during chunking):

In [21]:
tree.find_all(match=lambda n: n.data_type == "List")

[Node<"List L_17: '<mistletoe.block_token.List with 3 children line_number=17 loose=False start=None at 0x105bfda60>'", data_id=L_17>,
 Node<"List L_38: '<mistletoe.block_token.List with 4 children line_number=38 loose=False start=None at 0x105bff020>'", data_id=L_38>,
 Node<"List L_40: '<mistletoe.block_token.List with 1 child line_number=40 loose=False start=None at 0x105bff380>'", data_id=L_40>,
 Node<"List L_41: '<mistletoe.block_token.List with 1 child line_number=41 loose=False start=None at 0x105bff620>'", data_id=L_41>,
 Node<"List L_42: '<mistletoe.block_token.List with 1 child line_number=42 loose=False start=None at 0x105bff770>'", data_id=L_42>,
 Node<"List L_47: '<mistletoe.block_token.List with 6 children line_number=47 loose=False start=None at 0x105bff9b0>'", data_id=L_47>]

### Tree preparation (before chunking)

For now, let's talk about "nodes" and "tokens" synonymously.
* They have a `type`
    - These types are either a `BlockToken` or `SpanToken`

In [22]:
mistletoe.block_token.BlockToken.__subclasses__()

[mistletoe.block_token.Document,
 mistletoe.block_token.Heading,
 mistletoe.block_token.SetextHeading,
 mistletoe.block_token.Quote,
 mistletoe.block_token.Paragraph,
 mistletoe.block_token.BlockCode,
 mistletoe.block_token.CodeFence,
 mistletoe.block_token.List,
 mistletoe.block_token.ListItem,
 mistletoe.block_token.Table,
 mistletoe.block_token.TableRow,
 mistletoe.block_token.TableCell,
 mistletoe.block_token.Footnote,
 mistletoe.block_token.ThematicBreak,
 mistletoe.block_token.HtmlBlock,
 mistletoe.markdown_renderer.BlankLine]

A `SpanToken` token *always* appears within a `BlockToken`.
- Block tokens are separated by a blank line in markdown text.
- Span tokens are inline (not visually separated) in markdown text.

In [23]:
mistletoe.span_token.SpanToken.__subclasses__()

[mistletoe.span_token.CoreTokens,
 mistletoe.span_token.Strong,
 mistletoe.span_token.Emphasis,
 mistletoe.span_token.InlineCode,
 mistletoe.span_token.Strikethrough,
 mistletoe.span_token.Image,
 mistletoe.span_token.Link,
 mistletoe.span_token.AutoLink,
 mistletoe.span_token.EscapeSequence,
 mistletoe.span_token.LineBreak,
 mistletoe.span_token.RawText,
 mistletoe.span_token.HtmlSpan,
 mistletoe.span_token.XWikiBlockMacroStart,
 mistletoe.span_token.XWikiBlockMacroEnd,
 mistletoe.markdown_renderer.LinkReferenceDefinition]

In [24]:
tree.print()

Tree<'Markdown tree'>
╰── Document D_1: '<mistletoe.block_token.Document with 37 children line_number=1 at 0x105b72ff0>'
    ├── BlankLine BL_1
    ├── Paragraph P_2 of length 45 across 1 children
    │   ╰── RawText s.0: "<mistletoe.span_token.RawText content='This is the first paragraph wi'...+14 at 0x105b853a0>"
    ├── BlankLine BL_3
    ├── Heading H1_4: '# Heading 1\n'
    │   ╰── RawText s.1: "<mistletoe.span_token.RawText content='Heading 1' at 0x105b85460>"
    ├── BlankLine BL_5
    ├── Paragraph P_6 of length 146 across 1 children
    │   ╰── RawText s.2: "<mistletoe.span_token.RawText content='First paragraph under Heading '...+115 at 0x105bfd490>"
    ├── BlankLine BL_7
    ├── Paragraph P_8 of length 205 across 1 children
    │   ╰── RawText s.3: "<mistletoe.span_token.RawText content='Second paragraph under Heading'...+174 at 0x105b85640>"
    ├── BlankLine BL_9
    ├── Heading H2_10: '## Heading 2\n'
    │   ╰── RawText s.4: "<mistletoe.span_token.RawText content='Headi

#### Hide Span Tokens

In [25]:
hide_span_tokens(tree)

35

In [26]:
tree.print()

Tree<'Markdown tree'>
╰── Document D_1: '<mistletoe.block_token.Document with 37 children line_number=1 at 0x105b72ff0>'
    ├── BlankLine BL_1
    ├── Paragraph P_2 of length 45 across 1 children: 'This is the first paragraph with no heading.'
    ├── BlankLine BL_3
    ├── Heading H1_4: '# Heading 1'
    ├── BlankLine BL_5
    ├── Paragraph P_6 of length 146 across 1 children: 'First paragraph under Heading 1....(hidden)'
    ├── BlankLine BL_7
    ├── Paragraph P_8 of length 205 across 1 children: 'Second paragraph under Heading 1....(hidden)'
    ├── BlankLine BL_9
    ├── Heading H2_10: '## Heading 2'
    ├── BlankLine BL_11
    ├── Paragraph P_12 of length 179 across 3 children: 'Paragraph 1 under Heading 2 with [a...(hidden)'
    ├── BlankLine BL_13
    ├── Paragraph P_14 of length 200 across 1 children: 'Paragraph 2 under Heading 2. Paragraph...(hidden)'
    ├── BlankLine BL_15
    ├── Paragraph P_16 of length 12 across 1 children: 'List intro:'
    ├── List L_17: '<mistletoe.b

Tech detail: The parent of hidden nodes have `"freeze_token_children"` set to True. This has later implications when copying subtrees.

Flexible/Extensibility feature: Custom attributes can be added to `node.data`, useful for tree traversals and manipulations.

In [27]:
tree['P_47'].data["freeze_token_children"]

True

This results in fewer nodes to deal with.

In [28]:
tree_descr = describe_tree(tree)

{'counts': defaultdict(int,
             {'Document': 1,
              'BlankLine': 17,
              'Paragraph': 27,
              'Heading': 5,
              'List': 6,
              'ListItem': 16,
              'Table': 1,
              'TableRow': 3}),
 'children': defaultdict(set,
             {'Document': {'BlankLine',
               'Heading',
               'List',
               'Paragraph',
               'Table'},
              'List': {'ListItem'},
              'ListItem': {'List', 'Paragraph'},
              'Table': {'TableRow'}}),
 'parents': defaultdict(set,
             {'BlankLine': {'Document'},
              'Paragraph': {'Document', 'ListItem'},
              'Heading': {'Document'},
              'List': {'Document', 'ListItem'},
              'ListItem': {'List'},
              'Table': {'Document'},
              'TableRow': {'Table'}}),
 'tokens': defaultdict(set,
             {'Document': {'_children',
               'data_id',
               'footnotes',
 

#### Create HeadingSection nodes (a custom "block node")

Why? HeadingSections allow us to work with a Heading and its body together.
- The out-of-the-box tree structure (which reflects the token structure) have Heading nodes as siblings of body-text nodes.
- We want to reason about them as a single unit, i.e., a `HeadingSection`.

A HeadingSection node has children:
- Heading
- Paragraph, List, Table
- (sub)HeadingSection
    - (sub)Heading
    - Paragraph, List, Table
    - (sub-sub)HeadingSection
        - ...


In [29]:
create_heading_sections(tree)

5

In [30]:
tree.print()

Tree<'Markdown tree'>
╰── Document D_1: '<mistletoe.block_token.Document with 37 children line_number=1 at 0x105b72ff0>'
    ├── BlankLine BL_1
    ├── Paragraph P_2 of length 45 across 1 children: 'This is the first paragraph with no heading.'
    ├── BlankLine BL_3
    ├── HeadingSection _S1_4 with 6 children
    │   ├── Heading H1_4: '# Heading 1'
    │   ├── BlankLine BL_5
    │   ├── Paragraph P_6 of length 146 across 1 children: 'First paragraph under Heading 1....(hidden)'
    │   ├── BlankLine BL_7
    │   ├── Paragraph P_8 of length 205 across 1 children: 'Second paragraph under Heading 1....(hidden)'
    │   ╰── BlankLine BL_9
    ├── HeadingSection _S2_10 with 9 children
    │   ├── Heading H2_10: '## Heading 2'
    │   ├── BlankLine BL_11
    │   ├── Paragraph P_12 of length 179 across 3 children: 'Paragraph 1 under Heading 2 with [a...(hidden)'
    │   ├── BlankLine BL_13
    │   ├── Paragraph P_14 of length 200 across 1 children: 'Paragraph 2 under Heading 2. Paragraph...

#### Nest HeadingSections

Makes the tree look more hierarchical so we can work with a HeadingSection and all its descendant HeadingSections as a single unit (i.e., a subtree).

In [31]:
nest_heading_sections(tree)

3

In [32]:
tree.print()

Tree<'Markdown tree'>
╰── Document D_1: '<mistletoe.block_token.Document with 37 children line_number=1 at 0x105b72ff0>'
    ├── BlankLine BL_1
    ├── Paragraph P_2 of length 45 across 1 children: 'This is the first paragraph with no heading.'
    ├── BlankLine BL_3
    ├── HeadingSection _S1_4 with 7 children
    │   ├── Heading H1_4: '# Heading 1'
    │   ├── BlankLine BL_5
    │   ├── Paragraph P_6 of length 146 across 1 children: 'First paragraph under Heading 1....(hidden)'
    │   ├── BlankLine BL_7
    │   ├── Paragraph P_8 of length 205 across 1 children: 'Second paragraph under Heading 1....(hidden)'
    │   ├── BlankLine BL_9
    │   ╰── HeadingSection _S2_10 with 10 children
    │       ├── Heading H2_10: '## Heading 2'
    │       ├── BlankLine BL_11
    │       ├── Paragraph P_12 of length 179 across 3 children: 'Paragraph 1 under Heading 2 with [a...(hidden)'
    │       ├── BlankLine BL_13
    │       ├── Paragraph P_14 of length 200 across 1 children: 'Paragraph 2 unde

#### Add intro sentences to List and Tables

We want introductory/preceding sentences for List and Tables. Those sentences are in previous siblings of the List/Table. Let's save the intro sentence to the List/Table node now before splitting and chunking.

In [33]:
tree.print()

Tree<'Markdown tree'>
╰── Document D_1: '<mistletoe.block_token.Document with 37 children line_number=1 at 0x105b72ff0>'
    ├── BlankLine BL_1
    ├── Paragraph P_2 of length 45 across 1 children: 'This is the first paragraph with no heading.'
    ├── BlankLine BL_3
    ├── HeadingSection _S1_4 with 7 children
    │   ├── Heading H1_4: '# Heading 1'
    │   ├── BlankLine BL_5
    │   ├── Paragraph P_6 of length 146 across 1 children: 'First paragraph under Heading 1....(hidden)'
    │   ├── BlankLine BL_7
    │   ├── Paragraph P_8 of length 205 across 1 children: 'Second paragraph under Heading 1....(hidden)'
    │   ├── BlankLine BL_9
    │   ╰── HeadingSection _S2_10 with 10 children
    │       ├── Heading H2_10: '## Heading 2'
    │       ├── BlankLine BL_11
    │       ├── Paragraph P_12 of length 179 across 3 children: 'Paragraph 1 under Heading 2 with [a...(hidden)'
    │       ├── BlankLine BL_13
    │       ├── Paragraph P_14 of length 200 across 1 children: 'Paragraph 2 unde

In [34]:
add_list_and_table_intros(tree)

7

In [35]:
list_nodes = tree.find_all(match=lambda n: n.data_type == "List")

[Node<"List L_17: '<mistletoe.block_token.List with 3 children line_number=17 loose=False start=None at 0x105bfda60>'", data_id=L_17>,
 Node<"List L_38: '<mistletoe.block_token.List with 4 children line_number=38 loose=False start=None at 0x105bff020>'", data_id=L_38>,
 Node<"List L_40: '<mistletoe.block_token.List with 1 child line_number=40 loose=False start=None at 0x105bff380>'", data_id=L_40>,
 Node<"List L_41: '<mistletoe.block_token.List with 1 child line_number=41 loose=False start=None at 0x105bff620>'", data_id=L_41>,
 Node<"List L_42: '<mistletoe.block_token.List with 1 child line_number=42 loose=False start=None at 0x105bff770>'", data_id=L_42>,
 Node<"List L_47: '<mistletoe.block_token.List with 6 children line_number=47 loose=False start=None at 0x105bff9b0>'", data_id=L_47>]

In [36]:
list_nodes[0].data["intro"]

'List intro:\n'

In [37]:
[n.data["intro"] for n in list_nodes]

['List intro:\n',
 'Following list has *no* intro.\n',
 'Item H1>H3.L1.2\n',
 'Item H1>H3.L1.subL.1\n',
 'Item H1>H3.L1.subL.2\n',
 'Paragraph before list with long list items.\n']

In [None]:
table_node = tree.find_first(match=lambda n: n.data_type == "Table")

Node<"Table T_26: '| H3.2.T1: header 1     | H3.2.T1: header 2     | H3.2.T1: header 3     | H3.2.T1: header 4     |'", data_id=T_26>

In [None]:
table_node.data["intro"]

'Table intro:\n'

#### Rendering the tree or subtrees

(Reminder in preparation for chunking)

In [40]:
md = render_tree_as_md(tree)
print(md)

This is the first paragraph with no heading.

# Heading 1

First paragraph under Heading 1. Paragraph H1.p1, sentence 2. Paragraph H1.p1, sentence 3. Paragraph H1.p1, sentence 4. Last sentence of 'H1.p1'.

Second paragraph under Heading 1. Paragraph H1.p2, sentence 2. Paragraph H1.p2, sentence 3. Paragraph H1.p2, sentence 4. Paragraph H1.p2, sentence 5. Paragraph H1.p2, sentence 6. Last sentence of 'H1.p2'.

## Heading 2

Paragraph 1 under Heading 2 with [a link](http://to.nowhere.com). Paragraph H2.p1, sentence 2. Paragraph H2.p1, sentence 3. Paragraph H2.p1, sentence 4. Last sentence of 'H2.p1'.

Paragraph 2 under Heading 2. Paragraph H2.p2, sentence 2. Paragraph H2.p2, sentence 3. Paragraph H2.p2, sentence 4. Paragraph H2.p2, sentence 5. Paragraph H2.p2, sentence 6. Last sentence of 'H2.p2'.

List intro:
* Item H2.3.L1.1
* Item H2.3.L1.2
* Item H2.3.L1.3

### Heading 3

Only paragraph under **Heading 3**. Paragraph H3.p1, sentence 2. Paragraph H3.p1, sentence 3. Paragraph H3.p1, sen

In [41]:
print(render_subtree_as_md(tree.system_root.first_child()))

This is the first paragraph with no heading.

# Heading 1

First paragraph under Heading 1. Paragraph H1.p1, sentence 2. Paragraph H1.p1, sentence 3. Paragraph H1.p1, sentence 4. Last sentence of 'H1.p1'.

Second paragraph under Heading 1. Paragraph H1.p2, sentence 2. Paragraph H1.p2, sentence 3. Paragraph H1.p2, sentence 4. Paragraph H1.p2, sentence 5. Paragraph H1.p2, sentence 6. Last sentence of 'H1.p2'.

## Heading 2

Paragraph 1 under Heading 2 with [a link](http://to.nowhere.com). Paragraph H2.p1, sentence 2. Paragraph H2.p1, sentence 3. Paragraph H2.p1, sentence 4. Last sentence of 'H2.p1'.

Paragraph 2 under Heading 2. Paragraph H2.p2, sentence 2. Paragraph H2.p2, sentence 3. Paragraph H2.p2, sentence 4. Paragraph H2.p2, sentence 5. Paragraph H2.p2, sentence 6. Last sentence of 'H2.p2'.

List intro:
* Item H2.3.L1.1
* Item H2.3.L1.2
* Item H2.3.L1.3

### Heading 3

Only paragraph under **Heading 3**. Paragraph H3.p1, sentence 2. Paragraph H3.p1, sentence 3. Paragraph H3.p1, sen

##### Comparing the input and output markdown

In [42]:
# TBD

##### Checking token structure

Expect no mismatches since we haven't done any significant changes to the tree.

In [43]:
mismatches = tokens_vs_tree_mismatches(tree)
pprint(mismatches, sort_dicts=False, width=120)
assert len(mismatches) == 0

defaultdict(<class 'list'>, {})


### Chunking

In [44]:
from src.ingestion.markdown_chunking import *

Need some utilities for chunking.

- Rendering to markdown a list of any nodes:

In [45]:
print(nodes_as_markdown([tree['L_38'], tree['BL_45'], tree['T_26']]))

* Item H1>H3.L1.1
* Item H1>H3.L1.2
  * Item H1>H3.L1.subL.1
    * Item H1>H3.L1.subL.2
      * Item H1>H3.L1.subL.3
* Item H1>H3.L1.1
* Item H1>H3.L1.2

| H3.2.T1: header 1     | H3.2.T1: header 2     | H3.2.T1: header 3     | H3.2.T1: header 4     |
| --------------------- | --------------------- | --------------------- | --------------------- |
| H3.2.T1: row 1, col 1 | H3.2.T1: row 1, col 2 | H3.2.T1: row 1, col 3 | H3.2.T1: row 1, col 4 |
| H3.2.T1: row 2, col 1 | H3.2.T1: row 2, col 2 | H3.2.T1: row 2, col 3 | H3.2.T1: row 2, col 4 |
| H3.2.T1: row 3, col 1 | H3.2.T1: row 3, col 2 | H3.2.T1: row 3, col 3 | H3.2.T1: row 3, col 4 |



- Copying a subtree to its own tree. 
    - Think of a chunk as a subtree, except we can "summarize" subtrees to make them fit into a chunk.
    - *Summarize* = create a chunk with the full content of the subtree, and add a short "summary" text to the node so that text can be used to chunk upper-level subtrees.
    - Tech detail: `copy_subtree()` provides the "deep copy" capability so that it doesn't reference the original node.data and node.data.tokens (exception being the fozen tokens associated with `freeze_token_children`). This allows the new subtree to be modified without affecting the original tree.
        - Used for splitting Lists/Tables into chunks with partial Lists/Tables

In [46]:
node = tree['L_47']
subtree = copy_subtree(node)
subtree.print()

Tree<'L_47 subtree'>
╰── List L_47: '<mistletoe.block_token.List with 6 children line_number=47 loose=False start=None at 0x105b85970>'
    ├── ListItem LI_47: "'* H1>H3.L2.1 -- Paragraph L2.item1, sentence 2. Paragraph L2.item1, sentence 3. Paragraph L2.item1, '"
    │   ╰── Paragraph P_47 of length 110 across 1 children: 'H1>H3.L2.1 -- Paragraph L2.item1,...(hidden)'
    ├── ListItem LI_48: "'* H1>H3.L2.2 -- Paragraph L2.item2, sentence 2. Paragraph L2.item2, sentence 3. Paragraph L2.item2, '"
    │   ╰── Paragraph P_48 of length 142 across 1 children: 'H1>H3.L2.2 -- Paragraph L2.item2,...(hidden)'
    ├── ListItem LI_49: "'* H1>H3.L2.3 -- Paragraph L2.item3, sentence 2. Paragraph L2.item3, sentence 3. Paragraph L2.item3, '"
    │   ╰── Paragraph P_49 of length 174 across 1 children: 'H1>H3.L2.3 -- Paragraph L2.item3,...(hidden)'
    ├── ListItem LI_50: "'* H1>H3.L2.4 -- Paragraph L2.item1, sentence 2. Paragraph L2.item1, sentence 3. Paragraph L2.item1, '"
    │   ╰── Paragraph P_50 

- Get the headings (and potentially other context) is easy by leveraging the hierachical structure:

In [47]:
get_parent_headings_md(tree['LI_52'])

['# Second H1 without a paragraph', '### Skip to H3']

#### Chunk tree

In [None]:
all_chunks=chunk_tree(tree)
pprint(list(all_chunks.values()), sort_dicts=False, width=140)
print("Total characters", sum(len(c.markdown) for c in all_chunks.values()))
len(all_chunks)

Splitting into chunks: Document D_1 with children: P_2, _S1_4, _S1_32
Splitting into chunks: HeadingSection _S1_4 with children: H1_4, P_6, P_8, _S2_10
Splitting into chunks: HeadingSection _S2_10 with children: H2_10, P_12, P_14, P_16, L_17, _S3_21
Splitting into chunks: HeadingSection _S3_21 with children: H3_21, P_23, P_25, T_26
==> Chunked 0:T_26: 1 nodes, len 490: '| H3.2.T1: header 1 | H3.2.T1: header 2 | H3.2.T1: header 3 | H3.2.T1: header 4 |\n| --------------------- |...'
==> Chunked 1:H3_21: 7 nodes, len 268: '### Heading 3\n'
==> Chunked 2:H2_10: 10 nodes, len 493: '## Heading 2\n'
==> Chunked 3:H1_4: 7 nodes, len 399: '# Heading 1\n'
Splitting into chunks: HeadingSection _S1_32 with children: H1_32, _S3_34
Splitting into chunks: HeadingSection _S3_34 with children: H3_34, P_36, L_38, P_46, L_47, P_54
Splitting into chunks: List L_47
==> Chunked 4:L_47[0]:LI_47: 1 nodes, len 303: '(Paragraph before list with long list items.)\n'
==> Chunked 5:L_47[1]:LI_49: 1 nodes, len 335:

10

In [49]:
for k,c in all_chunks.items():
    print(k, len(c.markdown))

0:T_26 490
1:H3_21 268
2:H2_10 493
3:H1_4 399
4:L_47[0]:LI_47 303
5:L_47[1]:LI_49 335
6:L_47[2]:LI_51 367
7:H3_34 480
8:H1_32 68
9:BL_1 130
