In [1]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [2]:
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)

In [4]:
# load the jsonl file
import json
guides = {}
with open("city_guides.jsonl") as f:
    for line in f:
        data = json.loads(line)
        guides[data['city']] = data['guide']
        

In [11]:
guide = guides[list(guides.keys())[0]]
guide = f'=Agra=\n\n' + guide
guide

'=Agra=\n\n{{pagebanner|Agra banner Taj Mahal.jpg|unesco=yes}}\n\'\'\'Agra\'\'\' (Hindi: आगरा \'\'Āgrā\'\') is the city of the Taj Mahal, in the north [[India]]n state of [[Uttar Pradesh]], some 200&nbsp;km from [[Delhi]].\n\nAgra has three [[UNESCO World Heritage List|UNESCO World Heritage]] sites, the \'\'\'Taj Mahal\'\'\' and the \'\'\'Agra Fort\'\'\' in the city and \'\'\'[[Fatehpur Sikri]]\'\'\' 40 km away. There are also many other buildings and tombs from Agra\'s days of glory as the capital of the [[Mughal Empire]].\n\nBesides these three sites, the city has little else to recommend it. Pollution, especially smog and litter, is rampant and visitors are pestered by swarms of touts and hawkers at every monument, besides the inner Taj Mahal which, once you are in, is free of scams and touts. The sites are some of the wonders of the world and no trip to India is complete without at least one visit to the Taj. For the vast majority of visitors, a single day in Agra is more than enou

In [9]:
texts = text_splitter.create_documents([guide])
texts

[Document(metadata={}, page_content="{{pagebanner|Agra banner Taj Mahal.jpg|unesco=yes}}\n'''Agra''' (Hindi: आगरा ''Āgrā'') is the city of the Taj Mahal, in the north [[India]]n state of [[Uttar Pradesh]], some 200&nbsp;km from [[Delhi]].\n\nAgra has three [[UNESCO World Heritage List|UNESCO World Heritage]] sites, the '''Taj Mahal''' and the '''Agra Fort''' in the city and '''[[Fatehpur Sikri]]''' 40 km away. There are also many other buildings and tombs from Agra's days of glory as the capital of the [[Mughal Empire]].\n\nBesides these three sites, the city has little else to recommend it. Pollution, especially smog and litter, is rampant and visitors are pestered by swarms of touts and hawkers at every monument, besides the inner Taj Mahal which, once you are in, is free of scams and touts. The sites are some of the wonders of the world and no trip to India is complete without at least one visit to the Taj. For the vast majority of visitors, a single day in Agra is more than enough.

In [15]:
len(texts)

96

In [37]:
from langchain_text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
    ("####", "Header 4"),
    ("#####", "Header 5"),
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)


In [38]:
md_header_splits = markdown_splitter.split_text(guide)
md_header_splits

[Document(metadata={}, page_content='=Agra=  \n{{pagebanner|Agra banner Taj Mahal.jpg|unesco=yes}}\n\'\'\'Agra\'\'\' (Hindi: आगरा \'\'Āgrā\'\') is the city of the Taj Mahal, in the north [[India]]n state of [[Uttar Pradesh]], some 200&nbsp;km from [[Delhi]].  \nAgra has three [[UNESCO World Heritage List|UNESCO World Heritage]] sites, the \'\'\'Taj Mahal\'\'\' and the \'\'\'Agra Fort\'\'\' in the city and \'\'\'[[Fatehpur Sikri]]\'\'\' 40 km away. There are also many other buildings and tombs from Agra\'s days of glory as the capital of the [[Mughal Empire]].  \nBesides these three sites, the city has little else to recommend it. Pollution, especially smog and litter, is rampant and visitors are pestered by swarms of touts and hawkers at every monument, besides the inner Taj Mahal which, once you are in, is free of scams and touts. The sites are some of the wonders of the world and no trip to India is complete without at least one visit to the Taj. For the vast majority of visitors, a 

In [16]:
len(md_header_splits)

1

In [27]:
list(range(4, -1, -1))

[4, 3, 2, 1, 0]

In [32]:
import re

def convert_to_markdown_headings(text):
  """Converts Wikivoyage headings to markdown headings."""
  for i in range(5, 0, -1):  # Iterate from 5 to 1
    pattern = f"({'=' * i})(.*?)({'=' * i})"
    # Use \2 to refer to the second capture group
    replacement = "#" * i + r" \2"  
    text = re.sub(pattern, replacement, text)
  return text

In [33]:
markdown_text = convert_to_markdown_headings(guide)
print(markdown_text)

# Agra

{{pagebanner|Agra banner Taj Mahal.jpg|unesco=yes}}
'''Agra''' (Hindi: आगरा ''Āgrā'') is the city of the Taj Mahal, in the north [[India]]n state of [[Uttar Pradesh]], some 200&nbsp;km from [[Delhi]].

Agra has three [[UNESCO World Heritage List|UNESCO World Heritage]] sites, the '''Taj Mahal''' and the '''Agra Fort''' in the city and '''[[Fatehpur Sikri]]''' 40 km away. There are also many other buildings and tombs from Agra's days of glory as the capital of the [[Mughal Empire]].

Besides these three sites, the city has little else to recommend it. Pollution, especially smog and litter, is rampant and visitors are pestered by swarms of touts and hawkers at every monument, besides the inner Taj Mahal which, once you are in, is free of scams and touts. The sites are some of the wonders of the world and no trip to India is complete without at least one visit to the Taj. For the vast majority of visitors, a single day in Agra is more than enough.

## Understand

While the heyday 

In [39]:
md_header_splits = markdown_splitter.split_text(markdown_text)
md_header_splits

[Document(metadata={'Header 1': 'Agra'}, page_content="{{pagebanner|Agra banner Taj Mahal.jpg|unesco=yes}}\n'''Agra''' (Hindi: आगरा ''Āgrā'') is the city of the Taj Mahal, in the north [[India]]n state of [[Uttar Pradesh]], some 200&nbsp;km from [[Delhi]].  \nAgra has three [[UNESCO World Heritage List|UNESCO World Heritage]] sites, the '''Taj Mahal''' and the '''Agra Fort''' in the city and '''[[Fatehpur Sikri]]''' 40 km away. There are also many other buildings and tombs from Agra's days of glory as the capital of the [[Mughal Empire]].  \nBesides these three sites, the city has little else to recommend it. Pollution, especially smog and litter, is rampant and visitors are pestered by swarms of touts and hawkers at every monument, besides the inner Taj Mahal which, once you are in, is free of scams and touts. The sites are some of the wonders of the world and no trip to India is complete without at least one visit to the Taj. For the vast majority of visitors, a single day in Agra is

In [40]:
len(md_header_splits)

43

In [44]:
md_header_splits[7].metadata

{'Header 1': 'Agra',
 'Header 2': 'Get in',
 'Header 3': 'By train',
 'Header 4': 'Lines'}

In [42]:
def combine_metadata(metadata):
  """Combines metadata values into a comma-separated string."""
  values = list(metadata.values())
  return ", ".join(values)

In [45]:
combine_metadata(md_header_splits[7].metadata)

'Agra, Get in, By train, Lines'

In [46]:
md_header_splits[7].page_content

"* '''Delhi to Agra''' — Close to 20 trains connect [[Delhi]] and Agra each day with journey times varying from 2-5 hr. The best options include the ''Rani Kamalapati Vande Bharat Express'' (fastest), ''Rani Kamalapati Shatabdi Express'' (departs New Delhi at 6:15AM arriving Agra Cantt at 8:12AM; departs Agra Cantt at 8:30PM arriving New Delhi at 10:30PM, daily except Friday; meal and water included in air-con carriage) and the ''Taj Express'' (departs Hazrat Nizamuddin at 7:15AM arriving Agra Cantt at 10:07AM; departs Agra Cantt at 6:55PM arriving Hazrat Nizamuddin at 10PM, daily).\n* '''Agra to Jaipur''' - The journey to Jaipur (station code: JP) takes around 4 hr by train no. 2988 which leaves Agra Fort at 6:25PM and reaches Jaipur at around 10:20PM.\nAlso train number 2965 from Agra Cantonment to Jaipur at 5:40PM. The train arrives at 10:15PM. ₹300 air-con carriage.  \n* The '''Luxury train''' — ''[[Palace on Wheels]]'' stops at Agra on its 8-day round trip of tourist destinations 