# PDF parsing

We are using Marker-PDF that is based on Surya-OCR to parse the PDF into Markdown.

In [None]:
!apt-get install poppler-utils
!apt-get install tesseract-ocr
!pip install marker-pdf

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.5).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


In [None]:
from marker.convert import convert_single_pdf
from marker.models import load_all_models

fpath = "regles 40K ENGLISH.pdf"
model_lst = load_all_models()
full_text, images, out_meta = convert_single_pdf(fpath, model_lst)

Loaded detection model vikp/surya_det3 on device cuda with dtype torch.float16
Loaded detection model vikp/surya_layout3 on device cuda with dtype torch.float16
Loaded reading order model vikp/surya_order on device cuda with dtype torch.float16
Loaded recognition model vikp/surya_rec2 on device cuda with dtype torch.float16
Loaded texify model to cuda with torch.float16 dtype
Loaded recognition model vikp/surya_tablerec on device cuda with dtype torch.float16


Detecting bboxes: 100%|██████████| 15/15 [00:14<00:00,  1.01it/s]
Recognizing Text: 100%|██████████| 3/3 [00:05<00:00,  1.85s/it]
Detecting bboxes: 100%|██████████| 10/10 [00:29<00:00,  2.91s/it]
Finding reading order: 100%|██████████| 10/10 [00:10<00:00,  1.04s/it]
Recognizing tables: 100%|██████████| 1/1 [00:00<00:00,  2.51it/s]


In [None]:
images

{'1_image_0.png': <PIL.Image.Image image mode=RGB size=625x901>,
 '2_image_0.png': <PIL.Image.Image image mode=RGB size=192x216>,
 '3_image_0.png': <PIL.Image.Image image mode=RGB size=76x79>,
 '3_image_1.png': <PIL.Image.Image image mode=RGB size=102x98>,
 '3_image_2.png': <PIL.Image.Image image mode=RGB size=147x51>,
 '3_image_3.png': <PIL.Image.Image image mode=RGB size=107x89>,
 '4_image_0.png': <PIL.Image.Image image mode=RGB size=362x742>,
 '5_image_0.png': <PIL.Image.Image image mode=RGB size=544x395>,
 '5_image_1.png': <PIL.Image.Image image mode=RGB size=78x197>,
 '6_image_0.png': <PIL.Image.Image image mode=RGB size=242x493>,
 '7_image_0.png': <PIL.Image.Image image mode=RGB size=355x396>,
 '7_image_1.png': <PIL.Image.Image image mode=RGB size=355x407>,
 '7_image_2.png': <PIL.Image.Image image mode=RGB size=98x233>,
 '8_image_0.png': <PIL.Image.Image image mode=RGB size=85x17>,
 '8_image_1.png': <PIL.Image.Image image mode=RGB size=49x50>,
 '9_image_0.png': <PIL.Image.Image i

In [None]:
# prompt: save the full_text in a .md file, save all the images in a folder image and save the metadat dict in a json

import json
import os

# Create the 'images' directory if it doesn't exist
if not os.path.exists('images'):
    os.makedirs('images')

# Save the full_text to a .md file
with open('full_text.md', 'w') as f:
  f.write(full_text)

# Save the images
for name, image in images.items():
    image.save(os.path.join('images', name))

# Save the metadata to a JSON file
with open('metadata.json', 'w') as json_file:
    json.dump(out_meta, json_file, indent=4)

# Chunking

We are going to do a tiny cleanup and split the mardown text

In [None]:
!pip install -qU langchain-text-splitters

## Cleaning

There are image lminks that we will remove first

In [None]:
import re

clean_text = re.sub(r'!\[.*?\]\([^)]*\)', '', full_text)

In [None]:
print(clean_text)


# Core Rules

'We are beset on all sides by vile predatory aliens and sedition gnaws at us from within; in this dark hour the best we can do is look to our wargear and pray to our gods.'
- Skolak a'Trellar IV, Imperial Commander



200 STORE
# Introduction

++ THERE IS NO TIME FOR PEACE. NO RESPITE. NO FORGIVENESS. THERE IS ONLY WAR. ++
Welcome to the Warhammer 40,000 Core Rules! The following pages contain everything you need to know in order to wage glorious battle across the war-torn galaxy of the 41st Millennium. Warhammer 40,000 is a tabletop war game in which players command armies of Citadel miniatures and attempt to defeat their opponent through a mixture of skill, tactics and luck. Storytelling is at the core of Warhammer 40,000, with the rules designed to bring to life the epic conflicts between the forces of Mankind, aliens and daemons in the grim darkness of the far future. The purpose of the game is for all players to have an enjoyable shared experience, putting their tac

In [None]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
md_header_splits = markdown_splitter.split_text(clean_text)


In [None]:
md_header_splits[12]

Document(metadata={'Header 1': 'Core Concepts', 'Header 2': 'Battlefield'}, page_content='Battles of Warhammer 40,000 are fought on rectangular battlefields. This can be any surface upon which the models can stand - a dining table, for example, or the floor. Your mission will guide you as to the size of battlefield required.  \n#### Terrain Features  \nThe scenery on a battlefield can be represented by models from  \nthe Warhammer 40,000 range. These models are called terrain features to differentiate them from the models that make up an army. Terrain features are set up on the battlefield before the battle begins. You can find out more about terrain features on pages 44-48. Unless the mission you are playing instructs you otherwise, you should feel free to create an exciting battlefield using any terrain features from your collection.')

Here, the chunks are cut acciording to header levels, as it is allways the case with PDF parsing, there can be discrépancies between the PDF layout and its retranscription in markdown.

Furthermore, we are going to split text that are too long.

In [None]:
# Char-level splits
from langchain_text_splitters import RecursiveCharacterTextSplitter

chunk_size = 1000
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(md_header_splits)

In [None]:
splits[141]

Document(metadata={'Header 1': 'Datasheets'}, page_content='#### 7 Wargear Options  \nSome datasheets have a bullet-pointed list of wargear options. When you include such a unit in your army, you can use these options to change the weapons and other wargear of models in the unit. The order you use these options in does not matter, but each can only be used once.  \n#### Leadership Tests  \nIf a rule requires you to take a Leadership test for a unit, roll 2D6: if the total is greater than or equal to the best Leadership characteristic in that unit, that test is passed. Otherwise, it is failed.  \n#### Random Characteristics')

In [None]:
with open('chunks.json', 'w') as json_file:
    json.dump([_.model_dump() for _ in splits], json_file, indent=4)