## Parse PDF
Step 1 is to parse the PDF. PDFs are kind of just a collection of letters with coordinates, so we have to infer their structure, formatting, and even word groupings.

## Environment Setup
Set constants and load secrets from a .env file so we don't store them in the notebook.

This step uses the following libraries:
|Library|License|
|-|-|
| [Docling](https://github.com/docling-project/docling]) | MIT |
| [EasyOCR](https://github.com/JaidedAI/EasyOCR) | Apache 2.0 |
| [python-dotenv](https://github.com/theskumar/python-dotenv) | BSD-3-Clause |
| [huggingface_hub](https://github.com/huggingface/huggingface_hub) | Apache 2.0 |
| [transformers](https://github.com/huggingface/transformers) | Apache 2.0 |



In [1]:
import json, os
from pathlib import Path

from dotenv import load_dotenv

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    AcceleratorDevice,
    AcceleratorOptions,
    PdfPipelineOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption

from huggingface_hub import login
from transformers import AutoTokenizer

In [6]:
DOCUMENT    = "FM5_0"
PDF_PATH    = Path("../pdfs/FM5_0.pdf")
BASE_MODEL  = Path("QuantFactory/Llama-3.2-1B-GGUF")
GGUF_FILE   = "Llama-3.2-1B.Q8_0.gguf"
CACHE_DIR   = "hf_cache"

load_dotenv()
HF_API_KEY = os.environ["HF_API_KEY"]
login(HF_API_KEY)

MODEL_DIR    = DOCUMENT / BASE_MODEL / "lora"
DATA_DIR     = DOCUMENT / BASE_MODEL / "data"
CHUNKED_DATA = DATA_DIR / "chunked"  / "chunked.jsonl"
QA_DATA      = DATA_DIR / "qa"       / "qa_pairs.jsonl"

os.makedirs(CHUNKED_DATA.parent, exist_ok=True)
os.makedirs(QA_DATA.parent,      exist_ok=True)
os.makedirs(CACHE_DIR,           exist_ok=True)
os.makedirs(MODEL_DIR,           exist_ok=True)
os.makedirs(DATA_DIR,            exist_ok=True)

## Tunables
Chunk size is the number of tokens in a section of the document we are going to feed into the LLM.

The size of the chunks were ultimately dictated by the memory available for training. Bigger might be better in a production system, but it just depends on how the LLM responds to large context windows. Llama 3.2 advertises a context window of 128k but often performance drops off for any context > 10% of the context window. This is an area where more experimentation is needed.

Chunk size was set with no overlap so that I could parse the entire document. Ideally we could have some overlap so the LLM can learn how the context chunks fit with each other. I would start at about 25% overlap and tune from there, given I had the resources.

In [7]:
chunks       = []
chunk_size   = 512
chunk_stride = 512

Setup of Docling pipeline. Mostly taken from the [examples in their docs.](https://docling-project.github.io/docling/examples/custom_convert/)

This uses EasyOCR on each page to determine word grouping, header, footers, etc. and was generally much better than the alternative. For production, it might be worth the effort to get [RapidOCR](https://github.com/RapidAI/RapidOCR) working, depending on the volume of data needing ingestion.

In [8]:
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = False
pipeline_options.table_structure_options.do_cell_matching = False
pipeline_options.ocr_options.lang = ["en"]
pipeline_options.accelerator_options = AcceleratorOptions(
    num_threads=4, device=AcceleratorDevice.AUTO
)

doc_converter = DocumentConverter(
    format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } )

Once configured, just feed it the PDF and tell it where to save.

In [9]:
converted_pdf = doc_converter.convert(PDF_PATH)
pdf_text = converted_pdf.document.export_to_text()

Parameter `strict_text` has been deprecated and will be ignored.


Since we're chunking based on token size, we need to tokenize the text first. This loads the tokenizer.

In [10]:
tok = AutoTokenizer.from_pretrained(BASE_MODEL, cache_dir=CACHE_DIR, gguf_file=GGUF_FILE, use_fast=True)

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


Since we're loading a modified model, I'm double-checking that the [Llama3.2 special tokens](https://www.llama.com/docs/model-cards-and-prompt-formats/meta-llama-3/) are present. They actually aren't in the "special tokens" but they are in the vocabulary, which is fine. I'll also check the padding token and set that since some of the tokens seem to be misconfigured.

In [11]:
print(f"Special Tokens: {tok.special_tokens_map}")

Special Tokens: {'bos_token': '<|begin_of_text|>', 'eos_token': '<|begin_of_text|>'}


In [12]:
v = tok.get_vocab()

In [13]:
bos_tok        = "<|begin_of_text|>"
eot_id_tok     = "<|eot_id|>"
start_hd_tok   = "<|start_header_id|>"
eot_tok        = "<|end_of_text|>"
special_tokens = [bos_tok, eot_id_tok, start_hd_tok, eot_tok]

for t in special_tokens:
    print(f"{t} in vocabulary - {t in v}")

<|begin_of_text|> in vocabulary - True
<|eot_id|> in vocabulary - True
<|start_header_id|> in vocabulary - True
<|end_of_text|> in vocabulary - True


With the tokenizer set up, I can encode the pdf text and start making the chunks based on the token count.

In [14]:
pdf_encoded = tok(pdf_text)
pdf_as_tokens = pdf_encoded.input_ids

In [15]:
for i, start_tok in enumerate(range(0, len(pdf_as_tokens), chunk_stride)):
    slice_ids = pdf_as_tokens[start_tok : start_tok + chunk_size]
    chunk_text = tok.decode(slice_ids, clean_up_tokenization_spaces=False, skip_special_tokens=True)
    chunks.append({
        "chunk_id": f"{i:06d}",
        "text": chunk_text
    })

    # Break if we've reached the end of the document
    if start_tok + chunk_size >= len(pdf_as_tokens):
        break

In [16]:
print(f"There are {len(chunks)} chunks")

There are 462 chunks


In [17]:
print(f"The first chunk looks like:\n {chunks[0]}")

The first chunk looks like:
 {'chunk_id': '000000', 'text': "## FM 5-0 PLANNING AND ORDERS PRODUCTION\n\nNOVEMBER 2024\n\nDISTRIBUTION RESTRICTION:\n\nApproved for public release; distribution is unlimited.\n\nThis publication supersedes FM 5-0, dated 16 May 2022. HEADQUARTERS, DEPARTMENT OF THE ARMY\n\nThis publication is available at the Army Publishing Directorate site (https://armypubs.army.mil) and the Central Army Registry Site (https://atiam.train.army.mil/catalog/dashboard).\n\n## PLANNING AND ORDERS PRODUCTION\n\n## Contents\n\nDISTRIBUTION RESTRICTION: Approved for public release; distribution is unlimited.\n\nINTEGRATING PROCESSES SUPPORT TO PLANNING .......................................................  345\n\nGlossary ............................................................................................................................................  363\n\nReferences .................................................................................................

Looks good so we save the chunks and the tokenizer for the next steps.

In [18]:
with open(CHUNKED_DATA, "w", encoding="utf-8") as f:
    for c in chunks:
        f.write(json.dumps(c, ensure_ascii=False) + "\n")

In [19]:
tok.save_pretrained(MODEL_DIR)

('FM5_0/QuantFactory/Llama-3.2-1B-GGUF/lora/tokenizer_config.json',
 'FM5_0/QuantFactory/Llama-3.2-1B-GGUF/lora/special_tokens_map.json',
 'FM5_0/QuantFactory/Llama-3.2-1B-GGUF/lora/tokenizer.model',
 'FM5_0/QuantFactory/Llama-3.2-1B-GGUF/lora/added_tokens.json',
 'FM5_0/QuantFactory/Llama-3.2-1B-GGUF/lora/tokenizer.json')