# Extracting Complex Data from Serbian Regulatory PDF

This PDF contains parts of Serbian policy documents, crucial for a research project analyzing industry policies across countries. The challenge lies in extracting a large table that spans pages (page 90 to 97) and a math formula on page 98, all in Serbian. Both elements lack clear boundaries between pages, complicating extraction.


In [None]:
# Install natural-pdf
!pip install natural-pdf

In [None]:
# Download the PDF file
import urllib.request
import os

pdf_url = "https://pub-4e99d31d19cb404d8d4f5f7efa51ef6e.r2.dev/pdfs/serbia-zakon-o-naknadama-za-koriscenje-javnih/serbia-zakon-o-naknadama-za-koriscenje-javnih.pdf"
pdf_name = "serbia-zakon-o-naknadama-za-koriscenje-javnih.pdf"

if not os.path.exists(pdf_name):
    print(f"Downloading {pdf_name}...")
    urllib.request.urlretrieve(pdf_url, pdf_name)
    print(f"Downloaded {pdf_name}")
else:
    print(f"{pdf_name} already exists")

# Extracting Complex Data from Serbian Regulatory PDF

This PDF contains parts of Serbian policy documents, crucial for a research project analyzing industry policies across countries. The challenge lies in extracting a large table that spans pages (page 90 to 97) and a math formula on page 98, all in Serbian. Both elements lack clear boundaries between pages, complicating extraction.

In [None]:
from natural_pdf import PDF
from natural_pdf.analyzers.guides import Guides

pdf = PDF("serbia-zakon-o-naknadama-za-koriscenje-javnih.pdf")
pdf.pages[:8].show(cols=4)

The submitter mentioned specific pages, but it's more fun to say "between the page with this and the page with that."

In [None]:
first_page = pdf.find(text="Prilog 7.").page
last_page = pdf.find(text='VISINA NAKNADE ZA ZAGAĐENJE VODA').page
pages = pdf.pages[first_page.index:last_page.index+1]
pages.show(cols=4)

We want everything between Table 4 and 5.

In [None]:
region = (
    pages
    .find(text="Tabela 4")
    .below(
        until="text:contains(Tabela 5)",
        include_endpoint=False,
        multipage=True
    )
)
region.show(cols=4)

We want everything broken up by category, which is labeled as "RAZRED" in the document. We'll just split it into sections with those serving as headers.

In [None]:
sections = region.get_sections('text:contains(RAZRED)', include_boundaries='none')
    
sections.show(cols=4)

Some of them have headers and some of them don't, which can make extraction tough. Here's one that spans two pages and has headers.

In [None]:
sections[7].show(cols=2)

Since it has headers, we can just use `.to_df()`.

In [None]:
sections[7].extract_table().to_df()

This next one does *not* have headers.

In [None]:
sections[5].show(cols=2)

We'll just manually specify them, probably the easiest route.

In [None]:
df = sections[5].extract_table().to_df(header=False)
df.columns = ['Naziv proizvoda', 'Opis proizvoda', 'Jed. mere', 'Naknada u dinarima po jedinici mere']
df

How do we find the math formula? Find the page that has it, then just ask for the image.

In [None]:
page = pdf.find(text="Obračun naknade za neposredno zagađenje voda").page
page.find("image").show()

If we were fancier we'd probably use [surya](https://github.com/datalab-to/surya) to convert it, but Natural PDF can't extract images like that just yet (I don't think?).