<a href="https://colab.research.google.com/github/Extralit/papers-ocr-benchmarks/blob/main/marker_demo_blocks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Marker Demo: Research Paper Block Extraction & Exploration

Self-contained, Google Colab-ready.

- Upload a PDF, run Marker to extract JSON structure
- Flatten and filter block structure (inspired by TypeScript)
- Explore block types, metadata, tables, and figures
- No LLM-based summarization or captioning included


In [38]:
# 1. Install marker-pdf and dependencies
!uv pip install --quiet marker-pdf[full] docling
!uv pip install -q "mineru[all]"
!uv pip install -q "PyMuPDF>=1.23.0" "pandas>=1.5.0" pymupdf4llm llama_index
!uv pip install -q "matplotlib>=3.5.0" "seaborn>=0.11.0" "textdistance>=4.6.0"

In [3]:
# prompt: import HTML and display from ipython

from IPython.display import HTML, display, JSON
from pprint import pprint

In [4]:
# 2. Upload a PDF
from google.colab import files
uploaded = files.upload()
pdf_path = next(iter(uploaded))

Saving Allossogbe_et_al_2017_Mal_J.pdf to Allossogbe_et_al_2017_Mal_J.pdf


In [5]:
file_path = "/content/Allossogbe_et_al_2017_Mal_J.pdf"

## MinerU

In [29]:
%%time
!mineru -p /content/Allossogbe_et_al_2017_Mal_J.pdf -o /content/mineru_output/

[32m2025-07-08 20:03:55.332[0m | [1mINFO    [0m | [36mmineru.backend.pipeline.pipeline_analyze[0m:[36mdoc_analyze[0m:[36m124[0m - [1mBatch 1/1: 11 pages/11 pages[0m
[32m2025-07-08 20:03:55.334[0m | [1mINFO    [0m | [36mmineru.backend.pipeline.pipeline_analyze[0m:[36mbatch_image_analyze[0m:[36m187[0m - [1mgpu_memory: 15 GB, batch_ratio: 8[0m
[32m2025-07-08 20:03:55.334[0m | [1mINFO    [0m | [36mmineru.backend.pipeline.model_init[0m:[36m__init__[0m:[36m137[0m - [1mDocAnalysis init, this may take some times......[0m
[32m2025-07-08 20:04:09.188[0m | [1mINFO    [0m | [36mmineru.backend.pipeline.model_init[0m:[36m__init__[0m:[36m182[0m - [1mDocAnalysis init done![0m
[32m2025-07-08 20:04:09.189[0m | [1mINFO    [0m | [36mmineru.backend.pipeline.pipeline_analyze[0m:[36mcustom_model_init[0m:[36m64[0m - [1mmodel init cost: 13.854581832885742[0m
Layout Predict: 100% 11/11 [00:02<00:00,  4.03it/s]
MFD Predict: 100% 11/11 [00:04<00:00,  2.2

## Docling

In [48]:
%%time
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert(file_path)

CPU times: user 1min 14s, sys: 3.18 s, total: 1min 17s
Wall time: 1min 13s


In [None]:
print(result.document.export_to_markdown())

## PyMuPDF4LLM

Read https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/api.html#pymupdf4llm-api

In [49]:
%%time
import pymupdf4llm
md_text = pymupdf4llm.to_markdown(file_path, table_strategy=None)

CPU times: user 5.22 s, sys: 190 ms, total: 5.41 s
Wall time: 6.42 s


In [36]:
print(md_text)

Allossogbe et al. Malar J (2017) 16:77
DOI 10.1186/s12936-017-1727-x

### **RESEARCH**


## Malaria Journal

### **Open Access**


# WHO cone bio‑assays of classical and new‑generation long‑lasting insecticidal nets call for innovative insecticides targeting the knock‑down resistance mechanism in Benin

Marius Allossogbe [1,2*], Virgile Gnanguenon [1,2], Boulais Yovogan [1,2], Bruno Akinro [1], Rodrigue Anagonou [1,2],
Fiacre Agossa [1,2], André Houtoukpe [3], Germain Gil Padonou [1,2] and Martin Akogbeto [1,2]


**Abstract**

**Background:** To increase the effectiveness of insecticide-treated nets (ITN) in areas of high resistance, new longlasting insecticidal nets (LLINs) called new-generation nets have been developed. These nets are treated with the
piperonyl butoxide (PBO) synergist which inhibit the action of detoxification enzymes. The effectiveness of the
new-generation nets has been proven in some studies, but their specific effect on mosquitoes carrying detoxification enzymes

In [39]:
%%time
llama_reader = pymupdf4llm.LlamaMarkdownReader()
llama_docs = llama_reader.load_data(file_path)

Successfully imported LlamaIndex
CPU times: user 16 s, sys: 631 ms, total: 16.6 s
Wall time: 17.5 s


In [47]:
[(display(p.metadata), print(p.text)) for p in llama_docs]

{'format': 'PDF 1.7',
 'title': 'WHO cone bio-assays of classical and new-generation long-lasting insecticidal nets call for innovative insecticides targeting the knock-down resistance mechanism in Benin',
 'author': 'Marius Allossogbe',
 'subject': 'Malaria Journal, doi:10.1186/s12936-017-1727-x',
 'keywords': 'LLINs,Bio-efficacy,Piperonyl butoxide,Resistant mosquitoes',
 'creator': 'ocrmypdf 16.1.2 / Tesseract OCRhOCR 5.3.4',
 'producer': 'pikepdf 8.13.0',
 'creationDate': "D:20170214204129+05'30'",
 'modDate': "D:20240423063541+00'00'",
 'trapped': '',
 'encryption': None,
 'page': 1,
 'total_pages': 11,
 'file_path': '/content/Allossogbe_et_al_2017_Mal_J.pdf'}

Allossogbe et al. Malar J (2017) 16:77
DOI 10.1186/s12936-017-1727-x

### **RESEARCH**


## Malaria Journal

### **Open Access**


# WHO cone bio‑assays of classical and new‑generation long‑lasting insecticidal nets call for innovative insecticides targeting the knock‑down resistance mechanism in Benin

Marius Allossogbe [1,2*], Virgile Gnanguenon [1,2], Boulais Yovogan [1,2], Bruno Akinro [1], Rodrigue Anagonou [1,2],
Fiacre Agossa [1,2], André Houtoukpe [3], Germain Gil Padonou [1,2] and Martin Akogbeto [1,2]


**Abstract**

**Background:** To increase the effectiveness of insecticide-treated nets (ITN) in areas of high resistance, new longlasting insecticidal nets (LLINs) called new-generation nets have been developed. These nets are treated with the
piperonyl butoxide (PBO) synergist which inhibit the action of detoxification enzymes. The effectiveness of the
new-generation nets has been proven in some studies, but their specific effect on mosquitoes carrying detoxification enzymes

{'format': 'PDF 1.7',
 'title': 'WHO cone bio-assays of classical and new-generation long-lasting insecticidal nets call for innovative insecticides targeting the knock-down resistance mechanism in Benin',
 'author': 'Marius Allossogbe',
 'subject': 'Malaria Journal, doi:10.1186/s12936-017-1727-x',
 'keywords': 'LLINs,Bio-efficacy,Piperonyl butoxide,Resistant mosquitoes',
 'creator': 'ocrmypdf 16.1.2 / Tesseract OCRhOCR 5.3.4',
 'producer': 'pikepdf 8.13.0',
 'creationDate': "D:20170214204129+05'30'",
 'modDate': "D:20240423063541+00'00'",
 'trapped': '',
 'encryption': None,
 'page': 2,
 'total_pages': 11,
 'file_path': '/content/Allossogbe_et_al_2017_Mal_J.pdf'}

Allossogbe et al. Malar J (2017) 16:77 Page 2 of 11



**Background**
Malaria is a major public health problem worldwide,
and particularly so in Benin. It remains a permanent
threat from its high morbidity (214 million) and mortality (438,000). Africa is the most endemic region
affected (395,000 deaths per year) [1]. It affects onefifth of the world population. However, this proportion
has decreased significantly by 37% between 2000 and
2015 due to the effect of malaria prevention and treatment methods, including long-lasting insecticidal nets
(LLINs), indoor residual spraying of residual insecticides
(IRS), chemo-prevention for pregnant women and children, and therapeutic treatment with artemisinin-based
combinations.

Among these prevention methods, LLINs have
emerged in recent years as a privileged tool to prevent
malaria. Te insecticides selected by the World Health
Organization (WHO) for LLIN treatment are pyrethroids, which have little toxicity to humans, are effective at low dos

{'format': 'PDF 1.7',
 'title': 'WHO cone bio-assays of classical and new-generation long-lasting insecticidal nets call for innovative insecticides targeting the knock-down resistance mechanism in Benin',
 'author': 'Marius Allossogbe',
 'subject': 'Malaria Journal, doi:10.1186/s12936-017-1727-x',
 'keywords': 'LLINs,Bio-efficacy,Piperonyl butoxide,Resistant mosquitoes',
 'creator': 'ocrmypdf 16.1.2 / Tesseract OCRhOCR 5.3.4',
 'producer': 'pikepdf 8.13.0',
 'creationDate': "D:20170214204129+05'30'",
 'modDate': "D:20240423063541+00'00'",
 'trapped': '',
 'encryption': None,
 'page': 3,
 'total_pages': 11,
 'file_path': '/content/Allossogbe_et_al_2017_Mal_J.pdf'}

Allossogbe et al. Malar J (2017) 16:77 Page 3 of 11



Te larvae of these mosquito populations were collected
in different ecological areas (vegetable, urban, rice and
cotton areas). Te study was also conducted on resistant
laboratory strains (kdr-Kisumu and ace-1R-Kisumu).


**Study sites**

**Malanville**

Malanville district is bordered on the north by the Republic of Niger, on the south by Kandi and Segbana districts,
on the west by Karimama district and on the east by the
Republic of Nigeria. It has an area of 3016 km [2] and had a
population of 144,843 inhabitants in 2013 (Fig. 1).


**Tanguieta**
It is bordered on the north by the Republic of Burkina
Faso, on the south by Boukoumbe district, on the east by
Kerou, Kouande and Tounkountouna districts and on the
west by Materi and Cobly districts. It covers an area of
5456 km [2] and had a population of 77,987 inhabitants in
2013 (Fig. 1).


**Abomey-Calavi**
Abomey-Calavi is bounded on the north by Ze district,
on the south by the

{'format': 'PDF 1.7',
 'title': 'WHO cone bio-assays of classical and new-generation long-lasting insecticidal nets call for innovative insecticides targeting the knock-down resistance mechanism in Benin',
 'author': 'Marius Allossogbe',
 'subject': 'Malaria Journal, doi:10.1186/s12936-017-1727-x',
 'keywords': 'LLINs,Bio-efficacy,Piperonyl butoxide,Resistant mosquitoes',
 'creator': 'ocrmypdf 16.1.2 / Tesseract OCRhOCR 5.3.4',
 'producer': 'pikepdf 8.13.0',
 'creationDate': "D:20170214204129+05'30'",
 'modDate': "D:20240423063541+00'00'",
 'trapped': '',
 'encryption': None,
 'page': 4,
 'total_pages': 11,
 'file_path': '/content/Allossogbe_et_al_2017_Mal_J.pdf'}

Allossogbe et al. Malar J (2017) 16:77 Page 4 of 11



**Cone test**

Te cone test is used to assess the effectiveness of an
insecticide and its persistence on the net. It was conducted following the WHO protocol. Tis test aims to
compare the behaviour of mosquitoes while in contact
with treated mosquito nets without PBO or with PBO.
Cone tests were performed on five types of nets (Olyset
Plus, Olyset Net, LifeNet, PermaNet 2.0 and PermaNet
3.0). Tese tests were carried out using fragments of
LLINs (30 cm × 30 cm) cut from five (05) positions on
each net. Two standard cones were fixed with a plastic
sheet on each of the five (05) screen fragments. For PermaNet 3.0 LLIN, an additional two cones were added on
the PBO-containing roof. Five unfed An. gambiae females
aged 2–5 days (Kisumu or wild type) were introduced
into each cone placed on the LLIN for 3 min. After exposure, the mosquitoes were removed from the cones using
a mouth aspirator and then transferred into paper cups
and provid

{'format': 'PDF 1.7',
 'title': 'WHO cone bio-assays of classical and new-generation long-lasting insecticidal nets call for innovative insecticides targeting the knock-down resistance mechanism in Benin',
 'author': 'Marius Allossogbe',
 'subject': 'Malaria Journal, doi:10.1186/s12936-017-1727-x',
 'keywords': 'LLINs,Bio-efficacy,Piperonyl butoxide,Resistant mosquitoes',
 'creator': 'ocrmypdf 16.1.2 / Tesseract OCRhOCR 5.3.4',
 'producer': 'pikepdf 8.13.0',
 'creationDate': "D:20170214204129+05'30'",
 'modDate': "D:20240423063541+00'00'",
 'trapped': '',
 'encryption': None,
 'page': 5,
 'total_pages': 11,
 'file_path': '/content/Allossogbe_et_al_2017_Mal_J.pdf'}

Allossogbe et al. Malar J (2017) 16:77 Page 5 of 11


**Fig. 1** Map of Benin showing the study locations





{'format': 'PDF 1.7',
 'title': 'WHO cone bio-assays of classical and new-generation long-lasting insecticidal nets call for innovative insecticides targeting the knock-down resistance mechanism in Benin',
 'author': 'Marius Allossogbe',
 'subject': 'Malaria Journal, doi:10.1186/s12936-017-1727-x',
 'keywords': 'LLINs,Bio-efficacy,Piperonyl butoxide,Resistant mosquitoes',
 'creator': 'ocrmypdf 16.1.2 / Tesseract OCRhOCR 5.3.4',
 'producer': 'pikepdf 8.13.0',
 'creationDate': "D:20170214204129+05'30'",
 'modDate': "D:20240423063541+00'00'",
 'trapped': '',
 'encryption': None,
 'page': 6,
 'total_pages': 11,
 'file_path': '/content/Allossogbe_et_al_2017_Mal_J.pdf'}

Allossogbe et al. Malar J (2017) 16:77 Page 6 of 11


**Table 1 Biochemical and molecular characteristics of the Anopheles gambiae s.l. populations tested**



**Strains of An. gambiae** **Average oxidase activ-** **Average α esterase**
**s.l.** **ity (min/mg protein)** **activity (min/mg**
**protein)**



**Average β esterase**
**activity (min/mg**
**protein)**



**Average glutathione-**
**S-transferase activity**
**(min/mg protein)**



**kdr frequency**



Kisumu 0.1015 [a] 0.07409 [a] 0.07655 [a] 0.3846 [a] 0 [a]


Agblangandan 0.07966 [a] 0.07883 [a] 0.06117 [a] 0.7319 [b] 0.03 [a]

Abomey-Calavi 0.08454 [a] 0.07149 [a] 0.05929 [a] 0.4295 [a] 0.93 [b]

Akron 0.1604 [b] 0.08589 [a] 0.07897 [a] 2.221 [b] 0.74 [b]


Houeyiho 0.17.39 [b] 0.07694 [a] 0.08774 [a] 0.4042 [a] 0.9 [b]

Vossa 0.07566 [a] 0.06897 [a] 0.06389 [a] 0.7078 [a] 0.84 [b]


Ladji 0.1737 [b] 0.07146 [a] 0.0774 [a] 1.194 [b] 0.92 [b]

Bame 0.1106 [a] 0.0588 [a] 0.06223 [a] 0.2901 [a] 0.78 [b]


Malanville 0.06549 [a

{'format': 'PDF 1.7',
 'title': 'WHO cone bio-assays of classical and new-generation long-lasting insecticidal nets call for innovative insecticides targeting the knock-down resistance mechanism in Benin',
 'author': 'Marius Allossogbe',
 'subject': 'Malaria Journal, doi:10.1186/s12936-017-1727-x',
 'keywords': 'LLINs,Bio-efficacy,Piperonyl butoxide,Resistant mosquitoes',
 'creator': 'ocrmypdf 16.1.2 / Tesseract OCRhOCR 5.3.4',
 'producer': 'pikepdf 8.13.0',
 'creationDate': "D:20170214204129+05'30'",
 'modDate': "D:20240423063541+00'00'",
 'trapped': '',
 'encryption': None,
 'page': 7,
 'total_pages': 11,
 'file_path': '/content/Allossogbe_et_al_2017_Mal_J.pdf'}

Allossogbe et al. Malar J (2017) 16:77 Page 7 of 11


**Table 2 Distribution of the knock-down rate observed in localities where there was only one resistance mechanism (kdr)**


**Strains** **LLINs** **N mosquito tested** **KD after 60 min** **95% CI** **Mortality after 24 h (%)**


Malanville LifeNet 55 72.27 [59.03–83.86] 27.27


Olyset Net 53 30.19 [18.34–44.34] 05.56


Olyset Plus 51 54.9 [40.34–68.87] 21.56


PermaNet 2.0 59 28.81 [17.76–42.08] 47.46


PermaNet 3.0 84 95.24 [88.25–98.69] 61.90


Abomey-Calavi LifeNet 53 9.43 [3.13–20.66] 7.54


Olyset Net 54 11.11 [4.18–22.63] 5.56


Olyset Plus 55 29.09 [17.62–49.90] 20


PermaNet 2.0 52 70.49 [57.43–81.84] 26.92


PermaNet 3.0 72 81.94 [71.1–90.02] 86.11


Zagnanado (Bamè) LifeNet 58 68.97 [55.45–80.46] 10.34


Olyset Net 54 23.08 [12.53–36.84] 00


Olyset Plus 55 33.96 [21.51–46.27] 09.43


PermaNet 2.0 53 52.83 [38.63–66.7] 03.77


PermaNet 3.0 75 63.93 [57.61–79.47] 62.67


Vossa LifeNet 54 62.96 [48.74–75.71] 20.37


Olyset

{'format': 'PDF 1.7',
 'title': 'WHO cone bio-assays of classical and new-generation long-lasting insecticidal nets call for innovative insecticides targeting the knock-down resistance mechanism in Benin',
 'author': 'Marius Allossogbe',
 'subject': 'Malaria Journal, doi:10.1186/s12936-017-1727-x',
 'keywords': 'LLINs,Bio-efficacy,Piperonyl butoxide,Resistant mosquitoes',
 'creator': 'ocrmypdf 16.1.2 / Tesseract OCRhOCR 5.3.4',
 'producer': 'pikepdf 8.13.0',
 'creationDate': "D:20170214204129+05'30'",
 'modDate': "D:20240423063541+00'00'",
 'trapped': '',
 'encryption': None,
 'page': 8,
 'total_pages': 11,
 'file_path': '/content/Allossogbe_et_al_2017_Mal_J.pdf'}

Allossogbe et al. Malar J (2017) 16:77 Page 8 of 11


**Table 3 Distribution of the knock-down rate observed in localities where there were several resistance mechanisms**

**(kdr** **+ metabolic resistance)**


**Strains** **LLINs** **N mosquito tested** **KD after 60 min** **95% CI** **Mortality (%)**


Agblangandan LifeNet 53 50.94 [36.83–64.96] 15.09


Olyset Net 54 20.75 [10.84–34.11] 07.4


Olyset Plus 55 50.91 [37.07–64.65] 34.72


PermaNet 2.0 47 36.17 [22.67–51.58] 17.02


PermaNet 3.0 66 60.61 [47.80–72.42] 65.15


Ladji LifeNet 57 85.96 [74.2–93.74] 47.36


Olyset Net 57 50.88 [37.28–64.37] 40.35


Olyset Plus 56 42.86 [29.71–56.78] 41.07


PermaNet 2.0 50 66 [51.23–78.79] 14


PermaNet 3.0 69 88.41 [78.42–94.86] 44.93


Akron LifeNet 52 30.77 [18.71–45.1] 15.38


Olyset Net 54 31.48 [19.52–45.55] 5.56


Olyset Plus 55 74.55 [60.99–85.33] 25.45


PermaNet 2.0 61 70.49 [57.43–81.84] 54.09


PermaNet 3.0 82 81.71 [71.63–89.38] 89.02


Parakou LifeNet 51 43.14 [29.34–57.75] 09.

{'format': 'PDF 1.7',
 'title': 'WHO cone bio-assays of classical and new-generation long-lasting insecticidal nets call for innovative insecticides targeting the knock-down resistance mechanism in Benin',
 'author': 'Marius Allossogbe',
 'subject': 'Malaria Journal, doi:10.1186/s12936-017-1727-x',
 'keywords': 'LLINs,Bio-efficacy,Piperonyl butoxide,Resistant mosquitoes',
 'creator': 'ocrmypdf 16.1.2 / Tesseract OCRhOCR 5.3.4',
 'producer': 'pikepdf 8.13.0',
 'creationDate': "D:20170214204129+05'30'",
 'modDate': "D:20240423063541+00'00'",
 'trapped': '',
 'encryption': None,
 'page': 9,
 'total_pages': 11,
 'file_path': '/content/Allossogbe_et_al_2017_Mal_J.pdf'}

Allossogbe et al. Malar J (2017) 16:77 Page 9 of 11



time (KDT 50 and 95%) compared to other LLINs. In a
recent study conducted in Benin [36], Olyset Plus, treated
with permethrin + PBO, demonstrated a higher efficacy
than Olyset Net against wild multi-resistant An. gambiae
s.l. in experimental huts, as observed in WHO cone tests
used in the present study. In south-western Ethiopia [35]
and in Uganda [34], a reduced efficacy of mono-treated
LLINs was also observed against wild resistant An. gambiae s.l. in comparison with Permanet 3.0 treated with
deltamethrin + PBO. Te results are similar to those
observed in this study. However, these studies did not
include Olyset Plus, the second type of new-generation
LLINs treated with permethrin + PBO.
Te reduced efficacy of LLINs treated with permethrin
would be related to the strong resistance of the local vectors to permethrin due to the resistance selection pressures generated by the use of the same class of insecticide
for malaria vector 

{'format': 'PDF 1.7',
 'title': 'WHO cone bio-assays of classical and new-generation long-lasting insecticidal nets call for innovative insecticides targeting the knock-down resistance mechanism in Benin',
 'author': 'Marius Allossogbe',
 'subject': 'Malaria Journal, doi:10.1186/s12936-017-1727-x',
 'keywords': 'LLINs,Bio-efficacy,Piperonyl butoxide,Resistant mosquitoes',
 'creator': 'ocrmypdf 16.1.2 / Tesseract OCRhOCR 5.3.4',
 'producer': 'pikepdf 8.13.0',
 'creationDate': "D:20170214204129+05'30'",
 'modDate': "D:20240423063541+00'00'",
 'trapped': '',
 'encryption': None,
 'page': 10,
 'total_pages': 11,
 'file_path': '/content/Allossogbe_et_al_2017_Mal_J.pdf'}

Allossogbe et al. Malar J (2017) 16:77 Page 10 of 11



**Author details**
1 Centre de Recherche Entomologique de Cotonou (CREC), Cotonou,
Benin. [2] Université d’Abomey-Calavi, Abomey‑Calavi, Benin. [3] Medical Care
and Development International, Washington, USA.


**Acknowledgements**
We thank CREC personnel for their technical assistance and collaboration.


**Competing interests**
The authors declare that they have no competing interests.


**Availability of data and materials**
Data collected during this study are included in the published article and its
additional files.


**Funding**
This work is supported by Faculty of Letters, Arts and Human Sciences of the
University of Abomey-Calavi.


Received: 5 December 2016  Accepted: 7 February 2017


**References**

1. WHO. World malaria report 2015. Geneva: World Health Organiza[tion; 2015. http://www.who.int/malaria/publications/world-malaria-](http://www.who.int/malaria/publications/world-malaria-report-2015/report/en/)
[report-201

{'format': 'PDF 1.7',
 'title': 'WHO cone bio-assays of classical and new-generation long-lasting insecticidal nets call for innovative insecticides targeting the knock-down resistance mechanism in Benin',
 'author': 'Marius Allossogbe',
 'subject': 'Malaria Journal, doi:10.1186/s12936-017-1727-x',
 'keywords': 'LLINs,Bio-efficacy,Piperonyl butoxide,Resistant mosquitoes',
 'creator': 'ocrmypdf 16.1.2 / Tesseract OCRhOCR 5.3.4',
 'producer': 'pikepdf 8.13.0',
 'creationDate': "D:20170214204129+05'30'",
 'modDate': "D:20240423063541+00'00'",
 'trapped': '',
 'encryption': None,
 'page': 11,
 'total_pages': 11,
 'file_path': '/content/Allossogbe_et_al_2017_Mal_J.pdf'}

Allossogbe et al. Malar J (2017) 16:77 Page 11 of 11



35. Yewhalaw D, Asale A, Tushune K, Getachew Y, Duchateau L, Speybroeck
N. Bio-efficacy of selected long-lasting insecticidal nets against pyrethroid resistant Anopheles arabiensis from South-Western Ethiopia. Parasit
Vectors. 2012;5:159.
36. Pennetier C, Bouraima A, Chandre F, Piameu M, Etang J, Rossignol M, et al.
Efficacy of Olyset [®] Plus, a new long-lasting insecticidal net incorporating
permethrin and piperonil-butoxide against multi-resistant malaria vectors. PLoS ONE. 2013;8:e75134.
37. Ranson H, Abdallah H, Badolo A, Guelbeogo WM, Kerah-Hinzoumbé C,
Yangalbé-Kalnoné E, et al. Insecticide resistance in Anopheles gambiae:
data from the first year of a multi-country study highlight the extent of
the problem. Malar J. 2009;8:299.
38. Chouaibou MS, Chabi J, Bingham GV, Knox TB, N’Dri L, Kesse NB, et al.
Increase in susceptibility to insecticides with aging of wild Anopheles
gambiae mosquitoes from Côte d’Ivoire. BMC Infect Di

[(None, None),
 (None, None),
 (None, None),
 (None, None),
 (None, None),
 (None, None),
 (None, None),
 (None, None),
 (None, None),
 (None, None),
 (None, None)]

## Marker

In [7]:
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.schema import BlockTypes

artifact_dict = create_model_dict()
artifact_dict

Downloading layout model to /root/.cache/datalab/models/layout/2025_02_18: 100%|██████████| 5/5 [00:00<00:00,  5.35it/s]
Downloading text_recognition model to /root/.cache/datalab/models/text_recognition/2025_05_16: 100%|██████████| 10/10 [00:31<00:00,  3.13s/it]
Downloading table_recognition model to /root/.cache/datalab/models/table_recognition/2025_02_18: 100%|██████████| 5/5 [00:00<00:00,  8.89it/s]
Downloading text_detection model to /root/.cache/datalab/models/text_detection/2025_05_07: 100%|██████████| 6/6 [00:00<00:00, 22.67it/s]
Downloading ocr_error_detection model to /root/.cache/datalab/models/ocr_error_detection/2025_02_18: 100%|██████████| 8/8 [00:05<00:00,  1.44it/s]


{'layout_model': <surya.layout.LayoutPredictor at 0x7f93a32e4590>,
 'recognition_model': <surya.recognition.RecognitionPredictor at 0x7f93a428e110>,
 'table_rec_model': <surya.table_rec.TableRecPredictor at 0x7f93a2f84410>,
 'detection_model': <surya.detection.DetectionPredictor at 0x7f93a2eb9710>,
 'ocr_error_model': <surya.ocr_error.OCRErrorPredictor at 0x7f93a32d7e90>}

In [None]:
converter = PdfConverter(
    artifact_dict=create_model_dict(),
)

document = converter.build_document(file_path)
forms = document.contained_blocks((BlockTypes.Form,))

Recognizing layout: 100%|██████████| 1/1 [00:01<00:00,  1.79s/it]
Running OCR Error Detection: 100%|██████████| 1/1 [00:00<00:00, 26.08it/s]
Detecting bboxes: 0it [00:00, ?it/s]
Detecting bboxes: 0it [00:00, ?it/s]
Recognizing tables: 100%|██████████| 1/1 [00:01<00:00,  1.26s/it]


In [None]:
blocks = document.contained_blocks((BlockTypes.Text,))
blocks[0]

Text(polygon=PolygonBox(polygon=[[56.45703125, 229.614013671875], [506.076171875, 229.614013671875], [506.076171875, 257.422607421875], [56.45703125, 257.422607421875]], bbox=[56.45703125, 229.614013671875, 506.076171875, 257.422607421875]), block_description='A paragraph or line of text.', block_type=<BlockTypes.Text: '23'>, block_id=6, page_id=0, text_extraction_method='pdftext', structure=[/page/0/Line/40, /page/0/Line/58, /page/0/Line/64, /page/0/Line/67, /page/0/Line/77], ignore_for_output=False, replace_output_newlines=False, source='layout', top_k={<BlockTypes.Text: '23'>: 0.9404296875, <BlockTypes.TextInlineMath: '16'>: 0.05963134765625, <BlockTypes.Code: '10'>: 1.1980533599853516e-05, <BlockTypes.Footnote: '12'>: 4.947185516357422e-06, <BlockTypes.Form: '13'>: 4.827976226806641e-06}, metadata=None, lowres_image=None, highres_image=None, removed=False, has_continuation=False, blockquote=False, blockquote_level=0, html=None)

In [None]:
blocks[0].html

In [None]:
pprint(blocks)

[Text(polygon=PolygonBox(polygon=[[56.45703125, 229.614013671875], [506.076171875, 229.614013671875], [506.076171875, 257.422607421875], [56.45703125, 257.422607421875]], bbox=[56.45703125, 229.614013671875, 506.076171875, 257.422607421875]), block_description='A paragraph or line of text.', block_type=<BlockTypes.Text: '23'>, block_id=6, page_id=0, text_extraction_method='pdftext', structure=[/page/0/Line/40, /page/0/Line/58, /page/0/Line/64, /page/0/Line/67, /page/0/Line/77], ignore_for_output=False, replace_output_newlines=False, source='layout', top_k={<BlockTypes.Text: '23'>: 0.9404296875, <BlockTypes.TextInlineMath: '16'>: 0.05963134765625, <BlockTypes.Code: '10'>: 1.1980533599853516e-05, <BlockTypes.Footnote: '12'>: 4.947185516357422e-06, <BlockTypes.Form: '13'>: 4.827976226806641e-06}, metadata=None, lowres_image=None, highres_image=None, removed=False, has_continuation=False, blockquote=False, blockquote_level=0, html=None),
 Text(polygon=PolygonBox(polygon=[[62.27734375, 303.

In [None]:
# 3. Run Marker to extract JSON structure
import os
output_dir = 'marker_output'
os.makedirs(output_dir, exist_ok=True)
json_out = os.path.join(output_dir, os.path.splitext(os.path.basename(pdf_path))[0] + '_structure.json')

!marker_single "{pdf_path}" --output_format json --output_dir "{output_dir}"

In [None]:
# 4. Load the Marker JSON
import json
with open(json_out, 'r') as f:
    marker_json = json.load(f)

In [None]:
# 5. Data models and flattening utilities (Python version of your TypeScript)
from typing import List, Dict, Any

class SimplifiedBlock:
    def __init__(self, type: str, content: str, page: int, bbox: list):
        self.type = type
        self.content = content
        self.page = page
        self.bbox = bbox

    def as_dict(self):
        return {
            'type': self.type,
            'content': self.content,
            'page': self.page,
            'bbox': self.bbox,
        }

import html

def decode_html_entities(text: str) -> str:
    return html.unescape(text)

def flatten_marker_json(blocks: List[Dict[str, Any]], page_number: int = 0) -> List[SimplifiedBlock]:
    flat_blocks = []
    for block in blocks:
        # Skip Page blocks but process their children
        if block.get('block_type') == 'Page':
            child_page = int(block.get('id', '0/0/0').split('/')[2]) if 'id' in block else 0
            flat_blocks.extend(flatten_marker_json(block.get('children', []), child_page))
            continue

        # Process current block
        content = ''
        if block.get('images') and isinstance(block['images'], dict) and block['images']:
            content = next(iter(block['images'].values()))
        elif block.get('block_type') == 'Table':
            content = block.get('html', '').strip()
        elif block.get('html'):
            import re
            content = re.sub(r'<[^>]*>', ' ', block['html']).strip()
        content = decode_html_entities(content)

        page = (int(block.get('id', '0/0/0').split('/')[2]) if 'id' in block else page_number) + 1
        bbox = block.get('bbox', [0,0,0,0])

        flat_blocks.append(SimplifiedBlock(
            type=block.get('block_type', ''),
            content=content,
            page=page,
            bbox=bbox
        ))

        # Recursively process children (except for Page blocks)
        if block.get('children'):
            flat_blocks.extend(flatten_marker_json(block['children'], page))
    return flat_blocks

def filter_and_flatten_marker_json(blocks: List[Dict[str, Any]], page_number: int = 0) -> List[SimplifiedBlock]:
    unfiltered = flatten_marker_json(blocks, page_number)
    remove_types = {
        'TableCell', 'TableGroup', 'FigureGroup', 'ListGroup', 'Reference',
        'PageFooter', 'PageHeader', 'Footnote'
    }
    return [b for b in unfiltered if b.type not in remove_types and b.content]

In [None]:
# 6. Flatten and filter the Marker output
flat_blocks = filter_and_flatten_marker_json(marker_json.get('children', []))

In [None]:
# 7. Explore block types and content
import pandas as pd

df = pd.DataFrame([b.as_dict() for b in flat_blocks])
print('Block types found:', df['type'].unique())
df.head(20)  # Show first 20 blocks

In [None]:
# 8. Simple metadata extraction (title, authors, abstract)
def extract_metadata(blocks: List[SimplifiedBlock]):
    title = next((b.content for b in blocks if b.type.lower() in {'title', 'main_title'}), '')
    authors = next((b.content for b in blocks if 'author' in b.type.lower()), '')
    abstract = next((b.content for b in blocks if 'abstract' in b.type.lower()), '')
    return {'title': title, 'authors': authors, 'abstract': abstract}

metadata = extract_metadata(flat_blocks)
print('Extracted Metadata:', metadata)

In [None]:
# 9. Find and display all tables and figures (with extensibility for custom processing)
tables = [b for b in flat_blocks if b.type == 'Table']
figures = [b for b in flat_blocks if b.type == 'Figure' or b.type == 'Picture']

print(f'Found {len(tables)} tables and {len(figures)} figures.')

# Example: Show first table's HTML (for further processing)
if tables:
    from IPython.display import display, HTML
    print('First table HTML:')
    display(HTML(tables[0].content))

# Example: Show first figure as image (if base64-encoded)
import base64
from IPython.display import Image

def show_base64_image(b64str):
    try:
        display(Image(data=base64.b64decode(b64str)))
    except Exception as e:
        print('Could not display image:', e)

if figures:
    print('First figure (if image):')
    show_base64_image(figures[0].content)

In [None]:
# 10. (Optional) Extensible: Add your own logic to process tables/figures, e.g., send table HTML to a model, extract captions, etc.
# (No LLM-based summarization or captioning included)

In [None]:
# 11. Save flattened blocks for further analysis
df.to_json('flattened_blocks.json', orient='records', indent=2)
from google.colab import files
files.download('flattened_blocks.json')