# Docling VLM Test
* docling v2.25.0 release introduces VLM pipeline
    * https://github.com/DS4SD/docling/releases/tag/v2.25.0
* example: [[link]](https://github.com/DS4SD/docling/blob/37dd8c1cc7fc05095fe889eec78300647c946a42/docs/examples/minimal_vlm_pipeline.py#L10)

## VlmPipeline
* https://github.com/DS4SD/docling/blob/37dd8c1cc7fc05095fe889eec78300647c946a42/docling/pipeline/vlm_pipeline.py#L47


## VlmPipelineOptions
* https://github.com/DS4SD/docling/blob/37dd8c1cc7fc05095fe889eec78300647c946a42/docling/datamodel/pipeline_options.py#L334

```
class VlmPipelineOptions(PaginatedPipelineOptions):
    artifacts_path: Optional[Union[Path, str]] = None

    generate_page_images: bool = True
    force_backend_text: bool = (
        False  # (To be used with vlms, or other generative models)
    )
    # If True, text from backend will be used instead of generated text
    vlm_options: Union[HuggingFaceVlmOptions] = smoldocling_vlm_conversion_options
```

## VlmOptions
* uses response_format `DOCTAGS`

```
class HuggingFaceVlmOptions(BaseVlmOptions):
    kind: Literal["hf_model_options"] = "hf_model_options"

    repo_id: str
    load_in_8bit: bool = True
    llm_int8_threshold: float = 6.0
    quantized: bool = False

    response_format: ResponseFormat

    @property
    def repo_cache_folder(self) -> str:
        return self.repo_id.replace("/", "--")


smoldocling_vlm_conversion_options = HuggingFaceVlmOptions(
    repo_id="ds4sd/SmolDocling-256M-preview",
    prompt="Convert this page to docling.",
    response_format=ResponseFormat.DOCTAGS,
)

granite_vision_vlm_conversion_options = HuggingFaceVlmOptions(
    repo_id="ibm-granite/granite-vision-3.1-2b-preview",
    # prompt="OCR the full page to markdown.",
    prompt="OCR this image.",
    response_format=ResponseFormat.MARKDOWN,
)

```

doctags example:
* html like format (similar to qwen2-vl html)
```
<document>
<subtitle-level-1><location><page_1><loc_18><loc_85><loc_83><loc_89></location>DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis</subtitle-level-1>
<paragraph><location><page_1><loc_15><loc_77><loc_32><loc_83></location>Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com</paragraph>...
```

In [1]:
import os

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    AcceleratorDevice,
    VlmPipelineOptions,
    HuggingFaceVlmOptions,
    ResponseFormat
)
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline

from config import settings

In [2]:
sources = [
    # "tests/data/2305.03393v1-pg9-img.png",
    "tests/data/1706.03762v7_pg1_3.pdf",
]

## Use experimental VlmPipeline
pipeline_options = VlmPipelineOptions()
# If force_backend_text = True, text from backend will be used instead of generated text
pipeline_options.force_backend_text = False
pipeline_options.generate_picture_images = True

pipeline_options.accelerator_options.device = AcceleratorDevice.MPS

model_dir = os.path.join(
    settings.docling_model_weight_dir, "granite-vision-3.1-2b-preview"
)

pipeline_options.artifacts_path=settings.docling_model_weight_dir
vlm_conversion_options = HuggingFaceVlmOptions(
    repo_id = "granite-vision-3.1-2b-preview",
    # repo_id=model_dir,
    prompt="OCR the full page to markdown.",
    # prompt="OCR this image.",
    response_format=ResponseFormat.MARKDOWN,
    # response_format=ResponseFormat.DOCTAGS,
    load_in_8bit=False,
    quantized=False
)

## Pick a VLM model. We choose SmolDocling-256M by default
pipeline_options.vlm_options = vlm_conversion_options

In [3]:
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options,
        ),
        InputFormat.IMAGE: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options,
        ),
    }
)

In [13]:
# source = "../samples/1706.03762v7_p3.png"
source = "../samples/2305.03393v1-pg9-img.png" # used in docling example
source = "../samples/1706.03762v7_pg1_3.pdf" # 3 pages of attention is all you need

res = converter.convert(source)

  test_elements = torch.tensor(test_elements)


In [14]:
print(len(res.pages))
for page in res.pages:
    print("")
    print("Predicted page in DOCTAGS:")
    print(page.predictions.vlm_response.text)

3

Predicted page in DOCTAGS:
|    | 0              | 1              | 2              | 3              |
|---:|:---------------|:---------------|:---------------|:---------------|
|  0 | Ashish Vaswani* | Noam Shazeer*   | Niki Parmar*   | Jakob Uszkoreit* |
|  1 | Google Brain   | Google Brain   | Google Research | Google Research |
|  2 | awvansing@google.com | tooming@google.com | nikip@google.com | nikip@google.com |
|  3 | Ilion Jones*   | Aidan N. Gomez* | Lukasz Kaiser* | Lukasz Kaiser* |
|  4 | Google Research | University of Toronto | Google Brain   | Google Brain   |
|  5 | 11ion@google.com | aidan@cs.toronto.edu | lukaszkaiser@google.com | lukaszkaiser@google.com |
|  6 | Illia Polosukhin* | Illia Polosukhin* | Illia Polosukhin* | Illia Polosukhin* |<|end_of_text|>

Predicted page in DOCTAGS:

|    | 0                                                                                                                                                                                

In [15]:
print(res.document.export_to_markdown())

|    | 0                    | 1                     | 2                       | 3                       |
|----|----------------------|-----------------------|-------------------------|-------------------------|
|  0 | Ashish Vaswani*      | Noam Shazeer*         | Niki Parmar*            | Jakob Uszkoreit*        |
|  1 | Google Brain         | Google Brain          | Google Research         | Google Research         |
|  2 | awvansing@google.com | tooming@google.com    | nikip@google.com        | nikip@google.com        |
|  3 | Ilion Jones*         | Aidan N. Gomez*       | Lukasz Kaiser*          | Lukasz Kaiser*          |
|  4 | Google Research      | University of Toronto | Google Brain            | Google Brain            |
|  5 | 11ion@google.com     | aidan@cs.toronto.edu  | lukaszkaiser@google.com | lukaszkaiser@google.com |
|  6 | Illia Polosukhin*    | Illia Polosukhin*     | Illia Polosukhin*       | Illia Polosukhin*       |

|    | 0                                     

In [16]:
res.document

DoclingDocument(schema_name='DoclingDocument', version='1.1.0', name='1706.03762v7_pg1_3', origin=DocumentOrigin(mimetype='text/markdown', binary_hash=6974430905591189644, filename='1706.03762v7_pg1_3.pdf', uri=None), furniture=GroupItem(self_ref='#/furniture', parent=None, children=[], content_layer=<ContentLayer.FURNITURE: 'furniture'>, name='_root_', label=<GroupLabel.UNSPECIFIED: 'unspecified'>), body=GroupItem(self_ref='#/body', parent=None, children=[RefItem(cref='#/tables/0'), RefItem(cref='#/tables/1'), RefItem(cref='#/tables/2')], content_layer=<ContentLayer.BODY: 'body'>, name='_root_', label=<GroupLabel.UNSPECIFIED: 'unspecified'>), groups=[], texts=[], pictures=[], tables=[TableItem(self_ref='#/tables/0', parent=RefItem(cref='#/body'), children=[], content_layer=<ContentLayer.BODY: 'body'>, label=<DocItemLabel.TABLE: 'table'>, prov=[], captions=[], references=[], footnotes=[], image=None, data=TableData(table_cells=[TableCell(bbox=None, row_span=1, col_span=1, start_row_off

In [17]:
for item in res.document.body.children:
    print(item)
    # print(type(item))
    
# item = res.document.pictures[0]

cref='#/tables/0'
cref='#/tables/1'
cref='#/tables/2'


In [18]:
# image = item.get_image(res.document)
# print(image)

In [19]:
from pathlib import Path
from docling_core.types.doc import DocItemLabel, ImageRefMode
from docling_core.types.doc.document import DEFAULT_EXPORT_LABELS

res.document.save_as_html(
    filename=Path("./3_docling_vlm_test_result.html"),
    image_mode=ImageRefMode.REFERENCED,
    labels=[*DEFAULT_EXPORT_LABELS, DocItemLabel.FOOTNOTE],
)

In [20]:
DEFAULT_EXPORT_LABELS

{<DocItemLabel.CHECKBOX_SELECTED: 'checkbox_selected'>,
 <DocItemLabel.CHECKBOX_UNSELECTED: 'checkbox_unselected'>,
 <DocItemLabel.CODE: 'code'>,
 <DocItemLabel.DOCUMENT_INDEX: 'document_index'>,
 <DocItemLabel.FORMULA: 'formula'>,
 <DocItemLabel.LIST_ITEM: 'list_item'>,
 <DocItemLabel.PAGE_FOOTER: 'page_footer'>,
 <DocItemLabel.PAGE_HEADER: 'page_header'>,
 <DocItemLabel.PARAGRAPH: 'paragraph'>,
 <DocItemLabel.PICTURE: 'picture'>,
 <DocItemLabel.REFERENCE: 'reference'>,
 <DocItemLabel.SECTION_HEADER: 'section_header'>,
 <DocItemLabel.TABLE: 'table'>,
 <DocItemLabel.TEXT: 'text'>,
 <DocItemLabel.TITLE: 'title'>}