# Form Tools Demo

This notebook will walk you through how to create and make us of the `form_tools` library to extract metadata from a PDF acro form and how to then extract field thumbnails from a scanned image of a completed form.

## Extracting metadata from a PDF form

To extract `FormMetadata` from a PDF acro form you'll need to import the `PdfFormMetaExtractor` class as follows:

In [None]:
output_paths = {
    "template_images": "notebooks/template_images",
    "metadata": "notebooks/metadata/",
    "pass_directory": "notebooks/pass_directory",
    "fail_directory": "notebooks/fail_directory",
}

template_path = "tests/tests_end2end/data/dummy_form.pdf"
scanned_path = "tests/tests_end2end/data/scanned_dummy_form.jpg"
config_path = "tests/tests_end2end/data/config.yaml"

In [None]:
from form_tools.form_meta.extractors.pdf_form_extractor import PdfFormMetaExtractor

# Instantiate extractor
pfme = PdfFormMetaExtractor()

Once the extractor has been created you can use it to create:
* a form image directory for your form
* a `FormMetadata` object for further processing

To do this, simply specify the location of your form and the name of the directory for storing the images for each page of your form.

In [None]:
# Create FormMetadata object and populate
# image directory template_images
form_metadata = pfme.extract_meta(
    form_template_path=template_path,
    form_image_dir=output_paths["template_images"],
    form_image_dir_overwrite=True,
)

Let's view the generated form image.

In [None]:
import os
import cv2

from PIL import Image
from IPython.display import display
from form_tools.utils.image_reader import ImageReader

_, imgs = ImageReader.read(os.path.join(output_paths["template_images"], "page_1.ppm"))

form_image = cv2.cvtColor(imgs[0], cv2.COLOR_BGR2RGB)

display(Image.fromarray(form_image))


## Manipulating `FormMetadata`

Let's find out if the metadata contains all the fields we can see.

In [None]:
print("\n".join([f.name for f in form_metadata.form_fields]))

All the fields are included, but unfortunately some of the names aren't very clear. Let's change that.

In [None]:
new_name_map = {
    "textbox1": "name",
    "textbox2": "occupation",
    "textbox22": "favouritelibrary",
}

for k, v in new_name_map.items():
    new_field = form_metadata.form_field(k)
    new_field.name = v

    # Note: we need to use update_column and remove_column
    # as form_field is only a read method. `FormMetadata` is
    # a child class of the mojap `Metadata` class and uses its
    # methods to set / update properties
    form_metadata.update_column(new_field.to_dict())
    form_metadata.remove_column(k)

print("\n".join([f.name for f in form_metadata.form_fields]))

Each form field has a `bounding_box` component which stores the rectangle bounding box dimensions for the field. Let's access the bounding box for `languageotherdetails`.

In [None]:
print(form_metadata.form_field("languageotherdetails").bounding_box)

We can use this bounding box along with the `BoundingBoxOperator` class to crop our form image to the field in question.

In [None]:
from form_tools.form_operators.bounding_box_operator import BoundingBoxOperator

bbop = BoundingBoxOperator()

field_image = bbop.crop_image_to_bb(
    image=imgs[0],
    bounding_box=form_metadata.form_field("languageotherdetails").bounding_box
)

field_pil_image = cv2.cvtColor(field_image, cv2.COLOR_RGB2BGR)
display(Image.fromarray(field_pil_image))


`FormMetadata` also contains information on the pages contained in the PDF form. You can access this information using the `form_pages` property.

In [None]:
print(form_metadata.form_pages)

Each `FormMetadata` object needs a regex identifier for each page included, as well as an overall form identifier regex. Let's set these.

In [None]:
# Set form page identifiers
identifier_map = {
    1: "Dummy"
}

new_form_pages = []
for pn, id in identifier_map.items():
    form_page = form_metadata.form_page(pn)
    form_page.identifier = id
    new_form_pages.append(form_page)

form_metadata.form_pages = new_form_pages
form_metadata.form_identifier = "Dummy"

print(form_metadata.form_identifier)
print(form_metadata.form_pages)


We can now write our metadata out to a JSON file.

In [None]:
from pathlib import Path

metadata_path = Path(output_paths["metadata"])
if not metadata_path.exists():
    metadata_path.mkdir(parents=True, exist_ok=True)

form_metadata.to_json(
    os.path.join(output_paths["metadata"], "dummy_form_meta.json"),
    indent=4,
)

If we ever want to read the JSON file back in as a `FormMetadata` object we can use the `from_json` classmethod.

In [None]:
from form_tools.form_meta import FormMetadata

meta = FormMetadata.from_json(
    os.path.join(
        output_paths["metadata"],
        "dummy_form_meta.json",
    ),
)

## Extracting fields from a scanned form

Let's use our form metadata to extract the thumbnails from a scanned document. Let's first see what we're dealing with.

In [None]:
_, imgs = ImageReader.read(scanned_path)
scanned_image = cv2.cvtColor(imgs[0], cv2.COLOR_RGB2BGR)
display(Image.fromarray(scanned_image))

We'll need to set a config file for our form processor. We'll use a config file `config.yaml` based on one from the docs. This sets:
* the algorithms used to align the scanned image to the template
* the ocr engine to use for matching pages in the scanned form to the template form
* preprocessing functions to apply to the scanned image for cleaning

You can find more information in the docs.

In [None]:
with open(config_path, "r") as f:
    config_text = f.read()

print(config_text)

Rather than running the full pipeline in one go, let's run it step by step. We'll first read the scanned image, apply preprocessing and then auto rotate the image (in case it's landscape).

In [None]:
from form_tools.form_operators import FormOperator

form_operator = FormOperator.create_from_config(config_path)

_, imgs = ImageReader.read(scanned_path)

preprocessed_imgs = form_operator.preprocess_form_images(imgs)

rotated_imgs = form_operator.auto_rotate_form_images(preprocessed_imgs)

rotated_image = cv2.cvtColor(rotated_imgs[0], cv2.COLOR_RGB2BGR)
display(Image.fromarray(rotated_image))


Notice that our rotated image is now in grayscale. We'll now run OCR on the image to check it's a match to our metadata.

In [None]:
form_images_text = form_operator.form_images_to_text(rotated_imgs)
print(form_images_text)

It looks like it should match. Let's test it.

In [None]:
matching_meta_store = form_operator.match_form_images_text_to_form_meta(
    output_paths["metadata"], form_images_text
)
print(matching_meta_store)

Our metadata store isn't empty which means a match has been found, and that's for the metadata we created, as expected. We'll now use our metadata to validate and match the pages in our scanned document to our original form (we've only got one page in each, so there's little to match, but this step would be crucial if we had a multi-page form to work with).

In [None]:
meta_id, meta = list(matching_meta_store.items())[0]

matched_images = form_operator.validate_and_match_pages(
    form_images=rotated_imgs,
    form_meta=meta,
    form_images_as_strings=form_images_text,
)

We now need to align our scanned form to the template so that we know that the fields in the scanned form are in the same position as in the original form. Behind the scenes we're using a process called keypoint detection and matching to produce something called a homography matrix (this is just a transformation to align our image to the template). We can view this process in action by setting `debug=True` (click on any windows that open up and press the down key to progress the code).

In [None]:
aligned_images = form_operator.align_images_to_template(
    matched_images, form_meta=meta, debug=True
)

The overlay looks pretty good. Let's extract the fields from the aligned images. We'll set `debug=True` again to see the thumbnail images as they're generated.

In [None]:
extracted_fields = form_operator.extract_fields(
    aligned_images,
    form_meta=meta,
    debug=True,
)

Rather than running the above sequentially, we can simply use the `run_full_pipeline` method to do everything in one go and store the images in a directory (local for this demo, but you could also specify an S3 location).

In [None]:
from form_tools.form_operators import FormOperator

form_operator = FormOperator.create_from_config(config_path)

_ = form_operator.run_full_pipeline(
    form_path=scanned_path,
    pass_dir=output_paths["pass_directory"],
    fail_dir=output_paths["fail_directory"],
    form_meta_directory=output_paths["metadata"],
)

Let's read one of the fields back in as an image.

In [None]:
pass_directory = Path(output_paths["pass_directory"])

for p in pass_directory.glob("**/*.jpg"):
    if "field_name=favouritelibrary" in p.as_posix():
        _, imgs = ImageReader.read(
            p.as_posix()
        )

        display(Image.fromarray(imgs[0]))

If your image was of computer generated text you could then simply pass it to an OCR engine (e.g. tesseract) to convert the image to text. Hand-written text is more complicated.

Let's clean things up.

In [None]:
from shutil import rmtree

for _, path in output_paths.items():
    if os.path.exists(path):
        rmtree(path)