<a href="https://colab.research.google.com/github/matthewleechen/woodcroft_patents/blob/main/notebooks/layout_detect_patents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook was designed for training in Google Colab Pro. It is **not** recommended to run this notebook on the Colab free plan. This notebook's training loop was originally run using Colab Pro on 1 Nvidia A100 (40GB) GPU. Training is both compute and time-intensive: the process consumed approximately 15-20 compute credits per hour, and the Faster-RCNN and Mask-RCNN (ResNet-50 backbone) models took approximately one full day (~12 hours) to train to 100,000 iterations. This is going to vary significantly depending on the quality of your input images that you need to load into GPU memory: I compressed them to 20% of their original size. You can also run this locally on a virtual machine or server, but carefully check for dependencies.

This notebook uses the [Detectron2](https://github.com/facebookresearch/detectron2) library for object detection and instance segmentation from Facebook AI Research, the [Google Cloud Vision API](https://cloud.google.com/vision/docs) and the [LayoutParser](https://layout-parser.github.io) toolkit. It also borrows heavily from the [layout model training directory](https://github.com/Layout-Parser/layout-model-training) from LayoutParser.

**Prepare labelled data and directories**

In [None]:
# Clone forked layout-model-training repo from Layout-Parser
! git clone https://github.com/matthewleechen/layout-model-training
! cd /content/layout-model-training/ && pip install -r requirements.txt

# Change working directory
%cd /content/layout-model-training/

# Clone forked detectron2 repo from FAIR
! git clone https://github.com/matthewleechen/detectron2

# Clone forked cocosplit repo from akarazniewicz
! git clone https://github.com/matthewleechen/cocosplit
! pip install -r cocosplit/requirements.txt

# Install remaining dependencies
! pip install -e git+https://github.com/matthewleechen/layout-parser.git#egg=layoutparser
! pip install torchvision && pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.5#egg=detectron2"
! pip install google-cloud-vision

Restart runtime before proceeding.

In [1]:
# Change working directory
%cd /content/layout-model-training/

/content/layout-model-training


In [2]:
import os
import zipfile
import layoutparser as lp
import cv2
from concurrent.futures import ThreadPoolExecutor

Upload the COCO annotations file to the current directory `/content/layout-model-training/`.

In [3]:
zip_file = "patent_data.zip"

# Create data folder
output_folder = "data"
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

# Extract the contents of the annotations file to the data folder
with zipfile.ZipFile(zip_file, 'r') as zip_ref:
    for member in zip_ref.namelist():
        if not member.startswith('._'):
            zip_ref.extract(member, output_folder)


In [4]:
# Create outputs folder
! mkdir outputs

The outputs folder will contain evaluation data, checkpoint information and model weights following training.

**Split the data into training and test sets**

The code below allocates 80% of the data to the training set, and 20% to the test set. You can change this via the parameter `--s` currently set to 0.8.

In [5]:
# Split the data
# Run the coco-split
! python cocosplit/cocosplit.py --having-annotations --multi-class -s 0.8 data/result.json data/train.json data/test.json --seed 42

Saved 9016 entries in data/train.json and 2254 in data/test.json


**Training Detectron2 vision models**

***Continue training from last checkpoint***

Upload the `last_checkpoint` file and the model weights file (`model_{number of iterations}.pth`) to the outputs folder.

***Start training from default pre-trained model weights***

Ensure the outputs folder is empty.

***Evaluation only***

Pass the `--eval-only MODEL.WEIGHTS outputs/last_checkpoint` argument to the `train_annotations.sh` file.


Note: The default model in `train_annotations.sh` is Faster-RCNN with a ResNet-50 backbone and a feature pyramid network (config file: `layout-model-training/configs/fast_rcnn_R_50_FPN_3x.yaml`). There is also Mask-RCNN with the same backbone and feature pyramid network (config file: `mask_rcnn_R_50_FPN_3x.yaml`). Mask-RCNN is an instance segmentation model built upon Faster-RCNN (so disabling segmentation masks leaves you with a Faster-RCNN model) and so you will need a COCO dataset with segmentation masks, or else an attribute error will be returned. You can try other models from the Detectron2 library (`layout-model-training/detectron2/configs`) by changing `cxonfig-file` in `train_annotations.sh`.

Hyperparameters can be adjusted from the configuration files (some can be adjusted from `train_annotations.sh`). If training diverges, you will likely need to reduce the base learning rate (`BASE_LR` in the config file). Note that I used the hyperparameters from the config files in the cloned repository, which correspond to the default Detectron2 hyperparameters (except the base learning rate for Mask-RCNN which was halved to 0.01 because training diverged with a base learning rate of 0.02). I set the maximum iterations to 100,000 (from the default 60,000) and train all models to this iteration.

I train Fast-RCNN using a subset of the annotations (only those from 1853). I train both Faster-RCNN and Mask-RCNN using the full set of annotations.

Note that setting the seed on Detectron2 models does not guarantee deterministic behavior - see https://detectron2.readthedocs.io/en/latest/modules/config.html for further information.

In [None]:
# Training loop
! bash scripts/train_annotations.sh

**Visualize bounding box predictions (Optional)**

In [None]:
image = cv2.imread("/path/to/image") # Set path to image you want to visualize
layout = model.detect(image)

In [None]:
blocks = lp.Layout([b for b in layout if b.type=='text' or b.type=='date_box' or b.type=='full_box'])

In [None]:
text_blocks = lp.Layout([b for b in layout if b.type=='text'])
date_blocks = lp.Layout([b for b in layout if b.type=='date_box'])
header_blocks = lp.Layout([b for b in layout if b.type=='header'])
full_blocks = lp.Layout([b for b in layout if b.type=='full_blocks'])

In [None]:
# Visualization (box_width is the relative width of bounding box boundaries)
lp.draw_box(image, blocks, box_width=10) # Can replace blocks with text_blocks, date_blocks or header_blocks

**Inference**

In [None]:
model = lp.Detectron2LayoutModel(
    config_path = "/path/to/config/file", # config file will be in outputs folder
    model_path = "/path/to/model/weights/file", # model weights file will be in outputs folder
    extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.5], # set confidence threshold here (replace 0.5 if desired)
    label_map={0: "date_box", 1: "full_box", 2: "header", 3: "text"} # set label_map according to COCO dataset
)

Upload your Google Cloud Vision (GCV) credentials file to the current directory. You will need a Google account to use the GCV API. Information on getting started by setting up your credentials is available [here](https://developers.google.com/workspace/guides/create-credentials).

In [None]:
# Initialize GCV API
ocr_agent = lp.GCVAgent.with_credential("/path/to/credentials",
                                        languages = ['en'])

The original patent documents span several books. Before running inference, it is recommended to structure the unlabelled documents in separate directories by book. For example:

```
patent_data_woodcroft/
└── chron_1617-1852-vol1/
    ├── chron_1617-1852-vol1_0019.jp2
    ├── chron_1617-1852-vol1_0020.jp2
    ├── chron_1617-1852-vol1_0021.jp2
    ...
    ├── chron_1617-1852-vol1_0802.jp2
└── chron_1617-1852-vol2/
└── chron_1852-oct-dec/
└── chron_1853/
└── chron_1854/
└── chron_1855/
...
└── chron_1871/
```

Each subdirectory of "patent_data_woodcroft" should contain all the image files you want to run inference on. Then, run the following three code cells for each directory (book) to obtain a merged text file for every book.

In [None]:
# Set folder path
folder_path = "/path/to/directory"

The inference loop below performs layout detection and OCR (using GCV) on the images in the folder path, and writes the output to a .txt file that inherits the same name as the original image file, separating each bounding box of text with a separator (---).

The loop only runs over all image files within a folder (e.g. within `chron_1854`), and not over all folders. The reason is to allow for cost monitoring given that GCV charges on a per page basis. If cost monitoring is not a concern, you can modify the code block below to loop over all the folders.

In order to run inference on all of the unlabelled original patent documents, this requires approximately 10-12 hours in total on a Colab CPU. Batch processing is limited on Colab by the small number of CPU cores. Inference will be completed much quicker if run locally on a multi-core unit or on a GPU.

In [None]:
# Loop over documents and run inference
def process_image(filepath):
    # Construct the input and output file paths
    output_filepath = os.path.splitext(filepath)[0] + '.txt'

    # Perform layout detection and OCR on the image
    image = cv2.imread(filepath)
    layout = model.detect(image)
    blocks = lp.Layout([b for b in layout if b.type=='text' or b.type=='date_box' or b.type=='full_box'])

    with open(output_filepath, 'w') as f:
        sorted_blocks = sorted(blocks, key=lambda b: b.coordinates[1]) # order by y-axis

        for block in sorted_blocks: # padding
            segment_image = (block
                                .pad(left=5, right=5, top=5, bottom=5)
                                .crop_image(image))

            layout = ocr_agent.detect(segment_image)

            full_text = ''
            for line in layout:
                text = line.text
                if text.endswith('.'):
                    full_text += text + '\n'
                else:
                    # remove spaces before commas
                    text = text.replace(' ,', ',')
                    full_text += text + ' '

            # remove space before full stops
            full_text = full_text.replace(' .', '.')

            # Write the output to the file
            f.write(full_text.strip() + "\n")
            f.write('---\n')

if __name__ == '__main__':

    # Get a list of all the JPEG files in the folder
    filenames = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.lower().endswith('.jpg')]

    # Batch process the images using multiple threads
    with ThreadPoolExecutor(max_workers=2) as executor: # set max_workers = #cpu cores (2 on Colab)
        executor.map(process_image, filenames)

This code merges the saved text files across all pages in order of page number. The final output will be a merged text file containing all the patents in the relevant book (e.g. a single `merged.txt` file for every subdirectory of `patent_data_woodcroft`). A separator (---) will be between any two bounding boxes.

In [None]:
# Create merged text file from all individual pages.

output_file = "merged.txt"

# Get a list of all the .txt files in the directory, sorted by name
files = [f for f in os.listdir(folder_path) if f.endswith(".txt")]
files.sort()

with open(os.path.join(folder_path, output_file), "w") as outfile:
    for filename in files:
        with open(os.path.join(folder_path, filename), "r") as infile:
            content = infile.read().strip()
            if content:  # Check if content is not empty
                if outfile.tell() != 0:  # Check if output file is not empty
                    outfile.write("---\n")  # Add separator between files
                outfile.write(content)

# Remove double separators
with open(os.path.join(folder_path, output_file), "r+") as f:
    lines = f.readlines()
    f.seek(0)
    for i, line in enumerate(lines):
        if line.strip() != "---" or i == 0 or lines[i-1].strip() != "---":
            f.write(line)
    f.truncate()