# Document Entity Extraction with OpenVINO

This demo shows entity extraction from text and document inferencing using JPG and PDF files with OpenVINO. We use [small BERT-large-like model](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/intel/bert-small-uncased-whole-word-masking-squad-int8-0002) distilled and quantized to INT8 on SQuAD v1.1 training set from larger BERT-large model. The model comes from [Open Model Zoo](https://github.com/openvinotoolkit/open_model_zoo/). At the bottom of this notebook, you will see live inference results from your inputs and templates.

## Imports

In [None]:
import cv2
import time
import json
import operator

import pdfplumber
import pytesseract

import numpy as np
import tokens_bert as tokens
from openvino.runtime import Core

### Preparations: Test tesseract OCR with sample image

In [None]:
# On Windows, this script adds the directory that contains tesseract.exe to the PATH to enable OCR engine to find the
# required Python tools. This code assumes that tesseract python package is installed in the default
# directory. If you have a different installation path, please set the correct path in pytesseract.pytesseract.tesseract_cmd

# https://stackoverflow.com/questions/50655738/how-do-i-resolve-a-tesseractnotfounderror

import sys

if sys.platform == "win32":
    from pathlib import Path

    OCR_INSTALL_DIR = r"C:/Program Files/Tesseract-OCR"
    ocr_path = sorted(list(Path(OCR_INSTALL_DIR).glob("**/tesseract.exe")))
    if len(ocr_path) == 0:
        print("Cannot find Tesseract OCR tool executable. Fow Windows OS, this notebook requires setting "
              "pytesseract.pytesseract.tesseract_cmd to the directory that contains tesseract.exe ")
    else:
        pytesseract.pytesseract.tesseract_cmd=str(ocr_path[-1])
        print(f"Successfully set pytesseract.pytesseract.tesseract_cmd to path {ocr_path[-1]}\n")
        
        # Test converting sample JPG image into string using pytesseract OCR
        print('Extracted OCR text:')
        img = cv2.imread('data/sample.jpg')
        print(pytesseract.image_to_string(img))

## The model

### Download the model

We use `omz_downloader`, which is a command-line tool from the `openvino-dev` package. `omz_downloader` automatically creates a directory structure and downloads the selected model. If the model is already downloaded, this step is skipped.

You can download and use any of the following models: `bert-large-uncased-whole-word-masking-squad-0001`, `bert-large-uncased-whole-word-masking-squad-int8-0001`, `bert-small-uncased-whole-word-masking-squad-0001`, `bert-small-uncased-whole-word-masking-squad-0002`, `bert-small-uncased-whole-word-masking-squad-int8-0002`, just change the model name below. Any of these models are already converted to OpenVINO Intermediate Representation (IR), so there is no need to use `omz_converter`.

In [None]:
# directory where model will be downloaded
base_model_dir = "model"

# desired precision
precision = "FP16-INT8"

# model name as named in Open Model Zoo
model_name = "bert-small-uncased-whole-word-masking-squad-int8-0002"

model_path = f"model/intel/{model_name}/{precision}/{model_name}.xml"
model_weights_path = f"model/intel/{model_name}/{precision}/{model_name}.bin"

download_command = f"omz_downloader " \
                   f"--name {model_name} " \
                   f"--precision {precision} " \
                   f"--output_dir {base_model_dir} " \
                   f"--cache_dir {base_model_dir}"
! $download_command

### Load the model

Downloaded models are located in a fixed structure, which indicates vendor, model name and precision. Only a few lines of code are required to run the model. First, we initialize OpenVINO. Then we read the network architecture and model weights from the .xml and .bin files. Finally, we compile the network for the desired device. You can choose `CPU` or `GPU` in the case of this model.

In [None]:
# initialize inference engine
core = Core()
# read the network and corresponding weights from file
model = core.read_model(model=model_path, weights=model_weights_path)
# load the model on the CPU (you can use GPU as well)
compiled_model = core.compile_model(model=model, device_name="CPU")

# get input and output names of nodes
input_keys = list(compiled_model.inputs)
output_keys = list(compiled_model.outputs)

# get network input size
input_size = compiled_model.input(0).shape[1]

Input keys are the names of the input nodes and output keys contain names of output nodes of the network. In the case of the BERT-large-like model, we have four inputs and two outputs.

In [None]:
[i.any_name for i in input_keys], [o.any_name for o in output_keys]

## Processing

NLP models usually take a list of tokens as standard input. A token is a single word converted to some integer. To provide the proper input, we need the vocabulary for such mapping. We also define some special tokens like separators or padding and a function to load the content. Content can be loaded from either simple text or perform OCR conversion from a JPG image or PDF file. Many such conversion can be supported like HTML, XML, PNG files etc.

In [None]:
# path to vocabulary file
vocab_file_path = "data/vocab.txt"

# create dictionary with words and their indices
vocab = tokens.load_vocab_file(vocab_file_path)

# define special tokens
cls_token = vocab["[CLS]"]
pad_token = vocab["[PAD]"]
sep_token = vocab["[SEP]"]

# Create custom class and handle tesseract error
class TesseractNotFoundError(Exception):
    pass

# function to load context text string or perform document OCR
def load_context(source, source_format="text"):
    if source_format == "document_jpg":
        img = cv2.imread(source)

        try:
            ocr_text = pytesseract.image_to_string(img)
        except:
            raise TesseractNotFoundError
        
        print("Extracted OCR text:\n", ocr_text)
        return ocr_text
    elif source_format == "document_pdf":
        with pdfplumber.open(source) as pdf:
            first_page = pdf.pages[0]
            ocr_text = first_page.extract_text()
            print("Extracted OCR text:\n", ocr_text)
            return ocr_text
    return source


### Preprocessing

The input size in this case is 384 tokens long. The main input (`input_ids`) to used BERT model consist of two parts: entity tokens and context tokens separated by some special tokens. If entity + context are shorter than 384 tokens, padding tokens are added. If entity + context is longer than 384 tokens, the context must be split into parts and the entity with different parts of context must be fed to the network many times. We use overlapping, so neighbor parts of the context are overlapped by half size of the context part (if the context part equals 300 tokens, neighbor context parts overlap with 150 tokens). We also need to provide: `attention_mask`, which is a sequence of integer values representing the mask of valid values in the input; `token_type_ids`, which is a sequence of integer values representing the segmentation of the `input_ids` into entity and context; `position_ids`, which is a sequence of integer values from 0 to 383 representing the position index for each input token. To know more about input, please read [this](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/intel/bert-small-uncased-whole-word-masking-squad-int8-0002#input).

In [None]:
# generator of a sequence of inputs
def prepare_input(entity_tokens, context_tokens):
    # length of entity in tokens
    entity_len = len(entity_tokens)
    # context part size
    context_len = input_size - entity_len - 3

    if context_len < 16:
        raise RuntimeError("Question is too long in comparison to input size. No space for context")

    # take parts of context with overlapping by 0.5
    for start in range(0, max(1, len(context_tokens) - context_len), context_len // 2):
        # part of context
        part_context_tokens = context_tokens[start:start + context_len]
        # input: entity and context separated by special tokens
        input_ids = [cls_token] + entity_tokens + [sep_token] + part_context_tokens + [sep_token]
        # 1 for any index if there is no padding token, 0 otherwise
        attention_mask = [1] * len(input_ids)
        # 0 for entity tokens, 1 for context part
        token_type_ids = [0] * (entity_len + 2) + [1] * (len(part_context_tokens) + 1)

        # add padding at the end
        (input_ids, attention_mask, token_type_ids), pad_number = pad(input_ids=input_ids,
                                                                      attention_mask=attention_mask,
                                                                      token_type_ids=token_type_ids)

        # create input to feed the model
        input_dict = {
            "input_ids": np.array([input_ids], dtype=np.int32),
            "attention_mask": np.array([attention_mask], dtype=np.int32),
            "token_type_ids": np.array([token_type_ids], dtype=np.int32),
        }

        # some models require additional position_ids
        if "position_ids" in [i_key.any_name for i_key in input_keys]:
            position_ids = np.arange(len(input_ids))
            input_dict["position_ids"] = np.array([position_ids], dtype=np.int32)

        yield input_dict, pad_number, start


# function to add padding
def pad(input_ids, attention_mask, token_type_ids):
    # how many padding tokens
    diff_input_size = input_size - len(input_ids)

    if diff_input_size > 0:
        # add padding to all inputs
        input_ids = input_ids + [pad_token] * diff_input_size
        attention_mask = attention_mask + [0] * diff_input_size
        token_type_ids = token_type_ids + [0] * diff_input_size

    return (input_ids, attention_mask, token_type_ids), diff_input_size

### Postprocessing

The results from the network are raw (logits). We need to use the softmax function to get the probability distribution. Then, we are looking for the best entity extraction in the current part of the context (the highest score) and we return the score and the context range for the extracted entity.

In [None]:
# based on https://github.com/openvinotoolkit/open_model_zoo/blob/bf03f505a650bafe8da03d2747a8b55c5cb2ef16/demos/common/python/openvino/model_zoo/model_api/models/bert.py#L163
def postprocess(output_start, output_end, entity_tokens, context_tokens_start_end, padding, start_idx):

    def get_score(logits):
        out = np.exp(logits)
        return out / out.sum(axis=-1)

    # get start-end scores for context
    score_start = get_score(output_start)
    score_end = get_score(output_end)

    # index of first context token in tensor
    context_start_idx = len(entity_tokens) + 2
    # index of last+1 context token in tensor
    context_end_idx = input_size - padding - 1

    # find product of all start-end combinations to find the best one
    max_score, max_start, max_end = find_best_entity_window(start_score=score_start,
                                                            end_score=score_end,
                                                            context_start_idx=context_start_idx,
                                                            context_end_idx=context_end_idx)

    # convert to context text start-end index
    max_start = context_tokens_start_end[max_start + start_idx][0]
    max_end = context_tokens_start_end[max_end + start_idx][1]

    return max_score, max_start, max_end


# based on https://github.com/openvinotoolkit/open_model_zoo/blob/bf03f505a650bafe8da03d2747a8b55c5cb2ef16/demos/common/python/openvino/model_zoo/model_api/models/bert.py#L188
def find_best_entity_window(start_score, end_score, context_start_idx, context_end_idx):
    context_len = context_end_idx - context_start_idx
    score_mat = np.matmul(
        start_score[context_start_idx:context_end_idx].reshape((context_len, 1)),
        end_score[context_start_idx:context_end_idx].reshape((1, context_len)),
    )
    # reset candidates with end before start
    score_mat = np.triu(score_mat)
    # reset long candidates (>16 words)
    score_mat = np.tril(score_mat, 16)
    # find the best start-end pair
    max_s, max_e = divmod(score_mat.flatten().argmax(), score_mat.shape[1])
    max_score = score_mat[max_s, max_e]

    return max_score, max_s, max_e

Firstly, we need to create a list of tokens from the context and the entity. Then, we are looking for the best extracted entity by trying different parts of the context. The best extracted entity should come with the highest score.

In [None]:
def get_best_entity(entity, context):
    # convert context string to tokens
    context_tokens, context_tokens_start_end = tokens.text_to_tokens(text=context.lower(),
                                                                     vocab=vocab)
    # convert entity string to tokens
    entity_tokens, _ = tokens.text_to_tokens(text=entity.lower(), vocab=vocab)

    results = []
    # iterate through different parts of context
    for network_input, padding, start_idx in prepare_input(entity_tokens=entity_tokens,
                                                           context_tokens=context_tokens):
        # get output layers
        output_start_key = compiled_model.output("output_s")
        output_end_key = compiled_model.output("output_e")

        # openvino inference
        result = compiled_model(network_input)
        # postprocess the result getting the score and context range for the answer
        score_start_end = postprocess(output_start=result[output_start_key][0],
                                      output_end=result[output_end_key][0],
                                      entity_tokens=entity_tokens,
                                      context_tokens_start_end=context_tokens_start_end,
                                      padding=padding,
                                      start_idx=start_idx)
        results.append(score_start_end)

    # find the highest score
    answer = max(results, key=operator.itemgetter(0))
    # return the part of the context, which is already an answer
    return context[answer[1]:answer[2]], answer[0]

### Main Processing Function

Run entity extraction on specific knowledge base and iterate through the template entities. Final output is a JSON object with two fields i.e. Extraction (consists of Entity, Type and Confidence Score) and Overall Processing time. Currently application supports only few entities from healthcare domain, more templates and entities can be added.

In [None]:
healthcare_template = ["name", "age", "medical condition", "medication", "dosage", "dosage unit"]

def run_analyze_entities(source, source_type = "text"):
    print(f"Context: {source}\n", flush=True)

    try:
        context = load_context(source, source_type)
    except TesseractNotFoundError:
        print("OCR Error: To extract entities from image you need to install Tesseract-OCR engine. "
              "Follow this link to resolve \nhttps://stackoverflow.com/questions/50655738/how-do-i-resolve-a-tesseractnotfounderror")
        return

    if len(context) == 0:
        print("Error: Empty context or outside paragraphs")
        return

    # measure processing time
    start_time = time.perf_counter()
    extract = []
    for field in healthcare_template:
        entity_to_find = field + "?"
        entity, score = get_best_entity(entity=entity_to_find, context=context)
        extract.append({"Entity": entity, "Type": field, "Score": f"{score:.2f}"})
    end_time = time.perf_counter()
    res = {"Extraction": extract, "Time": f"{end_time - start_time:.2f}s"}
    print("\nJSON Output:")
    print(json.dumps(res, sort_keys=False, indent=4))

## Run

### Run on local paragraphs

Change sources to your own medical-domain text, supported by the template, to perform entity extraction. It supports only one input text at a time. Usually, you need to wait a few seconds for the entities to be extracted, but longer the context, longer would be the waiting time. The model is very limited and sensitive for the input and predefined template. The answer can depend on whether it is supported by the template or not. The model will try to extract entities even if not supported by the template, so in that case, you can see random results.

Sample source: Medical Named Entity and Relationship Extraction Paragraph (from [here](https://aws.amazon.com/comprehend/medical/features/?nc=sn&loc=2))

Sample entities supported by the application healthcare-domain template:
- Name, Age, Medical Condition, Medication, Dosage and Dosage Units.

In [None]:
source_text = "Mr. Smith is a 63-year-old gentleman with coronary artery disease and hypertension. " \
               "CURRENT MEDICATIONS: taking a dosage of LIPITOR 20 mg once daily."

run_analyze_entities(source_text)

### Run on JPG Image

You can also provide a sample JPG image. Note that the context (knowledge base) is built from the text available in the image. If some information is outside the paragraphs or not supported by the predefined template, the algorithm won't able to perform entity extraction correctly.

Sample source: Paragraph Converted into JPG format image file (from [here](https://arxiv.org/pdf/1910.07419.pdf))

In [None]:
source_document = "data/sample.jpg"
run_analyze_entities(source_document, source_type="document_jpg")

### Run on PDF File

You can also provide a sample PDF file. Note that the context (knowledge base) is built from the text available in the PDF. If some information is outside the paragraphs or not supported by the predefined template, the algorithm won't able to perform entity extraction correctly.

Sample source: Paragraph Converted into PDF format file (from [here](https://biotext.berkeley.edu/dis_treat_data.html))

In [None]:
source_document = "data/sample.pdf"
run_analyze_entities(source_document, source_type="document_pdf")