# Typo Detector with OpenVino

Typo detection in AI is a process of identifying and correcting typographical errors in text data using machine learning algorithms. The goal of typo detection is to improve the accuracy, readability, and usability of text by identifying and correcting mistakes made during the writing process.

A typo detector takes a sentence as an input and identify all typographical errors such as misspellings and homophone errors.

This tutorial provides how to use the [Typo Detector](https://huggingface.co/m3hrdadfi/typo-detector-distilbert-en) from the [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) library to perform the above task.

### Imports

In [16]:
from transformers import AutoConfig, AutoTokenizer, AutoModelForTokenClassification, pipeline
from openvino.runtime import Core
from pathlib import Path
import numpy as np
import torch
import re

### Methods

There are two methods to use the typo detection model with OpenVino. In this tutorial we will look at both.

##### 1. Using the [Hugging Face Optimum](https://huggingface.co/docs/optimum/index) library
The Hugging Face Optimum API is a high-level API that allows us to convert and quantize models from the Hugging Face Transformers library to the OpenVINO™ IR format.

##### 2. Converting the model to ONNX and then to OpenVino IR
First the Pytorch model is convereted to the ONNX format and then the [Model Optimizer](https://docs.openvino.ai/latest/openvino_docs_MO_DG_Deep_Learning_Model_Optimizer_DevGuide.html) tool will be used to convert to Openvino IR format. This method provides much more insight to the openvino environment and applications.

### 1. Hugging Face Optimum library

For this method, we need to install the Hugging Face Optimum library accelerated by OpenVINO integration.

Optimum Intel can be used to load optimized models from the [Hugging Face Hub](https://huggingface.co/docs/optimum/intel/hf.co/models) and create pipelines to run an inference with OpenVINO Runtime using Hugging Face APIs. The Optimum Inference models are API compatible with Hugging Face Transformers models.  This means we need just replace AutoModelForXxx class with the corresponding OVModelForXxx class.

In [None]:
!pip install optimum[openvino]

Import required class

In [6]:
from optimum.intel.openvino import OVModelForTokenClassification

##### Load the model

From the OVModelForTokenCLassification class we will import the relevant pre-trained model. To load a Transformers model and convert it to the OpenVINO format on-the-fly, we set export=True when loading your model.

In [8]:
# The pretrained model we are using
model_id = "m3hrdadfi/typo-detector-distilbert-en"

model_dir = Path("model")

# Save the model to the path if not existing
if model_dir.exists():
    model = OVModelForTokenClassification.from_pretrained(model_dir)
else:
    model = OVModelForTokenClassification.from_pretrained(model_id, export=True)
    model.save_pretrained(model_dir)

##### Load the tokenizer

Text Preprocessing cleans the text-based input data so it can be fed into the model. Tokenization splits paragraphs and sentences into smaller units that can be more easily assigned meaning. It involves cleaning the data and assigning tokens or IDs to the words, so they are represented in a vector space where similar words have similar vectors. This helps the model understand the context of a sentence. We're making use of an [AutoTokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer) from Hugging Face, which is essentially a pretrained tokenizer.

In [None]:
# Load the tokernizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

Then we use the inference pipeline for `text-classification` task. You can find more information about usage Hugging Face inference pipelines in this [tutorial](https://huggingface.co/docs/transformers/pipeline_tutorial)

In [9]:
nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="average")

Function to find typos in a sentence and write them to the terminal

In [37]:
def show_typos(sentence):

    typos = [sentence[r["start"]: r["end"]] for r in nlp(sentence)]

    detected = sentence
    for typo in typos:
        detected = detected.replace(typo, f'<i>{typo}</i>')

    print("[Input]: ", sentence)
    print("[Detected]: ", detected)
    print("-" * 130)

Demo

In [None]:
sentences = [
    "He had also stgruggled with addiction during his time in Congress .",
    "The review thoroughla assessed all aspects of JLENS SuR and CPG esign maturit and confidence .",
    "Letterma also apologized two his staff for the satyation .",
    "Vincent Jay had earlier won France 's first gold in gthe 10km biathlon sprint .",
    "It is left to the directors to figure out hpw to bring the stry across to tye audience .",
]

for sentence in sentences:
    show_typos(sentence)

### 2. Converting the model to ONNX and then to OpenVino IR

##### Load the Pytorch model

Use the `AutoModelForTokenClassification` class to load the pretrained pytorch model.

In [5]:
model_id = "m3hrdadfi/typo-detector-distilbert-en"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model_dir = Path("model")
config = AutoConfig.from_pretrained(model_id)

# Save the model to the path if not existing
if model_dir.exists():
    model = AutoModelForTokenClassification.from_pretrained(model_dir)
else:
    model = AutoModelForTokenClassification.from_pretrained(model_id, config=config)
    model.save_pretrained(model_dir)

##### Converting to ONNX

`ONNX` is an open format built to represent machine learning models. ONNX defines a common set of operators - the building blocks of machine learning and deep learning models - and a common file format to enable AI developers to use models with a variety of frameworks, tools, runtimes, and compilers. We need to convert our model from PyTorch to ONNX. In order to perform the operation, we use the torch.onnx.export function to [convert a Hugging Face model](https://huggingface.co/blog/convert-transformers-to-onnx#export-with-torchonnx-low-level) to its respective ONNX format.

In [None]:
onnx_model = "typo_detect.onnx"
MODEL_DIR = "model/"

MODEL_DIR = f"{MODEL_DIR}"

onnx_model_path = Path(MODEL_DIR) / onnx_model

print(onnx_model_path)

dummy_model_input = tokenizer("This is a sample", return_tensors="pt")

torch.onnx.export(
    model,
    tuple(dummy_model_input.values()),
    f=onnx_model_path,
    input_names=['input_ids', 'attention_mask'],
    output_names=['logits'],
    dynamic_axes={'input_ids': {0: 'batch_size', 1: 'sequence'},
                  'attention_mask': {0: 'batch_size', 1: 'sequence'},
                  'logits': {0: 'batch_size', 1: 'sequence'}},
)

##### Model Optimizer

[Model Optimizer](https://docs.openvino.ai/latest/openvino_docs_MO_DG_Deep_Learning_Model_Optimizer_DevGuide.html) is a cross-platform command-line tool that facilitates the transition between training and deployment environments, performs static model analysis, and adjusts deep learning models for optimal execution on end-point target devices.
Model Optimizer converts the model to the OpenVINO Intermediate Representation format (IR), which you can infer later with [OpenVINO runtime](https://docs.openvino.ai/latest/openvino_docs_OV_UG_OV_Runtime_User_Guide.html#doxid-openvino-docs-o-v-u-g-o-v-runtime-user-guide).

In [None]:
optimizer_command = f'mo \
    --input_model {onnx_model_path} \
    --output_dir {MODEL_DIR} \
    --model_name {model_id} \
    --input input_ids,attention_mask \
    '
! $optimizer_command

OpenVINO™ Runtime uses the [Infer Request](https://docs.openvino.ai/latest/openvino_docs_OV_UG_Infer_request.html) mechanism which allows running models on different devices in asynchronous or synchronous manners. The model graph is sent as an argument to the OpenVINO API and an inference request is created. The default inference mode is AUTO but it can be changed according to requirements and hardware available. You can explore the different inference modes and their usage [in documentation.](https://docs.openvino.ai/latest/openvino_docs_Runtime_Inference_Modes_Overview.html)

In [8]:
ie = Core()
ir_model_xml = str((Path(MODEL_DIR) / model_id).with_suffix(".xml"))
compiled_model = ie.compile_model(ir_model_xml)
infer_request = compiled_model.create_infer_request()

### Helper Functions

In [9]:
""" 
Maps a list of tokens to words in the original text. 
Built on the feature that tokens starting with '##' is attached to the previous token as the same word.
"""


def token_to_words(tokens):
    word_count = -1
    map_to_words = []
    for token in tokens:
        if token.startswith('##'):
            map_to_words.append(word_count)
            continue
        word_count += 1
        map_to_words.append(word_count)
    return map_to_words

In [11]:
"""
Creating a generic inference function to read the input and infer the result
"""


def infer(input_text):

    tokens = tokenizer(
        input_text,
        return_tensors="np",
    )

    inputs = dict(tokens)

    result = infer_request.infer(inputs=inputs)

    return result

In [1]:
""" 
Given results from the inference and tokens map to words, identifies the indexes of the words with typos.
"""


def get_typo_indexes(result, map_to_words):
    wrong_words = []
    c = 0
    for i in list(result.values())[0][0]:
        prob = np.argmax(i)
        if prob == 1:
            if map_to_words[c-1] not in wrong_words:
                wrong_words.append(map_to_words[c])
        c += 1
    return wrong_words

In [2]:
"""
Split the sentence into words and characters
"""


def sentence_split(sentence):
    splitted = re.split("([',. ])",sentence)
    splitted = [x for x in splitted if x != " " and x != ""]
    return splitted

In [3]:
"""
Detect typos from the given sentence.
Writes both the original input and tagged version to the terminal
"""


def show_typos(sentence):

    tokens = tokenizer.tokenize(sentence)
    map_to_words = token_to_words(tokens)
    result = infer(sentence)
    typo_indexes = get_typo_indexes(result,map_to_words)

    sentence_words = sentence_split(sentence)
    
    typos = [sentence_words[i] for i in typo_indexes]   

    detected = sentence
    for typo in typos:
        detected = detected.replace(typo, f'<i>{typo}</i>')

    print("   [Input]: ", sentence)
    print("[Detected]: ", detected)
    print("-" * 130)

### Demo

In [None]:
sentences = [
    "He had also stgruggled with addiction during his time in Congress .",
    "The review thoroughla assessed all aspects of JLENS SuR and CPG esign maturit and confidence .",
    "Letterma also apologized two his staff for the satyation .",
    "Vincent Jay had earlier won France 's first gold in gthe 10km biathlon sprint .",
    "It is left to the directors to figure out hpw to bring the stry across to tye audience .",
]   

for sentence in sentences:
    show_typos(sentence)