# Typo Detector with OpenVINO
Typo detection in AI is a process of identifying and correcting typographical errors in text data using machine learning algorithms. The goal of typo detection is to improve the accuracy, readability, and usability of text by identifying and indicating mistakes made during the writing process. To detect typos, AI-based typo detectors use various techniques, such as natural language processing (NLP), machine learning (ML), and deep learning (DL).

A typo detector takes a sentence as an input and identify all typographical errors such as misspellings and homophone errors.

This tutorial provides how to use the Typo Detector from the Hugging Face Transformers library in the OpenVINO environment to perform the above task.

The model detects typos in a given text with a high accuracy, performances of which are listed below,

- Precision score of 0.9923
- Recall score of 0.9859
- f1-score of 0.9891

https://huggingface.co/m3hrdadfi/typo-detector-distilbert-en

These metrics indicate that the model can correctly identify a high proportion of both correct and incorrect text, minimizing both false positives and false negatives.

The model has been pretrained on the NeuSpell dataset. https://github.com/neuspell/neuspell

# Pip packages

In [1]:
#%pip install -q "diffusers>=0.17.1" "openvino>=2023.1.0" "nncf>=2.5.0" "gradio>=4.19" "onnx>=1.11.0,<1.16.2" "transformers>=4.39.0" "torch>=2.1,<2.4" "torchvision<0.19.0" --extra-index-url https://download.pytorch.org/whl/cpu
#%pip install -q "git+https://github.com/huggingface/optimum-intel.git"

In [2]:
from transformers import (
    AutoConfig,
    AutoTokenizer,
    AutoModelForTokenClassification,
    pipeline,
)
from pathlib import Path
import numpy as np
import re
from typing import List, Dict
import time

In [3]:
from notebook_utils import device_widget

device = device_widget()

device

Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

# Using Hugging Face Optimum Intel library

In [4]:
from optimum.intel.openvino import OVModelForTokenClassification
# The pretrained model we are using
model_id = "m3hrdadfi/typo-detector-distilbert-en"

model_dir = Path("typo_detector")

# Save the model to the path if not existing
if model_dir.exists():
    model = OVModelForTokenClassification.from_pretrained(model_dir, device=device.value)
else:
    model = OVModelForTokenClassification.from_pretrained(model_id, export=True, device=device.value)
    model.save_pretrained(model_dir)

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

Framework not specified. Using pt to export the model.


pytorch_model.bin:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/365 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Using framework PyTorch: 2.4.0+cu121
  op1 = operator(*args, **kwargs)
Compiling the model to AUTO ...


In [6]:
tokenizer = AutoTokenizer.from_pretrained(model_id)
nlp = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="average",
)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [39]:
def show_typos(sentence: str):
    """
    Detect typos from the given sentence.
    Writes both the original input and typo-tagged version to the terminal.

    Arguments:
    sentence -- Sentence to be evaluated (string)
    """

    typos = [sentence[r["start"] : r["end"]] for r in nlp(sentence)]

    detected = "\033[1;30m"+sentence
    for typo in typos:
        detected = detected.replace(typo, f"\033[1;31;47m <i>{typo}</i>\033[0m\033[1;30m")

    print("\033[1;30m[Input]: ",  sentence)
    print("[Detected]: ", detected)
    print("-" * 130)

In [47]:
sentences = [
    "He had also stgruggled with addiction during his time in Congress .",
    "The review thoroughla assessed all aspects of JLENS SuR and CPG esign maturit and confidence .",
    "Letterma also apologized two his staff for the satyation .",
    "Vincent Jay had earlier won France 's first gold in gthe 10km biathlon sprint .",
    "It is left to the directors to figure out hpw to bring the stry across to tye audience .",
    "I wnet to the park yestreday to play foorball with my fiends, but it statred to rain very hevaily and we had to stop.",
    "My faorite restuarant servs the best spahgetti in the town, but they are always so buzy that you have to make a resrvation in advnace.",
    "I was goig to watch a mvoie on Netflx last night, but the straming was so slow that I decided to cancled my subscrpition.",
    "My freind and I went campign in the forest last weekend and saw a beutiful sunst that was so amzing it took our breth away.",
    "I  have been stuying for my math exam all week, but I'm stil not very confidet that I will pass it, because there are so many formuals to remeber.",
]

start = time.time()

for sentence in sentences:
    show_typos(sentence)

print(f"Time elapsed: {time.time() - start}")

[1;30m   [Input]:  He had also stgruggled with addiction during his time in Congress .
[Detected]:  [1;30mHe had also [1;31;47m <i>stgruggled</i>[0m[1;30m with addiction during his time in Congress .
----------------------------------------------------------------------------------------------------------------------------------
[1;30m   [Input]:  The review thoroughla assessed all aspects of JLENS SuR and CPG esign maturit and confidence .
[Detected]:  [1;30mThe review [1;31;47m <i>thoroughla</i>[0m[1;30m assessed all aspects of JLENS SuR and CPG [1;31;47m <i>esign</i>[0m[1;30m [1;31;47m <i>maturit</i>[0m[1;30m and confidence .
----------------------------------------------------------------------------------------------------------------------------------
[1;30m   [Input]:  Letterma also apologized two his staff for the satyation .
[Detected]:  [1;30m[1;31;47m <i>Letterma</i>[0m[1;30m also apologized [1;31;47m <i>two</i>[0m[1;30m his staff for the [1;31;47m <

# Converting the model to OpenVINO IR
Use the AutoModelForTokenClassification class to load the pretrained pytorch model.



In [42]:
model_id = "m3hrdadfi/typo-detector-distilbert-en"
model_dir = Path("pytorch_model")

tokenizer = AutoTokenizer.from_pretrained(model_id)
config = AutoConfig.from_pretrained(model_id)

# Save the model to the path if not existing
if model_dir.exists():
    model = AutoModelForTokenClassification.from_pretrained(model_dir)
else:
    model = AutoModelForTokenClassification.from_pretrained(model_id, config=config)
    model.save_pretrained(model_dir)

In [43]:
import openvino as ov

ov_model_path = Path(model_dir) / "typo_detect.xml"

dummy_model_input = tokenizer("This is a sample", return_tensors="pt")
ov_model = ov.convert_model(model, example_input=dict(dummy_model_input))
ov.save_model(ov_model, ov_model_path)

In [44]:
core = ov.Core()

compiled_model = core.compile_model(ov_model, device.value)
output_layer = compiled_model.output(0)

# Helper Functions

In [45]:
def token_to_words(tokens: List[str]) -> Dict[str, int]:
    """
    Maps the list of tokens to words in the original text.
    Built on the feature that tokens starting with '##' is attached to the previous token as tokens derived from the same word.

    Arguments:
    tokens -- List of tokens

    Returns:
    map_to_words -- Dictionary mapping tokens to words in original text
    """

    word_count = -1
    map_to_words = {}
    for token in tokens:
        if token.startswith("##"):
            map_to_words[token] = word_count
            continue
        word_count += 1
        map_to_words[token] = word_count
    return map_to_words

def infer(input_text: str) -> Dict[np.ndarray, np.ndarray]:
    """
    Creating a generic inference function to read the input and infer the result

    Arguments:
    input_text -- The text to be infered (String)

    Returns:
    result -- Resulting list from inference
    """

    tokens = tokenizer(
        input_text,
        return_tensors="np",
    )
    inputs = dict(tokens)
    result = compiled_model(inputs)[output_layer]
    return result

def get_typo_indexes(
    result: Dict[np.ndarray, np.ndarray],
    map_to_words: Dict[str, int],
    tokens: List[str],
) -> List[int]:
    """
    Given results from the inference and tokens-map-to-words, identifies the indexes of the words with typos.

    Arguments:
    result -- Result from inference (tensor)
    map_to_words -- Dictionary mapping tokens to words (Dictionary)

    Results:
    wrong_words -- List of indexes of words with typos
    """

    wrong_words = []
    c = 0
    result_list = result[0][1:-1]
    for i in result_list:
        prob = np.argmax(i)
        if prob == 1:
            if map_to_words[tokens[c]] not in wrong_words:
                wrong_words.append(map_to_words[tokens[c]])
        c += 1
    return wrong_words

def sentence_split(sentence: str) -> List[str]:
    """
    Split the sentence into words and characters

    Arguments:
    sentence - Sentence to be split (string)

    Returns:
    splitted -- List of words and characters
    """

    splitted = re.split("([',. ])", sentence)
    splitted = [x for x in splitted if x != " " and x != ""]
    return splitted


def show_typos(sentence: str):
    """
    Detect typos from the given sentence.
    Writes both the original input and typo-tagged version to the terminal.

    Arguments:
    sentence -- Sentence to be evaluated (string)
    """

    tokens = tokenizer.tokenize(sentence)
    map_to_words = token_to_words(tokens)
    result = infer(sentence)
    typo_indexes = get_typo_indexes(result, map_to_words, tokens)

    sentence_words = sentence_split(sentence)

    typos = [sentence_words[i] for i in typo_indexes]

    detected = "\033[1;30m"+sentence
    for typo in typos:
        detected = detected.replace(typo, f"\033[1;31;47m <i>{typo}</i>\033[0m\033[1;30m")

    print("\033[1;30m   [Input]: ", sentence)
    print("[Detected]: ", detected)
    print("-" * 130)




In [49]:
sentences = [
    "He had also stgruggled with addiction during his time in Congress .",
    "The review thoroughla assessed all aspects of JLENS SuR and CPG esign maturit and confidence .",
    "Letterma also apologized two his staff for the satyation .",
    "Vincent Jay had earlier won France 's first gold in gthe 10km biathlon sprint .",
    "It is left to the directors to figure out hpw to bring the stry across to tye audience .",
    "I wnet to the park yestreday to play foorball with my fiends, but it statred to rain very hevaily and we had to stop.",
    "My faorite restuarant servs the best spahgetti in the town, but they are always so buzy that you have to make a resrvation in advnace.",
    "I was goig to watch a mvoie on Netflx last night, but the straming was so slow that I decided to cancled my subscrpition.",
    "My freind and I went campign in the forest last weekend and saw a beutiful sunst that was so amzing it took our breth away.",
    "I  have been stuying for my math exam all week, but I'm stil not very confidet that I will pass it, because there are so many formuals to remeber.",
]

start = time.time()

for sentence in sentences:
    show_typos(sentence)

print(f"Time elapsed: {time.time() - start}")

[1;30m   [Input]:  He had also stgruggled with addiction during his time in Congress .
[Detected]:  [1;30mHe had also [1;31;47m <i>stgruggled</i>[0m[1;30m with addiction during his time in Congress .
----------------------------------------------------------------------------------------------------------------------------------
[1;30m   [Input]:  The review thoroughla assessed all aspects of JLENS SuR and CPG esign maturit and confidence .
[Detected]:  [1;30mThe review [1;31;47m <i>thoroughla</i>[0m[1;30m assessed all aspects of JLENS SuR and CPG [1;31;47m <i>esign</i>[0m[1;30m [1;31;47m <i>maturit</i>[0m[1;30m and confidence .
----------------------------------------------------------------------------------------------------------------------------------
[1;30m   [Input]:  Letterma also apologized two his staff for the satyation .
[Detected]:  [1;30m[1;31;47m <i>Letterma</i>[0m[1;30m also apologized [1;31;47m <i>two</i>[0m[1;30m his staff for the [1;31;47m <