# WhisperNormalizer Base Module

> OpenAI's non-english basic text normalization module

## What does this module do?

As per the text normalization/standardization approach mentioned in  Appendix Section C pp.21 in  the paper [Robust Speech Recognition via Large-Scale  Weak Supervision](https://cdn.openai.com/papers/whisper.pdf). The `BasicTextNormalizer` does the following functionality:

1. Remove any phrases between matching brackets ([, ]).
2. Remove any phrases between matching parentheses ((, )).
3. Replace any markers, symbols, and punctuation characters with a space, i.e. when the Unicode category of each
character in the NFKC-normalized string starts with M, S, or P.
4. make the text lowercase.
5. replace any successive whitespace characters with a space

In [None]:
# | default_exp basic

In [None]:
# | hide
from nbdev.showdoc import show_doc

In [None]:
# | export
# This code is from OpenAI Whisper Repository: https://github.com/openai/whisper/tree/main/whisper/normalizers
import re
import unicodedata

import regex

# from fastcore.foundation import add_docs


# non-ASCII letters that are not separated by "NFKD" normalization
ADDITIONAL_DIACRITICS = {
    "œ": "oe",
    "Œ": "OE",
    "ø": "o",
    "Ø": "O",
    "æ": "ae",
    "Æ": "AE",
    "ß": "ss",
    "ẞ": "SS",
    "đ": "d",
    "Đ": "D",
    "ð": "d",
    "Ð": "D",
    "þ": "th",
    "Þ": "th",
    "ł": "l",
    "Ł": "L",
}


def remove_symbols_and_diacritics(s: str, keep=""):
    """
    Replace any other markers, symbols, and punctuations with a space,
    and drop any diacritics (category 'Mn' and some manual mappings)
    """
    return "".join(
        (
            c
            if c in keep
            else (
                ADDITIONAL_DIACRITICS[c]
                if c in ADDITIONAL_DIACRITICS
                else (
                    ""
                    if unicodedata.category(c) == "Mn"
                    else " "
                    if unicodedata.category(c)[0] in "MSP"
                    else c
                )
            )
        )
        for c in unicodedata.normalize("NFKD", s)
    )


def remove_symbols(s: str):
    """
    Replace any other markers, symbols, punctuations with a space, keeping diacritics
    """
    return "".join(
        " " if unicodedata.category(c)[0] in "MSP" else c
        for c in unicodedata.normalize("NFKC", s)
    )

In [None]:
# | export
class BasicTextNormalizer:
    """As per the text normalization/standardization approach mentioned in  Appendix Section C pp.21 in  the paper [Robust Speech Recognition via Large-Scale  Weak Supervision](https://cdn.openai.com/papers/whisper.pdf). The `BasicTextNormalizer` does the following functionality:

        1. Remove any phrases between matching brackets ([, ]).
        2. Remove any phrases between matching parentheses ((, )).
        3. Replace any markers, symbols, and punctuation characters with a space, i.e. when the Unicode category of each
        character in the NFKC-normalized string starts with M, S, or P.
        4. make the text lowercase.
        5. replace any successive whitespace characters with a space

    Note: It's not recommended to use this function for non-english languages because it may removes vowels in languages as identified by [kavya in this tweet](https://twitter.com/kavya_manohar/status/1752574864618365059).
    """

    def __init__(
        self,
        remove_diacritics: bool = False,
        split_letters: bool = False,
    ):
        """
        remove_diaciritics - Replace any other markers, symbols, and punctuations with a space and drop any diacritics
        split_letters  - It uses a regular expression \X to find all Unicode graphemes (extended grapheme clusters) in the string s and join them together by space
        """
        self.clean = (
            remove_symbols_and_diacritics if remove_diacritics else remove_symbols
        )
        self.split_letters = split_letters

    def __call__(self, s: str):
        s = s.lower()
        s = re.sub(r"[<\[][^>\]]*[>\]]", "", s)  # remove words between brackets
        s = re.sub(r"\(([^)]+?)\)", "", s)  # remove words between parenthesis
        s = self.clean(s).lower()

        if self.split_letters:
            s = " ".join(regex.findall(r"\X", s, regex.U))

        s = re.sub(
            r"\s+", " ", s
        )  # replace any successive whitespace characters with a space

        return s

In [None]:
# #| export
# add_docs(BasicTextNormalizer, "Initialize BasicTextNormalizer",
#          # remove_diacritics="Replace any other markers, symbols, and punctuations with a space and drop any diacritics",
#          # split_letters="It uses a regular expression \X to find all Unicode graphemes (extended grapheme clusters) in the string s and join them together by space",
#          __call__="Call string s and apply normalizer with `BasicTextNormalizer`"
#         )

In [None]:
# |hide
show_doc(BasicTextNormalizer)

---

[source](https://github.com/kurianbenoy/whisper_normalizer/blob/main/whisper_normalizer/basic.py#L62){target="_blank" style="float:right; font-size:smaller"}

### BasicTextNormalizer

>      BasicTextNormalizer (remove_diacritics:bool=False,
>                           split_letters:bool=False)

As per the text normalization/standardization approach mentioned in  Appendix Section C pp.21 in  the paper [Robust Speech Recognition via Large-Scale  Weak Supervision](https://cdn.openai.com/papers/whisper.pdf). The `BasicTextNormalizer` does the following functionality:

    1. Remove any phrases between matching brackets ([, ]).
    2. Remove any phrases between matching parentheses ((, )).
    3. Replace any markers, symbols, and punctuation characters with a space, i.e. when the Unicode category of each
    character in the NFKC-normalized string starts with M, S, or P.
    4. make the text lowercase.
    5. replace any successive whitespace characters with a space

Note: It's not recommended to use this function for non-english languages because it may removes vowels in languages as identified by [kavya in this tweet](https://twitter.com/kavya_manohar/status/1752574864618365059).

## Testing Basic Normalizer

In [None]:
normalizer = BasicTextNormalizer()
normalizer("എന്റെ കമ്പ്യൂട്ടറിനു് എന്റെ ഭാഷ")

'എന റ കമ പ യ ട ടറ ന എന റ ഭ ഷ'

In [None]:
article_text = """Language is like a map that we use to navigate the world, but it’s also like a prison that keeps us from seeing what’s beyond the walls.

But what if there was a way to break out of this prison, to expand our map, to explore new worlds with new words? This is the possibility and the challenge offered by instruction tuned language models like GPT 4, a cutting-edge technology that uses artificial neural networks to generate natural language texts based on user inputs.

GPT 4 can write anything from essays to novels to poems to tweets to code to recipes to jokes to lyrics to whatever you want. It can even write things that don’t exist yet, things that no human has ever thought of or said before.

As Wittgenstein’s quote suggests, language is a source of limitation and liberation. GPT 4 pushes this idea to the extreme by giving us access to unlimited language.

This could be the most significant new technology in modern history because it has the potential to change many domains and industries. From education to entertainment, from journalism to justice, from science to art, these models could enable new forms of learning, storytelling, reporting, reasoning, discovery, and creation.

They could also create new ethical, social, and cultural challenges that require careful reflection and regulation. How we use this technology will depend on how we recognize its implications for ourselves and others.

This technology is a form of “Artificial Intelligence”. The word “intelligence” derives from inter- (“between”) and legere (“to choose, pick out, read”). To be intelligent, then, is to be able to choose between things, to pick out what matters, to read what is written. Intelligence is not just a quantity or a quality; it is an activity, a process, a practice. It is something that we do with our minds and our words.

But when we let GPT 4 do this for us, are we not abdicating our intelligence? Are we not letting go of our ability to choose, to pick out, to read? Are we not becoming passive consumers of language instead of active producers?
"""
normalizer(article_text)

'language is like a map that we use to navigate the world but it s also like a prison that keeps us from seeing what s beyond the walls but what if there was a way to break out of this prison to expand our map to explore new worlds with new words this is the possibility and the challenge offered by instruction tuned language models like gpt 4 a cutting edge technology that uses artificial neural networks to generate natural language texts based on user inputs gpt 4 can write anything from essays to novels to poems to tweets to code to recipes to jokes to lyrics to whatever you want it can even write things that don t exist yet things that no human has ever thought of or said before as wittgenstein s quote suggests language is a source of limitation and liberation gpt 4 pushes this idea to the extreme by giving us access to unlimited language this could be the most significant new technology in modern history because it has the potential to change many domains and industries from educat

In [None]:
normalizer = BasicTextNormalizer(remove_diacritics=True)

article_text = """Language is like a map that we use to navigate the world, but it’s also like a prison that keeps us from seeing what’s beyond the walls.

But what if there was a way to break out of this prison, to expand our map, to explore new worlds with new words? This is the possibility and the challenge offered by instruction tuned language models like GPT 4, a cutting-edge technology that uses artificial neural networks to generate natural language texts based on user inputs.

GPT 4 can write anything from essays to novels to poems to tweets to code to recipes to jokes to lyrics to whatever you want. It can even write things that don’t exist yet, things that no human has ever thought of or said before.

As Wittgenstein’s quote suggests, language is a source of limitation and liberation. GPT 4 pushes this idea to the extreme by giving us access to unlimited language.

This could be the most significant new technology in modern history because it has the potential to change many domains and industries. From education to entertainment, from journalism to justice, from science to art, these models could enable new forms of learning, storytelling, reporting, reasoning, discovery, and creation.

They could also create new ethical, social, and cultural challenges that require careful reflection and regulation. How we use this technology will depend on how we recognize its implications for ourselves and others.

This technology is a form of “Artificial Intelligence”. The word “intelligence” derives from inter- (“between”) and legere (“to choose, pick out, read”). To be intelligent, then, is to be able to choose between things, to pick out what matters, to read what is written. Intelligence is not just a quantity or a quality; it is an activity, a process, a practice. It is something that we do with our minds and our words.

But when we let GPT 4 do this for us, are we not abdicating our intelligence? Are we not letting go of our ability to choose, to pick out, to read? Are we not becoming passive consumers of language instead of active producers?
"""
normalizer(article_text)

'language is like a map that we use to navigate the world but it s also like a prison that keeps us from seeing what s beyond the walls but what if there was a way to break out of this prison to expand our map to explore new worlds with new words this is the possibility and the challenge offered by instruction tuned language models like gpt 4 a cutting edge technology that uses artificial neural networks to generate natural language texts based on user inputs gpt 4 can write anything from essays to novels to poems to tweets to code to recipes to jokes to lyrics to whatever you want it can even write things that don t exist yet things that no human has ever thought of or said before as wittgenstein s quote suggests language is a source of limitation and liberation gpt 4 pushes this idea to the extreme by giving us access to unlimited language this could be the most significant new technology in modern history because it has the potential to change many domains and industries from educat

In [None]:
# | hide
import nbdev

nbdev.nbdev_export()