# English Text Normalization Test Cases

This notebook tests the English text normalization implementation.

## Test Categories:
1. Cardinal Numbers
2. Decimal Numbers
3. Fractions
4. Dates
5. Time
6. Money
7. Measurements
8. Ordinal Numbers
9. Telephone Numbers
10. Whitelist/Abbreviations
11. Mixed Content
12. Batch Testing


In [1]:
import sys
import os

# Add the NeMo-text-processing directory to the path
sys.path.insert(0, os.path.abspath('.'))

from nemo_text_processing.text_normalization.normalize import Normalizer

print("Imports successful!")


Imports successful!


In [2]:
# Initialize English normalizer
normalizer_en = Normalizer(
    input_case='cased',
    lang='en',
    cache_dir=None,  # Set to a directory path if you want to cache .far files
    overwrite_cache=False,
    post_process=True
)

print("English Text Normalizer initialized successfully!")
print(f"Language: {normalizer_en.lang}")


 NeMo-text-processing :: INFO     :: Creating ClassifyFst grammars.


English Text Normalizer initialized successfully!
Language: en


## 1. Cardinal Numbers Test


In [13]:
cardinal_tests = [
    "155",
    "1040",
    "15034",
    "123456",
    "1234567",
    "12345678",
    "-123",
    "-120",
    "1000",
    "100000",  # 1 hundred thousand
    "1000000000",  # 1 billion
]

print("=" * 60)
print("CARDINAL NUMBERS TEST")
print("=" * 60)
for test in cardinal_tests:
    result = normalizer_en.normalize(test)
    print(f"Input:  {test:15} -> Output: {result}")


CARDINAL NUMBERS TEST
Input:  155             -> Output: one hundred and fifty five
Input:  1040            -> Output: ten forty
Input:  15034           -> Output: fifteen thousand and thirty four
Input:  123456          -> Output: one hundred twenty three thousand four hundred and fifty six
Input:  1234567         -> Output: one two three four five six seven
Input:  12345678        -> Output: one two three four five six seven eight
Input:  -123            -> Output: minus one hundred and twenty three
Input:  -120            -> Output: minus one hundred and twenty
Input:  1000            -> Output: one thousand
Input:  100000          -> Output: one hundred thousand
Input:  1000000000      -> Output: one billion


## 2. Decimal Numbers Test


In [10]:
decimal_tests = [
    "12.34",
    "123.456",
    "0.5",
    "-12.34",
    "12.3456",
    "0.001",
    "100.5",
]

print("=" * 60)
print("DECIMAL NUMBERS TEST")
print("=" * 60)
for test in decimal_tests:
    result = normalizer_en.normalize(test)
    print(f"Input:  {test:15} -> Output: {result}")


DECIMAL NUMBERS TEST
Input:  12.34           -> Output: twelve point three four
Input:  123.456         -> Output: one hundred and twenty three point four five six
Input:  0.5             -> Output: zero point five
Input:  -12.34          -> Output: minus twelve point three four
Input:  12.3456         -> Output: twelve point three four five six
Input:  0.001           -> Output: zero point zero zero one
Input:  100.5           -> Output: one hundred point five


## 3. Fractions Test


In [3]:
fraction_tests = [
    "3/4",
    "1/2",
    "1/4",
    "12 3/4",
    "-1/2",
    "5/8",
    "2/3",
]

print("=" * 60)
print("FRACTIONS TEST")
print("=" * 60)
for test in fraction_tests:
    result = normalizer_en.normalize(test)
    print(f"Input:  {test:15} -> Output: {result}")


FRACTIONS TEST
Input:  3/4             -> Output: three quarters
Input:  1/2             -> Output: one half
Input:  1/4             -> Output: one quarter
Input:  12 3/4          -> Output: twelve and three quarters
Input:  -1/2            -> Output: - one half
Input:  5/8             -> Output: five eighths
Input:  2/3             -> Output: two thirds


## 4. Dates Test


In [6]:
date_tests = [
    "01-04-2024",
    "15-06-2024",
    "2024-01-15",
    "15/06/2024",
    "04-01-2024",
    "January 15, 2024",
    "15th January 2024",
    "2024/01/15",
]

print("=" * 60)
print("DATES TEST")
print("=" * 60)
for test in date_tests:
    result = normalizer_en.normalize(test)
    print(f"Input:  {test:20} -> Output: {result}")


DATES TEST
Input:  01-04-2024           -> Output: january fourth twenty twenty four
Input:  15-06-2024           -> Output: the fifteenth of june twenty twenty four
Input:  2024-01-15           -> Output: january fifteenth twenty twenty four
Input:  15/06/2024           -> Output: the fifteenth of june twenty twenty four
Input:  04-01-2024           -> Output: april first twenty twenty four
Input:  January 15, 2024     -> Output: january fifteenth, twenty twenty four
Input:  15th January 2024    -> Output: the fifteenth of january twenty twenty four
Input:  2024/01/15           -> Output: january fifteenth twenty twenty four


## 5. Time Test


In [7]:
time_tests = [
    "12:30",
    "1:40",
    "12:00",
    "12:30:45",
    "09:15",
    "23:59",
    "9:15",
    "12:30 PM",
    "1:40 AM",
]

print("=" * 60)
print("TIME TEST")
print("=" * 60)
for test in time_tests:
    result = normalizer_en.normalize(test)
    print(f"Input:  {test:15} -> Output: {result}")


TIME TEST
Input:  12:30           -> Output: twelve thirty
Input:  1:40            -> Output: one forty
Input:  12:00           -> Output: twelve o'clock
Input:  12:30:45        -> Output: twelve hours thirty minutes and forty five seconds
Input:  09:15           -> Output: nine fifteen
Input:  23:59           -> Output: twenty three fifty nine
Input:  9:15            -> Output: nine fifteen
Input:  12:30 PM        -> Output: twelve thirty PM
Input:  1:40 AM         -> Output: one forty AM


## 6. Money Test


In [8]:
money_tests = [
    "$100",
    "$1234",
    "$50.50",
    "$0.50",
    "$1000",
    "$500",
    "£100",
    "€100",
    "USD 100",
]

print("=" * 60)
print("MONEY TEST")
print("=" * 60)
for test in money_tests:
    result = normalizer_en.normalize(test)
    print(f"Input:  {test:15} -> Output: {result}")


MONEY TEST
Input:  $100            -> Output: one hundred dollars
Input:  $1234           -> Output: one thousand two hundred and thirty four dollars
Input:  $50.50          -> Output: fifty dollars fifty cents
Input:  $0.50           -> Output: fifty cents
Input:  $1000           -> Output: one thousand dollars
Input:  $500            -> Output: five hundred dollars
Input:  £100            -> Output: one hundred pounds
Input:  €100            -> Output: one hundred euros
Input:  USD 100         -> Output: USD one hundred


## 7. Measurements Test


In [9]:
measure_tests = [
    "12 kg",
    "125 kg",
    "100 m",
    "5 km",
    "12.34 cm",
    "10 lbs",
    "5.5 ft",
    "100 mph",
]

print("=" * 60)
print("MEASUREMENTS TEST")
print("=" * 60)
for test in measure_tests:
    result = normalizer_en.normalize(test)
    print(f"Input:  {test:15} -> Output: {result}")


MEASUREMENTS TEST
Input:  12 kg           -> Output: twelve kilograms
Input:  125 kg          -> Output: one hundred and twenty five kilograms
Input:  100 m           -> Output: one hundred M
Input:  5 km            -> Output: five kilometers
Input:  12.34 cm        -> Output: twelve point three four centimeters
Input:  10 lbs          -> Output: ten pounds
Input:  5.5 ft          -> Output: five point five feet
Input:  100 mph         -> Output: one hundred miles per hour


## 8. Ordinal Numbers Test


In [3]:
ordinal_tests = [
    "1st",
    "2nd",
    "3rd",
    "4th",
    "10th",
    "21st",
    "100th",
    "1st place",
    "2nd place",
]

print("=" * 60)
print("ORDINAL NUMBERS TEST")
print("=" * 60)
for test in ordinal_tests:
    result = normalizer_en.normalize(test)
    print(f"Input:  {test:15} -> Output: {result}")


ORDINAL NUMBERS TEST
Input:  1st             -> Output: first
Input:  2nd             -> Output: second
Input:  3rd             -> Output: third
Input:  4th             -> Output: fourth
Input:  10th            -> Output: tenth
Input:  21st            -> Output: twenty first
Input:  100th           -> Output: one hundredth
Input:  1st place       -> Output: first place
Input:  2nd place       -> Output: second place


## 9. Telephone Numbers Test


In [27]:
telephone_tests = [
    "+1-555-123-4567",
    "+1 5551234567",
    "5551234",
    "94544369",
    "9943206292",
    "(555) 123-4567",
]

print("=" * 60)
print("TELEPHONE NUMBERS TEST")
print("=" * 60)
for test in telephone_tests:
    result = normalizer_en.normalize(test)
    print(f"Input:  {test:20} -> Output: {result}")


TELEPHONE NUMBERS TEST
Input:  +1-555-123-4567      -> Output: plus one, five five five, one two three, four five six seven
Input:  +1 5551234567        -> Output: plus one five five five one two three four five six seven
Input:  5551234              -> Output: five five five one two three four
Input:  94544369             -> Output: nine four five four four three six nine
Input:  9943206292           -> Output: nine nine four three two zero six two nine two
Input:  (555) 123-4567       -> Output: five five five, one two three, four five six seven


## 10. Whitelist/Abbreviations Test


In [28]:
whitelist_tests = [
    "Dr.",
    "Prof.",
    "Mr.",
    "Mrs.",
    "km",
    "m",
    "kg",
    "etc.",
]

print("=" * 60)
print("WHITELIST/ABBREVIATIONS TEST")
print("=" * 60)
for test in whitelist_tests:
    result = normalizer_en.normalize(test)
    print(f"Input:  {test:15} -> Output: {result}")


WHITELIST/ABBREVIATIONS TEST
Input:  Dr.             -> Output: doctor
Input:  Prof.           -> Output: Prof.
Input:  Mr.             -> Output: mister
Input:  Mrs.            -> Output: misses
Input:  km              -> Output: KM
Input:  m               -> Output: m
Input:  kg              -> Output: kg
Input:  etc.            -> Output: etcetera.


## 11. Mixed Content Test


In [29]:
mixed_tests = [
    "The meeting is on 15-06-2024 at 12:30 PM.",
    "$1000 and $500 together make $1500.",
    "123 kg weight and 50 km distance.",
    "1st place and 2nd place.",
    "Call me at +1-555-123-4567 on January 15th.",
]

print("=" * 60)
print("MIXED CONTENT TEST")
print("=" * 60)
for test in mixed_tests:
    result = normalizer_en.normalize(test)
    print(f"Input:  {test}")
    print(f"Output: {result}")
    print("-" * 60)


MIXED CONTENT TEST
Input:  The meeting is on 15-06-2024 at 12:30 PM.
Output: The meeting is on the fifteenth of june twenty twenty four at twelve thirty PM.
------------------------------------------------------------
Input:  $1000 and $500 together make $1500.
Output: one thousand dollars and five hundred dollars together make one thousand five hundred dollars.
------------------------------------------------------------
Input:  123 kg weight and 50 km distance.
Output: one hundred and twenty three kilograms weight and fifty kilometers distance.
------------------------------------------------------------
Input:  1st place and 2nd place.
Output: first place and second place.
------------------------------------------------------------
Input:  Call me at +1-555-123-4567 on January 15th.
Output: Call me at plus one, five five five, one two three, four five six seven on january fifteenth.
------------------------------------------------------------


## 12. Batch Testing


In [30]:
# Test multiple inputs at once
batch_tests = [
    "123",
    "12.34",
    "12:30",
    "$100000000",
    "15-06-2024",
]

print("=" * 60)
print("BATCH TESTING")
print("=" * 60)
results = normalizer_en.normalize_list(batch_tests)
for input_text, output_text in zip(batch_tests, results):
    print(f"Input:  {input_text:15} -> Output: {output_text}")


BATCH TESTING


100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 43.38it/s]
100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 29.46it/s]
100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 32.96it/s]
100%|██████████████████████████████████████████| 1/1 [00:00<00:00, 103.34it/s]
100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 22.20it/s]

Input:  123             -> Output: one hundred and twenty three
Input:  12.34           -> Output: twelve point three four
Input:  12:30           -> Output: twelve thirty
Input:  $100000000      -> Output: one hundred million dollars
Input:  15-06-2024      -> Output: the fifteenth of june twenty twenty four





## Summary

All test cases have been executed. Check the outputs above to verify that English text normalization is working correctly for all categories.

### Usage Tips:
- Run each cell sequentially (Shift+Enter)
- Modify test cases in any cell to test your own inputs
- Use `verbose=True` in normalize() to see detailed processing information
- Set `cache_dir` to a directory path to speed up subsequent runs
