# numerizer Preprocessing
The numerizer package converts natural language numerics into ints and floats. The package does not have a lot of documentation, but seems simple to use.

See also:

https://github.com/jaidevd/numerizer

## numerizer Acting on Strings

In [2]:
from numerizer import numerize

print(numerize('forty two'))
print(numerize('forty-two'))
print(numerize('four hundred and sixty two'))
print(numerize('one fifty'))
print(numerize('twelve hundred'))
print(numerize('twenty one thousand four hundred and seventy three'))
print(numerize('one billion and one'))
print(numerize('nine and three quarters'))
print(numerize('platform nine and three quarters'))

42
42
462
150
1200
21473
1000000001
9.75
platform 9.75


## numerizer Optional Arguments
numerizer can take in the optional arguments `ignore` and `bias`

`ignore` - provide list of words to ignore

`bias` - str, can be 'ordinal', 'fractional', 'fractionals'

In [3]:
text = (
        "squash a bug\n"
        "crash a car\n"
        "An\n"
        "an\n"
        "A\n"
        "first\n"
        "two\n"
        "second\n"
        "forty-second\n"
        "two thirds\n"
        "one fourth\n"
        "one quarter\n"
        "one half\n"
        "nine and three quarters\n"
        "I saw A BIRD"
)

In [4]:
text_list = text.split('\n')
no = numerize(text)
no_list = no.split('\n')
ignore = numerize(text, ignore=['second'])
ignore_list = ignore.split('\n')
ordinal = numerize(text, bias='ordinal')
ordinal_list = ordinal.split('\n')
fractional = numerize(text, bias='fractional')
fractional_list = fractional.split('\n')
fractionals = numerize(text, bias='fractionals')
fractionals_list = fractionals.split('\n')
print("Original                | No args          | ignore=['second'] | bias='ordinal'   | bias='fractional' | bias='fractionals' |")
print("============================================================================================================================")
for orig, non, ig, ordi, frac, fracs in zip(text_list, no_list, ignore_list, ordinal_list, fractional_list, fractionals_list):
    print(f"{orig:23} | {non:16} | {ig:17} | {ordi:16} | {frac:17} | {fracs:18} |")

Original                | No args          | ignore=['second'] | bias='ordinal'   | bias='fractional' | bias='fractionals' |
squash a bug            | squash 1 bug     | squash 1 bug      | squash 1 bug     | squash 1 bug      | squash 1 bug       |
crash a car             | crash a car      | crash a car       | crash a car      | crash a car       | crash a car        |
An                      | An               | An                | An               | An                | An                 |
an                      | an               | an                | an               | an                | an                 |
A                       | A                | A                 | A                | A                 | A                  |
first                   | 1st              | 1st               | 1st              | 1st               | first              |
two                     | 2                | 2                 | 2                | 2                 | 2                  |


Note that the first time numerizer sees 'A' or 'a' it gets changed to '1'.

## numerizer as SpaCy Extenstion

In [5]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('The projected revenue for the next quarter is over two million dollars.')

In [6]:
doc._.numerize()

{the next quarter: 'the next1/4', two million dollars: '2000000 dollars'}

In [7]:
doc._.numerize(labels=['MONEY'])

{two million dollars: '2000000 dollars'}

In [8]:
two_million = doc[-4:-2]
two_million._.numerize()

'2000000'

In [9]:
quarter = doc[6]
quarter._.numerized

'1/4'

In [21]:
# Currently have issue with ImportError: cannot import name 'UNICODE_EMOJI' from 'emoji'

# from recognizers_text import Culture, ModelResult
# from recognizers_number import NumberRecognizer
# from recognizers_number_with_unit import NumberWithUnitRecognizer 
# from recognizers_date_time import DateTimeRecognizer 
# from recognizers_sequence import SequenceRecognizer 