# Normalizer

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/normalizer](https://github.com/huseinzol05/Malaya/tree/master/example/normalizer).
    
</div>

In [1]:
%%time
import malaya

CPU times: user 5.37 s, sys: 1.04 s, total: 6.41 s
Wall time: 7.5 s


In [2]:
string1 = 'xjdi ke, y u xsuke makan HUSEIN kt situ tmpt, i hate it. pelikle, pada'
string2 = 'i mmg2 xske mknn HUSEIN kampng tmpat, i love them. pelikle saye'
string3 = 'perdana menteri ke11 sgt suka makn ayam, harganya cuma rm15.50'
string4 = 'pada 10/4, kementerian mengumumkan, 1/100'
string5 = 'Husein Zolkepli dapat tempat ke-12 lumba lari hari ni'
string6 = 'Husein Zolkepli (2011 - 2019) adalah ketua kampng di kedah sekolah King Edward ke-IV'
string7 = '2jam 30 minit aku tunggu kau, 60.1 kg kau ni, suhu harini 31.2c, aku dahaga minum 600ml'

### Load normalizer

This normalizer can load any spelling correction model, eg, `malaya.spell.probability`, or `malaya.spell.transformer`.

In [8]:
corrector = malaya.spell.probability()
normalizer = malaya.normalize.normalizer(corrector)

#### normalize

```python
def normalize(
    self, string: str, check_english: bool = True, normalize_entity = True
):
    """
    Normalize a string

    Parameters
    ----------
    string : str
    check_english: bool, (default=True)
        check a word in english dictionary.
    normalize_entity: bool, (default=True)
        normalize entities, only effect `date`, `datetime`, `time` and `money` patterns string only.

    Returns
    -------
    string: normalized string
    """
```

In [4]:
string = 'boleh dtg 8pagi esok tak atau minggu depan? 2 oktober 2019 2pm, tlong bayar rm 3.2k sekali tau'

In [5]:
normalizer.normalize(string)

{'normalize': 'boleh datang lapan pagi esok tidak atau minggu depan ? 02/10/2019 14:00:00 , tolong bayar tiga ribu dua ratus ringgit sekali tahu',
 'date': {'minggu depan': datetime.datetime(2020, 12, 6, 21, 0, 15, 285198),
  '8 AM esok': datetime.datetime(2020, 11, 30, 8, 0),
  '2 oktober 2019 2pm': datetime.datetime(2019, 10, 2, 14, 0)},
 'money': {'rm 3.2k': 'RM3200.0'}}

In [6]:
normalizer.normalize(string, normalize_entity = False)

{'normalize': 'boleh datang lapan pagi esok tidak atau minggu depan ? 02/10/2019 14:00:00 , tolong bayar tiga ribu dua ratus ringgit sekali tahu',
 'date': {},
 'money': {}}

Here you can see, Malaya normalizer will normalize `minggu depan` to datetime object, also `3.2k ringgit` to `RM3200`

In [7]:
print(normalizer.normalize(string1))
print(normalizer.normalize(string2))
print(normalizer.normalize(string3))
print(normalizer.normalize(string4))
print(normalizer.normalize(string5))
print(normalizer.normalize(string6))
print(normalizer.normalize(string7))

{'normalize': 'tak jadi ke , kenapa awak tak suka makan HUSEIN kat situ tempat , saya hate it . pelik lah , pada', 'date': {}, 'money': {}}
{'normalize': 'saya memang - memang tak suka makan HUSEIN kampung tempat , saya love them . pelik lah saya', 'date': {}, 'money': {}}
{'normalize': 'perdana menteri kesebelas sangat suka makan ayam , harganya cuma lima belas ringgit lima puluh sen', 'date': {}, 'money': {'rm15.50': 'RM15.50'}}
{'normalize': 'pada sepuluh hari bulan empat , kementerian mengumumkan , satu per seratus', 'date': {}, 'money': {}}
{'normalize': 'Husein Zolkepli dapat tempat kedua belas lumba lari hari ini', 'date': {}, 'money': {}}
{'normalize': 'Husein Zolkepli ( dua ribu sebelas hingga dua ribu sembilan belas ) adalah ketua kampung di kedah sekolah King Edward keempat', 'date': {}, 'money': {}}
{'normalize': 'dua jam tiga puluh minit aku tunggu kamu , enam puluh perpuluhan satu kilogram kamu ini , suhu hari ini tiga puluh satu perpuluhan dua celcius , aku dahaga minum 

### Skip spelling correction

Simply pass `None` to `speller` to `normalizer = malaya.normalize.normalizer`. By default it is `None`.

In [10]:
normalizer = malaya.normalize.normalizer(corrector)
without_corrector_normalizer = malaya.normalize.normalizer(None)

In [13]:
normalizer.normalize(string2)

{'normalize': 'saya memang - memang tak suka makan HUSEIN kampung tempat , saya love them . pelik lah saya',
 'date': {},
 'money': {}}

In [14]:
without_corrector_normalizer.normalize(string2)

{'normalize': 'saya memang - memang tak suka mknn HUSEIN kampng tmpat , saya love them . pelik lah saya',
 'date': {},
 'money': {}}

### Pass kwargs preprocessing

Let say you want to skip to normalize date pattern, you can pass kwargs to normalizer, check original tokenizer implementation at https://github.com/huseinzol05/Malaya/blob/master/malaya/preprocessing.py#L103

In [15]:
normalizer = malaya.normalize.normalizer(corrector)
skip_date_normalizer = malaya.normalize.normalizer(corrector, date = False)

In [16]:
normalizer.normalize('tarikh program tersebut 14 mei')

{'normalize': 'tarikh program tersebut 14/05/2020',
 'date': {'14 mei': datetime.datetime(2020, 5, 14, 0, 0)},
 'money': {}}

In [17]:
skip_date_normalizer.normalize('tarikh program tersebut 14 mei')

{'normalize': 'tarikh program tersebut empat belas mei',
 'date': {'14 mei': datetime.datetime(2020, 5, 14, 0, 0)},
 'money': {}}

### Normalize url

Let say you have an `url` word, example, `https://huseinhouse.com`, this parameter going to,

1. replace `://` with empty string.
2. replace `.` with ` dot `.

Simply `normalizer.normalize(string, normalize_url = True)`, default is `False`.

In [24]:
normalizer = malaya.normalize.normalizer()
normalizer.normalize('web saya ialah https://huseinhouse.com')

{'normalize': 'web saya ialah https://huseinhouse.com',
 'date': {},
 'money': {}}

In [25]:
normalizer.normalize('web saya ialah https://huseinhouse.com', normalize_url = True)

{'normalize': 'web saya ialah https huseinhouse dot com',
 'date': {},
 'money': {}}

In [26]:
normalizer.normalize('web saya ialah https://huseinhouse02934.com', normalize_url = True)

{'normalize': 'web saya ialah https huseinhouse 02934 dot com',
 'date': {},
 'money': {}}

### Normalize email

Let say you have an `email` word, example, `husein.zol05@gmail.com`, this parameter going to,

1. replace `://` with empty string.
2. replace `.` with ` dot `.
3. replace `@` with ` di `.

Simply `normalizer.normalize(string, normalize_email = True)`, default is `False`.

In [28]:
normalizer = malaya.normalize.normalizer()
normalizer.normalize('email saya ialah husein.zol05@gmail.com')

{'normalize': 'email saya ialah husein.zol05@gmail.com',
 'date': {},
 'money': {}}

In [29]:
normalizer = malaya.normalize.normalizer()
normalizer.normalize('email saya ialah husein.zol05@gmail.com', normalize_email = True)

{'normalize': 'email saya ialah husein dot zol 05 di gmail dot com',
 'date': {},
 'money': {}}

### Normalizing rules

**All these rules will ignore if first letter is capital.**

#### 1. Normalize title,

```python

{
    'dr': 'Doktor',
    'yb': 'Yang Berhormat',
    'hj': 'Haji',
    'ybm': 'Yang Berhormat Mulia',
    'tyt': 'Tuan Yang Terutama',
    'yab': 'Yang Berhormat',
    'ybm': 'Yang Berhormat Mulia',
    'yabhg': 'Yang Amat Berbahagia',
    'ybhg': 'Yang Berbahagia',
    'miss': 'Cik',
}

```

In [8]:
normalizer.normalize('Dr yahaya')

{'normalize': 'Doktor yahaya', 'date': {}, 'money': {}}

#### 2. expand `x`

In [9]:
normalizer.normalize('xtahu')

{'normalize': 'tak tahu', 'date': {}, 'money': {}}

#### 3. normalize `ke -`

In [10]:
normalizer.normalize('ke-12')

{'normalize': 'kedua belas', 'date': {}, 'money': {}}

In [11]:
normalizer.normalize('ke - 12')

{'normalize': 'kedua belas', 'date': {}, 'money': {}}

#### 4. normalize `ke - roman`

In [12]:
normalizer.normalize('ke-XXI')

{'normalize': 'kedua puluh satu', 'date': {}, 'money': {}}

In [13]:
normalizer.normalize('ke - XXI')

{'normalize': 'kedua puluh satu', 'date': {}, 'money': {}}

#### 5. normalize `NUM - NUM`

In [14]:
normalizer.normalize('2011 - 2019')

{'normalize': 'dua ribu sebelas hingga dua ribu sembilan belas',
 'date': {},
 'money': {}}

In [15]:
normalizer.normalize('2011.01-2019')

{'normalize': 'dua ribu sebelas perpuluhan kosong satu hingga dua ribu sembilan belas',
 'date': {},
 'money': {}}

#### 6. normalize `pada NUM (/ | -) NUM`

In [16]:
normalizer.normalize('pada 10/4')

{'normalize': 'pada sepuluh hari bulan empat', 'date': {}, 'money': {}}

In [17]:
normalizer.normalize('PADA 10 -4')

{'normalize': 'pada sepuluh hari bulan empat', 'date': {}, 'money': {}}

#### 7. normalize `NUM / NUM`

In [18]:
normalizer.normalize('10 /4')

{'normalize': 'sepuluh per empat', 'date': {}, 'money': {}}

#### 8. normalize `rm NUM`

In [19]:
normalizer.normalize('RM10.5')

{'normalize': 'sepuluh ringgit lima puluh sen',
 'date': {},
 'money': {'rm10.5': 'RM10.5'}}

#### 9. normalize `rm NUM sen`

In [20]:
normalizer.normalize('rm 10.5 sen')

{'normalize': 'sepuluh ringgit lima puluh sen',
 'date': {},
 'money': {'rm 10.5': 'RM10.5'}}

#### 10. normalize `NUM sen`

In [21]:
normalizer.normalize('1015 sen')

{'normalize': 'sepuluh ringgit lima belas sen',
 'date': {},
 'money': {'1015 sen': 'RM10.15'}}

#### 11. normalize money

In [22]:
normalizer.normalize('rm10.4m')

{'normalize': 'sepuluh juta empat ratus ribu ringgit',
 'date': {},
 'money': {'rm10.4m': 'RM10400000.0'}}

In [23]:
normalizer.normalize('$10.4K')

{'normalize': 'sepuluh ribu empat ratus dollar',
 'date': {},
 'money': {'$10.4k': '$10400.0'}}

#### 12. normalize cardinal

In [24]:
normalizer.normalize('123')

{'normalize': 'seratus dua puluh tiga', 'date': {}, 'money': {}}

#### 13. normalize ordinal

In [25]:
normalizer.normalize('ke123')

{'normalize': 'keseratus dua puluh tiga', 'date': {}, 'money': {}}

#### 14. normalize date / time / datetime string to datetime.datetime

In [26]:
normalizer.normalize('2 hari lepas')

{'normalize': 'dua hari lepas',
 'date': {'2 hari lalu': datetime.datetime(2020, 7, 30, 22, 55, 24, 921050)},
 'money': {}}

In [27]:
normalizer.normalize('esok')

{'normalize': 'esok',
 'date': {'esok': datetime.datetime(2020, 8, 2, 22, 55, 24, 930259)},
 'money': {}}

In [28]:
normalizer.normalize('okt 2019')

{'normalize': '01/10/2019',
 'date': {'okt 2019': datetime.datetime(2019, 10, 1, 0, 0)},
 'money': {}}

In [29]:
normalizer.normalize('2pgi')

{'normalize': 'dua pagi',
 'date': {'2 AM': datetime.datetime(2020, 8, 1, 2, 0)},
 'money': {}}

In [30]:
normalizer.normalize('pukul 8 malam')

{'normalize': 'pukul lapan malam',
 'date': {'pukul 8': datetime.datetime(2020, 8, 8, 0, 0)},
 'money': {}}

In [31]:
normalizer.normalize('jan 2 2019 12:01pm')

{'normalize': '02/01/2019 12:01:00',
 'date': {'jan 2 2019 12:01pm': datetime.datetime(2019, 1, 2, 12, 1)},
 'money': {}}

In [32]:
normalizer.normalize('2 ptg jan 2 2019')

{'normalize': 'dua petang 02/01/2019',
 'date': {'2 PM jan 2 2019': datetime.datetime(2019, 1, 2, 14, 0)},
 'money': {}}

#### 15. normalize money string to string number representation

In [33]:
normalizer.normalize('50 sen')

{'normalize': 'lima puluh sen', 'date': {}, 'money': {'50 sen': 'RM0.5'}}

In [34]:
normalizer.normalize('20.5 ringgit')

{'normalize': 'dua puluh ringgit lima puluh sen',
 'date': {},
 'money': {'20.5 ringgit': 'RM20.5'}}

In [35]:
normalizer.normalize('20m ringgit')

{'normalize': 'dua puluh juta ringgit',
 'date': {},
 'money': {'20m ringgit': 'RM20000000.0'}}

In [36]:
normalizer.normalize('22.5123334k ringgit')

{'normalize': 'dua puluh dua ribu lima ratus dua belas ringgit tiga ratus tiga puluh empat sen',
 'date': {},
 'money': {'22.512334k ringgit': 'RM22512.334'}}

#### 16. normalize date string to %d/%m/%y

In [37]:
normalizer.normalize('1 nov 2019')

{'normalize': '01/11/2019',
 'date': {'1 nov 2019': datetime.datetime(2019, 11, 1, 0, 0)},
 'money': {}}

In [38]:
normalizer.normalize('januari 1 1996')

{'normalize': '01/01/1996',
 'date': {'januari 1 1996': datetime.datetime(1996, 1, 1, 0, 0)},
 'money': {}}

In [39]:
normalizer.normalize('januari 2019')

{'normalize': '01/01/2019',
 'date': {'januari 2019': datetime.datetime(2019, 1, 1, 0, 0)},
 'money': {}}

#### 17. normalize time string to %H:%M:%S

In [40]:
normalizer.normalize('2pm')

{'normalize': '14:00:00',
 'date': {'2pm': datetime.datetime(2020, 8, 1, 14, 0)},
 'money': {}}

In [41]:
normalizer.normalize('2:01pm')

{'normalize': '14:01:00',
 'date': {'2:01pm': datetime.datetime(2020, 8, 1, 14, 1)},
 'money': {}}

In [42]:
normalizer.normalize('2AM')

{'normalize': '02:00:00',
 'date': {'2am': datetime.datetime(2020, 8, 1, 2, 0)},
 'money': {}}

#### 18. expand repetition shortform

In [43]:
normalizer.normalize('skit2')

{'normalize': 'sakit - sakit', 'date': {}, 'money': {}}

In [44]:
normalizer.normalize('xskit2')

{'normalize': 'tak sakit - sakit', 'date': {}, 'money': {}}

In [45]:
normalizer.normalize('xjdi2')

{'normalize': 'tak jadi - jadi', 'date': {}, 'money': {}}

In [46]:
normalizer.normalize('xjdi4')

{'normalize': 'tak jadi - jadi - jadi - jadi', 'date': {}, 'money': {}}

In [47]:
normalizer.normalize('xjdi0')

{'normalize': 'tak jadi', 'date': {}, 'money': {}}

In [48]:
normalizer.normalize('xjdi')

{'normalize': 'tak jadi', 'date': {}, 'money': {}}

#### 19. normalize `NUM SI-UNIT`

In [49]:
normalizer.normalize('61.2 kg')

{'normalize': 'enam puluh satu perpuluhan dua kilogram',
 'date': {},
 'money': {}}

In [50]:
normalizer.normalize('61.2kg')

{'normalize': 'enam puluh satu perpuluhan dua kilogram',
 'date': {},
 'money': {}}

In [51]:
normalizer.normalize('61kg')

{'normalize': 'enam puluh satu kilogram', 'date': {}, 'money': {}}

In [52]:
normalizer.normalize('61ml')

{'normalize': 'enam puluh satu milliliter', 'date': {}, 'money': {}}

In [53]:
normalizer.normalize('61m')

{'normalize': 'enam puluh satu meter', 'date': {}, 'money': {}}

In [54]:
normalizer.normalize('61.3434km')

{'normalize': 'enam puluh satu perpuluhan tiga empat tiga empat kilometer',
 'date': {},
 'money': {}}

In [55]:
normalizer.normalize('61.3434c')

{'normalize': 'enam puluh satu perpuluhan tiga empat tiga empat celcius',
 'date': {},
 'money': {}}

In [56]:
normalizer.normalize('61.3434 c')

{'normalize': 'enam puluh satu perpuluhan tiga empat tiga empat celcius',
 'date': {},
 'money': {}}

#### 20. normalize `laughing` pattern

In [57]:
normalizer.normalize('dia sakai wkwkwkawkw')

{'normalize': 'dia sakai haha', 'date': {}, 'money': {}}

In [58]:
normalizer.normalize('dia sakai hhihihu')

{'normalize': 'dia sakai haha', 'date': {}, 'money': {}}

#### 21. normalize `mengeluh` pattern

In [4]:
normalizer.normalize('Haih apa lah si yusuff ni . Mama cari rupanya celah ni')

{'normalize': 'Aduh apa lah si yusuf ini . Mama cari rupanya celah ini',
 'date': {},
 'money': {}}

In [60]:
normalizer.normalize('hais sorrylah syazzz')

{'normalize': 'aduh maaf lah syazz', 'date': {}, 'money': {}}