# Preprocessing

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/preprocessing](https://github.com/huseinzol05/Malaya/tree/master/example/preprocessing).
    
</div>

In [1]:
%%time
import malaya

CPU times: user 6.65 s, sys: 1.01 s, total: 7.67 s
Wall time: 7.24 s


### Available rules

We know that social media texts from Twitter, Facebook and Instagram are very noisy and we want to clean as much as possible to make our machines understand the structure of sentence much better. In Malaya, we standardize our text preprocessing,

1. Malaya can replace special words into tokens to reduce dimension curse. `rm10k` become `<money>`.
3. Malaya can put tags for special words, `#drmahathir` become `<hashtag> drmahathir </hashtag>`.
4. Malaya can expand english contractions.
5. Malaya can translate EN words to become MS words. required a translator callable.
6. Stemming and lemmatizing, required a stemmer callable.
7. Normalize elongated words, required a Malaya speller callable.
8. Expand hashtags, `#drmahathir` become `dr mahathir`, required a segmentation callable.

#### normalize

Supported `normalize`,

1. hashtag
2. cashtag
3. tag
4. user
5. emphasis
6. censored
7. acronym
8. eastern_emoticons
9. rest_emoticons
10. emoji
11. quotes
12. percent
13. repeat_puncts
14. money
15. email
16. phone
17. number
18. allcaps
19. url
20. date
21. time

You can check all supported list at `malaya.preprocessing.get_normalize()`.

Example, if you set `money` and `number`, and input string is `RM10k`, the output is `<money>`.

#### annotate

Supported `annotate`,

1. hashtag
2. allcaps
3. elongated
4. repeated
5. emphasis
6. censored

Example, if you set `hashtag`, and input string is `#drmahathir`, the output is `<hashtag> drmahathir </hashtag>`.

In [2]:
string_1 = 'CANT WAIT for the new season of #mahathirmohamad ＼(^o^)／!!! #davidlynch #tvseries :))), TAAAK SAAABAAR!!!'
string_2 = 'kecewanya #johndoe movie and it suuuuucks!!! WASTED RM10... rm10 #badmovies :/'
string_3 = "@husein:  can't wait for the Nov 9 #Sentiment talks!  YAAAAAAY !!! :-D http://sentimentsymposium.com/."
string_4 = 'aahhh, malasnye nak pegi keje harini #mondayblues'
string_5 = '#drmahathir #najibrazak #1malaysia #mahathirnajib'

### Preprocessing Interface

```python
def preprocessing(
    normalize: List[str] = [
        'url',
        'email',
        'percent',
        'money',
        'phone',
        'user',
        'time',
        'date',
        'number',
    ],
    annotate: List[str] = [
        'allcaps',
        'elongated',
        'repeated',
        'emphasis',
        'censored',
        'hashtag',
    ],
    lowercase: bool = True,
    fix_unidecode: bool = True,
    expand_english_contractions: bool = True,
    translator: Callable = None,
    segmenter: Callable = None,
    stemmer: Callable = None,
    speller: Callable = None,
    **kwargs,
):
    """
    Load Preprocessing class.

    Parameters
    ----------
    normalize: List[str], optional (default=['url', 'email', 'percent', 'money', 'phone', 'user', 'time', 'date', 'number'])
        normalizing tokens, can check all supported normalizing at `malaya.preprocessing.get_normalize()`.
    annotate: List[str], optional (default=['hashtag', 'allcaps', 'elongated', 'repeated', 'emphasis', 'censored'])
        annonate tokens <open></open>,
        only accept ['hashtag', 'allcaps', 'elongated', 'repeated', 'emphasis', 'censored'].
    lowercase: bool, optional (default=True)
    fix_unidecode: bool, optional (default=True)
    expand_english_contractions: bool
        expand english contractions
    translator: Callable, optional (default=None)
        function to translate EN word to MS word.
    segmenter: Callable, optional (default=None)
        function to segmentize word.
        If provide, it will expand hashtags, #mondayblues == monday blues
    stemmer: Callable, optional (default=None)
        function to stem word.
    speller: object
        spelling correction object, need to have a method `correct` or `normalize_elongated`

    Returns
    -------
    result : malaya.preprocessing.Preprocessing class
    """
```

### Load default paramaters

default parameters able to translate most of english to bahasa malaysia.

In [3]:
%%time
preprocessing = malaya.preprocessing.preprocessing()

CPU times: user 105 ms, sys: 3.94 ms, total: 109 ms
Wall time: 109 ms


In [4]:
%%time
' '.join(preprocessing.process(string_1))

CPU times: user 3.94 ms, sys: 61 µs, total: 4 ms
Wall time: 4.06 ms


'<allcaps> tak boleh wait </allcaps> untuk the new season of <hashtag> mahathirmohamad </hashtag> \\(^o^)/ <repeated> ! </repeated> <hashtag> davidlynch </hashtag> <hashtag> tvseries </hashtag> <happy> , <allcaps> <elongated> taak </elongated> <elongated> saabaar </elongated> </allcaps> <repeated> ! </repeated>'

In [5]:
%%time
' '.join(preprocessing.process(string_2))

CPU times: user 608 µs, sys: 0 ns, total: 608 µs
Wall time: 615 µs


'kecewanya <hashtag> johndoe </hashtag> movie and it <elongated> suucks </elongated> <repeated> ! </repeated> <allcaps> wasted </allcaps> <money> <repeated> . </repeated> <money> <hashtag> badmovies </hashtag> <annoyed>'

In [6]:
%%time
' '.join(preprocessing.process(string_3))

CPU times: user 758 µs, sys: 27 µs, total: 785 µs
Wall time: 820 µs


'<user> : can not wait untuk the <date> <hashtag> sentiment </hashtag> talks ! <allcaps> <elongated> yaay </elongated> </allcaps> <repeated> ! </repeated> :-d <url>'

In [7]:
%%time
' '.join(preprocessing.process(string_4))

CPU times: user 443 µs, sys: 1 µs, total: 444 µs
Wall time: 450 µs


'<elongated> aahh </elongated> , malasnye nak pergi kerja hari ini <hashtag> mondayblues </hashtag>'

In [8]:
%%time
' '.join(preprocessing.process(string_5))

CPU times: user 462 µs, sys: 7 µs, total: 469 µs
Wall time: 479 µs


'<hashtag> drmahathir </hashtag> <hashtag> najibrazak </hashtag> <hashtag> 1 malaysia </hashtag> <hashtag> mahathirnajib </hashtag>'

### Load default paramaters with spelling correction to normalize elongated words.

We saw `taak`, `saabaar` and another elongated words are not the original words, so we can use spelling correction to normalize it.

In [9]:
corrector = malaya.spell.probability()

In [10]:
%%time
preprocessing = malaya.preprocessing.preprocessing(speller = corrector)

CPU times: user 162 µs, sys: 13 µs, total: 175 µs
Wall time: 181 µs


In [11]:
%%time
' '.join(preprocessing.process(string_1))

CPU times: user 957 µs, sys: 13 µs, total: 970 µs
Wall time: 991 µs


'<allcaps> tak boleh wait </allcaps> untuk the new season of <hashtag> mahathirmohamad </hashtag> \\(^o^)/ <repeated> ! </repeated> <hashtag> davidlynch </hashtag> <hashtag> tvseries </hashtag> <happy> , <allcaps> <elongated> tidak </elongated> <elongated> sabar </elongated> </allcaps> <repeated> ! </repeated>'

In [12]:
%%time
' '.join(preprocessing.process(string_2))

CPU times: user 652 µs, sys: 4 µs, total: 656 µs
Wall time: 662 µs


'kecewanya <hashtag> johndoe </hashtag> movie and it <elongated> sucks </elongated> <repeated> ! </repeated> <allcaps> wasted </allcaps> <money> <repeated> . </repeated> <money> <hashtag> badmovies </hashtag> <annoyed>'

In [13]:
%%time
' '.join(preprocessing.process(string_3))

CPU times: user 635 µs, sys: 1e+03 ns, total: 636 µs
Wall time: 641 µs


'<user> : can not wait untuk the <date> <hashtag> sentiment </hashtag> talks ! <allcaps> <elongated> yay </elongated> </allcaps> <repeated> ! </repeated> :-d <url>'

In [14]:
%%time
' '.join(preprocessing.process(string_4))

CPU times: user 477 µs, sys: 1 µs, total: 478 µs
Wall time: 485 µs


'<elongated> ah </elongated> , malasnye nak pergi kerja hari ini <hashtag> mondayblues </hashtag>'

In [15]:
%%time
' '.join(preprocessing.process(string_5))

CPU times: user 565 µs, sys: 14 µs, total: 579 µs
Wall time: 604 µs


'<hashtag> drmahathir </hashtag> <hashtag> najibrazak </hashtag> <hashtag> 1 malaysia </hashtag> <hashtag> mahathirnajib </hashtag>'

### Load default paramaters with segmenter to expand hashtags.

We saw `<hashtag> drmahathir </hashtag> <hashtag> najibrazak </hashtag>`, we want to expand to become `dr mahathir` and `najib razak`.

In [16]:
segmenter = malaya.segmentation.transformer(model = 'small', quantized = True)

Load quantized model will cause accuracy drop.


In [17]:
segmenter_func = lambda x: segmenter.greedy_decoder([x])[0]
segmenter_func('hellosuka')

'hello suka'

In [18]:
%%time
preprocessing = malaya.preprocessing.preprocessing(segmenter = segmenter_func)

CPU times: user 135 µs, sys: 12 µs, total: 147 µs
Wall time: 151 µs


In [19]:
%%time
' '.join(preprocessing.process(string_1))

CPU times: user 415 ms, sys: 77 ms, total: 492 ms
Wall time: 196 ms


'<allcaps> tak boleh wait </allcaps> untuk the new season of <hashtag> mahathir mohamad </hashtag> \\(^o^)/ <repeated> ! </repeated> <hashtag> davidlynch </hashtag> <hashtag> tv series </hashtag> <happy> , <allcaps> <elongated> taak </elongated> <elongated> saabaar </elongated> </allcaps> <repeated> ! </repeated>'

In [20]:
%%time
' '.join(preprocessing.process(string_2))

CPU times: user 260 ms, sys: 47.4 ms, total: 307 ms
Wall time: 129 ms


'kecewanya <hashtag> johndoe </hashtag> movie and it <elongated> suucks </elongated> <repeated> ! </repeated> <allcaps> wasted </allcaps> <money> <repeated> . </repeated> <money> <hashtag> bad movies </hashtag> <annoyed>'

In [21]:
%%time
' '.join(preprocessing.process(string_3))

CPU times: user 117 ms, sys: 21.2 ms, total: 138 ms
Wall time: 52.8 ms


'<user> : can not wait untuk the <date> <hashtag> sentiment </hashtag> talks ! <allcaps> <elongated> yaay </elongated> </allcaps> <repeated> ! </repeated> :-d <url>'

In [22]:
%%time
' '.join(preprocessing.process(string_4))

CPU times: user 195 ms, sys: 35.3 ms, total: 231 ms
Wall time: 98.6 ms


'<elongated> aahh </elongated> , malasnye nak pergi kerja hari ini <hashtag> mondayblues </hashtag>'

In [23]:
%%time
' '.join(preprocessing.process(string_5))

CPU times: user 473 ms, sys: 85.1 ms, total: 558 ms
Wall time: 214 ms


'<hashtag> dr mahathir </hashtag> <hashtag> najib razak </hashtag> <hashtag> 1 malaysia </hashtag> <hashtag> mahathir najib </hashtag>'

### Load default paramaters with stemming and lemmatization

In [24]:
sastrawi = malaya.stem.sastrawi()
stemmer_func = lambda x: sastrawi.stem(x)

In [25]:
stemmer_func('sukakan')

'suka'

In [26]:
%%time
preprocessing = malaya.preprocessing.preprocessing(stemmer = stemmer_func)

CPU times: user 149 µs, sys: 17 µs, total: 166 µs
Wall time: 171 µs


In [27]:
%%time
' '.join(preprocessing.process(string_1))

CPU times: user 15.7 ms, sys: 701 µs, total: 16.4 ms
Wall time: 15.9 ms


'<allcaps> tak boleh wait </allcaps> untuk the new season of <hashtag> mahathirmohamad </hashtag> o <repeated> </repeated> <hashtag> davidlynch </hashtag> <hashtag> tvseries </hashtag> <happy> <allcaps> <elongated> taak </elongated> <elongated> saabaar </elongated> </allcaps> <repeated> </repeated>'

In [28]:
%%time
' '.join(preprocessing.process(string_2))

CPU times: user 2.45 ms, sys: 16 µs, total: 2.47 ms
Wall time: 2.49 ms


'kecewa <hashtag> johndoe </hashtag> movie and it <elongated> suucks </elongated> <repeated> </repeated> <allcaps> wasted </allcaps> <money> <repeated> </repeated> <money> <hashtag> badmovies </hashtag> <annoyed>'

In [29]:
%%time
' '.join(preprocessing.process(string_3))

CPU times: user 2.81 ms, sys: 62 µs, total: 2.87 ms
Wall time: 2.97 ms


'<user> can not wait untuk the <date> <hashtag> sentiment </hashtag> talks <allcaps> <elongated> yaay </elongated> </allcaps> <repeated> </repeated> -d <url>'

In [30]:
preprocessing.process(string_3)

['<user>',
 'can',
 'not',
 'wait',
 'untuk',
 'the',
 '<date>',
 '<hashtag>',
 'sentiment',
 '</hashtag>',
 'talks',
 '<allcaps>',
 '<elongated>',
 'yaay',
 '</elongated>',
 '</allcaps>',
 '<repeated>',
 '</repeated>',
 '-d',
 '<url>']

In [31]:
%%time
' '.join(preprocessing.process(string_4))

CPU times: user 2.3 ms, sys: 7 µs, total: 2.31 ms
Wall time: 2.34 ms


'<elongated> aahh </elongated> malasnye nak pergi kerja hari ini <hashtag> mondayblues </hashtag>'

In [32]:
%%time
' '.join(preprocessing.process(string_5))

CPU times: user 2.81 ms, sys: 26 µs, total: 2.84 ms
Wall time: 2.87 ms


'<hashtag> drmahathir </hashtag> <hashtag> najibrazak </hashtag> <hashtag> 1 malaysia </hashtag> <hashtag> mahathirnajib </hashtag>'

In [33]:
%%time
' '.join(preprocessing.process('saya disini berjalan pergi ke putrajaya, #masjidbesi'))

CPU times: user 2.8 ms, sys: 18 µs, total: 2.82 ms
Wall time: 2.85 ms


'saya sini jalan pergi ke putrajaya <hashtag> masjidbesi </hashtag>'

### Load translation

In [37]:
en_ms_vocab = malaya.translation.en_ms.dictionary()
translator = lambda x: en_ms_vocab.get(x, x)
translator('pain'), translator('aduh')

('kesakitan', 'aduh')

In [38]:
%%time
preprocessing = malaya.preprocessing.preprocessing(translator = translator)

CPU times: user 147 µs, sys: 13 µs, total: 160 µs
Wall time: 166 µs


In [39]:
%%time
' '.join(preprocessing.process(string_1))

CPU times: user 1.29 ms, sys: 24 µs, total: 1.32 ms
Wall time: 1.36 ms


'<allcaps> tak boleh tunggu </allcaps> untuk yang baru musim daripada <hashtag> mahathirmohamad </hashtag> \\(^o^)/ <repeated> ! </repeated> <hashtag> davidlynch </hashtag> <hashtag> tvseries </hashtag> <happy> , <allcaps> <elongated> taak </elongated> <elongated> saabaar </elongated> </allcaps> <repeated> ! </repeated>'

In [40]:
%%time
' '.join(preprocessing.process(string_2))

CPU times: user 762 µs, sys: 1 µs, total: 763 µs
Wall time: 770 µs


'kecewanya <hashtag> johndoe </hashtag> filem dan ia <elongated> suucks </elongated> <repeated> ! </repeated> <allcaps> dibazirkan </allcaps> <money> <repeated> . </repeated> <money> <hashtag> badmovies </hashtag> <annoyed>'

In [41]:
%%time
' '.join(preprocessing.process(string_3))

CPU times: user 645 µs, sys: 1 µs, total: 646 µs
Wall time: 652 µs


'<user> : boleh tidak tunggu untuk yang <date> <hashtag> sentimen </hashtag> talks ! <allcaps> <elongated> yaay </elongated> </allcaps> <repeated> ! </repeated> :-d <url>'

#### Use Neural Translation Machine

Problem with dictionary based, if the words is not exist, the translation will not work,

In [42]:
translator('love'), translator('them'), translator('pain')

('love', 'them', 'kesakitan')

In [43]:
nmt = malaya.translation.en_ms.transformer(model = 'small')
nmt_func = lambda x: nmt.greedy_decoder([x])[0]

In [44]:
nmt_func('love'), nmt_func('them'), nmt_func('pain')

('cinta', 'mereka', 'kesakitan')

In [45]:
%%time
preprocessing = malaya.preprocessing.preprocessing(translator = nmt_func)

CPU times: user 171 µs, sys: 1e+03 ns, total: 172 µs
Wall time: 178 µs


In [46]:
%%time
' '.join(preprocessing.process(string_1))

CPU times: user 359 ms, sys: 41 ms, total: 400 ms
Wall time: 251 ms


'<allcaps> tak boleh tunggu </allcaps> untuk baru musim <hashtag> mahathirmohamad </hashtag> \\(^o^)/ <repeated> ! </repeated> <hashtag> davidlynch </hashtag> <hashtag> tvseries </hashtag> <happy> , <allcaps> <elongated> taak </elongated> <elongated> abaar </elongated> </allcaps> <repeated> ! </repeated>'

In [47]:
%%time
' '.join(preprocessing.process(string_2))

CPU times: user 207 ms, sys: 25.4 ms, total: 233 ms
Wall time: 148 ms


'kecewanya <hashtag> johndoe </hashtag> filem dan ia <elongated> bernasib baik </elongated> <repeated> ! </repeated> <allcaps> disia </allcaps> <money> <repeated> . </repeated> <money> <hashtag> badmovie </hashtag> <annoyed>'

In [48]:
%%time
' '.join(preprocessing.process(string_3))

CPU times: user 206 ms, sys: 28.2 ms, total: 234 ms
Wall time: 152 ms


'<user> : boleh tidak tunggu untuk <date> <hashtag> sentimen </hashtag> ceramah ! <allcaps> <elongated> yaay </elongated> </allcaps> <repeated> ! </repeated> :-d <url>'

### Tokenizer

It able to tokenize multiple regex pipelines, you can check the list from `malaya.preprocessing.get_normalize()`

In [35]:
tokenizer = malaya.preprocessing.TOKENIZER().tokenize

In [36]:
tokenizer(string_1)

['CANT',
 'WAIT',
 'for',
 'the',
 'new',
 'season',
 'of',
 '#mahathirmohamad',
 '＼(^o^)／',
 '!',
 '!',
 '!',
 '#davidlynch',
 '#tvseries',
 ':)))',
 ',',
 'TAAAK',
 'SAAABAAR',
 '!',
 '!',
 '!']

In [37]:
tokenizer(string_2)

['kecewanya',
 '#johndoe',
 'movie',
 'and',
 'it',
 'suuuuucks',
 '!',
 '!',
 '!',
 'WASTED',
 'RM10',
 '.',
 '.',
 '.',
 'rm10',
 '#badmovies',
 ':/']

In [38]:
tokenizer(string_3)

['@husein',
 ':',
 'can',
 "'",
 't',
 'wait',
 'for',
 'the',
 'Nov 9',
 '#Sentiment',
 'talks',
 '!',
 'YAAAAAAY',
 '!',
 '!',
 '!',
 ':-D',
 'http://sentimentsymposium.com/.']

In [39]:
tokenizer('saya nak makan ayam harga rm10k')

['saya', 'nak', 'makan', 'ayam', 'harga', 'rm10k']