# Preprocessing

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/preprocessing](https://github.com/huseinzol05/Malaya/tree/master/example/preprocessing).
    
</div>

In [1]:
%%time
import malaya

CPU times: user 6.56 s, sys: 1.39 s, total: 7.95 s
Wall time: 9.68 s


### Available rules

We know that social media texts from Twitter, Facebook and Instagram are very noisy and we want to clean as much as possible to make our machines understand the structure of sentence much better. In Malaya, we standardize our text preprocessing,

1. Malaya can replace special words into tokens to reduce dimension curse. `rm10k` become `<money>`.
3. Malaya can put tags for special words, `#drmahathir` become `<hashtag> drmahathir </hashtag>`.
4. Malaya can expand english contractions.
5. Malaya can translate EN words to become MS words. required a translator callable.
6. Stemming and lemmatizing, required a stemmer callable.
7. Normalize elongated words, required a Malaya speller callable.
8. Expand hashtags, `#drmahathir` become `dr mahathir`, required a segmentation callable.
9. Malaya can put emoji tags if provide `demoji` object.

#### normalize

Supported `normalize`,

1. hashtag
2. cashtag
3. tag
4. user
5. emphasis
6. censored
7. acronym
8. eastern_emoticons
9. rest_emoticons
10. emoji
11. quotes
12. percent
13. repeat_puncts
14. money
15. email
16. phone
17. number
18. allcaps
19. url
20. date
21. time

You can check all supported list at `malaya.preprocessing.get_normalize()`.

Example, if you set `money` and `number`, and input string is `RM10k`, the output is `<money>`.

#### annotate

Supported `annotate`,

1. hashtag
2. allcaps
3. elongated
4. repeated
5. emphasis
6. censored

Example, if you set `hashtag`, and input string is `#drmahathir`, the output is `<hashtag> drmahathir </hashtag>`.

In [2]:
string_1 = 'CANT WAIT for the new season of #mahathirmohamad ＼(^o^)／!!! #davidlynch #tvseries :))), TAAAK SAAABAAR!!!'
string_2 = 'kecewanya #johndoe movie and it suuuuucks!!! WASTED RM10... rm10 #badmovies :/'
string_3 = "@husein:  can't wait for the Nov 9 #Sentiment talks!  YAAAAAAY !!! :-D http://sentimentsymposium.com/."
string_4 = 'aahhh, malasnye nak pegi keje harini #mondayblues'
string_5 = '#drmahathir #najibrazak #1malaysia #mahathirnajib'

### Preprocessing Interface

```python
def preprocessing(
    normalize: List[str] = [
        'url',
        'email',
        'percent',
        'money',
        'phone',
        'user',
        'time',
        'date',
        'number',
    ],
    annotate: List[str] = [
        'allcaps',
        'elongated',
        'repeated',
        'emphasis',
        'censored',
        'hashtag',
    ],
    lowercase: bool = True,
    fix_unidecode: bool = True,
    expand_english_contractions: bool = True,
    translator: Callable = None,
    segmenter: Callable = None,
    stemmer: Callable = None,
    speller: Callable = None,
    demoji: Callable = None,
    **kwargs,
):
    """
    Load Preprocessing class.

    Parameters
    ----------
    normalize: List[str], optional (default=['url', 'email', 'percent', 'money', 'phone', 'user', 'time', 'date', 'number'])
        normalizing tokens, can check all supported normalizing at `malaya.preprocessing.get_normalize()`.
    annotate: List[str], optional (default=['hashtag', 'allcaps', 'elongated', 'repeated', 'emphasis', 'censored'])
        annonate tokens <open></open>,
        only accept ['hashtag', 'allcaps', 'elongated', 'repeated', 'emphasis', 'censored'].
    lowercase: bool, optional (default=True)
    fix_unidecode: bool, optional (default=True)
        fix unidecode using `ftfy.fix_text`.
    expand_english_contractions: bool, optional (default=True)
        expand english contractions.
    translator: Callable, optional (default=None)
        function to translate EN word to MS word.
    segmenter: Callable, optional (default=None)
        function to segmentize word.
        If provide, it will expand hashtags, #mondayblues == monday blues
    stemmer: Callable, optional (default=None)
        function to stem word.
    speller: object
        spelling correction object, need to have a method `correct` or `normalize_elongated`.
    demoji: object
        demoji object, need to have a method `demoji`.

    Returns
    -------
    result : malaya.preprocessing.Preprocessing class
    """
```

### Load default paramaters

default parameters able to translate most of english to bahasa malaysia.

In [3]:
%%time
preprocessing = malaya.preprocessing.preprocessing()

CPU times: user 211 ms, sys: 3.78 ms, total: 215 ms
Wall time: 217 ms


In [4]:
%%time
' '.join(preprocessing.process(string_1))

CPU times: user 2.85 ms, sys: 55 µs, total: 2.91 ms
Wall time: 2.95 ms


'<allcaps> tak boleh wait </allcaps> untuk the new season of <hashtag> mahathirmohamad </hashtag> \\(^o^)/ <repeated> ! </repeated> <hashtag> davidlynch </hashtag> <hashtag> tvseries </hashtag> <happy> , <allcaps> <elongated> taak </elongated> <elongated> saabaar </elongated> </allcaps> <repeated> ! </repeated>'

In [5]:
%%time
' '.join(preprocessing.process(string_2))

CPU times: user 532 µs, sys: 6 µs, total: 538 µs
Wall time: 552 µs


'kecewanya <hashtag> johndoe </hashtag> movie and it <elongated> suucks </elongated> <repeated> ! </repeated> <allcaps> wasted </allcaps> <money> <repeated> . </repeated> <money> <hashtag> badmovies </hashtag> <annoyed>'

In [6]:
%%time
' '.join(preprocessing.process(string_3))

CPU times: user 595 µs, sys: 9 µs, total: 604 µs
Wall time: 619 µs


'<user> : can not wait untuk the <date> <hashtag> sentiment </hashtag> talks ! <allcaps> <elongated> yaay </elongated> </allcaps> <repeated> ! </repeated> :-d <url>'

In [7]:
%%time
' '.join(preprocessing.process(string_4))

CPU times: user 452 µs, sys: 13 µs, total: 465 µs
Wall time: 491 µs


'<elongated> aahh </elongated> , malasnye nak pergi kerja hari ini <hashtag> mondayblues </hashtag>'

In [8]:
%%time
' '.join(preprocessing.process(string_5))

CPU times: user 396 µs, sys: 1e+03 ns, total: 397 µs
Wall time: 401 µs


'<hashtag> drmahathir </hashtag> <hashtag> najibrazak </hashtag> <hashtag> 1 malaysia </hashtag> <hashtag> mahathirnajib </hashtag>'

### Load default paramaters with spelling correction to normalize elongated words.

We saw `taak`, `saabaar` and another elongated words are not the original words, so we can use spelling correction to normalize it.

In [49]:
corrector = malaya.spell.probability()

In [10]:
%%time
preprocessing = malaya.preprocessing.preprocessing(speller = corrector)

CPU times: user 181 µs, sys: 19 µs, total: 200 µs
Wall time: 220 µs


In [11]:
%%time
' '.join(preprocessing.process(string_1))

CPU times: user 909 µs, sys: 16 µs, total: 925 µs
Wall time: 937 µs


'<allcaps> tak boleh wait </allcaps> untuk the new season of <hashtag> mahathirmohamad </hashtag> \\(^o^)/ <repeated> ! </repeated> <hashtag> davidlynch </hashtag> <hashtag> tvseries </hashtag> <happy> , <allcaps> <elongated> tidak </elongated> <elongated> sabar </elongated> </allcaps> <repeated> ! </repeated>'

In [12]:
%%time
' '.join(preprocessing.process(string_2))

CPU times: user 701 µs, sys: 9 µs, total: 710 µs
Wall time: 719 µs


'kecewanya <hashtag> johndoe </hashtag> movie and it <elongated> sucks </elongated> <repeated> ! </repeated> <allcaps> wasted </allcaps> <money> <repeated> . </repeated> <money> <hashtag> badmovies </hashtag> <annoyed>'

In [13]:
%%time
' '.join(preprocessing.process(string_3))

CPU times: user 560 µs, sys: 7 µs, total: 567 µs
Wall time: 575 µs


'<user> : can not wait untuk the <date> <hashtag> sentiment </hashtag> talks ! <allcaps> <elongated> yay </elongated> </allcaps> <repeated> ! </repeated> :-d <url>'

In [14]:
%%time
' '.join(preprocessing.process(string_4))

CPU times: user 456 µs, sys: 1 µs, total: 457 µs
Wall time: 463 µs


'<elongated> ah </elongated> , malasnye nak pergi kerja hari ini <hashtag> mondayblues </hashtag>'

In [15]:
%%time
' '.join(preprocessing.process(string_5))

CPU times: user 353 µs, sys: 1 µs, total: 354 µs
Wall time: 357 µs


'<hashtag> drmahathir </hashtag> <hashtag> najibrazak </hashtag> <hashtag> 1 malaysia </hashtag> <hashtag> mahathirnajib </hashtag>'

### Load default paramaters with segmenter to expand hashtags.

We saw `<hashtag> drmahathir </hashtag> <hashtag> najibrazak </hashtag>`, we want to expand to become `dr mahathir` and `najib razak`.

In [48]:
segmenter = malaya.segmentation.transformer(model = 'small', quantized = True)

In [17]:
segmenter_func = lambda x: segmenter.greedy_decoder([x])[0]
segmenter_func('hellosuka')

'hello suka'

In [18]:
%%time
preprocessing = malaya.preprocessing.preprocessing(segmenter = segmenter_func)

CPU times: user 179 µs, sys: 6 µs, total: 185 µs
Wall time: 194 µs


In [19]:
%%time
' '.join(preprocessing.process(string_1))

CPU times: user 336 ms, sys: 65.9 ms, total: 402 ms
Wall time: 170 ms


'<allcaps> tak boleh wait </allcaps> untuk the new season of <hashtag> mahathir mohamad </hashtag> \\(^o^)/ <repeated> ! </repeated> <hashtag> davidlynch </hashtag> <hashtag> tv series </hashtag> <happy> , <allcaps> <elongated> taak </elongated> <elongated> saabaar </elongated> </allcaps> <repeated> ! </repeated>'

In [20]:
%%time
' '.join(preprocessing.process(string_2))

CPU times: user 189 ms, sys: 36.4 ms, total: 225 ms
Wall time: 86.9 ms


'kecewanya <hashtag> johndoe </hashtag> movie and it <elongated> suucks </elongated> <repeated> ! </repeated> <allcaps> wasted </allcaps> <money> <repeated> . </repeated> <money> <hashtag> bad movies </hashtag> <annoyed>'

In [21]:
%%time
' '.join(preprocessing.process(string_3))

CPU times: user 82.2 ms, sys: 20.4 ms, total: 103 ms
Wall time: 45.9 ms


'<user> : can not wait untuk the <date> <hashtag> sentiment </hashtag> talks ! <allcaps> <elongated> yaay </elongated> </allcaps> <repeated> ! </repeated> :-d <url>'

In [22]:
%%time
' '.join(preprocessing.process(string_4))

CPU times: user 133 ms, sys: 29.9 ms, total: 163 ms
Wall time: 69.1 ms


'<elongated> aahh </elongated> , malasnye nak pergi kerja hari ini <hashtag> mondayblues </hashtag>'

In [23]:
%%time
' '.join(preprocessing.process(string_5))

CPU times: user 373 ms, sys: 73.7 ms, total: 447 ms
Wall time: 177 ms


'<hashtag> dr mahathir </hashtag> <hashtag> najib razak </hashtag> <hashtag> 1 malaysia </hashtag> <hashtag> mahathir najib </hashtag>'

### Load default paramaters with stemming and lemmatization

In [24]:
sastrawi = malaya.stem.sastrawi()
stemmer_func = lambda x: sastrawi.stem(x)

In [25]:
stemmer_func('sukakan')

'suka'

In [26]:
%%time
preprocessing = malaya.preprocessing.preprocessing(stemmer = stemmer_func)

CPU times: user 210 µs, sys: 11 µs, total: 221 µs
Wall time: 227 µs


In [27]:
%%time
' '.join(preprocessing.process(string_1))

CPU times: user 9.15 ms, sys: 190 µs, total: 9.34 ms
Wall time: 9.39 ms


'<allcaps> tak boleh wait </allcaps> untuk the new season of <hashtag> mahathirmohamad </hashtag> o <repeated> </repeated> <hashtag> davidlynch </hashtag> <hashtag> tvseries </hashtag> <happy> <allcaps> <elongated> taak </elongated> <elongated> saabaar </elongated> </allcaps> <repeated> </repeated>'

In [28]:
%%time
' '.join(preprocessing.process(string_2))

CPU times: user 1.85 ms, sys: 1e+03 ns, total: 1.85 ms
Wall time: 1.86 ms


'kecewa <hashtag> johndoe </hashtag> movie and it <elongated> suucks </elongated> <repeated> </repeated> <allcaps> wasted </allcaps> <money> <repeated> </repeated> <money> <hashtag> badmovies </hashtag> <annoyed>'

In [29]:
%%time
' '.join(preprocessing.process(string_3))

CPU times: user 1.66 ms, sys: 1e+03 ns, total: 1.66 ms
Wall time: 1.67 ms


'<user> can not wait untuk the <date> <hashtag> sentiment </hashtag> talks <allcaps> <elongated> yaay </elongated> </allcaps> <repeated> </repeated> -d <url>'

In [30]:
preprocessing.process(string_3)

['<user>',
 'can',
 'not',
 'wait',
 'untuk',
 'the',
 '<date>',
 '<hashtag>',
 'sentiment',
 '</hashtag>',
 'talks',
 '<allcaps>',
 '<elongated>',
 'yaay',
 '</elongated>',
 '</allcaps>',
 '<repeated>',
 '</repeated>',
 '-d',
 '<url>']

In [31]:
%%time
' '.join(preprocessing.process(string_4))

CPU times: user 1.78 ms, sys: 15 µs, total: 1.8 ms
Wall time: 1.82 ms


'<elongated> aahh </elongated> malasnye nak pergi kerja hari ini <hashtag> mondayblues </hashtag>'

In [32]:
%%time
' '.join(preprocessing.process(string_5))

CPU times: user 2.01 ms, sys: 11 µs, total: 2.02 ms
Wall time: 2.03 ms


'<hashtag> drmahathir </hashtag> <hashtag> najibrazak </hashtag> <hashtag> 1 malaysia </hashtag> <hashtag> mahathirnajib </hashtag>'

In [33]:
%%time
' '.join(preprocessing.process('saya disini berjalan pergi ke putrajaya, #masjidbesi'))

CPU times: user 2.18 ms, sys: 20 µs, total: 2.2 ms
Wall time: 2.23 ms


'saya sini jalan pergi ke putrajaya <hashtag> masjidbesi </hashtag>'

### Load translation

In [34]:
en_ms_vocab = malaya.translation.en_ms.dictionary()
translator = lambda x: en_ms_vocab.get(x, x)
translator('pain'), translator('aduh')

('kesakitan', 'aduh')

In [35]:
%%time
preprocessing = malaya.preprocessing.preprocessing(translator = translator)

CPU times: user 121 µs, sys: 10 µs, total: 131 µs
Wall time: 135 µs


In [36]:
%%time
' '.join(preprocessing.process(string_1))

CPU times: user 1.09 ms, sys: 63 µs, total: 1.15 ms
Wall time: 1.51 ms


'<allcaps> tak boleh tunggu </allcaps> untuk yang baru musim daripada <hashtag> mahathirmohamad </hashtag> \\(^o^)/ <repeated> ! </repeated> <hashtag> davidlynch </hashtag> <hashtag> tvseries </hashtag> <happy> , <allcaps> <elongated> taak </elongated> <elongated> saabaar </elongated> </allcaps> <repeated> ! </repeated>'

In [37]:
%%time
' '.join(preprocessing.process(string_2))

CPU times: user 550 µs, sys: 8 µs, total: 558 µs
Wall time: 563 µs


'kecewanya <hashtag> johndoe </hashtag> filem dan ia <elongated> suucks </elongated> <repeated> ! </repeated> <allcaps> dibazirkan </allcaps> <money> <repeated> . </repeated> <money> <hashtag> badmovies </hashtag> <annoyed>'

In [38]:
%%time
' '.join(preprocessing.process(string_3))

CPU times: user 690 µs, sys: 22 µs, total: 712 µs
Wall time: 759 µs


'<user> : boleh tidak tunggu untuk yang <date> <hashtag> sentimen </hashtag> talks ! <allcaps> <elongated> yaay </elongated> </allcaps> <repeated> ! </repeated> :-d <url>'

#### Use Neural Translation Machine

Problem with dictionary based, if the words is not exist, the translation will not work,

In [39]:
translator('love'), translator('them'), translator('pain')

('love', 'them', 'kesakitan')

In [47]:
nmt = malaya.translation.en_ms.transformer(model = 'small')
nmt_func = lambda x: nmt.greedy_decoder([x])[0]

In [41]:
nmt_func('love'), nmt_func('them'), nmt_func('pain')

('cinta', 'mereka', 'kesakitan')

In [42]:
%%time
preprocessing = malaya.preprocessing.preprocessing(translator = nmt_func)

CPU times: user 112 µs, sys: 4 µs, total: 116 µs
Wall time: 119 µs


In [43]:
%%time
' '.join(preprocessing.process(string_1))

CPU times: user 277 ms, sys: 27.5 ms, total: 305 ms
Wall time: 194 ms


'<allcaps> tak boleh tunggu </allcaps> untuk baru musim <hashtag> mahathirmohamad </hashtag> \\(^o^)/ <repeated> ! </repeated> <hashtag> davidlynch </hashtag> <hashtag> tvseries </hashtag> <happy> , <allcaps> <elongated> taak </elongated> <elongated> abaar </elongated> </allcaps> <repeated> ! </repeated>'

In [44]:
%%time
' '.join(preprocessing.process(string_2))

CPU times: user 149 ms, sys: 14.7 ms, total: 163 ms
Wall time: 105 ms


'kecewanya <hashtag> johndoe </hashtag> filem dan ia <elongated> bernasib baik </elongated> <repeated> ! </repeated> <allcaps> disia </allcaps> <money> <repeated> . </repeated> <money> <hashtag> badmovie </hashtag> <annoyed>'

In [45]:
%%time
' '.join(preprocessing.process(string_3))

CPU times: user 130 ms, sys: 10.6 ms, total: 141 ms
Wall time: 97.9 ms


'<user> : boleh tidak tunggu untuk <date> <hashtag> sentimen </hashtag> ceramah ! <allcaps> <elongated> yaay </elongated> </allcaps> <repeated> ! </repeated> :-d <url>'