# Language Detection

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/language-detection](https://github.com/huseinzol05/Malaya/tree/master/example/language-detection).
    
</div>

<div class="alert alert-info">

This module trained on both standard and local (included social media) language structures, so it is save to use for both.
    
</div>

In [1]:
%%time
import malaya
import fasttext

CPU times: user 3.13 s, sys: 2.83 s, total: 5.96 s
Wall time: 2.71 s


  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))


### labels supported

Default labels for language detection module.

In [2]:
for k, v in malaya.language_detection.metrics.items():
    print(k, v)

deep-model 
              precision    recall  f1-score   support

         eng    0.96760   0.97401   0.97080    553739
         ind    0.97635   0.96131   0.96877    576059
       malay    0.96985   0.98498   0.97736   1800649
    manglish    0.98036   0.96569   0.97297    181442
       other    0.99641   0.99627   0.99634   1428083
       rojak    0.94221   0.84302   0.88986    189678

    accuracy                        0.97779   4729650
   macro avg    0.97213   0.95421   0.96268   4729650
weighted avg    0.97769   0.97779   0.97760   4729650

mesolitica/fasttext-language-detection-v1 
              precision    recall  f1-score   support

         eng    0.94014   0.96750   0.95362    553739
         ind    0.97290   0.97316   0.97303    576059
       malay    0.98674   0.95262   0.96938   1800649
    manglish    0.96595   0.98417   0.97498    181442
       other    0.98454   0.99698   0.99072   1428083
       rojak    0.81149   0.91650   0.86080    189678

    accuracy          

Different models support different languages.

## List available language detection models

In [3]:
malaya.language_detection.available_fasttext

{'mesolitica/fasttext-language-detection-v1': {'Size (MB)': 353,
  'Quantized Size (MB)': 31.1,
  'dim': 16,
  'Label': {0: 'eng',
   1: 'ind',
   2: 'malay',
   3: 'manglish',
   4: 'other',
   5: 'rojak'}},
 'mesolitica/fasttext-language-detection-v2': {'Size (MB)': 1840,
  'Quantized Size (MB)': 227,
  'dim': 16,
  'Label': {0: 'standard-english',
   1: 'local-english',
   2: 'manglish',
   3: 'standard-indonesian',
   4: 'socialmedia-indonesian',
   5: 'standard-malay',
   6: 'local-malay',
   7: 'standard-mandarin',
   8: 'local-mandarin',
   9: 'other'}},
 'mesolitica/fasttext-language-detection-ms-id': {'Size (MB)': 537,
  'Quantized Size (MB)': 62.5,
  'dim': 16,
  'Label': {0: 'standard-indonesian',
   1: 'socialmedia-indonesian',
   2: 'standard-malay',
   3: 'local-malay',
   4: 'other'}},
 'mesolitica/fasttext-language-detection-bahasa-en': {'Size (MB)': 537,
  'Quantized Size (MB)': 62.5,
  'dim': 16,
  'Label': {0: 'bahasa', 1: 'english', 2: 'other'}},
 'mesolitica/fastte

In [4]:
chinese_text = '今天是６月１８号，也是Muiriel的生日！'
english_text = 'i totally love it man'
indon_text = 'menjabat saleh perombakan menjabat periode komisi energi fraksi partai pengurus partai periode periode partai terpilih periode menjabat komisi perdagangan investasi persatuan periode'
malay_text = 'beliau berkata program Inisitif Peduli Rakyat (IPR) yang diperkenalkan oleh kerajaan negeri Selangor lebih besar sumbangannya'
socialmedia_malay_text = 'nti aku tengok dulu tiket dari kl pukul berapa ada nahh'
socialmedia_indon_text = 'saking kangen papanya pas vc anakku nangis'
rojak_text = 'jadi aku tadi bikin ini gengs dan dijual haha salad only k dan haha drinks only k'
manglish_text = 'power lah even shopback come to edmw riao'

### Load Fast-text model

Make sure fast-text already installed, if not, simply,

```bash
pip install fasttext
```

```python
def fasttext(quantized: bool = True, **kwargs):

    """
    Load Fasttext language detection model.
    Original size is 353MB, Quantized size 31.1MB.
    
    Parameters
    ----------
    quantized: bool, optional (default=True)
        if True, load quantized fasttext model. Else, load original fasttext model.

    Returns
    -------
    result : malaya.model.ml.LanguageDetection class
    """
```

In this example, I am going to compare with pretrained fasttext from Facebook. https://fasttext.cc/docs/en/language-identification.html

Simply download pretrained model,

```bash
wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz
```

In [5]:
# !wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz

In [22]:
model = fasttext.load_model('lid.176.ftz')
fast_text = malaya.language_detection.fasttext()

**Language detection in Malaya is not trying to tackle possible languages in this world, just towards to hyperlocal language.**

In [7]:
model.predict(['suka makan ayam dan daging'])

([['__label__id']], [array([0.6334154], dtype=float32)])

In [8]:
fast_text.predict_proba(['suka makan ayam dan daging'])

[{'standard-english': 0.0,
  'local-english': 0.0,
  'manglish': 0.0,
  'standard-indonesian': 0.0,
  'socialmedia-indonesian': 0.0,
  'standard-malay': 0.50445783,
  'local-malay': 0.0,
  'standard-mandarin': 0.0,
  'local-mandarin': 0.0,
  'other': 0.0}]

In [9]:
model.predict(malay_text)

(('__label__ms',), array([0.57101035]))

In [10]:
fast_text.predict_proba([malay_text])

[{'standard-english': 0.0,
  'local-english': 0.0,
  'manglish': 0.0,
  'standard-indonesian': 0.0,
  'socialmedia-indonesian': 0.0,
  'standard-malay': 0.9099521,
  'local-malay': 0.0,
  'standard-mandarin': 0.0,
  'local-mandarin': 0.0,
  'other': 0.0}]

In [11]:
model.predict(socialmedia_malay_text)

(('__label__id',), array([0.7870034]))

In [12]:
fast_text.predict_proba([socialmedia_malay_text])

[{'standard-english': 0.0,
  'local-english': 0.0,
  'manglish': 0.0,
  'standard-indonesian': 0.0,
  'socialmedia-indonesian': 0.0,
  'standard-malay': 0.0,
  'local-malay': 0.9976433,
  'standard-mandarin': 0.0,
  'local-mandarin': 0.0,
  'other': 0.0}]

In [13]:
model.predict(socialmedia_indon_text)

(('__label__fr',), array([0.2912012]))

In [14]:
fast_text.predict_proba([socialmedia_indon_text])

[{'standard-english': 0.0,
  'local-english': 0.0,
  'manglish': 0.0,
  'standard-indonesian': 0.0,
  'socialmedia-indonesian': 1.00003,
  'standard-malay': 0.0,
  'local-malay': 0.0,
  'standard-mandarin': 0.0,
  'local-mandarin': 0.0,
  'other': 0.0}]

In [15]:
model.predict(rojak_text)

(('__label__id',), array([0.87948251]))

In [16]:
fast_text.predict_proba([rojak_text])

[{'standard-english': 0.0,
  'local-english': 0.0,
  'manglish': 0.0,
  'standard-indonesian': 0.0,
  'socialmedia-indonesian': 0.0,
  'standard-malay': 0.0,
  'local-malay': 0.9569701,
  'standard-mandarin': 0.0,
  'local-mandarin': 0.0,
  'other': 0.0}]

In [17]:
model.predict(manglish_text)

(('__label__en',), array([0.89707506]))

In [18]:
fast_text.predict_proba([manglish_text])

[{'standard-english': 0.0,
  'local-english': 0.0,
  'manglish': 0.99997073,
  'standard-indonesian': 0.0,
  'socialmedia-indonesian': 0.0,
  'standard-malay': 0.0,
  'local-malay': 0.0,
  'standard-mandarin': 0.0,
  'local-mandarin': 0.0,
  'other': 0.0}]

In [19]:
model.predict(chinese_text)

(('__label__zh',), array([0.97311586]))

In [20]:
fast_text.predict_proba([chinese_text])

[{'standard-english': 0.0,
  'local-english': 0.0,
  'manglish': 0.0,
  'standard-indonesian': 0.0,
  'socialmedia-indonesian': 0.0,
  'standard-malay': 0.0,
  'local-malay': 0.0,
  'standard-mandarin': 0.0,
  'local-mandarin': 0.5823944,
  'other': 0.0}]

In [21]:
fast_text.predict_proba([indon_text,malay_text])

[{'standard-english': 0.0,
  'local-english': 0.0,
  'manglish': 0.0,
  'standard-indonesian': 0.9755073,
  'socialmedia-indonesian': 0.0,
  'standard-malay': 0.0,
  'local-malay': 0.0,
  'standard-mandarin': 0.0,
  'local-mandarin': 0.0,
  'other': 0.0},
 {'standard-english': 0.0,
  'local-english': 0.0,
  'manglish': 0.0,
  'standard-indonesian': 0.0,
  'socialmedia-indonesian': 0.0,
  'standard-malay': 0.9099521,
  'local-malay': 0.0,
  'standard-mandarin': 0.0,
  'local-mandarin': 0.0,
  'other': 0.0}]