# Classification

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/zeroshot-classification](https://github.com/huseinzol05/Malaya/tree/master/example/zeroshot-classification).
    
</div>

<div class="alert alert-info">

This module trained on both standard and local (included social media) language structures, so it is save to use for both.
    
</div>

In [1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

In [2]:
%%time
import malaya

  warn("The installed version of bitsandbytes was compiled without GPU support. "


/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
CPU times: user 3.03 s, sys: 2.5 s, total: 5.53 s
Wall time: 2.74 s


  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))


### what is zero-shot classification

Commonly we supervised a machine learning on specific labels, negative / positive for sentiment, anger / happy / sadness for emotion and etc. The model cannot give an output if we want to know how much percentage of 'jealous' in emotion analysis model because supported labels are only {anger, happy, sadness}. Imagine, for example, trying to identify a text without ever having seen one 'jealous' label before, impossible. **So, zero-shot trying to solve this problem.**

zero-shot learning refers to the process by which a machine learns how to recognize objects (image, text, any features) without any labeled training data to help in the classification.

[Yin et al. (2019)](https://arxiv.org/abs/1909.00161) stated in his paper, any pretrained language model finetuned on text similarity actually can acted as an out-of-the-box zero-shot text classifier.

So, we are going to use transformer models from `malaya.similarity.semantic.huggingface` with a little tweaks.

### List available HuggingFace models

In [3]:
malaya.zero_shot.classification.available_huggingface

{'mesolitica/finetune-mnli-t5-super-tiny-standard-bahasa-cased': {'Size (MB)': 50.7,
  'macro precision': 0.74562,
  'macro recall': 0.74574,
  'macro f1-score': 0.74501},
 'mesolitica/finetune-mnli-t5-tiny-standard-bahasa-cased': {'Size (MB)': 139,
  'macro precision': 0.76584,
  'macro recall': 0.76565,
  'macro f1-score': 0.76542},
 'mesolitica/finetune-mnli-t5-small-standard-bahasa-cased': {'Size (MB)': 242,
  'macro precision': 0.78067,
  'macro recall': 0.78063,
  'macro f1-score': 0.7801},
 'mesolitica/finetune-mnli-t5-base-standard-bahasa-cased': {'Size (MB)': 892,
  'macro precision': 0.78903,
  'macro recall': 0.79064,
  'macro f1-score': 0.78918}}

### Load HuggingFace model

```python
def huggingface(
    model: str = 'mesolitica/finetune-mnli-t5-small-standard-bahasa-cased',
    force_check: bool = True,
    **kwargs,
):
    """
    Load HuggingFace model to zeroshot text classification.

    Parameters
    ----------
    model: str, optional (default='mesolitica/finetune-mnli-t5-small-standard-bahasa-cased')
        Check available models at `malaya.zero_shot.classification.available_huggingface()`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    Returns
    -------
    result: malaya.torch_model.huggingface.ZeroShotClassification
    """
```

In [4]:
model = malaya.zero_shot.classification.huggingface()

Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Some weights of the model checkpoint at mesolitica/finetune-mnli-t5-small-standard-bahasa-cased were not used when initializing T5ForSequenceClassification: ['classi

### predict batch

```python
def predict_proba(
    self,
    strings: List[str],
    labels: List[str],
    prefix: str = 'ayat ini berkaitan tentang',
    multilabel: bool = True,
):
    """
    classify list of strings and return probability.

    Parameters
    ----------
    strings: List[str]
    labels: List[str]
    prefix: str, optional (default='ayat ini berkaitan tentang')
        prefix of labels to zero shot. Playing around with prefix can get better results.
    multilabel: bool, optional (default=True)
        probability of labels can be more than 1.0
```

Because it is a zero-shot, we need to give labels for the model.

In [5]:
# copy from twitter

string = 'gov macam bengong, kami nk pilihan raya, gov backdoor, sakai'

In [6]:
model.predict_proba([string], labels = ['najib razak', 'mahathir', 'kerajaan', 'PRU', 'anarki'])

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


[{'najib razak': 0.47769544,
  'mahathir': 0.49602416,
  'kerajaan': 0.49770266,
  'PRU': 0.5020965,
  'anarki': 0.47935393}]

In [7]:
string = 'tolong order foodpanda jab, lapar'

In [8]:
model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery'])

[{'makan': 0.49923217,
  'makanan': 0.50025105,
  'novel': 0.50996864,
  'buku': 0.5179709,
  'kerajaan': 0.52829444,
  'food delivery': 0.5014325}]

the model understood `order foodpanda` got close relationship with `makan`, `makanan` and `food delivery`.

In [9]:
string = 'kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan'

In [10]:
model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',
                                       'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat'])

[{'makan': 0.4649984,
  'makanan': 0.4640362,
  'novel': 0.513372,
  'buku': 0.50357056,
  'kerajaan': 0.52359533,
  'food delivery': 0.49170837,
  'kerajaan jahat': 0.51742524,
  'kerajaan prihatin': 0.5301894,
  'bantuan rakyat': 0.5329738}]

### able to infer for mixed MS and EN

In [11]:
string = 'Hi guys! I noticed semalam & harini dah ramai yang dapat cookies ni kan. So harini i nak share some post mortem of our first batch:'

In [12]:
model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',
                                       'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat',
                                       'biskut', 'very helpful', 'sharing experiences',
                                       'sharing session'])

[{'makan': 0.446774,
  'makanan': 0.44988176,
  'novel': 0.46197286,
  'buku': 0.46333978,
  'kerajaan': 0.47140214,
  'food delivery': 0.4580575,
  'kerajaan jahat': 0.45878497,
  'kerajaan prihatin': 0.46423724,
  'bantuan rakyat': 0.44994086,
  'biskut': 0.4561887,
  'very helpful': 0.4278154,
  'sharing experiences': 0.458337,
  'sharing session': 0.4593555}]

In [13]:
model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',
                                       'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat',
                                       'biskut', 'very helpful', 'sharing experiences',
                                       'sharing session'],
                   prefix = 'teks ini berkaitan tentang')

[{'makan': 0.44845203,
  'makanan': 0.45382816,
  'novel': 0.46462435,
  'buku': 0.4645531,
  'kerajaan': 0.47166649,
  'food delivery': 0.460771,
  'kerajaan jahat': 0.46060804,
  'kerajaan prihatin': 0.46640876,
  'bantuan rakyat': 0.4555358,
  'biskut': 0.45959476,
  'very helpful': 0.4333261,
  'sharing experiences': 0.4627727,
  'sharing session': 0.4624747}]

### Multiclasses but not multilabel

Sum of probability equal to 1.0, so to do that, set `multilabel=False`.

In [14]:
string = 'kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan'

model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',
                                       'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat',
                                       'biskut', 'very helpful', 'sharing experiences',
                                       'sharing session'], multilabel = False)

[{'makan': 0.07241088,
  'makanan': 0.07458564,
  'novel': 0.07798509,
  'buku': 0.07774425,
  'kerajaan': 0.07856386,
  'food delivery': 0.07746716,
  'kerajaan jahat': 0.076677,
  'kerajaan prihatin': 0.07793205,
  'bantuan rakyat': 0.08176315,
  'biskut': 0.07559921,
  'very helpful': 0.07805643,
  'sharing experiences': 0.07687527,
  'sharing session': 0.07434013}]

### Stacking models

More information, you can read at https://malaya.readthedocs.io/en/latest/Stack.html

If you want to stack zero-shot classification models, you need to pass labels using keyword parameter,

```python
malaya.stack.predict_stack([model1, model2], List[str], labels = List[str])
```

We will passed `labels` as `**kwargs`.

In [15]:
string = 'kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan'
labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery', 
 'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat', 'comel', 'kerajaan syg sgt kepada rakyat']
malaya.stack.predict_stack([model, model, model], [string], 
                           labels = labels)

[{'makan': 0.46499833,
  'makanan': 0.4640362,
  'novel': 0.513372,
  'buku': 0.50357056,
  'kerajaan': 0.52359533,
  'food delivery': 0.49170837,
  'kerajaan jahat': 0.51742524,
  'kerajaan prihatin': 0.53018934,
  'bantuan rakyat': 0.53297377,
  'comel': 0.49191207,
  'kerajaan syg sgt kepada rakyat': 0.5212374}]