# Zeroshot Classification HuggingFace

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/zeroshot-classification-huggingface](https://github.com/huseinzol05/Malaya/tree/master/example/zeroshot-classification-huggingface).
    
</div>

<div class="alert alert-info">

This module trained on both standard and local (included social media) language structures, so it is save to use for both.
    
</div>

In [1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

In [2]:
import logging

logging.basicConfig(level=logging.INFO)

In [3]:
%%time
import malaya

CPU times: user 3.12 s, sys: 3.59 s, total: 6.72 s
Wall time: 2.17 s


  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))


### what is zero-shot classification

Commonly we supervised a machine learning on specific labels, negative / positive for sentiment, anger / happy / sadness for emotion and etc. The model cannot give an output if we want to know how much percentage of 'jealous' in emotion analysis model because supported labels are only {anger, happy, sadness}. Imagine, for example, trying to identify a text without ever having seen one 'jealous' label before, impossible. **So, zero-shot trying to solve this problem.**

zero-shot learning refers to the process by which a machine learns how to recognize objects (image, text, any features) without any labeled training data to help in the classification.

[Yin et al. (2019)](https://arxiv.org/abs/1909.00161) stated in his paper, any pretrained language model finetuned on text similarity actually can acted as an out-of-the-box zero-shot text classifier.

So, we are going to use transformer models from `malaya.similarity.semantic.huggingface` with a little tweaks.

### List available HuggingFace models

In [4]:
malaya.zero_shot.classification.available_huggingface()

INFO:malaya.similarity.semantic:tested on matched dev set translated MNLI, https://huggingface.co/datasets/mesolitica/translated-MNLI


Unnamed: 0,Size (MB),macro precision,macro recall,macro f1-score
mesolitica/finetune-mnli-t5-super-tiny-standard-bahasa-cased,50.7,0.88756,0.887,0.88727
mesolitica/finetune-mnli-t5-tiny-standard-bahasa-cased,139.0,0.88756,0.887,0.88727
mesolitica/finetune-mnli-t5-small-standard-bahasa-cased,242.0,0.88756,0.887,0.88727
mesolitica/finetune-mnli-t5-base-standard-bahasa-cased,892.0,0.88756,0.887,0.88727


### Load HuggingFace model

```python
def huggingface(model: str = 'mesolitica/finetune-mnli-t5-small-standard-bahasa-cased', **kwargs):
    """
    Load HuggingFace model to zeroshot text classification.

    Parameters
    ----------
    model: str, optional (default='mesolitica/finetune-mnli-t5-small-standard-bahasa-cased')
        Check available models at `malaya.zero_shot.classification.available_huggingface()`.

    Returns
    -------
    result: malaya.torch_model.huggingface.ZeroShotClassification
    """
```

In [5]:
model = malaya.zero_shot.classification.huggingface()

#### predict batch

```python
def predict_proba(
    self,
    strings: List[str],
    labels: List[str],
    prefix: str = 'ayat ini berkaitan tentang'
):
    """
    classify list of strings and return probability.

    Parameters
    ----------
    strings: List[str]
    labels: List[str]
    prefix: str
        prefix of labels to zero shot. Playing around with prefix can get better results.

    Returns
    -------
    list: List[Dict[str, float]]
    """
```

Because it is a zero-shot, we need to give labels for the model.

In [6]:
# copy from twitter

string = 'gov macam bengong, kami nk pilihan raya, gov backdoor, sakai'

In [7]:
model.predict_proba([string], labels = ['najib razak', 'mahathir', 'kerajaan', 'PRU', 'anarki'])

[{'najib razak': 0.6651765,
  'mahathir': 0.987833,
  'kerajaan': 0.9912515,
  'PRU': 0.9841426,
  'anarki': 0.45587578}]

In [8]:
string = 'tolong order foodpanda jab, lapar'

In [9]:
model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery'])

[{'makan': 0.9698464,
  'makanan': 0.9735605,
  'novel': 0.19823082,
  'buku': 0.00313239,
  'kerajaan': 0.12976034,
  'food delivery': 0.99331254}]

the model understood `order foodpanda` got close relationship with `makan`, `makanan` and `food delivery`.

In [10]:
string = 'kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan'

In [11]:
model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',
                                       'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat'])

[{'makan': 0.0004689095,
  'makanan': 0.0026079589,
  'novel': 0.29850212,
  'buku': 0.025044106,
  'kerajaan': 0.76523817,
  'food delivery': 0.0044676424,
  'kerajaan jahat': 0.0023713536,
  'kerajaan prihatin': 0.9468328,
  'bantuan rakyat': 0.9923975}]

### Stacking models

More information, you can read at https://malaya.readthedocs.io/en/latest/Stack.html

If you want to stack zero-shot classification models, you need to pass labels using keyword parameter,

```python
malaya.stack.predict_stack([model1, model2], List[str], labels = List[str])
```

We will passed `labels` as `**kwargs`.

In [15]:
string = 'kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan'
labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery', 
 'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat', 'comel', 'kerajaan syg sgt kepada rakyat']
malaya.stack.predict_stack([model, model, model], [string], 
                           labels = labels)

[{'makan': 0.00046890916,
  'makanan': 0.0026079628,
  'novel': 0.29850233,
  'buku': 0.02504399,
  'kerajaan': 0.7652382,
  'food delivery': 0.004467653,
  'kerajaan jahat': 0.0023713524,
  'kerajaan prihatin': 0.9468329,
  'bantuan rakyat': 0.99239755,
  'comel': 0.00077307917,
  'kerajaan syg sgt kepada rakyat': 0.9818335}]