# Prefix Generator

Give initial sentence, then the models will continue to generate the text.

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/example/prefix-generator](https://github.com/huseinzol05/Malaya/tree/master/example/prefix-generator).
    
</div>

In [1]:
%%time
import malaya
from pprint import pprint

CPU times: user 4.89 s, sys: 684 ms, total: 5.57 s
Wall time: 4.64 s


### Load GPT2

Malaya provided Pretrained GPT2 model, specific to Malay, we called it GPT2-Bahasa. This interface not able us to use it to do custom training.

GPT2-Bahasa was pretrained on ~1.2 billion words.

If you want to download pretrained model for GPT2-Bahasa and use it for custom transfer-learning, you can download it here, https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/gpt2, some notebooks to help you get started.

#### List available GPT2

In [2]:
malaya.generator.available_gpt2()

INFO:root:calculate perplexity on never seen malay karangan.


Unnamed: 0,Size (MB),Quantized Size (MB),Perplexity
117M,499.0,126.0,6.232461
345M,1420.0,357.0,6.104012


#### load model

```python
def gpt2(model: str = '345M', quantized: bool = False, **kwargs):
    """
    Load GPT2 model to generate a string given a prefix string.

    Parameters
    ----------
    model : str, optional (default='345M')
        Model architecture supported. Allowed values:

        * ``'117M'`` - GPT2 117M parameters.
        * ``'345M'`` - GPT2 345M parameters.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result: malaya.model.tf.GPT2 class
    """
```

In [3]:
model = malaya.generator.gpt2(model = '117M')

INFO:root:running gpt2/117M using device /device:CPU:0


In [4]:
model_quantized = malaya.generator.gpt2(model = '117M', quantized = True)

INFO:root:running gpt2/117M-quantized using device /device:CPU:0


In [5]:
string = 'ceritanya sebegini, aku bangun pagi baca surat khabar berita harian, tetiba aku nampak cerita seram, '

#### generate

```python
"""
    generate a text given an initial string.

    Parameters
    ----------
    string : str
    maxlen : int, optional (default=256)
        length of sentence to generate.
    n_samples : int, optional (default=1)
        size of output.
    temperature : float, optional (default=1.0)
        temperature value, value should between 0 and 1.
    top_k : int, optional (default=0)
        top-k in nucleus sampling selection.
    top_p : float, optional (default=0.0)
        top-p in nucleus sampling selection, value should between 0 and 1.
        if top_p == 0, will use top_k.
        if top_p == 0 and top_k == 0, use greedy decoder.

    Returns
    -------
    result: List[str]
    """
```

In [6]:
print(model.generate(string, temperature = 0.1))

["ceritanya sebegini, aku bangun pagi baca surat khabar berita harian, tetiba aku nampak cerita seram, ia adalah rancangan anak-anak yang aku bawa balik sampanye.\nSekali lagi aku hanya akan meminta dia jadi pembenci, dan memanggil supaya aku boleh berkata.\nKata-kata itu hanya diberikan pada adik aku; selepas itu pada aku terus terang, saudara-saudara aku hanya masu melihat betapa serngah aku mencipta naluri berkenaan.\nDia memakai kacamata untuk berjaga dan berwuduk dan menggesel dirinya hampir untuk berjalan.\nDia mempersembahkan rambutnya, daun berasal dari keduanya dan berselerak gila.\n'Tahi' seorang lelaki dunia pun tak berapa malah pakaiannya hanya menyapu ke bibir dan seluruh kulit aku menjadi panas.\nDengar, sop"]


In [7]:
print(model.generate(string, temperature = 0.1, top_p = 0.8))

['ceritanya sebegini, aku bangun pagi baca surat khabar berita harian, tetiba aku nampak cerita seram, iaitu cerita pertama yang menjadi viral, walaupun nama saya tidak kedengaran keras.\n"Aku pun memang tak rasa duduk di sini sambil mencari makanan, tengok kami tak boleh pergi mana-mana-mana ke sana (yang aku tidak mahu pergi).\nKalau nak buka bersama, aku tak boleh.\n"Tapi sebab aku tak nampak jelas, ramai orang menyebut perkataan yang kita tidak suka, mereka suka mengecam," katanya ketika ditemui pada majlis rasmi sebuah televisyen yang disiarkan secara langsung oleh Malaysiakini di Media, baru-baru ini.\nPada masa sama, Perdana Menteri, Tun Dr Mahathir Mohamad mengucapkan takziah atas kematian Adun Aman, Al-Ihsan dan Hajar Tahir, man']


### Using Babble method

We also can generate a text like GPT2 using Transformer-Bahasa. Right now only supported BERT, ALBERT and ELECTRA.

```python
def babble(
    string: str,
    model,
    generate_length: int = 30,
    leed_out_len: int = 1,
    temperature: float = 1.0,
    top_k: int = 100,
    burnin: int = 15,
    batch_size: int = 5,
):
    """
    Use pretrained transformer models to generate a string given a prefix string.
    https://github.com/nyu-dl/bert-gen, https://arxiv.org/abs/1902.04094

    Parameters
    ----------
    string: str
    model: object
        transformer interface object. Right now only supported BERT, ALBERT.
    generate_length : int, optional (default=256)
        length of sentence to generate.
    leed_out_len : int, optional (default=1)
        length of extra masks for each iteration. 
    temperature: float, optional (default=1.0)
        logits * temperature.
    top_k: int, optional (default=100)
        k for top-k sampling.
    burnin: int, optional (default=15)
        for the first burnin steps, sample from the entire next word distribution, instead of top_k.
    batch_size: int, optional (default=5)
        generate sentences size of batch_size.

    Returns
    -------
    result: List[str]
    """
```

Make sure you already installed `tensorflow-probability`,

```bash
pip3 install tensorflow-probability==0.7.0
```

In [10]:
# !pip3 install tensorflow-probability==0.7.0

In [3]:
electra = malaya.transformer.load(model = 'electra')


Instructions for updating:
Use keras.layers.Dense instead.
Instructions for updating:
Please use `layer.__call__` method instead.


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Use `tf.random.categorical` instead.






INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/electra-model/base/electra-base/model.ckpt


In [11]:
malaya.generator.babble(string, electra)

['ceritanya sebegini , aku bangun pagi baca surat khabar berita harian , tetiba aku nampak cerita seram , terseksa juga hidup di sekeliling aku . Diorang tak tahu sebab diorang tahu titik hitam yang mana kita tengok dari mana kita sendiri nampak cerita ke . Haih .',
 'ceritanya sebegini , aku bangun pagi baca surat khabar berita harian , tetiba aku nampak cerita seram , tengah baca benda besar pasal bumbung bilik . Rasanya sejuk macam pulau harapan . So aku baca cerita seram pelik . Jadi sedih juga dengar cerita seram seram ni .',
 'ceritanya sebegini , aku bangun pagi baca surat khabar berita harian , tetiba aku nampak cerita seram , lalu ibu ambil pusing bagi buku sejarah . Dah baca marsh pastu aku dah buat thread seram , ada dalam masa terdekat baru bangun . Sedih , hidup lagi',
 'ceritanya sebegini , aku bangun pagi baca surat khabar berita harian , tetiba aku nampak cerita seram , mesti seram sampai aku ikut takdir Allah bagi betul2 aib kita kembali menulis mengenai kisah cinta ak

### ngrams

You can generate ngrams pretty easy using this interface,

```python
def ngrams(
    sequence,
    n: int,
    pad_left = False,
    pad_right = False,
    left_pad_symbol = None,
    right_pad_symbol = None,
):
    """
    generate ngrams.

    Parameters
    ----------
    sequence : List[str]
        list of tokenize words.
    n : int
        ngram size

    Returns
    -------
    ngram: list
    """
```

In [6]:
string = 'saya suka makan ayam'

list(malaya.generator.ngrams(string.split(), n = 2))

[('saya', 'suka'), ('suka', 'makan'), ('makan', 'ayam')]

In [7]:
list(malaya.generator.ngrams(string.split(), n = 2, pad_left = True, pad_right = True))

[(None, 'saya'),
 ('saya', 'suka'),
 ('suka', 'makan'),
 ('makan', 'ayam'),
 ('ayam', None)]

In [8]:
list(malaya.generator.ngrams(string.split(), n = 2, pad_left = True, pad_right = True,
                            left_pad_symbol = 'START'))

[('START', 'saya'),
 ('saya', 'suka'),
 ('suka', 'makan'),
 ('makan', 'ayam'),
 ('ayam', None)]

In [8]:
list(malaya.generator.ngrams(string.split(), n = 2, pad_left = True, pad_right = True,
                            left_pad_symbol = 'START', right_pad_symbol = 'END'))

[('START', 'saya'),
 ('saya', 'suka'),
 ('suka', 'makan'),
 ('makan', 'ayam'),
 ('ayam', 'END')]