## CS310 Natural Language Processing
## Lab 10: Explore Natural Language Generation

In [1]:
import torch
import random

### T1. Explore Pretrained GPT-2 Model

In this task, you will explore the GPT-2 model using the `transformers` library.

Just like in the previous lab, you will need to download the pretrained model and unzip it to `./gpt2zh`. Note that this is not the original version of GPT-2 provided by OpenAI (https://huggingface.co/openai-community/gpt2), but rather a fine-tuned version for Chinese text generation.

In [2]:
from transformers import AutoTokenizer, GPT2LMHeadModel

gpt2_tokenizer = AutoTokenizer.from_pretrained("./gpt2zh")
gpt2_model = GPT2LMHeadModel.from_pretrained("./gpt2zh")
# Evaluation mode
gpt2_model.eval()

print('vocab size:', gpt2_tokenizer.vocab_size)
print(f'special token {gpt2_tokenizer.sep_token}:', gpt2_tokenizer.sep_token_id)
print(f'special token {gpt2_tokenizer.cls_token}:', gpt2_tokenizer.cls_token_id)
print(f'special token {gpt2_tokenizer.pad_token}:', gpt2_tokenizer.pad_token_id)

# Use [SEP] as end-of-sentence token
gpt2_model.config.eos_token_id = gpt2_tokenizer.sep_token_id

  from .autonotebook import tqdm as notebook_tqdm


vocab size: 21128
special token [SEP]: 102
special token [CLS]: 101
special token [PAD]: 0


The tokenizer can return the token IDs and the attention mask that indicates which tokens are padding tokens (`1` for real tokens, `0` for padding tokens).

Since we only have one sentence in the "batch", there is no padding used, and thus no `0` in the attention mask.

In [3]:
input_text = '学而时习之，不亦说乎！'
input_encoded = gpt2_tokenizer(input_text, return_tensors="pt")

print('input ids:', input_encoded['input_ids'])
print('input attention mask:', input_encoded['attention_mask'])

# Map token ids back to tokens
print('input tokens:', gpt2_tokenizer.convert_ids_to_tokens(input_encoded['input_ids'][0]))

input ids: tensor([[ 101, 2110, 5445, 3198,  739,  722, 8024,  679,  771, 6432,  725, 8013,
          102]])
input attention mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
input tokens: ['[CLS]', '学', '而', '时', '习', '之', '，', '不', '亦', '说', '乎', '！', '[SEP]']


It's easy to directly use the `generate` method to generate some sentences:

In [4]:
input_text = "子曰：人"
input_encoded = gpt2_tokenizer(input_text, return_tensors="pt", add_special_tokens=False)
n_outputs = 5

output = gpt2_model.generate(**input_encoded, 
                                 max_length=20, 
                                 num_return_sequences=n_outputs,
                                 do_sample=True, 
                                 top_k=50, 
                                 top_p=0.95, 
                                 temperature=0.7,
                                 pad_token_id=0,
                                 )
# print(type(output))
# print(output.shape)

for i in range(n_outputs):
    output_text = gpt2_tokenizer.decode(output[i], skip_special_tokens=True)
    print(output_text)

子 曰 ： 人 民 之 所 以 为 国 ， 民 之 所 以 为 民 ， 民 之
子 曰 ： 人 皆 谤 谤 谤 谤 谤 谤 谤 谤 谤 谤 谤 谤 谤 谤 谤
子 曰 ： 人 之 所 以 为 天 ， 人 之 所 以 为 地 ， 地 之 所
子 曰 ： 人 民 不 应 该 向 他 们 求 助 ， 因 为 人 民 是 天
子 曰 ： 人 道 之 善 ， 而 不 可 为 之 。 （ 原 文 ）


We can see that the generation is far from perfect. It still has good chances to produce a lot of repetitions.

---

### T2. Implement Top-k Sampling Algorithms Manually

Let's first try greedy search, i.e., top-1 sampling.

In [6]:
input_text = "今天天气"
input_encoded = gpt2_tokenizer(input_text, return_tensors="pt", add_special_tokens=False)
print('input size:', input_encoded.input_ids.shape[1])

output = gpt2_model(input_encoded.input_ids, 
                    attention_mask=input_encoded.attention_mask)
logits = output.logits
print(logits.shape)

### START YOUR CODE ###
# Get the probability distribution predicted at the last token's position
last_token_logits = logits[0, -1, :]

# Get the most likely token id from this distribution
most_likely_token_id = torch.argmax(last_token_logits).item()
### END YOUR CODE ###

# Convert the token id to a token
most_likely_token = gpt2_tokenizer.convert_ids_to_tokens(most_likely_token_id)
print(most_likely_token)

# You should expect to see the following output:
# input size: 4
# torch.Size([1, 4, 21128])
# 预

input size: 4
torch.Size([1, 4, 21128])
预


Once you are done with the above code, you can now implement the full generation loop: at each iteration, you select the most likely token and append it to the end input, and then feed the new input to the model for predicting the next token. 

The loop continues until `max_gen_len` is reached, or a `"[SEP]"` token is generated.

**Note**: 
- Use `torch.cat` to append elements to input IDs
- The `attn_mask` also needs be updated at each iteration.

In [7]:
max_gen_len = 50

input_text = "今天天气"
input_encoded = gpt2_tokenizer(input_text, return_tensors="pt", add_special_tokens=False)
input_ids = input_encoded.input_ids
attn_mask = input_encoded.attention_mask

count = 0
while count < max_gen_len:
    output = gpt2_model(input_ids, attention_mask=attn_mask)
    logits = output.logits

    ### START YOUR CODE ###
    last_token_logits = logits[0, -1, :]
    sampled_token_id = torch.argmax(last_token_logits).item()
    if sampled_token_id == gpt2_tokenizer.sep_token_id:
        break
    input_ids = torch.cat([input_ids, torch.tensor([[sampled_token_id]], dtype=torch.long)], dim=1) 
    attn_mask = torch.cat([attn_mask, torch.tensor([[1]], dtype=torch.long)], dim=1)
   ### END YOUR CODE ###

    count += 1


# Test
special_token_ids = set([gpt2_tokenizer.sep_token_id, 
                         gpt2_tokenizer.cls_token_id, 
                         gpt2_tokenizer.pad_token_id,
                         100]) # 100 for [UNK]

# Decode the generated tokens ids
for i in range(input_ids.shape[1]):
    tok_id = input_ids[0, i].item()
    # Skip the special tokens
    if tok_id not in special_token_ids:
        print(gpt2_tokenizer.convert_ids_to_tokens(input_ids[0, i].item()), end='')

# You should expect to see the following output:
# 今天天气预报：今天白天，我市阴天有小雨，气温：小雨转多云，气温：小雨转多云，气温：小雨转多云，气温：小雨转多

今天天气预报：今天白天，我市阴天有小雨，气温：小雨转多云，气温：小雨转多云，气温：小雨转多云，气温：小雨转多

As you can see, greedy search results in very repetitive text.

Now, let's implement a `top-k` sampling algorithm.

The idea is to **uniformly** sample from top-k most likely next tokens. PyTorch tensor provides a `topk` method to get the top-k values and indices. 

In the following example, you can check the top 5 most likely words following the sentence "今天天气":

In [8]:
k = 5
input_text = "今天天气"
input_encoded = gpt2_tokenizer(input_text, return_tensors="pt", add_special_tokens=False)
input_ids = input_encoded.input_ids
attn_mask = input_encoded.attention_mask

output = gpt2_model(input_ids, attention_mask=attn_mask)
logits = output.logits
### START YOUR CODE ###
last_token_logits = logits[0, -1, :]
top_k_logits, top_k_indices = torch.topk(last_token_logits, k)
### END YOUR CODE ###

# Test
print(top_k_logits)
print(top_k_indices)

for i in range(k):
    tok_id = top_k_indices[i].item()
    print(gpt2_tokenizer.convert_ids_to_tokens(tok_id), end=' ')

# You should expect to see the following output:
# tensor([7.8924, 7.8550, 7.5893, 7.3502, 7.3069], grad_fn=<TopkBackward0>)
# tensor([7564, 2523,  679, 1962, 6820])
# 预 很 不 好 还 

tensor([7.8924, 7.8550, 7.5893, 7.3502, 7.3069], grad_fn=<TopkBackward0>)
tensor([7564, 2523,  679, 1962, 6820])
预 很 不 好 还 

Next let's integrate the top-k sampling algorithm into the generation process. The uniform sampling can be implemented using `random.choices` among the top-k indices.

In [9]:
def generate_topk(input_text, k=5, max_gen_len=50):
    input_encoded = gpt2_tokenizer(input_text, return_tensors="pt", add_special_tokens=False)
    input_ids = input_encoded.input_ids
    attn_mask = input_encoded.attention_mask

    count = 0
    while count < max_gen_len:
        output = gpt2_model(input_ids, attention_mask=attn_mask)
        logits = output.logits

        ### START YOUR CODE ###
        last_token_logits = logits[0, -1, :]
        top_k_logits, top_k_indices = torch.topk(last_token_logits, k)
        sampled_token_id = random.choices(top_k_indices, k=1)[0].item()
        if sampled_token_id == gpt2_tokenizer.sep_token_id:
            break
        input_ids = torch.cat([input_ids, torch.tensor([[sampled_token_id]], dtype=torch.long)], dim=1)
        attn_mask = torch.cat([attn_mask, torch.tensor([[1]], dtype=torch.long)], dim=1)
        
        ### END YOUR CODE ###

        count += 1
    
    special_token_ids = set([gpt2_tokenizer.sep_token_id, 
                         gpt2_tokenizer.cls_token_id, 
                         gpt2_tokenizer.pad_token_id,
                         100]) # 100 for [UNK]
    
    generated_text = ''
    for i in range(input_ids.shape[1]):
        tok_id = input_ids[0, i].item()
        if tok_id not in special_token_ids:
            generated_text += gpt2_tokenizer.convert_ids_to_tokens(tok_id)
    
    return generated_text

In [10]:
# Test
input_text = "今天天气"
print(generate_topk(input_text, k=50))

input_text = "子曰：人"
print(generate_topk(input_text, k=50))

今天天气有不佳风的状的关云山不用谢一只会跑过日式风味，比不鸟好喝到哭有逼感！▼「牛羊河的早晚黑米粥店-这些
子曰：人应在于对恶事事成有评分（比法之成见得不止五处是错处.我对过往）及大概想的只能去猜得了如不思反感以免在


We can note that although the above uniform top-k sampling solves repetition issue, it will however produce *extremely incoherent* text. We can remedy this by using a proportional sampling instead of uniform sampling.

There are plenty of different ways to implement proportionaly sampling. You can either:
- Create list of cumulative relative probabilities of the top k tokens. For instance, if the relative probabilities of $k=5$ tokens are $0.1$, $0.2$, $0.5$, $0.1$, and $0.1$, then you cumulative probability list is `cum_prob = [0.1, 0.3, 0.8, 0.9, 1.0]`. Then you draw a random number $r$ from the unifrom distribution $[0,1]$ by `random.random()`, and you decide which token is sampled by telling which bin of `cum_prob` that $r$ falls into.
- Or, you use the `torch.multinomial()` function to accomplish similar sampling. *Note* the input weight provided to `torch.multinomial` should be teh relative probabilities of the top $k$ tokens, which can be obtained from applying softmax to the logits. 

In [27]:
import torch.nn.functional as F  

def generate_topk_prop(input_text, k=50, max_gen_len=50):
    input_encoded = gpt2_tokenizer(input_text, return_tensors="pt", add_special_tokens=False)
    input_ids = input_encoded.input_ids
    attn_mask = input_encoded.attention_mask

    count = 0
    while count < max_gen_len:
        output = gpt2_model(input_ids, attention_mask=attn_mask)
        logits = output.logits

        ### START YOUR CODE ###
        top_k_logits, top_k_indices = torch.topk(logits[0, -1, :], k)
        top_k_probabilities = F.softmax(top_k_logits, dim=0)
        cum_probabilities = torch.cumsum(top_k_probabilities, dim=0)
        random_value = random.random()
        sampled_token_idx = (cum_probabilities >= random_value).nonzero()[0].item()
        sampled_token_id = top_k_indices[sampled_token_idx].item()
        if sampled_token_id == gpt2_tokenizer.sep_token_id:
            break
        
        input_ids = torch.cat([input_ids, torch.tensor([[sampled_token_id]], dtype=torch.long)], dim=1)
        attn_mask = torch.cat([attn_mask, torch.tensor([[1]], dtype=torch.long)], dim=1)
       
        ### END YOUR CODE ###

        count += 1
    
    special_token_ids = set([gpt2_tokenizer.sep_token_id, 
                         gpt2_tokenizer.cls_token_id, 
                         gpt2_tokenizer.pad_token_id,
                         100]) # 100 for [UNK]
    
    generated_text = ''
    for i in range(input_ids.shape[1]):
        tok_id = input_ids[0, i].item()
        if tok_id not in special_token_ids:
            generated_text += gpt2_tokenizer.convert_ids_to_tokens(tok_id)
    
    return generated_text

In [28]:
# Test
input_text = "今天天气"
print(generate_topk_prop(input_text, k=50))

input_text = "子曰：人"
print(generate_topk_prop(input_text, k=50))

今天天气预报是这样的：小风吹着，空气湿度加大，但是我们的心里却有一种别样的感受：它能让天地交错，风一吹，就会
子曰：人生如此，顺理成章。无论如何，你的生命是美丽的，而不是单一的、模糊的其实，这句话的意思在于，它是


Do you think the proportional sampling produces better text?

Have fun sampling! :)