하단의 코드블럭에서 주석(#)을 해제하고, Colab 인스턴스에 transformers 와 datasets 라이브러리를 설치합니다. PyTorch는 Colab 인스턴스에 이미 설치되어 있습니다.

In [1]:
! pip install datasets transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 27.8 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 62.6 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 75.2 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 61.7 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 75.2 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,

설치된 Transformers lib의 버전이 최소 4.11.0 이상인지 확인합니다.

In [2]:
import transformers

print(transformers.__version__)

4.25.1


# Fine-tuning a language model

이 실습에서는, [🤗 Transformers](https://github.com/huggingface/transformers) 모델을 Language Modeling task에 맞춰 학습시키는 방법에 대해 배웁니다. Language Modeling은 크게 두 가지로 분류할 수 있습니다: 

- Causal language modeling: 모델은 문장의 다음 단어를 예측해야 합니다. 모델이 입력으로 들어온 '다음 단어'를 보고 이를 예측하는 것을 방지하기 위해, i+1번째 토큰을 예측할 때 i번째 토큰 이후의 토큰을 볼 수 없도록 attention mask를 설정합니다. 

![Widget inference representing the causal language modeling task](https://github.com/huggingface/notebooks/blob/main/examples/images/causal_language_modeling.png?raw=1)

- Masked language modeling: 모델은 Input에서 `[MASK]` Token으로 가려진 텍스트를 복원해야 합니다. 전체 문장을 열람하는 것이 가능합니다. 즉, Casual language modeling과 달리 미래의 토큰을 볼 수 있습니다.

![Widget inference representing the masked language modeling task](https://github.com/huggingface/notebooks/blob/main/examples/images/masked_language_modeling.png?raw=1)

저희는 이번 시간에 Casual Language Modeling에 대해서만 다룰 것입니다. 원본 Notebook에는 Masked Langauge Modeling에 대한 내용도 포함되어 있습니다.

## Preparing the dataset

학습에 사용할 데이터셋으로 Wikitext 2 Dataset을 사용합니다. Huggingface Datasets Library를 활용해 데이터를 불러 옵니다.

내일 있을 AI 생성 Task도 이를 활용해서 필요한 데이터를 불러올 수 있습니다. (해당 기능이 포함된 예시 코드를 내일 제공합니다)

In [3]:
from datasets import load_dataset
# wititext dataset 중 wikitext-2-raw-v1 subset을 불러옵니다.
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

Downloading builder script:   0%|          | 0.00/8.48k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/6.84k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.25k [00:00<?, ?B/s]

Downloading and preparing dataset wikitext/wikitext-2-raw-v1 to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126...


Downloading data:   0%|          | 0.00/4.72M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Dataset wikitext downloaded and prepared to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Wikitext 2 Dataset 뿐만 아니라 [허브](https://huggingface.co/datasets) 에 호스팅된 다른 데이터셋을 사용할 수도 있습니다. 
단, 모든 데이터셋이 같은 Format을 가진 것이 아니므로, 데이터를 전처리하는 과정은 각 데이터셋의 Format이랑 데이터 자체를 보고 원하는 Task에 맞게 처리해야 합니다.

하단의 코드의 주석을 해제하고, 원하는 데이터를 불러오도록 경로를 변경할 수 있습니다. 

In [4]:
# OSCAR Dataset 중 영어 subset을 불러옵니다.
# dataset = load_dataset("oscar", "unshuffled_deduplicated_en") 

임의의 .json 파일이나 .csv 파일을 불러오는 것도 가능합니다. 이는 오늘 주제에서 조금 벗어나므로 생략합니다. 

자세한 내용은 [full documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) 을 참고해 주세요.

데이터에 접근하기 위해 아래와 같이 split 이름 (train, test)과 Index를 설정합니다.

In [5]:
datasets["train"][10]

{'text': ' The game \'s battle system , the BliTZ system , is carried over directly from Valkyira Chronicles . During missions , players select each unit using a top @-@ down perspective of the battlefield map : once a character is selected , the player moves the character around the battlefield in third @-@ person . A character can only act once per @-@ turn , but characters can be granted multiple turns at the expense of other characters \' turns . Each character has a field and distance of movement limited by their Action Gauge . Up to nine characters can be assigned to a single mission . During gameplay , characters will call out if something happens to them , such as their health points ( HP ) getting low or being knocked out by enemy attacks . Each character has specific " Potentials " , skills unique to each character . They are divided into " Personal Potential " , which are innate skills that remain unaltered unless otherwise dictated by the story and can either help or impede

하단의 코드를 실행해서 Dataset의 일부를 열럼할 수 있습니다.

In [6]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [7]:
show_random_elements(datasets["train"])

Unnamed: 0,text
0,"Oldham ’ s museum and gallery service dates back to 1883 . Since then it has established itself as a cultural focus for Oldham and has developed one of the largest and most varied permanent collections in North West England . The current collection includes over 12 @,@ 000 social and industrial history items , more than 2 @,@ 000 works of art , about 1 @,@ 000 items of decorative art , more than 80 @,@ 000 natural history specimens , over 1 @,@ 000 geological specimens , about 3 @,@ 000 archaeological artefacts , 15 @,@ 000 photographs and a large number of books , pamphlets and documents . \n"
1,"Banai does not appear in the Malhari Mahatmya originating from the Brahmin ( high @-@ priest caste ) tradition , which glorifies Khandoba as Shiva and de @-@ emphasizes his earthly connections . In contrast , Banai occupies the central position in the Dhangar folk narrative and Mhalsa 's marriage to Khandoba is reduced to a passing mention ; Marathas and other settled castes give more importance to Mhalsa . \n"
2,"In Australia , Daydream was certified five @-@ times platinum by the Australian Recording Industry Association ( ARIA ) , denoting shipments of 350 @,@ 000 copies . The album finished ninth on the ARIA End of Year Charts in both 1995 and 1996 . In Japan , the album debuted at number one on the Oricon charts . According to the Oricon , Daydream made the top five of the best @-@ selling albums in Japan by a non @-@ Asian artist , with 2 @.@ 5 million copies sold . Daydream remains one of the best @-@ selling albums of all time , with sales of 25 million copies worldwide . \n"
3,"During the era of log floating , logjams sometimes occurred when logs struck an obstacle . Log rafts floating down the West Branch had to pass through chutes in canal dams . The rafts were commonly 28 feet ( 9 m ) wide — narrow enough to pass through the chutes — and 150 feet ( 46 m ) to 200 feet ( 61 m ) long . In 1874 , a large raft got wedged in the chute of the Dunnstown Dam and caused a jam that blocked the channel from bank to bank with a pile of logs 16 feet ( 5 m ) high . The jam eventually trapped another 200 log rafts , and 2 canal boats , The Mammoth of Newport and The Sarah Dunbar . \n"
4,"The relationship between John of Brienne and Hugh I of Cyprus was tense . Hugh ordered the imprisonment of John 's supporters in Cyprus , releasing them only at Pope Innocent 's command . During the War of the Antiochene Succession John sided with Bohemond IV of Antioch and the Templars against Raymond @-@ Roupen of Antioch and Leo I , King of Cilician Armenia , who were supported by Hugh and the Hospitallers . However , John sent only 50 knights to fight the Armenians in Antiochia in 1213 . Leo I concluded a peace treaty with the Knights Templar late that year , and he and John reconciled . John married Leo 's oldest daughter , Stephanie ( also known as Rita ) , in 1214 and Stephanie received a dowry of 30 @,@ 000 bezants . Quarrels among John , Leo I , Hugh I and Bohemond IV are documented by Pope Innocent 's letters urging them to reconcile their differences before the Fifth Crusade reached the Holy Land . \n"
5,"Some original costumes were found for the Klingons while others were made from patterns created by Robert Blackman . Greg Jein created new models of the Enterprise as well as Deep Space Station K7 and the Klingon cruiser , while 1 @,@ 400 tribbles were purchased from a company owned by Majel Barrett . Charlie Brill returned to Star Trek to appear once more as Arne Darvin , and Deidre L. Imershein was cast in part due to her being friends with one of the production crew members . Walter Koenig , who portrayed Ensign Pavel Chekov in The Original Series , showed the Deep Space Nine cast how to work the consoles on the Enterprise sets . \n"
6,
7,= = = Ottoman era = = = \n
8,"Gerard was appointed Lord Chancellor of England in 1085 , and was present at William I 's deathbed in 1087 . He continued as Chancellor to William Rufus until 1092 ; what precipitated his loss of office is unclear . He retained the king 's trust , for Rufus employed him in 1095 along with William Warelwast on a diplomatic mission to Pope Urban II regarding Archbishop Anselm receiving the pallium , the sign of an archbishop 's authority . Rufus offered to recognise Urban as pope rather than the Antipope Clement III in return for Anselm 's deposition and the delivery of Anselm 's pallium into Rufus ' custody , to dispose of as he saw fit . The mission departed for Rome in February 1095 and returned by Whitsun with a papal legate , Walter the Cardinal Bishop of Albano , who had Anselm 's pallium . The legate secured Rufus ' recognition of Urban , but subsequently refused to consider Anselm 's deposition . Rufus resigned himself to Anselm 's position as archbishop , and at the king 's court at Windsor he consented to Anselm being given the pallium . \n"
9,


Wikipedia의 전체 텍스트를 발췌하거나, 제목 및 빈 문서를 발췌하는 등 다양한 데이터가 포함되어 있는 것을 확인할 수 있습니다.

## Causal Language modeling

Casual Language Modeling 학습을 위해 일정 길이로 나눠진 Text Chunk (텍스트 조각)이 필요합니다. 이를 위해서, 데이터셋의 모든 Text를 Tokenize 한 후 하나로 합칩니다.
전부 다 이어붙인 Text를 미리 정해진 길이로 나눕니다. 편집 완료된 데이터의 예시는 아래와 같습니다:
```
part of text 1
```
or 
```
end of text 1 [BOS_TOKEN] beginning of text 2
```

텍스트 데이터 별로 학습시키는 것도 가능하지만, 여기서는 batch 관리를 편하게 하기 위해 일렬로 나열한 후 일정 길이 마다 잘라서 사용합니다. 

두 경우 모두 각 Token에 대해 다음 순번의 Token을 예측하는 것을 Objective Function으로 설정합니다.

여기서는 '[GPT-2](https://huggingface.co/gpt2)' 모델을 사용합니다. 여기 있는 모델 뿐만 아니라 [다른 모델](https://huggingface.co/models?filter=causal-lm)도 사용하는것이 가능합니다.

In [8]:
model_checkpoint = "gpt2"

학습 데이터를 모델을 학습시킨 데이터와 동일한 맵핑으로 Tokenize하기 위해서, 미리 학습 완료된 Tokenzier를 받아 옵니다. 이는 `AutoTokenizer` Class를 통해 해결할 수 있습니다:

In [9]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

이제 불러온 모든 Text에 대해 Tokenizer를 적용할 수 있습니다. Huggingface Datasets 라이브러리의 [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) 메서드를 통해 데이터에 일괄적으로 전처리 함수를 적용할 수 있습니다. 

여기서는 tokenizer로 입력된 Text를 Tokenize하는 전처리 함수를 정의합니다:

In [10]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

`Wikitext2` 데이터셋은 단순 텍스트의 집합체이므로 위와 같이 정의할 수 있습니다. 다른 데이터셋의 경우 필요한 데이터를 찾아 이를 필요한 형태로 변환할 수 있는 전처리 함수를 구현할 수 있습니다.

In [11]:
# 예시로, 원본 텍스트가 포함된 데이터를 생성하도록 하는 함수입니다.
# def tokenize_function_with_rawtext(examples):
#     return {**tokenizer(examples["text"]),"text":examples["text"]}

그리고, 해당 전처리 함수를 `datasets` object의 모든 데이터에 적용합니다. `batched=True`로 설정하고 process를 4로 설정함으로써 처리 속도를 향상시킬 수 있습니다. 

이 과정에서 에러가 발생할 경우 process 수를 줄이거나 `batched=False`로 설정해서 다시 해보세요.

Tokenize 이후에 원본 텍스트는 필요하지 않으므로, `remove_columns=["text"]`옵션을 통해 해당 Column을 제거합니다.

In [12]:
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

        

#0:   0%|          | 0/2 [00:00<?, ?ba/s]

#1:   0%|          | 0/2 [00:00<?, ?ba/s]

#2:   0%|          | 0/2 [00:00<?, ?ba/s]

#3:   0%|          | 0/2 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/10 [00:00<?, ?ba/s]

#1:   0%|          | 0/10 [00:00<?, ?ba/s]

#3:   0%|          | 0/10 [00:00<?, ?ba/s]

#2:   0%|          | 0/10 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

변환된 데이터를 보면, 각 Text에 대응하는 `input_ids`로 변환된 것으로 볼 수 있습니다.

In [13]:
tokenized_datasets["train"][1]

{'input_ids': [796, 569, 18354, 7496, 17740, 6711, 796, 220, 198],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Tokenize를 끝냈으면, 텍스트를 모두 하나로 이어 이를 `block_size` 만큼의 조각으로 나눕니다. 이를 위해 `batched=True`를 활성화 한 `map` method를 한번 더 활용합니다. 

`batched=True` 옵션은 입력과 출력 데이터의 개수를 다르게 지정할 수 있습니다. 이를 통해 새로운 batch dataset을 생성할 수도 있습니다. 

`block_size`를 `tokenizer.model_max_length`로 설정하는 것이 일반적이나(GPT-2의 경우 1024), GPU 자원의 한계로 인해 이를 모두 활용하지 못할 수도 있습니다. (Colab 노트북에서는 `tokenizer.model_max_length`로 설정하면 메모리가 부족합니다.) 

GPU 메모리 이슈가 발생할 경우, `block_size`를 128이나 더 작은 값으로 설정하고 다시 해보세요.

In [14]:
#block_size = tokenizer.model_max_length
block_size = 512

Batch 생성 함수를 아래와 같이 작성합니다:

In [15]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    # 데이터 생성시 labels에 input_ids랑 같은 값을 입력합니다. 이는 Transformer library에서 Langauge Model을 학습할 때는 알아서 labels를 한칸 오른쪽으로 Shift 하여 사용하기 때문입니다.
    result["labels"] = result["input_ids"].copy()
    return result

`map` 메서드의 `batch_size` 파라미터의 기본값은 1,000입니다. 즉 1,000 데이터마다 정해진 `block_size`에 맞지 않는 조그마한 데이터가 버려집니다.
필요에 따라 `batch_size`를 변경하는 것이 가능합니다. 


In [16]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

        

#0:   0%|          | 0/2 [00:00<?, ?ba/s]

#1:   0%|          | 0/2 [00:00<?, ?ba/s]

#2:   0%|          | 0/2 [00:00<?, ?ba/s]

#3:   0%|          | 0/2 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/10 [00:00<?, ?ba/s]

#1:   0%|          | 0/10 [00:00<?, ?ba/s]

#2:   0%|          | 0/10 [00:00<?, ?ba/s]

#3:   0%|          | 0/10 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

바뀐 데이터을 확인해봅니다. 이제 각 샘플은 `block_size` 개수의 토큰으로 이루어져 있습니다. 각 샘플은 원본 텍스트 여럿에서 기인했을 수도 있습니다. 

In [17]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

' depending on an individual player\'s approach : when one option is selected, the other is sealed off to the player. Outside missions, the player characters rest in a camp, where units can be customized and character growth occurs. Alongside the main story missions are character @-@ specific sub missions relating to different squad members. After the game\'s completion, additional episodes are unlocked, some of them having a higher difficulty than those found in the rest of the game. There are also love simulation elements related to the game\'s two main heroines, although they take a very minor role. \n The game\'s battle system, the BliTZ system, is carried over directly from Valkyira Chronicles. During missions, players select each unit using a top @-@ down perspective of the battlefield map : once a character is selected, the player moves the character around the battlefield in third @-@ person. A character can only act once per @-@ turn, but characters can be granted multiple tur

데이터 준비 후 `Trainer` object를 초기화할 수 있습니다.
그 전에 CasualLM model을 초기화 해줍니다:

In [18]:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

학습을 위한 `TrainingArguments` (Hyperparameters) 를 설정합니다.

설정할 수 있는 Argument의 종류는 [API 문서](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) 를 참조해 주세요.

In [19]:
from transformers import Trainer, TrainingArguments

In [55]:
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-wikitext2",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    num_train_epochs=2,
    push_to_hub=False,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


model, argument, dataset을 `Trainer` object를 초기화하면서 할당합니다.

In [58]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)

학습을 진행합니다:

In [None]:
trainer.train()

***** Running training *****
  Num examples = 4651
  Num Epochs = 2
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1164
  Number of trainable parameters = 124439808


Epoch,Training Loss,Validation Loss


학습이 종료된 후, 학습한 모델을 하단의 코드와 같이 perplexity metric으로 평가할 수 있습니다:

In [23]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

***** Running Evaluation *****
  Num examples = 481
  Batch size = 8


Perplexity: 23.21


## Enhance Quality of Generated Text

여러분들이 학습 시킨 모델을 바탕으로, 생성되는 텍스트의 퀄리티를 (조금이라도) 향상시킬 수 있는 기법을 설명합니다. 

Model의 Weight 값에 개입하는 방식은 아닙니다.

### Greedy Search

Language Modeling의 초기 정의에 충실하게, 매 단어를 생성할 때 마다 Probability가 가장 높은 단어를 선택합니다.

In [None]:
# add the EOS token as PAD token to avoid warnings
# 학습 이전의 모델을 사용하고 싶을 경우, 하단의 주석을 해제해주세요.
# model =  AutoModelForCausalLM.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

In [37]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode('What is AI?', return_tensors='pt').to('cuda')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
What is AI? 
 AI is a concept that is often used to describe the human mind. It is a concept that is often used to describe the human brain. It is a concept that is often used to describe the human brain. It is


### Beam Search
Greedy Search는 낮은 가능성 뒤에 높은 가능성의 단어를 선택할 수 없다는 문제가 있습니다. 이를 해결하는 방법 중 하나는 Beam Search를 사용하는 것입니다:

In [38]:
beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
What is AI? 
 AI is an artificial intelligence ( AI ) technology that can be used to solve problems in a variety of ways. It can be used to solve problems in a variety of ways. It can be used to solve problems in a


여기에 추가로, 같은 2-gram이 두번 이상 등장하지 않도록 옵션을 설정하고 생성을 해봅시다:

In [39]:
beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
What is AI? 
 AI is an artificial intelligence ( AI ) technology that can be used to solve problems in a variety of ways. AI has been used in many industries, including medicine, engineering, and mathematics. It has also been applied to


또한, 생성하는 문장의 수를 변경할 수도 있습니다. 높은 순서대로 k개의 문장을 한번에 리턴하는 것이 가능합니다:

이때, `num_return_seqences <= num_beams` 로 설정해주세요.

In [40]:
# set return_num_sequences > 1
beam_outputs = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    num_return_sequences=5, 
    early_stopping=True
)

# now we have 3 output sequences
print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
0: What is AI? 
 AI is an artificial intelligence ( AI ) technology that can be used to solve problems in a variety of ways. AI has been used in many industries, including medicine, engineering, and mathematics. It has also been applied to
1: What is AI? 
 AI is an artificial intelligence ( AI ) technology that can be used to solve problems in a variety of ways. AI has been used in many industries, including medicine, engineering, and agriculture. It has also been applied to
2: What is AI? 
 AI is an artificial intelligence ( AI ) technology that can be used to solve problems in a variety of ways. AI has been used in many industries, including medicine, engineering, and mathematics. It has also been employed in
3: What is AI? 
 AI is an artificial intelligence ( AI ) technology that can be used to solve problems in a variety of ways. AI has been used in many industries, includin

### Sampling
Sampling은 주어진 확률 분포에서 단어를 랜덤하게 하나 선택하는 것을 의미합니다.

예시로, 아래의 그림의 각 점에서 각 단어의 조건부 확률을 구하고, 그 중 하나의 단어를 샘플링 합니다.

![Sampling](https://huggingface.co/blog/assets/02_how-to-generate/sampling_search.png)

이전의 방식과 달리, 조건부 확률 순서대로, 결정론적으로 단어를 선택하는 방식이 아닙니다.



In [48]:
# 재현성을 위해 Seed 값을 고정합니다. 
import torch
torch.manual_seed(42)

# 샘플링을 활성화 하고, 후술할 top-k 샘플링을 비활성화 하기 위해 top_k sampling 값을 0으로 설정합니다.
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
What is AI? What you learn that makes it better than other AIs is hard to answer because it depends on where you are at — all AIs have the same brain activity and each has different manners, personalities and opinions of what you want and


Sampling method의 단점으로는, 확률이 낮은 단어도 낮은 확률로 선정될 수 있고,이로 인해 뜬금없는 단어가 생성될 가능성을 배제할 수 없습니다. 

이를 방지하기 위해 Sampling에 변주를 주는 기법을 소개합니다.


### Temperature (of the [softmax](https://en.wikipedia.org/wiki/Softmax_function#Smooth_arg_max))

확률 분포 $P(w|w_{1:t-1})$ 를 좀 더 날카롭게 만들기 위해 softmax function의 `temperature` 값을 조정합니다.

조정된 후의 확률 분포는 아래와 같습니다:

![Softmax_temp](https://huggingface.co/blog/assets/02_how-to-generate/sampling_search_with_temp.png)




In [49]:
# use temperature to decrease the sensitivity to low probability candidates
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0, 
    temperature=0.7
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
What is AI? 
 AI is an emerging field of artificial intelligence that seeks to solve problems in human service. It has recently been described as " a technology that can think about in terms of the world around it and make decisions about where to go


### Top-K Sampling

앞서 설명한 Beam Search와 유사합니다.

probability가 높은 순서대로 K개의 단어 중에서만 샘플링 합니다.

![Top-K](https://huggingface.co/blog/assets/02_how-to-generate/top_k_sampling.png)

In [50]:
# set top_k to 50
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
What is AI? Well, it is an artificial intelligence that can solve many types of problems without human knowledge. Here is a list. 
 AI is not necessarily a threat or a threat to human interests. Instead, it has developed methods to overcome


Top-p (nucleus) sampling

누적 확률이 p를 넘는 가장 작은 단어의 subset에서만 샘플링 합니다.
![Top-p](https://huggingface.co/blog/assets/02_how-to-generate/top_p_sampling.png)

In [51]:
# deactivate top_k sampling and sample only from 92% most likely words
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
What is AI? How do you design your processes for you and your clients? What is your favorite emotions, does it interact with people and what lessons are your clients learning from your experiences? 
 
 One element of AI @-@ driving


Top-p sampling과 Top-k sampling을 섞는 것도 가능합니다. 

In [53]:
# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    do_sample=True, 
    max_length=50, 
    top_k=50, 
    top_p=0.95, 
    num_return_sequences=5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
0: What is AI? How did it come about, and what did it accomplish to help us get where we are today?" Her recent book, Artificial Intelligence is a guidebook to understanding the foundations of intelligent machines, the technologies that will eventually enable them to
1: What is AI? When it was first suggested, many researchers thought AI was an evolutionary concept, but many different kinds of information technologies have been proposed. It was first suggested in 1978 by Professor William Kavlicek of Harvard University. The AI
2: What is AI? Is it real or do you think it is real?'He concluded. He compared it to the " artificial intelligence " which he had developed in the late 1950s. However, his book was a long and expensive affair, and
3: What is AI? This is a broad category, encompassing things such as artificial intelligence, AI and " the new " scientific paradigm in which h