<img src="https://www.e4ds.com/news_photo/U77C53G6CP8ASEHUJ5B7.png">

# Encoder Task

In [1]:
from transformers import FillMaskPipeline, AutoModelForMaskedLM, AutoTokenizer, BertTokenizer 

model_ckpt = 'bert-base-uncased'

unmasker = FillMaskPipeline(
    tokenizer = AutoTokenizer.from_pretrained(model_ckpt),
    model = AutoModelForMaskedLM.from_pretrained(model_ckpt)
)

unmasker('Hello, Mr. Bert! How is it [MASK]')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


[{'score': 0.9879509806632996,
  'token': 1029,
  'token_str': '?',
  'sequence': 'hello, mr. bert! how is it?'},
 {'score': 0.011153294704854488,
  'token': 999,
  'token_str': '!',
  'sequence': 'hello, mr. bert! how is it!'},
 {'score': 0.0007006392115727067,
  'token': 1012,
  'token_str': '.',
  'sequence': 'hello, mr. bert! how is it.'},
 {'score': 0.00018348416779190302,
  'token': 1025,
  'token_str': ';',
  'sequence': 'hello, mr. bert! how is it ;'},
 {'score': 5.2711493481183425e-06,
  'token': 2133,
  'token_str': '...',
  'sequence': 'hello, mr. bert! how is it...'}]

## 1. The Token Prediction Task Head

BERT와 같은 Transformer Encoder 기반 모델은 기본적으로 `contextual embedding`을 생성하는 base model이다. 따라서 이 상태에서는 단순히 문장의 각 토큰에 대한 의미 정보만 담고 있고, 어떤 태스크도 직접 수행하지 않는다. 

즉, BERT base model에 downstream task에 맞는 head(classifier, regressor) 층을 추가로 올려야된다. 

### Masked Language Modeling 
- Pretraining에서 사용하고 일부 토큰을 마스킹하고 해당 위치의 토큰을 예측하는 방식
- 각 토큰마다 Linear + softmax (vocabulary size 만큼)

In [None]:
from transformers import BertTokenizer, BertModel, FillMaskPipeline, AutoModelForMaskedLM
import torch

class MyFillMaskModel(FillMaskPipeline):
    def __init__(self):
        super().__init__(
            tokenizer = BertTokenizer.from_pretrained("bert-base-uncased"),
            model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased").to('cuda')
        )

    def __call__(self, string):
        
        input_tensors = unmasker.preprocess(string)
        input_tensors = input_tensors.to('cuda')
        mask_idx = (input_tensors['input_ids'] == 103).nonzero()[0][1].item()
        print("\nStatistics From Input:")
        print(" > Input Indices:", input_tensors['input_ids'])
        print(" > Input Decoding:", [unmasker.tokenizer.decode(token_index) for token_index in input_tensors['input_ids'][0]])
        print(" > Mask Index:", mask_idx)

        ## This is what we get throughout the model forward pass
        inputs_1 = {'input_ids' : input_tensors['input_ids']}
        inputs_2 = {'attention_mask' : input_tensors['attention_mask'].bool()}
        embed_out = unmasker.model.bert.embeddings.forward(**inputs_1)
        bert_out = unmasker.model.bert.encoder.forward(embed_out, **inputs_2)['last_hidden_state']
        y = unmasker.model.cls.forward(bert_out)
        print("\nStatistics From Forward Pass:")
        print("> Input Into BERT Encoder:", embed_out.shape)
        print("> Input Into Classifier:  ", bert_out.shape)
        print("> Output From Classifier: ", y.shape) # BERT의 vocab_size 

        ## The following statistics are generic outputs from the BERT differentiable pipeline
        pdfs = torch.softmax(y[0], -1) 
        print("\nStatistics From BERT Output:")
        print(" > Most-Likely Index:", torch.tensor([torch.argmax(pdf).item() for pdf in pdfs]))
        print(" > Most-Likely Probs:", torch.tensor([torch.max(pdf).item() for pdf in pdfs]))
        # max인 predicted token들을 decoding을 통해 확인 
        print(" > Most Likely Words:", [unmasker.tokenizer.decode(torch.argmax(pdf).item()) for pdf in pdfs])

        k = 5
        mask_top_probs = torch.topk(pdfs[mask_idx], k) # topk -> 상위 k개 return 
        mask_best_words = [unmasker.tokenizer.decode(index) for index in mask_top_probs.indices]
        print(f"\nStatistics From Postprocessing (Top {k}):")
        print(" > Most Likely Mask Index:", mask_top_probs.indices)
        print(" > Most Likely Mask Probs:", mask_top_probs.values.detach())
        print(" > Most Likely Mask Words:", mask_best_words, "\n")

        # Numpy로 처리하기 전에 cpu()로 device 변환 
        output = self.postprocess({**input_tensors.to('cpu'), 'logits' : y})        
        return output


unmasker = MyFillMaskModel()
unmasker("Say [MASK]!")[0]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0



Statistics From Input:
 > Input Indices: tensor([[ 101, 2360,  103,  999,  102]], device='cuda:0')
 > Input Decoding: ['[CLS]', 'say', '[MASK]', '!', '[SEP]']
 > Mask Index: 2

Statistics From Forward Pass:
> Input Into BERT Encoder: torch.Size([1, 5, 768])
> Input Into Classifier:   torch.Size([1, 5, 768])
> Output From Classifier:  torch.Size([1, 5, 30522])

Statistics From BERT Output:
 > Most-Likely Index: tensor([1012, 2360, 2009,  999, 1012])
 > Most-Likely Probs: tensor([0.0358, 0.8322, 0.3312, 0.9999, 0.9998])
 > Most Likely Words: ['.', 'say', 'it', '!', '.']

Statistics From Postprocessing (Top 5):
 > Most Likely Mask Index: tensor([2009, 2748, 2242, 2053, 7592], device='cuda:0')
 > Most Likely Mask Probs: tensor([0.3312, 0.1745, 0.1557, 0.0509, 0.0452], device='cuda:0')
 > Most Likely Mask Words: ['it', 'yes', 'something', 'no', 'hello'] 



{'score': 0.33123713731765747,
 'token': 2009,
 'token_str': 'it',
 'sequence': 'say it!'}

## 2. SQuAD(Srandford Question Answering Dataset)

SQuAD는 자연어 처리에서 대표적인 Extractive Question Answering(추출 기반 질문응답) 데이터셋이다.

- 형태
    - `context`: 하나의 문단
    - `question`: 질문
    - `answer`: 문단(context)에서 정확히 일치하는 **span**을 찾아내는 형식 

- 출력: `context` 안에서 답변의 시작/끝 위치를 예측하여 해당 텍스트 span을 출력 

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

# Pre-trained RoBERTa model + QA head 
model_name = "deepset/roberta-base-squad2"

nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
QA_input = {
    'question': 'Why is model conversion important?',
    'context': 'The option to convert models between FARM and transformers gives freedom to the user and let people easily switch between frameworks.'
}
nlp(QA_input)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Device set to use cuda:0


{'score': 0.21171413362026215,
 'start': 59,
 'end': 84,
 'answer': 'gives freedom to the user'}

In [19]:
nlp.model

RobertaForQuestionAnswering(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              

In [None]:
nlp.model.qa_outputs
# 여기서 out_features는 start와 end의 의미인데, 
# 즉 in_features로 768를 받은 후에 question에 가장 적합한 부분의 시작 부분 start와 끝 부분 end를 return 

Linear(in_features=768, out_features=2, bias=True)

## 3. RoBERTa Sentiment Classifier code

BERT나 다른 encoder 기반 모델 위에 분류기(classifier) head를 얹어서 downstream task를 해결하는 방식이다. SQuAD와는 다르게, **감정 분류(sentiment classification)** 은 전체 문장을 보고 증정, 부정, 중립 중 하나를 예측하는 task에 맞게 조정한다. 

RoBERTa classifier source code: https://github.com/huggingface/transformers/blob/f26099e7b5cf579f99a42bab6ddd371bf2c8d548/src/transformers/models/roberta/modeling_roberta.py#L1510

In [20]:
from transformers import AutoModelForSequenceClassification

emo_model = pipeline('sentiment-analysis', 'SamLowe/roberta-base-go_emotions')

print(emo_model("I love my old pillow?"))
print(emo_model("Why is it that every plant I touch dies within a few days?"))
print(emo_model("I'm so conflicted about these new instructions..."))

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use cuda:0


[{'label': 'love', 'score': 0.9212924242019653}]
[{'label': 'curiosity', 'score': 0.38752278685569763}]
[{'label': 'confusion', 'score': 0.7724317312240601}]


In [21]:
emo_model.model.classifier

RobertaClassificationHead(
  (dense): Linear(in_features=768, out_features=768, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
  (out_proj): Linear(in_features=768, out_features=28, bias=True)
)

## 4. Zero-shot Classification

1. Zero-shot learning
    - 학습 데이터에 해당 태스크나 레이블이 전혀 포함되지 않은 상태에서도 모델이 일반화된 지식을 활용해 문제를 푸는 방식 
    - 예시: "영화 리뷰를 긍정/부정으로 분류하세요"라는 감정분석 태스크를 훈련한 적이 없는 모델에게 자연어로 설명만 주고 분류하게 하는 경우.

2. Few-shot learning
    - 새로운 태스크에 대해 **소량의 예시(샘플 몇 개)**만을 보고도 문제를 해결하는 학습 방식.
    - 예시: 감정분석 예시를 2~5개 정도 제시한 후, 유사한 문장에 대해 감정을 예측하게 하는 경우.

facebook/bart-large-mnli: https://huggingface.co/facebook/bart-large-mnli

In [22]:
from transformers import pipeline
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")

sequence_to_classify = "one day I will see the world"
candidate_labels = ['travel', 'cooking', 'dancing']
classifier(sequence_to_classify, candidate_labels)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use cuda:0


{'sequence': 'one day I will see the world',
 'labels': ['travel', 'dancing', 'cooking'],
 'scores': [0.9938651323318481, 0.0032737762667238712, 0.00286104460246861]}