# Warming Up - BERT 추론 입출력 구조 및 Sentence BERT 소개

---


## 참조
- Jay Alammar, [The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)](http://jalammar.github.io/illustrated-bert/)
* How the Embedding Layers in BERT Were Implemented
    * 입력이 입베팅으로 변환하는 것을 직관적으로 보여줌
    * https://medium.com/@_init_/why-bert-has-3-embedding-layers-and-their-implementation-details-9c261108e28a
- [자연어처리_BERT 기초개념(완전 초보용)](https://han-py.tistory.com/252)
- [SentenceTransformers Documentation](https://www.sbert.net/index.html)

# 1. 사전 지식

## BERT 추론 입출력 구조 
- 입력 문장 예시: 
    -  '아 배고픈데 짜장면 먹고싶다'
    - BERT Tokeninzer 가 아래와 같이 변환
        - 예시 입력: 총 토큰 갯수 8개
            - ['아', '배고픈', '##데', '짜장면', '먹', '##고', '##싶', '##다']
    - 이 토큰은 token_id (자연수) 로 변환
        - Token
            - ['아', '배고픈', '##데', '짜장면', '먹', '##고', '##싶', '##다'] 를 아래왁 같이 토큰 번호로 변환 합니다.
        - token_id    
            - [2, 3079, 31420, 4244, 26766, 2654, 4219, 4451, 4176, 3]
            - 2: [CLS], '아' : 3079, '배고픈' : 31420, .... 3: 3: [SEP] 
- 준비된 token_id 를 버트 모델에 입력하면, 아웃풋이 아래의 그림과 같이 벡터들로 제공 됨        
    - 출력: Contextual Representation of Token (총 아웃풋 벡터 )
        - (디멘션: 768) 9개 (CLS 벡터(Class Label) + 8개 토큰 벡터)
        - `CLS 벡터`, `아 의 토큰 벡터`, `배고픈의 토큰 벡터`, ...., `##다 의 토큰 벡터` 



### 버트 토큰의 변화 과정
![BERT-Embedding.png](img/BERT-Embedding.png)

    

## BERT 구조

![BERT_Structure.png](img/BERT_Structure.png)

# 2. BERT 추론 입력 확인

In [2]:
# from datasets import load_dataset
from transformers import (
    ElectraModel, 
    ElectraTokenizer, 
    ElectraForSequenceClassification, 
    Trainer, 
    TrainingArguments, 
    set_seed
)
tokenizer_id = 'monologg/koelectra-small-v3-discriminator'


tokenizer = ElectraTokenizer.from_pretrained(tokenizer_id)



In [3]:
doc= '아 배고픈데 짜장면 먹고싶다'

tokenizer.tokenize(doc)

['아', '배고픈', '##데', '짜장면', '먹', '##고', '##싶', '##다']

In [4]:
tokenizer.encode(doc)

[2, 3079, 31420, 4244, 26766, 2654, 4219, 4451, 4176, 3]

In [5]:
tokenizer(doc)

{'input_ids': [2, 3079, 31420, 4244, 26766, 2654, 4219, 4451, 4176, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

# 3. Sentence BERT

## 3.1. 문장의 Embedding 보여 주기
- [SentenceTransformers Quickstart](https://www.sbert.net/docs/quickstart.html)

BERT (and other transformer networks) output for each token in our input text an embedding. In order to create a fixed-sized sentence embedding out of this, the model applies mean pooling, i.e., the output embeddings for all tokens are averaged to yield a fixed-sized vector.

In [6]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence']

#Sentences are encoded by calling model.encode()
sentence_embeddings = model.encode(sentences)

#Print the embeddings
for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("embedding shape: ", embedding.shape)    
    print("Embedding:", embedding)
    print("")

Sentence: This framework generates embeddings for each input sentence
embedding shape:  (384,)
Embedding: [-1.37173478e-02 -4.28515859e-02 -1.56286210e-02  1.40537620e-02
  3.95537540e-02  1.21796317e-01  2.94333659e-02 -3.17523889e-02
  3.54959592e-02 -7.93140158e-02  1.75878238e-02 -4.04369682e-02
  4.97259647e-02  2.54912749e-02 -7.18699768e-02  8.14968348e-02
  1.47072237e-03  4.79627214e-02 -4.50335667e-02 -9.92174894e-02
 -2.81769093e-02  6.45045787e-02  4.44670357e-02 -4.76217344e-02
 -3.52952369e-02  4.38671671e-02 -5.28565906e-02  4.33052715e-04
  1.01921484e-01  1.64072420e-02  3.26996557e-02 -3.45986970e-02
  1.21339411e-02  7.94871226e-02  4.58340487e-03  1.57778393e-02
 -9.68207326e-03  2.87626293e-02 -5.05806282e-02 -1.55793820e-02
 -2.87907142e-02 -9.62280110e-03  3.15556377e-02  2.27349531e-02
  8.71449560e-02 -3.85027714e-02 -8.84718820e-02 -8.75496119e-03
 -2.12343633e-02  2.08924245e-02 -9.02078301e-02 -5.25732152e-02
 -1.05638672e-02  2.88311355e-02 -1.61454827e-02 

## 3.2. 문장 유사도 보여주기

In [7]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

#Sentences are encoded by calling model.encode()
emb1 = model.encode("This is a red cat with a hat.") # 384
emb2 = model.encode("Have you seen my red cat?") # 384


cos_sim = util.cos_sim(emb1, emb2)
print("Cosine-Similarity:", cos_sim)

Cosine-Similarity: tensor([[0.6153]])


아래 sentences 에서 유사한 것 끼리 Pair 를 구합니다.

In [8]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.'
          ]

#Encode all sentences
embeddings = model.encode(sentences)

#Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)

#Add all pairs to a list with their cosine similarity score
all_sentence_combinations = []
for i in range(len(cos_sim)-1):
    for j in range(i+1, len(cos_sim)):
        all_sentence_combinations.append([cos_sim[i][j], i, j])

#Sort list by the highest cosine similarity score
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

print("Top-5 most similar pairs:")
for score, i, j in all_sentence_combinations[0:5]:
    print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j]))

Top-5 most similar pairs:
A man is eating food. 	 A man is eating a piece of bread. 	 0.7553
A man is riding a horse. 	 A man is riding a white horse on an enclosed ground. 	 0.7369
A monkey is playing drums. 	 Someone in a gorilla costume is playing a set of drums. 	 0.6433
A woman is playing violin. 	 Someone in a gorilla costume is playing a set of drums. 	 0.2564
A man is eating food. 	 A man is riding a horse. 	 0.2474


## 3.3. 시멘틱 서치

In [9]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

query_embedding = model.encode('How big is London')
passage_embedding = model.encode(['London has 9,787,426 inhabitants at the 2011 census',
                                  'London is known for its finacial district'])

print("Similarity:", util.dot_score(query_embedding, passage_embedding))

Similarity: tensor([[0.5472, 0.6330]])


In [10]:
from sentence_transformers import CrossEncoder
model = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-2-v2', max_length=512)
scores = model.predict([('Query1', 'Paragraph1'), ('Query1', 'Paragraph2')])

#For Example
scores = model.predict([('How many people live in Berlin?', 'Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.'), 
                        ('How many people live in Berlin?', 'Berlin is well known for its museums.')])
print("scores: \n", scores)

scores: 
 [ 7.152368 -6.287043]


## 3.4. Using Hugging Face models

In [11]:
from sentence_transformers import SentenceTransformer, util

question = "<Q>How many models can I host on HuggingFace?"
answer_1 = "<A>All plans come with unlimited private models and datasets."
answer_2 = "<A>AutoNLP is an automatic way to train and deploy state-of-the-art NLP models, seamlessly integrated with the Hugging Face ecosystem."
answer_3 = "<A>Based on how much training data and model variants are created, we send you a compute cost and payment link - as low as $10 per job."

model = SentenceTransformer('clips/mfaq')
query_embedding = model.encode(question)
corpus_embeddings = model.encode([answer_1, answer_2, answer_3])

print(util.semantic_search(query_embedding, corpus_embeddings))



[[{'corpus_id': 0, 'score': 0.5646325945854187}, {'corpus_id': 2, 'score': 0.5142339468002319}, {'corpus_id': 1, 'score': 0.4730038344860077}]]


# 4. 커널 리스타트


- 커널 리스타트에 대한 내용이 있습니다. 클릭 후 가장 하단의 "3.커널 리스타팅" 을 참조 하세요.
    - [리스타트 상세](https://github.com/gonsoomoon-ml/NLP-HuggingFace-On-SageMaker/blob/main/1_NSMC-Classification/2_WarmingUp/0.1.warming_up_yelp_review.ipynb)

In [12]:
import IPython

IPython.Application.instance().kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}