<a href="https://colab.research.google.com/github/monya-9/deep-learning-practice/blob/main/12_huggingface_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HuggingFace Transformers로 BERT

In [1]:
# 1. 라이브러리 설치
!pip install transformers torch



In [2]:
# 2. 토크나이저와 모델 불러오기
from transformers import BertTokenizer, BertModel
import torch

# BERT-base-uncased 모델 사용
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

- BERT-base (uncased) 모델과 토크나이저 로드

In [3]:
# 3. 문장 준비
sentence = "Hugging Face BERT practice is interesting!"

In [4]:
# 4. 토큰화 및 인코딩
inputs = tokenizer(sentence, return_tensors="pt")  # PyTorch tensor 반환
print("입력 토큰 ID:", inputs["input_ids"])
print("어텐션 마스크:", inputs["attention_mask"])

입력 토큰 ID: tensor([[  101, 17662,  2227, 14324,  3218,  2003,  5875,   999,   102]])
어텐션 마스크: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])


- 단어를 토큰 단위로 분리하고 정수 ID 변환
- [CLS], [SEP] 포함

In [5]:
# 5. 모델 실행 (forward pass)
with torch.no_grad():  # 학습 X, 추론만
    outputs = model(**inputs)

- 입력을 BERT에 넣어 단어 단위(hidden state)와 문장 임베딩(pooler_output)을 출력

In [6]:
# 6. 출력 확인
print("\n=== 출력 ===")
print("last_hidden_state:", outputs.last_hidden_state.shape)
print("pooler_output:", outputs.pooler_output.shape)


=== 출력 ===
last_hidden_state: torch.Size([1, 9, 768])
pooler_output: torch.Size([1, 768])


- outputs.last_hidden_state: 모든 토큰(hidden state) 벡터 (배치 크기 x 토큰 개수 x 768차원)
- outputs.pooler_output: [CLS] 토큰 기반 문장 벡터 ( 배치 크기 x 768차원)

In [7]:
# 7. 문장 임베딩 (pooler_output 사용)
sentence_embedding = outputs.pooler_output
print("\n문장 임베딩 벡터 차원:", sentence_embedding.shape)
print(sentence_embedding)


문장 임베딩 벡터 차원: torch.Size([1, 768])
tensor([[-7.6616e-01, -1.7362e-01,  5.9557e-01,  3.8121e-01, -3.1588e-01,
         -1.5283e-02,  8.0492e-01,  1.5867e-01,  4.2045e-01, -9.9902e-01,
          3.5151e-01, -1.3399e-01,  9.6092e-01, -3.8096e-01,  8.9829e-01,
         -3.4242e-01,  1.6761e-01, -4.5077e-01,  2.4046e-01, -4.9891e-01,
          4.7402e-01,  6.4965e-01,  6.6592e-01,  1.8362e-01,  2.5549e-01,
         -1.1114e-01, -4.4155e-01,  8.8547e-01,  9.0434e-01,  5.5801e-01,
         -5.0125e-01,  1.1161e-01, -9.6717e-01, -1.0528e-01,  5.6874e-01,
         -9.5737e-01,  4.2823e-02, -6.5959e-01,  6.3764e-02,  1.1802e-01,
         -8.3111e-01,  1.2409e-01,  9.8176e-01, -4.9814e-01, -2.4154e-01,
         -2.6183e-01, -9.9652e-01,  1.3074e-01, -7.7726e-01, -5.4344e-01,
         -4.2838e-01, -6.4672e-01, -2.5718e-02,  2.3404e-01,  2.3145e-01,
          4.7355e-01, -2.0612e-01,  3.4768e-02, -4.7149e-02, -4.1010e-01,
         -5.3218e-01,  1.2007e-01,  2.1638e-01, -8.2388e-01, -4.7633e-01,
  

- outputs.pooler_output 을 문장 표현(embedding)으로 사용 가능