# [모듈 0.1] 영문 스크래치

아래 HuggingF Face - Getting Started 링크에 대해서 실습을 통하여 핸즈온을 해봅니다.

주요 단계는 아래와 같습니다.
- 1. 워밍업
- 2. Yelp 데이타 셋으로 훈련 및 추론

---
### 참고 자료

[HuggingF Face - Getting Started](https://huggingface.co/docs/transformers/tasks/sequence_classification)

# 1. 워밍업

### tokenizer 생성

In [1]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")



### tokenizer 예시
아래 결과 설명
- input_ids: 주어진 문장에서 Token 을 만들고 이에 상응하는 숫자로 표시 함.
- token_type_ids: BERT 모델은 문제의 유형에 따라 한개의 문장, 두개의 문장을 넣을 수 있음. 주어진 문장을 구분 짓기 위함. 하나의 문장이 들어가면 모두 0 임. 두번째 문장이 들어가면 모두 1임
- attention mask: 주어진 문장의 지정된 길이가 있음. 만약에 지정된 길이 만큼 모두 사용하면 모두 1이 되고, 지정된 길이 만큼 사용하지 않으면 Padding  을 넣음. 이를 0 으로 표시 함.

#### 참고
- [한글 허깅페이스 기본 시작](https://www.ohsuz.dev/22f4e8e7-64a3-4789-9dd2-171913883733)

In [2]:
sequence = "In a hole in the ground there lived a hobbit."
encoded_input = tokenizer(sequence)
print(tokenizer(sequence))

{'input_ids': [101, 1130, 170, 4569, 1107, 1103, 1747, 1175, 2077, 170, 16358, 13834, 2875, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


### tokenizer 디코딩
- 위의 'input_ids': [101, 1130, 170, 4569, 1107, 1103, 1747, 1175, 2077, 170, 16358, 13834, 2875, 119, 102] 가 아래 처럼 다시 원문으로 디코딩이 되는 예시

In [3]:
tokenizer.decode(encoded_input["input_ids"])

'[CLS] In a hole in the ground there lived a hobbit. [SEP]'

### tokenizer 인코딩 예시
- 3개의 문장에 대한 인코딩 예시

In [4]:
batch_sentences = [
...     "But what about second breakfast?",
...     "Don't think he knows about second breakfast, Pip.",
...     "What about elevensies?",
... ]
# encoded_input = tokenizer(batch_sentences, padding=True)
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
print(encoded_input)

{'input_ids': tensor([[  101,  1252,  1184,  1164,  1248,  6462,   136,   102,     0,     0,
             0,     0,     0,     0,     0],
        [  101,  1790,   112,   189,  1341,  1119,  3520,  1164,  1248,  6462,
           117, 21902,  1643,   119,   102],
        [  101,  1327,  1164,  5450, 23434,   136,   102,     0,     0,     0,
             0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


# 2. Yelp 데이타 셋으로 훈련 및 추론

### Yelp 데이터 셋 로딩

In [5]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")
dataset["train"][100]

Reusing dataset yelp_review_full (/home/ec2-user/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf)


  0%|          | 0/2 [00:00<?, ?it/s]

{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. 

In [6]:
print("type of dataset : ", type(dataset))

type of dataset :  <class 'datasets.dataset_dict.DatasetDict'>


### tokenizer 함수 정의 및 데이터 세트를 인코딩 하기

In [7]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

Loading cached processed dataset at /home/ec2-user/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-abceb9d59c7d7796.arrow


  0%|          | 0/50 [00:00<?, ?ba/s]

### train, eval dataloader 생성

In [8]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

이미 BERT 입력에 필요한 'input_ids', 'token_type_ids', 'attention_mask' 를 생성하였으므로, 원문의 text 삭제 및 label --> labels 로 이름 변경

In [9]:
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

훈련 및 검증의 일부 데이터 (1000 개) 로 데이터 세트 생성

In [10]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

Loading cached shuffled indices for dataset at /home/ec2-user/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-f62f2ff8b5f7db66.arrow


토치의 "데이터 로더" 를 batch_size=8 로 생성

In [11]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

1 batch (8개의 레코드) 의 값을 확인

In [12]:
for record in train_dataloader:
    print("labels: ]n", record['labels'])
    print("input_ids: \n", record['input_ids'])    
    print("token_type_ids: \n", record['token_type_ids'])        
    print("attention_mask: \n", record['attention_mask'])            

    break

labels: ]n tensor([0, 2, 3, 0, 3, 3, 0, 1])
input_ids: 
 tensor([[  101,  1188,  1282,  ...,     0,     0,     0],
        [  101,  1422,  3143,  ...,     0,     0,     0],
        [  101, 23158,  1204,  ...,     0,     0,     0],
        ...,
        [  101,   146,   112,  ...,     0,     0,     0],
        [  101,  1573,  1292,  ...,     0,     0,     0],
        [  101,   146,  1932,  ...,     0,     0,     0]])
token_type_ids: 
 tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]])
attention_mask: 
 tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])


### Pre-trained model 로딩
- HuggingFace Hub 에서 "bert-base-cased" Pre-Trained 모델 다운로드

In [13]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

### Optimizer and learning rate scheduler 생성

In [14]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)


In [15]:
from transformers import get_scheduler

num_epochs = 1
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

### Pre-trained 모델을 디바이스에 로딩
- device 가 GPU 이면 GPU 에 "Pre-trained 모델" 로딩
- device 가 CPU 이면 CPU 에 "Pre-trained 모델" 로딩

In [16]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

### 모델 훈련 루프
- 모델의 훈련 루프를 실행 함.
    - 아래의 num_epochs 만큼 train_dataloader 의 데이터를 불러와서 모델 훈련 함.

In [17]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/125 [00:00<?, ?it/s]

### 모델 평가
- 모델 평가를 위한 메트릭 오브젝트 생성 

In [18]:
import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

### 모델 추론 및 모델 평가
- eval_dataloader 를 통해서 입력 데이터 생성
- outputs = model(**batch) 를 통해서 모델 추론
- 모델 추론의 outputs 을 통하여 예측값 생성
```
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
```    



#### 1개의 Batch 만 실행 결과 보기

In [19]:
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    print("barch: \n\n", batch)
    with torch.no_grad():
        outputs = model(**batch)
        print("\n\n outputs: \n\n", outputs)        

    logits = outputs.logits
    print("\n\n logits: \n\n", logits)        
    predictions = torch.argmax(logits, dim=-1)
    print("\n\n predictions: \n\n", predictions)            
    break


barch: 

 {'labels': tensor([2, 4, 1, 4, 3, 4, 2, 3], device='cuda:0'), 'input_ids': tensor([[  101, 14812, 16442,  ...,     0,     0,     0],
        [  101, 19383,  1303,  ...,     0,     0,     0],
        [  101, 12008, 27788,  ...,     0,     0,     0],
        ...,
        [  101,  3930, 13991,  ...,     0,     0,     0],
        [  101,  1284,  3523,  ...,     0,     0,     0],
        [  101,  6682,  3537,  ...,     0,     0,     0]], device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0')}


 outputs: 

 SequenceClas

#### 전체 Batch 로 모델 평가

In [20]:
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

{'accuracy': 0.509}

# 3. 커널 리스타팅

- 위의 노트북을 다 실행하고 나면 아래의 그림과 같이 GPU의 메모리를 차지하고 있습니다. (터미널에서 `nvidia-smi` 입력) 
![before-nvidia-smi.png](img/before-nvidia-smi.png)

- 아래 셀을 실행하면 이 노트북의 커널이 리스타트 되고 해제된 메모리를 확인 할 수 있습니다.
![after-nvidia-smi.png](img/after-nvidia-smi.png)

In [21]:
import IPython

IPython.Application.instance().kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}