## 003. IMDB
- 영화 리뷰 긍/부정 판단 위한 sentiment analysis dataset
- 25000 train data, 25000 test data로 구성
- torxhtext를 이용해 데이터 다운로드하기

``` Python
! pip install torchtext=0.15.2 # 자연어 처리 작업
! pip install portalocker=2.7.0 # 여러 프로세스가 동일한  파일을 동시에 접근할 때 충돌 방지
! pip install accelerate -U # pytorch 모델의 학습 및 평가 가속화
```

In [1]:
# 위까지 실행 후 restart
from torchtext.datasets import IMDB

train_iter = IMDB(split='train')
test_iter = IMDB(split='test')



In [2]:
import random

random.seed(6)

# xxx_iter를 리스트로 변경
train_list = list(train_iter)
test_list = list(test_iter)

# 리스트 중 1000개씩 랜덤샘플링
train_list_small = random.sample(train_list, 1000)
test_list_small = random.sample(test_list, 1000)

# 각 변수의 첫번째 원소 출력
print(train_list_small[0])
print(test_list_small[0])

(2, "I LOVED this movie! I am biased seeing as I am a huge Disney fan, but I really enjoyed myself. The action takes off running in the beginning of the film and just keeps going! This is a bit of a departure for Disney, they don't spend quite as much time on character development (my husband pointed this out)and there are no musical numbers. It is strictly action adventure. I thoroughly enjoyed it and recommend it to anyone who loves Disney, be they young or old.")
(1, 'This was an abysmal show. In short it was about this kid called Doug who guilt-tripped a lot. Seriously he could feel guilty over killing a fly then feeling guilty over feeling guilty for killing the fly and so forth. The animation was grating and unpleasant and the jokes cheap. <br /><br />It aired here in Sweden as a part of the "Disney time" show and i remember liking it some what but then i turned 13.<br /><br />I never got why some of the characters were green and purple too. What was up with that? <br /><br />Tru

## 005. label encoding
위의 데이터셋은 (텍스트, 레이블) 형태로 구성되어 있으며, 레이블의 경우 2는 긍정, 1은 부정으로 구성되어 있다. 긍정을 1, 부정을 0으로 바꾸고 이걸 각각 train_texts, train_labels, test_texts, test_labels에 저장하라

In [3]:
train_texts  = []
train_labels = []
test_texts = []
test_labels = []

In [4]:
for label, text in train_list_small:
    train_labels.append(1 if label==2 else 0)
    train_texts.append(text)

In [5]:
for label, text in test_list_small:
    test_labels.append(1 if label==2 else 0)
    test_texts.append(text)

In [6]:
print(train_labels[0])
print(train_texts[0])

1
I LOVED this movie! I am biased seeing as I am a huge Disney fan, but I really enjoyed myself. The action takes off running in the beginning of the film and just keeps going! This is a bit of a departure for Disney, they don't spend quite as much time on character development (my husband pointed this out)and there are no musical numbers. It is strictly action adventure. I thoroughly enjoyed it and recommend it to anyone who loves Disney, be they young or old.


## 006. 학습 및 검증 데이터 분리
1000개를 8:2로 나누어라

In [7]:
from sklearn.model_selection import train_test_split

train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.2, random_state=3)

print(len(train_labels))
print(len(train_texts))

800
800


## 007. 토크나이징 및 인코딩
위에서 추출한 train, val, test 데이터를 pretrained distilbert-base-uncased 모델에 투입하기 위해 토크나이저를 사용해서 인코딩하라

In [8]:
import transformers
transformers.__version__

  from .autonotebook import tqdm as notebook_tqdm


'4.37.1'

In [9]:
#from transformers import DistilBertTokenizerFast #DistilBertTokenizerfast
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

In [10]:
# 토크나이저 실행
train_encodings = tokenizer(train_texts, truncation=True, padding=True) # truncation: 모델의 default max_length를 넘는 입력 부분을 더 이상 받지 않고 절단함
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

print(train_encodings['input_ids'][0][:5])

[101, 4937, 11350, 2038, 2048]


In [11]:
print(train_encodings.keys())

dict_keys(['input_ids', 'attention_mask'])


In [12]:
print(train_encodings['input_ids'][0][:5])
print(tokenizer.decode(train_encodings['input_ids'][0][:5]))

[101, 4937, 11350, 2038, 2048]
[CLS] cat soup has two


## 008. 데이터세트 클래스 생성
- torch.utils.data.Dataset을 상속하는 IMDBDataset이라는 클래스 작성하기
- 문제 007의 imdb 데이터셋에서 학습한 train/val/test encodings를 입력해서 클래스 인스턴스화하기

cf. __init__은 인스턴스화에 생기는 generator이다. generator는 클래스로 객체를 생성할 때 자동으로 호출되는 메서드로, 객체를 미리 설정한 값으로 초기화한다. 그리고 클래스를 구성하는 arguments, variable, function을 정의한다.


In [13]:
import torch

class IMDBDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        # {'key':'value'} 형태의 딕셔너리 구조
        # val[idx], label[idx]의 데이터를 파이토치 텐서로 변환
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

In [14]:
train_dataset = IMDBDataset(train_encodings, train_labels)
val_dataset = IMDBDataset(val_encodings, val_labels)
test_dataset = IMDBDataset(test_encodings, test_labels)

test_dataset

<__main__.IMDBDataset at 0x1ac7da133e0>

In [15]:
# 각 데이터셋 확인: iterable 객체이므로 for문 사용가능
for i in train_dataset:
    print(i.keys())
    break


dict_keys(['input_ids', 'attention_mask', 'labels'])


- input_ids: 단어 ID
- attention_mask: 패딩 위치(실제 입력 토큰과 패딩 토큰 구분)
- labels

## 사전학습 모델 불러오기

In [16]:
from transformers import DistilBertForSequenceClassification

# 1. pretrained model 불러오기
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

model

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [17]:
# 2. training_argument 설정
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./result',
    num_train_epochs = 8,
    per_device_train_batch_size = 16, # 학습시 디바이스별 미니 배치 수
    per_device_eval_batch_size = 64, # 평가시 ~
    warmup_steps = 500, # 학습률 스케줄링용 웜업 스텝수
    weight_decay=0.01, # 가중치 감소 강도
    logging_dir = './logs',
    logging_steps=10,
)

In [18]:
# 3. 모델을 GPU에 전송
import torch

torch.cuda.is_available()


True

In [19]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

In [20]:
!nvidia-smi

Thu Apr  4 17:09:18 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 537.42                 Driver Version: 537.42       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 3060 Ti   WDDM  | 00000000:07:00.0  On |                  N/A |
|  0%   38C    P2              24W / 220W |   2018MiB /  8192MiB |     22%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [21]:
## 파인튜닝 이전 모델 극성 판별 평가
texts = [
    "I feel fantastic",
    "My life is going something wrong",
    "I have not figured out what the chosen title has to do with the movie."
]

# 토크나이징
input_tokens = tokenizer(texts, truncation=True, padding=True)
print(input_tokens.keys())


dict_keys(['input_ids', 'attention_mask'])


In [22]:

# 토크나이징 결과를 모델에 투입하고 모델 출력 결과를 GPU에 전송
outputs = model(torch.tensor(input_tokens['input_ids']).to(device))

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


In [23]:
outputs.keys()

odict_keys(['logits'])

In [24]:
# 레이블 딕셔너리 생성
label_dict = {0:'positive', 1:'negative'}

print([label_dict[i] for i in torch.argmax(outputs['logits'], axis=1).cpu().numpy()])

['negative', 'negative', 'negative']


In [25]:
## Trainer.train을 사용한 파인튜닝
from transformers import Trainer

trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
  2%|▎         | 10/400 [00:08<05:25,  1.20it/s]

{'loss': 0.6915, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.2}


  5%|▌         | 20/400 [00:17<05:15,  1.21it/s]

{'loss': 0.6868, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.4}


  8%|▊         | 30/400 [00:25<05:05,  1.21it/s]

{'loss': 0.6891, 'learning_rate': 3e-06, 'epoch': 0.6}


 10%|█         | 40/400 [00:33<04:57,  1.21it/s]

{'loss': 0.683, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.8}


 12%|█▎        | 50/400 [00:41<04:49,  1.21it/s]

{'loss': 0.6831, 'learning_rate': 5e-06, 'epoch': 1.0}


 15%|█▌        | 60/400 [00:50<04:40,  1.21it/s]

{'loss': 0.6701, 'learning_rate': 6e-06, 'epoch': 1.2}


 18%|█▊        | 70/400 [00:58<04:32,  1.21it/s]

{'loss': 0.6619, 'learning_rate': 7.000000000000001e-06, 'epoch': 1.4}


 20%|██        | 80/400 [01:06<04:25,  1.21it/s]

{'loss': 0.63, 'learning_rate': 8.000000000000001e-06, 'epoch': 1.6}


 22%|██▎       | 90/400 [01:14<04:15,  1.21it/s]

{'loss': 0.5692, 'learning_rate': 9e-06, 'epoch': 1.8}


 25%|██▌       | 100/400 [01:23<04:08,  1.21it/s]

{'loss': 0.4975, 'learning_rate': 1e-05, 'epoch': 2.0}


 28%|██▊       | 110/400 [01:31<03:59,  1.21it/s]

{'loss': 0.4562, 'learning_rate': 1.1000000000000001e-05, 'epoch': 2.2}


 30%|███       | 120/400 [01:39<03:51,  1.21it/s]

{'loss': 0.3588, 'learning_rate': 1.2e-05, 'epoch': 2.4}


 32%|███▎      | 130/400 [01:48<03:46,  1.19it/s]

{'loss': 0.3524, 'learning_rate': 1.3000000000000001e-05, 'epoch': 2.6}


 35%|███▌      | 140/400 [01:56<03:36,  1.20it/s]

{'loss': 0.3749, 'learning_rate': 1.4000000000000001e-05, 'epoch': 2.8}


 38%|███▊      | 150/400 [02:04<03:30,  1.19it/s]

{'loss': 0.3253, 'learning_rate': 1.5e-05, 'epoch': 3.0}


 40%|████      | 160/400 [02:13<03:19,  1.20it/s]

{'loss': 0.2744, 'learning_rate': 1.6000000000000003e-05, 'epoch': 3.2}


 42%|████▎     | 170/400 [02:21<03:14,  1.19it/s]

{'loss': 0.2606, 'learning_rate': 1.7000000000000003e-05, 'epoch': 3.4}


 45%|████▌     | 180/400 [02:30<03:04,  1.19it/s]

{'loss': 0.2421, 'learning_rate': 1.8e-05, 'epoch': 3.6}


 48%|████▊     | 190/400 [02:38<02:54,  1.20it/s]

{'loss': 0.2188, 'learning_rate': 1.9e-05, 'epoch': 3.8}


 50%|█████     | 200/400 [02:46<02:46,  1.20it/s]

{'loss': 0.1909, 'learning_rate': 2e-05, 'epoch': 4.0}


 52%|█████▎    | 210/400 [02:55<02:40,  1.18it/s]

{'loss': 0.0819, 'learning_rate': 2.1e-05, 'epoch': 4.2}


 55%|█████▌    | 220/400 [03:03<02:31,  1.19it/s]

{'loss': 0.1242, 'learning_rate': 2.2000000000000003e-05, 'epoch': 4.4}


 57%|█████▊    | 230/400 [03:11<02:22,  1.19it/s]

{'loss': 0.0571, 'learning_rate': 2.3000000000000003e-05, 'epoch': 4.6}


 60%|██████    | 240/400 [03:20<02:14,  1.19it/s]

{'loss': 0.1319, 'learning_rate': 2.4e-05, 'epoch': 4.8}


 62%|██████▎   | 250/400 [03:28<02:04,  1.20it/s]

{'loss': 0.0911, 'learning_rate': 2.5e-05, 'epoch': 5.0}


 65%|██████▌   | 260/400 [03:36<01:55,  1.21it/s]

{'loss': 0.0392, 'learning_rate': 2.6000000000000002e-05, 'epoch': 5.2}


 68%|██████▊   | 270/400 [03:45<01:47,  1.21it/s]

{'loss': 0.0604, 'learning_rate': 2.7000000000000002e-05, 'epoch': 5.4}


 70%|███████   | 280/400 [03:53<01:39,  1.21it/s]

{'loss': 0.0375, 'learning_rate': 2.8000000000000003e-05, 'epoch': 5.6}


 72%|███████▎  | 290/400 [04:01<01:31,  1.21it/s]

{'loss': 0.0565, 'learning_rate': 2.9e-05, 'epoch': 5.8}


 75%|███████▌  | 300/400 [04:10<01:22,  1.21it/s]

{'loss': 0.0921, 'learning_rate': 3e-05, 'epoch': 6.0}


 78%|███████▊  | 310/400 [04:18<01:14,  1.21it/s]

{'loss': 0.0704, 'learning_rate': 3.1e-05, 'epoch': 6.2}


 80%|████████  | 320/400 [04:26<01:06,  1.20it/s]

{'loss': 0.0037, 'learning_rate': 3.2000000000000005e-05, 'epoch': 6.4}


 82%|████████▎ | 330/400 [04:34<00:57,  1.21it/s]

{'loss': 0.0267, 'learning_rate': 3.3e-05, 'epoch': 6.6}


 85%|████████▌ | 340/400 [04:43<00:50,  1.20it/s]

{'loss': 0.0792, 'learning_rate': 3.4000000000000007e-05, 'epoch': 6.8}


 88%|████████▊ | 350/400 [04:51<00:41,  1.21it/s]

{'loss': 0.0317, 'learning_rate': 3.5e-05, 'epoch': 7.0}


 90%|█████████ | 360/400 [04:59<00:33,  1.20it/s]

{'loss': 0.188, 'learning_rate': 3.6e-05, 'epoch': 7.2}


 92%|█████████▎| 370/400 [05:08<00:25,  1.17it/s]

{'loss': 0.0374, 'learning_rate': 3.7e-05, 'epoch': 7.4}


 95%|█████████▌| 380/400 [05:16<00:16,  1.19it/s]

{'loss': 0.0149, 'learning_rate': 3.8e-05, 'epoch': 7.6}


 98%|█████████▊| 390/400 [05:25<00:08,  1.19it/s]

{'loss': 0.0334, 'learning_rate': 3.9000000000000006e-05, 'epoch': 7.8}


100%|██████████| 400/400 [05:33<00:00,  1.20it/s]

{'loss': 0.0956, 'learning_rate': 4e-05, 'epoch': 8.0}
{'train_runtime': 333.5862, 'train_samples_per_second': 19.185, 'train_steps_per_second': 1.199, 'train_loss': 0.27173578286543487, 'epoch': 8.0}





TrainOutput(global_step=400, training_loss=0.27173578286543487, metrics={'train_runtime': 333.5862, 'train_samples_per_second': 19.185, 'train_steps_per_second': 1.199, 'train_loss': 0.27173578286543487, 'epoch': 8.0})

In [26]:
# 파인튜닝 이후 결과 판별
texts = [
    "I feel fantastic",
    "My life is going something wrong",
    "I have not figured out what the chosen title has to do with the movie."
]

input_tokens = tokenizer(texts, truncation=True, padding=True)
outputs = model(torch.tensor(input_tokens['input_ids']).to(device))
label_dict = {0:'positive', 1:'negative'}

print([label_dict[i] for i in torch.argmax(outputs['logits'], axis=1).cpu().numpy()])


['negative', 'positive', 'positive']


### pytorch를 이용한 fine-tuning

In [None]:
from torch.utils.data import DataLoader
from transformers import DistilBertForSequenceClassification, AdamW
from transformers import DistilBertTokenizerFast

# 1) loading pre training model
