In [1]:
!pip install transformers 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.1-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 7.5 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 11.6 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 47.4 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 43.6 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling P

## Step 1. Pandas를 통해 Data 불러오기 ✨
데이터를 DACON에서 다운로드 받고, read_csv의 첫번째 파라미터 수정을 통해 경로를 변경하여 데이터를 불러오시기 바랍니다! 

In [2]:
import pandas as pd 
from transformers import AutoModel, AutoTokenizer 


train = pd.read_csv('drive/MyDrive/dacon_shop_review/train.csv', encoding='utf-8-sig')

train.head(5)

Unnamed: 0,id,reviews,target
0,0,조아요 처음구입 싸게햇어요,2
1,1,생각보다 잘 안돼요 매지 바른지 하루밖에 안됐는데ㅠㅠ 25천원가량 주고 사기 너무 ...,1
2,2,디자인은괜찮은데 상품이 금이가서 교환했는데 두번째받은상품도 까져있고 안쪽에 금이가져...,2
3,3,기전에 이 제품말고 이마트 트레이더스에서만 팔던 프리미엄 제품을 사용했었습니다. 샘...,2
4,4,튼튼하고 손목을 잘 받쳐주네요~,5


In [3]:
test = pd.read_csv('drive/MyDrive/dacon_shop_review/test.csv', encoding='utf-8-sig')

test.head(10)

Unnamed: 0,id,reviews
0,0,채소가 약간 시들어 있어요
1,1,발톱 두껍고 단단한 분들 써도 소용없어요 이 테이프 물렁거리고 힘이없어서 들어 올리...
2,2,부들부들 좋네요 입어보고 시원하면 또 살게요
3,3,이런 1. 8 골드 주라니깐 파란개 오네 회사전화걸어도 받지도 않고 머하자는거임?
4,4,검수도 없이 보내구 불량 배송비 5000원 청구하네요 완전별로 별하나도 아까워요
5,5,흠 마무리가 넘 안좋아요 가격대비 그냥써봅니다
6,6,조금 찌거러져서 왔지만 그냥 써야죠 뭐.. 신경 좀 써주세요.
7,7,잘 빠져요. 새다리들만 쓸수 있을듯해요.그냥 tv볼때 요거 위에다 다리올려놓고 봅니다.
8,8,재구매 가격저령하구요 상품질도 좋으네요~ 또이용하겠습니다
9,9,재구매 아이가 너무 좋아합니다 배송도 빠르고 사은품도 너무 좋네요~~


### 특징: test_data는 train_data와 달리 'target' 컬럼이 없다. 
이 경우 모델이 test_data의 레이블을 예측한 후에 sample_submission.csv를 업데이트 해주어야 한다. sample_submission.csv를 DACON 측에 제출하면 자신의 코드의 정확도 채점를 받을 수 있다. 이러한 데이터 분석 과정에서는 COLAB보다 Spyder가 더 편하다.

In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       25000 non-null  int64 
 1   reviews  25000 non-null  object
 2   target   25000 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 586.1+ KB


In [5]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       25000 non-null  int64 
 1   reviews  25000 non-null  object
dtypes: int64(1), object(1)
memory usage: 390.8+ KB


## Step 2. Transformers 라이브러리를 통해 BERT 모델 파라미터 불러오기 ✨

Huggingface의 Transformers 라이브러리를 통해 프리트레인이 완료된 모델 파라미터를 불러올 수 있다.

**[ 꿀팁 ]**

1) 코드 상에 from transformers import 까지 일단 쳐두고   
2) https://huggingface.co/models 에 들어가서 사용할 모델을 찾는다
3) 모델을 클릭하여 우측 상단의 Use in Transformers 버튼을 찾아 누른다.

4) 
다음과 같은 코드를 가져오면 모델을 불러올 수 있다.


```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
```



### 사용할 Model: Klue-bert-base
https://huggingface.co/klue/bert-base

In [6]:
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("klue/bert-base")

model = AutoModel.from_pretrained("klue/bert-base")

Downloading tokenizer_config.json:   0%|          | 0.00/289 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/425 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/243k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/483k [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/424M [00:00<?, ?B/s]

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertModel: ['cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### 토크나이저 인코딩: encode_plus() function 
BERT에 적합한 INPUT 형태로 데이터를 변환시키는 코드를 직접 구현해도 되지만, Hugging Face의 Tokenizer 라이브러리를 활용하면 좀 더 쉽고 빠르게 BERT의 INPUT을 tokenize 할 수 있다. Encode_plus 함수는 BERT에 적합한 INPUT 형태로 데이터를 변환시켜줄 뿐만 아니라, 입력 문장을 최대 길이에 맞게 패딩(Padding)하고 결괏값을 딕셔너리로 출력해준다. (참고: https://han-py.tistory.com/267)

아래처럼 보통 커스텀 함수를 만들어서 쓴다.

**encode_plus() 함수 공식 문서**
https://huggingface.co/docs/transformers/v4.21.1/en/main_classes/tokenizer

- **add_special_tokens** (bool, optional, defaults to True) – Whether or not to encode the sequences with the special tokens relative to their model.

- **pad_to_max_length** (bool, optional, defaults to False) – If set to True, the returned sequences will be padded according to the model’s padding side and padding index, up to their max length. If no max length is specified, the padding is done up to the model’s max length. The tokenizer padding sides are handled by the class attribute padding_side which can be set to the following strings: 
  - ’left’: pads on the left of the sequences
  - ’right’: pads on the right of the sequences
  - Defaults to False: no padding.


- 



In [7]:
def bert_tokenizer(text, MAX_LEN):

    '''    
    encoding = tokenizer.encode_plus(str(text), 
                                     add_special_tokens = True,    
                                     truncation = True, 
                                     max_length=MAX_LEN, 
                                     pad_to_max_length=True, 
                                     return_attention_mask = True, 
                                     return_tensors = "pt") #pt = Pytorch tensors
    '''
    encoding = tokenizer.encode_plus(str(text), 
                                     add_special_tokens = True,    
                                     truncation = True, 
                                     max_length=MAX_LEN, 
                                     padding='max_length', 
#                                     return_attention_mask = True, 
                                     return_tensors = "pt") #pt = Pytorch tensors



    input_id = encoding["input_ids"]   
    attention_mask = encoding["attention_mask"]
    token_type_id = encoding['token_type_ids']
    
    return input_id, attention_mask, token_type_id

### BERT의 Tokenizer인 WordPiece Tokenizer에서 파생되는 개념인 Input IDs, Attention Mask, Token Type IDs에 대한 가이드 글
- https://huggingface.co/docs/transformers/glossary#attention-mask

In [8]:
sample = train.iloc[33]['reviews']

print(sample)

인터넷으로 사니 저렴하네요 좋습니다


In [9]:
bert_tokenizer(sample, 32)

(tensor([[   2, 4254, 6233, 1233, 2209, 6206, 2205, 2203, 2182, 1560, 2219, 3606,
             3,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
             0,    0,    0,    0,    0,    0,    0,    0]]),
 tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0]]),
 tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0]]))

In [10]:
print(model)

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(32000, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

## Train (=Fine-tuning BERT)

Train에 앞서, PyTorch의 데이터셋 미니배치 기능인 Dataset, DataLoader에 관해 공부가 필요 
참고: https://didu-story.tistory.com/85

### Dataset Class

DACON에 기재된 label 정보: 

주어진 쇼핑몰 리뷰 데이터셋을 이용하여

상품의 평점 (1점, 2점, 4점, 5점)을 분류해주세요!

In [11]:


train['target'].value_counts()

5    10000
2     8000
1     4500
4     2500
Name: target, dtype: int64

In [12]:
import torch
from torch.utils.data import Dataset 
import numpy as np


labels = {1:0,
          2:1,
          4:2,
          5:3
          }

class Dataset(Dataset):

    def __init__(self, df):

        self.labels = [labels[label] for label in df['target']]
        self.texts = [tokenizer(text,padding='max_length', max_length = 128, truncation=True, return_tensors="pt") for text in df['reviews']]

    def classes(self):
        return self.labels

    def __len__(self):
        return len(self.labels)

    def get_batch_labels(self, idx):
        # Fetch a batch of labels
        return np.array(self.labels[idx])

    def get_batch_texts(self, idx):
        # Fetch a batch of inputs
        return self.texts[idx]

    def __getitem__(self, idx):

        batch_texts = self.get_batch_texts(idx)
        batch_y = self.get_batch_labels(idx)

        return batch_texts, batch_y

In [13]:

#SPLIT: Train, Validation 
import numpy as np
df_train, df_val = np.split(train.sample(frac=1, random_state=42),[int(.85*len(train))])



In [14]:


from torch import nn #neural network 
#from transformers import BertModel 


class BERT(nn.Module):

  def __init__(self, dropout=0.1): 

    super(BERT, self).__init__() #nn.Module에 구현된 init()을 상속받아 사용하겠다는 의미

    self.bert = model 
    self.dropout = nn.Dropout(dropout)
    self.linear = nn.Linear(768, 4) #Embedding vector size를 768으로, output label을 4로 targeting
    self.relu = nn.ReLU() #Activation Function인 ReLU를 Baseline으로 사용하며 다른 걸 써도 됨  

  def forward(self, input_id, mask):

    _, pooled_output = self.bert(input_ids = input_id, attention_mask=mask, return_dict=False)
    dropout_output = self.dropout(pooled_output)
    linear_output = self.linear(dropout_output)
    final_layer = self.relu(linear_output)

    return final_layer


In [15]:


from torch.optim import Adam 
from tqdm import tqdm # Work-progress bar on Python

def train(model, train_data, val_data, learning_rate, epochs):

  train, val = Dataset(train_data), Dataset(val_data)

  train_dataloader = torch.utils.data.DataLoader(train, batch_size=16, shuffle=True)
  val_dataloader = torch.utils.data.DataLoader(val, batch_size=16)

  use_cuda = torch.cuda.is_available() 
  device = torch.device("cuda" if use_cuda else "cpu")

  loss = nn.CrossEntropyLoss() 
  optimizer = Adam(model.parameters(), lr = learning_rate)

  if use_cuda:

    model = model.cuda() 
    loss = loss.cuda() 

  for epoch_num in range(epochs):
    total_acc_train = 0 
    total_loss_train = 0 

    for train_input, train_label in tqdm(train_dataloader):

      train_label = train_label.to(device)
      mask = train_input['attention_mask'].to(device)
      input_id = train_input['input_ids'].squeeze(1).to(device)

      output = model(input_id, mask)

      batch_loss = loss(output, train_label)
      total_loss_train += batch_loss.item() 

      acc = (output.argmax(dim=1) == train_label).sum().item() 
      total_acc_train += acc 

      model.zero_grad()
      batch_loss.backward()
      optimizer.step() 

    total_acc_val = 0
    total_loss_val = 0 

    with torch.no_grad():


      for val_input, val_label in val_dataloader:

        val_label = val_label.to(device)
        mask = val_input['attention_mask'].to(device)
        input_id = val_input['input_ids'].squeeze(1).to(device)

        output = model(input_id, mask)

        batch_loss = loss(output, val_label)
        total_loss_val += batch_loss.item() 

        acc = (output.argmax(dim=1) == val_label).sum().item() 
        total_acc_val += acc 
      
      print(
          f'Epochs: {epoch_num +1} | Train Loss: {total_loss_train / len(train_data): .3f}\
          | Train Accuracy: {total_acc_train / len(train_data): .3f} \
          | Val Loss: {total_loss_val / len(val_data): .3f} \
          | Val Accuracy: {total_acc_val / len(val_data): .3f}'
      )


In [None]:

EPOCHS = 5
model = BERT() 
LR =3e-5 

train(model, df_train, df_val, LR, EPOCHS)


100%|██████████| 1329/1329 [04:33<00:00,  4.87it/s]


Epochs: 1 | Train Loss:  0.049          | Train Accuracy:  0.673           | Val Loss:  0.047           | Val Accuracy:  0.683


100%|██████████| 1329/1329 [04:26<00:00,  4.99it/s]


Epochs: 2 | Train Loss:  0.039          | Train Accuracy:  0.733           | Val Loss:  0.048           | Val Accuracy:  0.681


  7%|▋         | 97/1329 [00:19<04:06,  5.00it/s]

In [None]:
def evaluate(model, test_data): 

   test = Dataset(test_data)

   test_dataloader = torch.utils.data.DataLoader(test, batch_size =16)

   use_cuda = torch.cuda.is_available() 
   device = torch.device("cuda" if use_cuda else "cpu")

   if use_cuda: 

     model = model.cuda()

   total_acc_test = 0 
   with torch.no_grad():

     for test_input, test_label in test_dataloader: 

       test_label = test_label.to(device)
       mask = test_input['attention_mask'].to(device) 
       input_id = test_input['input_ids'].squeeze(1).to(device) 

       output = model(input_id, mask)

       acc = (output.argmax(dim=1) == test_label).sum().item() 
       total_acc_test += acc
   print(f'Test Accuracy: {total_acc_test / len(test_data): .3f}')

In [None]:
evaluate(model, test)