<a href="https://colab.research.google.com/github/krooner/how-to-pytorch/blob/main/sequence_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 언어 모델 학습에 필요한 입력 데이터 전처리

> Text가 아닌 데이터를 Language Model에 투입하고 싶다. 어떻게 해야 할까?


__Import__

In [1]:
import torch

import pandas as pd
import numpy as np
import os
import pickle

from matplotlib import pyplot as plt
import seaborn as sns

__Input Data__
- 각 Sequence는 Sentence와 같이, Whitespace로 구분되어 있는 자체 정의된 Token으로 구성
- 예시는 일단 텍스트로 진행

In [2]:
text = [
  "Don't speak ill of others.",
  "To speak ill of others is a great crime.",
  "Rather rectify yourself through self-criticism.",
  "In this way, if you rectify yourself, others will follow you.",
  "To speak ill of others gives rise to more problems.",
  "This does not do any good to society.",
  "More than 80 percent people of our country live in villages.",
  "Most of them are poor and illiterate.",
  "Illiteracy is one of the causes of their poverty.",
  "Many of the villagers are landless cultivators.",
  "They cultivate the lands of other people throughout the year.",
  "They get a very small portion of the crops.",
  "They provide all of us with food.",
  "But in want they only starve.",
  "They suffer most.",
  "The situation needs to be changed.",
  "We live in the age of science.",
  "We can see the influence of science everywhere.",
  "Science is a constant companion of our life.",
  "We have made the impossible things possible with the help of science.",
  "Modern civilization is a contribution of science.",
  "Science should be devoted to the greater welfare of mankind.",
  "Rabindranath Tagore got the Nobel Prize in 1913 which is 98 years ago from today.",
  "He was awarded this prize for the translation of the Bengali 'Gitanjali' into English.",
  "This excellent rendering was the poet's own.",
  "In the English version of Gitanjali there are 103 songs."
]

## Processing Order

1. Sentence → List of Tokens: `torchtext.data.utils.get_tokenizer`
2. List of List of Tokens → Vocab: `torchtext.vocab.build_vocab_from_iterator`
3. List of Tokens → List of Indices: `torchtext.transforms.{VocabTransform, ToTensor, PadTransform, }`

### 1
---
#### Tokenizer - using [`torchtext.data.utils.get_tokenizer`](https://pytorch.org/text/stable/data_utils.html)

> Parameter
> - tokenizer – the name of tokenizer function. __If None, it returns split() function, which splits the string sentence by space.__

Text가 아닌 데이터의 경우는 Sequence로 만들기 위해서 
1. Token에 대한 정의
2. Token으로의 변환

을 이전 단계에서 진행했을 것이기 때문에 Text Tokenizer의 기능은 필요없음.  
(Vocab의 경우에도 `[UNK]` 즉, Unseen Token은 없다고 가정할 수 있음)

### 2
---

#### Vocab - using [`torchtext.vocab.build_vocab_from_iterator`](https://pytorch.org/text/stable/vocab.html#build-vocab-from-iterator)

> Parameter
> - iterator – Iterator used to build Vocab. __Must yield list or iterator of tokens.__

Tokenizer로 Sentence를 List of Tokens으로 만들었으면 이걸 Iterator로 만들어서 넘김 (`yield_tokens(text)`)

그리고 우리가 궁극적으로 만들 입력 데이터 형식은 다음과 같다.  
`[CLS] Token Token Token [SEP] [PAD] [PAD] ... `
- `[CLS]`는 Classification의 줄임말로 __Sequence의 시작을 의미__
- `[SEP]`는 Separator의 줄임말로 __Sequence의 끝을 의미__
- `[PAD]`는 Padding의 줄임말로 __Sequence가 `max_length`보다 짧을 때, 길이를 맞추기 위해 끼워넣는 Token을 의미__
- `[MASK]`
  - Masked Language Modeling (MLM) 을 위한 Token 
  - Random하게 일정 비율의 Token을 `[MASK]`로 변경한 후 Bidirectional하게 Masking된 Token이 무엇인지 맞추는 학습

In [4]:
import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer(None)

def yield_tokens(data_iter):
  for t in data_iter:
    yield tokenizer(t)

# vocab = build_vocab_from_iterator(yield_tokens(text), specials=["[PAD]"])
# vocab.set_default_index(vocab["[PAD]"])
vocab = build_vocab_from_iterator(yield_tokens(text))
vocab.insert_token('[PAD]', 0)
vocab.insert_token('[CLS]', 1)
vocab.insert_token('[SEP]', 2)
vocab.insert_token('[MASK]', 3)

pad_idx, cls_idx, sep_idx, msk_idx = 0, 1, 2, 3

max_seq_len = max([len(vocab(tokenizer(item))) for item in text])

### 3
---
#### Transforms - using [`torchtext.transforms`](https://pytorch.org/text/stable/transforms.html)

`torchvision`에서 이미지 데이터에 대해 Crop, Normalize 등을 적용할 수 있는 것처럼, `torchtext`를 활용해서도 텍스트 데이터에 대해 똑같이 적용할 수 있다.

- [`VocabTransform`](https://pytorch.org/text/stable/transforms.html#vocabtransform): List of Tokens → List of Indices (based on vocab)
- [`AddToken`](https://pytorch.org/text/stable/transforms.html#addtoken): List of Indices → `[CLS]` List of Indices `[SEP]`
- [`ToTensor`](https://pytorch.org/text/stable/transforms.html#totensor): `[CLS] ... [SEP]` → `[CLS] ... [SEP] [PAD] ...`
  - > padding_value (Optional[int]) – Pad value to make each input in the batch of __length equal to the longest sequence in the batch.__
  - Batch 내에서 longest sequence length에 맞게 padding을 적용함
- [`PadTransform`](https://pytorch.org/text/stable/transforms.html#padtransform): `[CLS] ... [SEP] [PAD] ...` → 모든 시퀀스에 대해 Globally Longest Length에 맞게 Padding을 적용

In [5]:
from torchtext.transforms import VocabTransform, PadTransform, ToTensor, AddToken

input_ids_transform = torchtext.transforms.Sequential(
    VocabTransform(vocab),
    AddToken(token=cls_idx, begin=True),
    AddToken(token=sep_idx, begin=False),
    ToTensor(padding_value=0),
    PadTransform(max_length=max_seq_len+2, pad_value=0)
)

def data_collate_fn(dataset_samples_list):
    arr = [tokenizer(item) for item in dataset_samples_list]

    input_ids = input_ids_transform(arr)
    token_type_ids = torch.zeros(input_ids.size())
    attention_mask = (input_ids != pad_idx).to(torch.int32)

    return {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": attention_mask}

#### Dataset and DataLoader

In [10]:
from torch.utils.data import Dataset, DataLoader

class MyDataset(Dataset):
    def __init__(self, src, tokenizer):
      self.src = src
      self.tokenizer = tokenizer

    def __len__(self):
      return len(self.src)

    def __getitem__(self, idx):
      src = self.src[idx]
      return src

dataset = MyDataset(text, tokenizer)
dataloader = DataLoader(dataset, batch_size=1, collate_fn=data_collate_fn)

In [12]:
print(text[0])
for item in dataloader:
  print(item['input_ids'])
  print(vocab.lookup_tokens(item['input_ids'].squeeze(0)[:7].tolist()))
  break

Don't speak ill of others.
tensor([[  1,  36,  16,  13,   4, 104,   2,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0]])
['[CLS]', "Don't", 'speak', 'ill', 'of', 'others.', '[SEP]']
