<a href="https://colab.research.google.com/github/jdasam/aat3020-2023/blob/main/notebooks/4_Machine_translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Translation

In [1]:
'''
Download dataset (originally from NIA AI-Hub)
'''

!gdown 1CpsqOuuuB3I_PG5DbuqH1ssCFVerU46g

Downloading...
From: https://drive.google.com/uc?id=1CpsqOuuuB3I_PG5DbuqH1ssCFVerU46g
To: /content/nia-aihub-korean-english.zip
100% 290M/290M [00:06<00:00, 44.0MB/s]


In [2]:
!unzip -q nia-aihub-korean-english.zip

In [4]:
from pathlib import Path
import pandas as pd

dataset_dir = Path('nia_korean_english')
data_list = sorted(list(dataset_dir.glob('*.xlsx')))
for path in data_list[:1]:
  df = pd.read_excel(path)
  kor_text_path = path.parent / (path.stem+'_kor.txt') 
  eng_text_path = path.parent / (path.stem+'_eng.txt') 
  with open(kor_text_path, 'w', encoding='utf8') as f:
      f.write('\n'.join(df['원문']))
  with open(eng_text_path, 'w', encoding='utf8') as f:
      f.write('\n'.join(df['번역문']))

KeyboardInterrupt: ignored

In [5]:
data_list

[PosixPath('nia_korean_english/1_구어체(1).xlsx')]

In [6]:
import pandas as pd

dfs = [pd.read_excel(path) for path in data_list[:1]]

In [7]:
df = pd.concat(dfs, axis=0)

In [16]:
with open("nia_korean_english/1_구어체(1)_kor.txt", 'w', encoding='utf8') as f:
    f.write('\n'.join(df['원문']))


In [10]:
df['원문'][10000:10050], df['번역문'][10000:10050]

(10000          개, 돌고래류, 원숭이, 앵무새 일련의 음성 명령 또는 단어를 배울 수 있다.
 10001                     개가 계속 딸꾹질을 한다면 문제가 있는 것이 틀림없습니다.
 10002                                      개가 그걸 좋아할 것 같아.
 10003                                      개가 정말 예쁘게 생겼네요.
 10004                            개가 지저분하게 해 놓은 것을 청소할 거예요.
 10005                           개가 짖는 소리 때문에 나는 방금 잠에서 깼어.
 10006    개가 하는 행동의 의미를 알고 있는 것은 반려견과 서로를 이해할 수 있는 하나의 방...
 10007         개개인들이 가까이 서로 붙어있으면, 그들은 서로 소통하고 많은 정보를 나눕니다.
 10008    개개인의 사망 시점을 예측하기는 어려워도 어느 집단에서 일정 기간의 평균 사망자 수...
 10009      개개인의 존재가 존중된 후에 제대로 된 society가 이루어질 수 있다고 생각해요.
 10010    개교 이래 100년 만에 DJ들과 빠른 비트의 EDM 음악, 현란한 조명들은 저에게...
 10011                     개구리는 뱀이 개구리를 먹는다는 것을 알고 어떻게 했나요?
 10012                         개구리로 변한 그는 사람들로부터 괴롭힘을 당했어요.
 10013    개구쟁이 성격을 지닌 캐릭터로 연령대가 10대 후반에서 20대 초반으로 사춘기를 겪...
 10014                 개구쟁이처럼만 지낼 것 같아서 걱정이었는데 사진 보니 자랑스러워.
 10015                                개국 기념일 날에는 불꾳 놀이를 해요.
 10016         개나 소나 말고, 보고 싶은 것만 보고 보여주

## Huggingface Tokenizer

In [18]:
!pip install transformers tokenizers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m56.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 transformers-4.28.1


In [17]:
from tokenizers import BertWordPieceTokenizer
tokenizer = BertWordPieceTokenizer(strip_accents=False, lowercase=False)

corpus_file   =  [str(path.parent / (path.stem + '_kor.txt')) for path in data_list[:2]]
# output_dir   = Path('hugging_kor_%d'%(vocab_size))
en_corpus_file   =  [str(path.parent / (path.stem + '_eng.txt')) for path in data_list[:2]]
# output_dir   = Path('hugging_eng_%d'%(vocab_size))

vocab_size    = 32000  # Number of maximum size of the vocabulary
limit_alphabet= 6000   
output_dir    = Path('hugging_kor_%d'%(vocab_size))
en_output_dir = Path('hugging_eng_%d'%(vocab_size))
output_dir.mkdir(exist_ok=True)
en_output_dir.mkdir(exist_ok=True)
min_frequency = 5 

tokenizer.train(files=corpus_file,
               vocab_size=vocab_size,
               min_frequency=min_frequency,
               limit_alphabet=limit_alphabet, 
               show_progress=True)

tokenizer.save_model(str(output_dir))

en_tokenizer = BertWordPieceTokenizer(strip_accents=False, lowercase=False)
en_tokenizer.train(files=en_corpus_file,
                vocab_size=vocab_size,
                min_frequency=min_frequency,
                limit_alphabet=limit_alphabet,
                show_progress=True)
en_tokenizer.save_model(str(en_output_dir))



['hugging_eng_32000/vocab.txt']

In [14]:
corpus_file

['nia_korean_english/1_구어체(1)_kor.txt']

In [19]:
from transformers import BertTokenizerFast

tokenizer_src = BertTokenizerFast.from_pretrained('hugging_kor_32000',
                                                       strip_accents=False,
                                                       lowercase=False) 
tokenizer_tgt = BertTokenizerFast.from_pretrained('hugging_eng_32000',
                                                       strip_accents=False,
                                                       lowercase=False) 

tokenized_data = tokenizer_src(df['원문'].iloc[10])
print(tokenizer_src.decode(tokenized_data['input_ids']))

[CLS] 나는 친구에게 그 철학자의 책을 선물해 주겠다고 말했습니다. [SEP]


In [21]:
tokenized_data['input_ids']

[2, 3390, 5817, 245, 12734, 4135, 3972, 15943, 26023, 6168, 15, 3]

In [23]:
tokenized_ids = tokenizer_src("나는 서강대학교에 다닙니다")['input_ids']
tokenized_ids

[2, 3390, 974, 2346, 11185, 10758, 3]

In [24]:
tokenizer_src.decode(tokenized_ids)

'[CLS] 나는 서강대학교에 다닙니다 [SEP]'

## Divide Train / Validate/ Test Set
- using `np.random.choice`
    - To always get the same random shuffling result, you have to use `np.random.seed()`

In [25]:
df

Unnamed: 0,SID,원문,번역문
0,1,'Bible Coloring'은 성경의 아름다운 이야기를 체험 할 수 있는 컬러링 ...,Bible Coloring' is a coloring application that...
1,2,씨티은행에서 일하세요?,Do you work at a City bank?
2,3,푸리토의 베스트셀러는 해외에서 입소문만으로 4차 완판을 기록하였다.,"PURITO's bestseller, which recorded 4th rough ..."
3,4,11장에서는 예수님이 이번엔 나사로를 무덤에서 불러내어 죽은 자 가운데서 살리셨습니다.,In Chapter 11 Jesus called Lazarus from the to...
4,5,"6.5, 7, 8 사이즈가 몇 개나 더 재입고 될지 제게 알려주시면 감사하겠습니다.",I would feel grateful to know how many stocks ...
...,...,...,...
199995,199996,나는 먼저 청소기로 바닥을 밀었어요.,"First of all, I vacuumed the floor."
199996,199997,나는 먼저 팀 과제를 하고 놀러 갔어요.,I did the team assignment first and went out t...
199997,199998,나는 비 같은 멋진 연예인을 좋아해요.,I like cool entertainer like Rain.
199998,199999,나는 멋진 자연 경치를 보고 눈물을 흘렸어.,I cried seeing the amazing scenery.


In [26]:
len(df)

200000

In [27]:
df.iloc[1]

SID                              2
원문                    씨티은행에서 일하세요?
번역문    Do you work at a City bank?
Name: 1, dtype: object

In [32]:
class Dataset:
  def __init__(self, df, src_tokenizer, tgt_tokenizer):
    self.data = df
    self.src_tokenizer = src_tokenizer
    self.tgt_tokenizer = tgt_tokenizer
  
  def __len__(self):
    return len(self.data)

  def __getitem__(self, idx):
    selected_row = self.data.iloc[idx]
    source = selected_row['원문']
    target = selected_row['번역문']

    source_enc = self.src_tokenizer(source)['input_ids']
    target_enc = self.tgt_tokenizer(target)['input_ids']

    return source_enc, target_enc[:-1], target_enc[1:]
  
dataset = Dataset(df, tokenizer_src, tokenizer_tgt)

dataset[1]

([2, 29361, 3393, 1274, 3621, 32, 3],
 [2, 336, 267, 425, 339, 68, 1619, 1601, 34],
 [336, 267, 425, 339, 68, 1619, 1601, 34, 3])

## Define Dataset
- Each datasample has to return source sentence and target sentence
- You need a Tokenizer to get the tokenized result


In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader

## Define Collate function
- After implementing Dataset, we have to declare DataLoader that groups several training samples as a single batch
- However, we cannot batchify the melodies in straightforward way, because the length of each melody is different
- In this problem, you will learn about how to handle sequences of different length as a batch


In [None]:

'''
This cell will make error, because the length of each sample is different to each other
'''

train_loader = DataLoader(train_set, batch_size=8)
batch = next(iter(train_loader))


In [None]:
'''
To handle that problem, you have to make your collate function 
'''
def your_collate_function(raw_batch):
  '''
  You can make your own function to handle the batch
  '''
  
  return raw_batch[0] # This returns the first melody of each batch. So it will avoid the error, but it doesn't do proper batchifying

batch_size = 8
raw_batch = [train_set[i] for i in range(batch_size)] # This is the input for the collate function
batch = your_collate_function(raw_batch)

'''
This is what the 'collate_fn' does in DataLoader
'''

train_loader = DataLoader(train_set, batch_size=batch_size, collate_fn=your_collate_function)
batch_by_loader = next(iter(train_loader))

#### Pad Sequence and Pack Sequence
In PyTorch, there are two ways to batchify a group of sequence with different length.
- `torch.nn.utils.rnn.pad_sequence`
    - This function takes list of tensors with different length and return padded sequence
    - Padding is adding some constant number as a PAD token to match the length of short sequence to the maximum length
        - e.g. If there are sequence of length (3,7,4), we can add 4 zeros to sequence with length 3, 3 zeros to sequence with length 4 to make them length 7
    - In default, we use 0 for padding (zero padding)
    - The result 
- `torch.nn.utils.rnn.pack_sequence`
    - pad_sequence 

Cells below show the example of `pad_sequence`

In [None]:
from torch.nn.utils.rnn import pad_sequence, pack_sequence, PackedSequence
short = torch.arange(3, -1, -1).float() # [3, 2, 1, 0]
long = torch.arange(27,19, -1).float()
middle = torch.arange(15,9, -1).float()

pad_sequence([short, long, middle], batch_first=False)  # T x N 

In [None]:
# Default value of batch_first in pad_sequence is False.
# So you have to always be careful not to miss batch_first=True in pad_sequence, if you use batch_first=True for your RNN layer.
pad_sequence([short, long, middle], batch_first=True)  # N x T 

1) However, the problem is that you can't figure out whether the 0 at the end of each sequence is a padded one, or was included in the original sequence
- e.g. `[2, 3, 4, 3, 0]` becomes `[ 2,  3,  4,  5,  0,  0,  0]`. Now we don't know how many zeros were added for padding

2) Also, if you run RNN for this padded sequence, RNN will calculate for the padded part also.
- RNN doesn't know whether it is padded data, or existing data
- This makes computation slower

3) If you want to use bi-directional, which also reads the sequence from backward, paddings can make the result different.

To solve this issue, we use PackedSequence, by using `pack_sequence`/

In [None]:
packed_sequence = pack_sequence([short, long, middle], enforce_sorted=False)
packed_sequence

`PackedSequence` has `data` and `batch_sizes`
- `data` contains the flattened value of given batch
    - To optimize the computation, the sequences have to be sorted by descending of length
- `batch_sizes` represents how many valid batch sample exists for each time step
    - `[3, 3, 3, 2, 2, 1, 1]` means that there are 3 sequences for first three time steps, and then 2 sequences for next two steps, and then only 1 sequence for next two steps.
- `sorted_indices` shows how the sorted sequences can be converted to original order.
    - `[1,2,0]` means that 
        - the 0th sequence in the sorted sequences (the longest one) was indexed as 1 in the original input batch
        - the 1st sequence in the sorted sequences (`middle`) was indexed as 2 in the original input batch
        - the 2nd sequence in the sorted sequences (`short`) was index as 0 in the original input batch
- `unsorted_indices` shows how the original sequences are sorted.
    - `[2,0,1]` means that
        - the 0th sequence in the original input was sorted as 2nd in the sorted sequences

In [None]:
rnn_layer = nn.GRU(1, 1)
packed_sequence = pack_sequence([short.unsqueeze(1), long.unsqueeze(1), middle.unsqueeze(1)], enforce_sorted=False)
out, last_hidden = rnn_layer(packed_sequence)

print(f"Type of output of RNN for PackedSequence: {type(out)}")
print(f"Type of last_hidden of RNN for PackedSequence: {type(last_hidden)}")

- RNN or its family of PyTorch can automatically handle `PackedSequence`
- However, other layers like `nn.Embedding` or `nn.Linear` cannot take `PackedSequence` as its input
- There are two ways to feed `PackedSequence` to these layers
    - First, convert PackedSequence to ordinary torch.Tensor by `torch.nn.utils.rnn.pad_packed_sequence`
        - This will convert PackedSequence to a tensor of sequneces with same length but different padding
    - The other way is to feed only PackedSequence.data, and then declaring new PackedSequence with the output as `data`

In [None]:
'''
This will make error, because other layers cannot handle PackedSequence
'''
test_linear_layer = nn.Linear(in_features=1, out_features=2)
test_linear_layer(packed_sequence)

In [None]:
'''
One way to to this is using torch.nn.utils.rnn.pad_packed_sequence to convert PackedSequence to ordinary tensor
'''

from torch.nn.utils.rnn import pad_packed_sequence
padded_sequence, batch_lengths = pad_packed_sequence(packed_sequence)
print(f'The padded sequence generated from packed sequence (squeezed for printing): \n {padded_sequence.squeeze()}')
print(f'"pad_packed_sequence" also returns "batch_lengths", to clarify the original length before the padding: \n {batch_lengths}')



In [None]:
'''
Now you can feed padded sequence to linear layer.
'''

linear_output = test_linear_layer(padded_sequence)
print(f"Output of feeding padded_sequence to a linear layer: {linear_output}")
print("Caution that it returns non-zero values for timestep with zero padding, because linear layer has a bias")

In [None]:
'''
You can make the output as a PackedSequence, by using torch.nn.utils.rnn.pack_padded_sequence
'''
from torch.nn.utils.rnn import pack_padded_sequence
re_packed_sequence = pack_padded_sequence(linear_output, batch_lengths, enforce_sorted=False)
re_packed_sequence

In [None]:
'''
Another way to do it is using PackedSequence.data
'''

linear_out_pack = test_linear_layer(packed_sequence.data)
packed_sequence_after_linear = PackedSequence(linear_out_pack, packed_sequence.batch_sizes, packed_sequence.sorted_indices, packed_sequence.unsorted_indices)
packed_sequence_after_linear

## Define Model
![image](https://raw.githubusercontent.com/tensorflow/nmt/master/nmt/g3doc/img/seq2seq.jpg)

## Define Trainer

In [None]:
from tqdm import tqdm

class Trainer:
  def __init__(self, model, optimizer, loss_fn, train_loader, valid_loader, device):
    self.model = model
    self.optimizer = optimizer
    self.loss_fn = loss_fn
    self.train_loader = train_loader
    self.valid_loader = valid_loader
    
    self.model.to(device)
    
    self.best_valid_accuracy = 0
    self.device = device
    
    self.training_loss = []
    self.validation_loss = []
    self.validation_acc = []

  def save_model(self, path='imdb_sentiment_model.pt'):
    torch.save({'model':self.model.state_dict(), 'optim':self.optimizer.state_dict()}, path)
    
  def train_by_num_epoch(self, num_epochs):
    for epoch in tqdm(range(num_epochs)):
      self.model.train()
      for batch in self.train_loader:
        loss_value = self._train_by_single_batch(batch)
        self.training_loss.append(loss_value)
      self.model.eval()
      validation_loss, validation_acc = self.validate()
      self.validation_loss.append(validation_loss)
      self.validation_acc.append(validation_acc)
      
      if validation_acc > self.best_valid_accuracy:
        print(f"Saving the model with best validation accuracy: Epoch {epoch+1}, Acc: {validation_acc:.4f} ")
        self.save_model('imdb_sentiment_model_best.pt')
      else:
        self.save_model('imdb_sentiment_model_last.pt')
      self.best_valid_accuracy = max(validation_acc, self.best_valid_accuracy)

      
  def _train_by_single_batch(self, batch):
    '''
    This method updates self.model's parameter with a given batch
    
    batch (tuple): (batch_of_input_text, batch_of_label)
    
    You have to use variables below:
    
    self.model (SentimentModel/torch.nn.Module): A neural network model
    self.optimizer (torch.optim.adam.Adam): Adam optimizer that optimizes model's parameter
    self.loss_fn (function): function for calculating BCE loss for a given prediction and target
    self.device (str): 'cuda' or 'cpu'

    output: loss (float): Mean binary cross entropy value for every sample in the training batch
    The model's parameters, optimizer's steps has to be updated inside this method

    TODO: Complete this method 
    '''

    
    return

    
  def validate(self, external_loader=None):
    '''
    This method calculates accuracy and loss for given data loader.
    It can be used for validation step, or to get test set result
    
    input:
      data_loader: If there is no data_loader given, use self.valid_loader as default.
      
    
    output: 
      validation_loss (float): Mean Binary Cross Entropy value for every sample in validation set
      validation_accuracy (float): Mean Accuracy value for every sample in validation set
      
    TODO: Complete this method 

    '''
    
    ### Don't change this part
    if external_loader and isinstance(external_loader, DataLoader):
      loader = external_loader
      print('An arbitrary loader is used instead of Validation loader')
    else:
      loader = self.valid_loader
      
    self.model.eval()
    
    '''
    Write your code from here, using loader, self.model, self.loss_fn.
    '''