<a href="https://colab.research.google.com/github/yuufong/English-Japanes-Short-Trans/blob/main/EN_JA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#English-Japanese Translation Model

## 1. Installing necessary packages

In [1]:
!pip install datasets transformers sentencepiece 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.6.1-py3-none-any.whl (441 kB)
[K     |████████████████████████████████| 441 kB 14.0 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 52.8 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 43.7 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 71.2 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.14-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 65.0 MB/s 
Collecting dill<0.3.6
  Downloading dill-0.3.5.1-py2.py3-none-any.whl (95 kB)
[K     |█████████████

## 2. Preparing dataset

In [2]:
from datasets import load_dataset

dataset = load_dataset("snow_simplified_japanese_corpus", "snow_t15")

Downloading builder script:   0%|          | 0.00/6.95k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/8.04k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.83k [00:00<?, ?B/s]

Downloading and preparing dataset snow_simplified_japanese_corpus/snow_t15 (download: 3.47 MiB, generated: 6.88 MiB, post-processed: Unknown size, total: 10.35 MiB) to /root/.cache/huggingface/datasets/snow_simplified_japanese_corpus/snow_t15/1.1.0/3d2b3ae03002b35ba284fd81fe825917859e4365c825af4ab3f10273074c81f6...


Downloading data:   0%|          | 0.00/3.63M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset snow_simplified_japanese_corpus downloaded and prepared to /root/.cache/huggingface/datasets/snow_simplified_japanese_corpus/snow_t15/1.1.0/3d2b3ae03002b35ba284fd81fe825917859e4365c825af4ab3f10273074c81f6. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['ID', 'original_ja', 'simplified_ja', 'original_en'],
        num_rows: 50000
    })
})

### Split this dataset into a train & a test set

In [4]:
dataset = dataset['train'].train_test_split(0.2)

In [5]:
dataset['train'][0]

{'ID': '11153',
 'original_ja': '辞書でその言葉をみつけなさい。',
 'simplified_ja': '辞書でその言葉を見つけろ。',
 'original_en': 'look up the word in the dictionary .'}

## 2. Tokenizing

Here, I will try to use the tokenizer and pretrained model from MBart large

In [6]:
from transformers import MBart50TokenizerFast

tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50", src_lang="en_XX", tgt_lang="ja_XX")

Downloading:   0%|          | 0.00/531 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/649 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

I will create a function to map so that we can tokenize our whole dataset. 

In [7]:
def tokenize_data(batch):
  inputs =[ex for ex in batch['original_en']]
  targets = [ex for ex in batch['simplified_ja']]
  model_inputs = tokenizer(inputs, text_target=targets, max_length = 128, truncation=True)
  return model_inputs

In [8]:
tokenized_dataset = dataset.map(tokenize_data, batched=True, remove_columns=['original_ja','ID'])

  0%|          | 0/40 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

In [9]:
tokenized_dataset 

DatasetDict({
    train: Dataset({
        features: ['simplified_ja', 'original_en', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 40000
    })
    test: Dataset({
        features: ['simplified_ja', 'original_en', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 10000
    })
})

In [10]:
tokenized_dataset['train'][20]

{'simplified_ja': '私の部屋にはテレビがある。',
 'original_en': 'there is a television in my room .',
 'input_ids': [250004, 2685, 83, 10, 113976, 23, 759, 17155, 6, 5, 2],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'labels': [250012, 6, 27491, 49218, 2880, 48418, 7794, 30, 2]}

### 3. Creating a performance benchmark

In [12]:
class PerformanceBenchmark:
    def __init__(self, pipeline, dataset, optim_type="facebook/mbart-large-50"):
        self.pipeline = pipeline
        self.dataset = dataset
        self.optim_type = optim_type
        
    def compute_accuracy(self):
        # We'll define this later
        pass    

    def compute_size(self):
        # We'll define this later
        pass

    def time_pipeline(self):
        # We'll define this later
        pass
    
    def run_benchmark(self):
        metrics = {}
        metrics[self.optim_type] = self.compute_size()
        metrics[self.optim_type].update(self.time_pipeline())
        metrics[self.optim_type].update(self.compute_accuracy())
        return metrics

In [22]:
import numpy as np
from time import perf_counter

def time_pipeline(self, query="What is the pin number for my account?"):
    """This overrides the PerformanceBenchmark.time_pipeline() method"""
    latencies = []
    # Warmup
    for _ in range(10):
        _ = self.pipeline(query)
    # Timed run
    for _ in range(100):
        start_time = perf_counter()
        _ = self.pipeline(query)
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    print(f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}")
    return {"time_avg_ms": time_avg_ms, "time_std_ms": time_std_ms}

PerformanceBenchmark.time_pipeline = time_pipeline

In [15]:
import torch
from pathlib import Path

def compute_size(self):
    """This overrides the PerformanceBenchmark.compute_size() method"""
    state_dict = self.pipeline.model.state_dict()
    tmp_path = Path("model.pt")
    torch.save(state_dict, tmp_path)
    # Calculate size in megabytes
    size_mb = Path(tmp_path).stat().st_size / (1024 * 1024)
    # Delete temporary file
    tmp_path.unlink()
    print(f"Model size (MB) - {size_mb:.2f}")
    return {"size_mb": size_mb}

PerformanceBenchmark.compute_size = compute_size

In [28]:
!pip install evaluate sacrebleu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sacrebleu
  Downloading sacrebleu-2.3.1-py3-none-any.whl (118 kB)
[K     |████████████████████████████████| 118 kB 11.1 MB/s 
Collecting portalocker
  Downloading portalocker-2.6.0-py2.py3-none-any.whl (15 kB)
Collecting colorama
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.6 portalocker-2.6.0 sacrebleu-2.3.1


In [30]:
import evaluate
#hide_output
sacrebleu = evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

In [31]:
def compute_accuracy(self):
    """This overrides the PerformanceBenchmark.compute_accuracy() method"""
    preds, labels = [], []
    for example in self.dataset:
        pred = self.pipeline(example["original_en"])['translation_text']
        label = example["labels"]
    results = sacrebleu.compute(predictions=preds, references=labels)
    print(f"Sacrebleu score on test set - {results['score']:.3f}")
    return results

PerformanceBenchmark.compute_accuracy = compute_accuracy

### 4. Training
We obtained our inputs as ids for input and label, now let's train it

First we will iniate our model from the pretrained mbart-large-50

In [18]:
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50")

Downloading:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

Use DataCollatorForSeq2Seq to create a batch of examples. It will also dynamically pad your text and labels to the length of the longest element in its batch, so they are a uniform length. 

In [19]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

Log in to the hub to upload your model after training

In [20]:
!huggingface-cli login


        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        To login, `huggingface_hub` now requires a token generated from https://huggingface.co/settings/tokens .
        
Token: 
Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in yo

In [14]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

model_name = "EN-JA_Translation_with_MBart"
training_args = Seq2SeqTrainingArguments(
    output_dir=model_name,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=1,
    num_train_epochs=2,
    disable_tqdm=False,
    fp16=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
/content/EN-JA_Translation_with_MBart is already a clone of https://huggingface.co/yuufong/EN-JA_Translation_with_MBart. Make sure you pull the latest changes with `repo.git_pull()`.


OSError: ignored

### 5. Testing model

In [None]:
!git lfs

git-lfs/2.3.4 (GitHub; linux amd64; go 1.8.3)
Sorry, no usage text found for "git-lfs"


In [None]:
from transformers import pipeline

translator = pipeline("translation", "https://huggingface.co/yuufong/EN-JA_Translation_with_MBart")

  f"Using `from_pretrained` with the url of a file (here {url}) is deprecated and won't be possible anymore in"


Downloading: 0.00B [00:00, ?B/s]

OSError: ignored