<a href="https://colab.research.google.com/github/noodlepopllc/LearnVietnamese/blob/main/Colab/Translate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Imports**

First import models needed
 - torch https://pytorch.org/
 - transformers https://github.com/huggingface/transformers

 The model we are going to be using is based on T5 https://huggingface.co/docs/transformers/model_doc/t5

In [10]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

##**Get Device**

Check to see if cuda is available for hardware acceleration otherwise use cpu for inferencing

In [11]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device)

cpu


##**Load the Model**

The model we are using comes from VietAI this is an open source english to vietnamese t5 translator, it is quite fast and runs well even on cpu

https://huggingface.co/VietAI/envit5-translation



In [12]:
model_name = "VietAI/envit5-translation"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

##**Translation**

First need a list of inputs, this model can handle both English to Vietnamese and Vietnamese to English. Each item in the list must start with either vi: or en: to notify the model of the input language.

Next the tokenizer will take the inputs and convert them into pytorch format and copy them to the device. The tokens must be in the same memory as the model. Finally the oposite has to take place must take the output tokens and covert them back into words. The output generated is same as the input, prepended with vi: or en: depending on language.


##**Example of Vietnamese to English**

In [13]:
inputs = ['vi: Xin chào thế giới']
outputs = model.generate(tokenizer(inputs, return_tensors="pt", padding=True).input_ids.to(device), max_length=512)
print([output for output in tokenizer.batch_decode(outputs, skip_special_tokens=True)])

['en: Hello world']


##**Example of English to Vietnamese**

In [14]:
inputs = ["en: Hello world"]
outputs = model.generate(tokenizer(inputs, return_tensors="pt", padding=True).input_ids.to(device), max_length=512)
print([output for output in tokenizer.batch_decode(outputs, skip_special_tokens=True)])

['vi: Xin chào thế giới']


##**Free Memory**

Import garbage collector, delete model, call garbage collector to free memory and empty cuda cache if in use

Not clear if model.cpu() is actually necessary

In [15]:
import gc

model.cpu()
del model
del tokenizer
gc.collect()
torch.cuda.empty_cache()