<a href="https://colab.research.google.com/github/noodlepopllc/LearnVietnamese/blob/main/Colab/Translate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Imports**

First import models needed
 - torch https://pytorch.org/
 - transformers https://github.com/huggingface/transformers

 The model we are going to be using is based on T5 https://huggingface.co/docs/transformers/model_doc/t5

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

##**Get Device**

Check to see if cuda is available for hardware acceleration otherwise use cpu for inferencing

In [2]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device)

cpu


##**Load the Model**

The model we are using comes from VietAI this is an open source english to vietnamese t5 translator, it is quite fast and runs well even on cpu

https://huggingface.co/VietAI/envit5-translation



In [3]:
model_name = "VietAI/envit5-translation"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

##**Translation**

First need a list of inputs, this model can handle both English to Vietnamese and Vietnamese to English. Each item in the list must start with either vi: or en: to notify the model of the input language.

Next the tokenizer will take the inputs and convert them into pytorch format and copy them to the device. The tokens must be in the same memory as the model. Finally the oposite has to take place must take the output tokens and covert them back into words. The output generated is same as the input, prepended with vi: or en: depending on language.


##**Example of Vietnamese to English**

In [4]:
inputs = ['vi: Xin chào thế giới']
outputs = model.generate(tokenizer(inputs, return_tensors="pt", padding=True).input_ids.to(device), max_length=512)
print([output for output in tokenizer.batch_decode(outputs, skip_special_tokens=True)])

['en: Hello world']


##**Example of English to Vietnamese**

In [5]:
inputs = ["en: Hello world"]
outputs = model.generate(tokenizer(inputs, return_tensors="pt", padding=True).input_ids.to(device), max_length=512)
print([output for output in tokenizer.batch_decode(outputs, skip_special_tokens=True)])

['vi: Xin chào thế giới']


##**Free Memory**

Import garbage collector, delete model, call garbage collector to free memory and empty cuda cache if in use

Not clear if model.cpu() is actually necessary

In [6]:
import gc

model.cpu()
del model
del tokenizer
gc.collect()
torch.cuda.empty_cache()

##Creating UI Elements

Next we are going to create a class to make the model a bit easier to use without requiring the rerunning of the cell


In [7]:
import ipywidgets as widgets
from ipywidgets import Layout

class VietTranslate(object):
    def __init__(self):
        self.device = "cuda:0" if torch.cuda.is_available() else "cpu"
        self.model_name = "VietAI/envit5-translation"
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.model = AutoModelForSeq2SeqLM.from_pretrained(self.model_name)\
            .to(self.device)

    def display(self,description='',button_description='Translate',text=''):
        self.textarea = widgets.Textarea(
            value=text,
            description= f'{description}:',
            disabled=False,
            layout=Layout(width='80%', height="100px"))
        self.output = widgets.Output(layout=Layout(width='80%', height='100px'))
        self.button = widgets.Button(description=button_description)
        self.button.on_click(self.button_pressed)
        display(self.textarea,self.button,self.output)

    def translate(self, t):
        inputs = [x for x in t.split('\n') if len(x) > 0]
        outputs = self.model.generate(self.tokenizer(inputs,
            return_tensors="pt", padding=True).input_ids.to(self.device),
            max_length=512)
        return [output for output in self.tokenizer.batch_decode(outputs,
                skip_special_tokens=True)]

    def button_pressed(self, b):
        self.output.clear_output()
        with self.output:
            for line in self.translate(self.textarea.value):
                print(line)


## You can change the text below in the box and hit the button to get the translation instead of rerunning the cells

In [8]:
translator = VietTranslate()
translator.display('Vietnamese','English',text='''
vi: VietAI là tổ chức phi lợi nhuận với sứ mệnh ươm mầm tài năng về trí tuệ nhân tạo và xây dựng một cộng đồng các chuyên gia trong lĩnh vực trí tuệ nhân tạo đẳng cấp quốc tế tại Việt Nam.

vi: Theo báo cáo mới nhất của Linkedin về danh sách việc làm triển vọng với mức lương hấp dẫn năm 2020, các chức danh công việc liên quan đến AI như Chuyên gia AI (Artificial Intelligence Specialist), Kỹ sư ML (Machine Learning Engineer) đều xếp thứ hạng cao.
''')

Textarea(value='\nvi: VietAI là tổ chức phi lợi nhuận với sứ mệnh ươm mầm tài năng về trí tuệ nhân tạo và xây …

Button(description='English', style=ButtonStyle())

Output(layout=Layout(height='100px', width='80%'))

In [9]:
translator2 = VietTranslate()
translator2.display('English','Vietnamese',text= '''
en: Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field."

en: We're on a journey to advance and democratize artificial intelligence through open source and open science.
''')

Textarea(value='\nen: Our teams aspire to make discoveries that impact everyone, and core to our approach is s…

Button(description='Vietnamese', style=ButtonStyle())

Output(layout=Layout(height='100px', width='80%'))