<a href="https://colab.research.google.com/github/noodlepopllc/LearnVietnamese/blob/main/Colab/Text2Speech2Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Imports

Import torch to get device and release gpu resources, gc (Garbage Collector) to cleanup memory after deleting references and the various methods for using the transformer models

*Overview of transformers below*

[Tasks transformers solves](https://huggingface.co/docs/transformers/main/en/tasks_explained)

In [1]:
import torch, gc
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, \
    T5ForConditionalGeneration, T5Tokenizer, VitsModel, pipeline

## Get device

For colab we only have cuda available to us for GPU acceleration or CPU if running a CPU only instance


In [2]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device)

cpu


## English to Vietnamese

Example to how to use model to translate text, I am using a different model below for actual translation since this model does not do a good job for this sentence. Will need to compare various models to pick the best one.

*The model we are using is linked below.*

[T5-EN-VI-SMALL:Pretraining Text-To-Text Transfer Transformer for English Vietnamese Translation](https://huggingface.co/NlpHUST/t5-en-vi-small)

In [3]:
model = T5ForConditionalGeneration.from_pretrained("NlpHUST/t5-en-vi-small")
tokenizer = T5Tokenizer.from_pretrained("NlpHUST/t5-en-vi-small")
model.to(device)

src = "I hope this model is capable of completely translating this sentence, let us see if it can."
tokenized_text = tokenizer.encode(src, return_tensors="pt").to(device)
model.eval()
summary_ids = model.generate(
                    tokenized_text,
                    max_length=128,
                    num_beams=5,
                    repetition_penalty=2.5,
                    length_penalty=1.0,
                    early_stopping=True
                )
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(output)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Tôi hy vọng mô hình này có thể dịch toàn bộ câu này, để xem nó có thể không.


## Release the memory that holds the model when done and loading a new one

In [4]:
model.cpu()
del model
del tokenizer
gc.collect()
torch.cuda.empty_cache()

## Similar to above we use a Vietnamese to English translator to see how well the translation worked

*Model is linked below*

[Machine translation for vietnamese](https://huggingface.co/NlpHUST/t5-vi-en-small)

In [5]:

model = T5ForConditionalGeneration.from_pretrained("NlpHUST/t5-vi-en-small")
tokenizer = T5Tokenizer.from_pretrained("NlpHUST/t5-vi-en-small")
model.to(device)

tokenized_text = tokenizer.encode(output, return_tensors="pt").to(device)
model.eval()
summary_ids = model.generate(
                    tokenized_text,
                    max_length=256,
                    num_beams=5,
                    repetition_penalty=2.5,
                    length_penalty=1.0,
                    early_stopping=True
                )
output2 = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(output2)


I hope this model can translate the whole thing, see if it couldn't.


## Unload the memory when done before loading next model

In [6]:
model.cpu()
del model
del tokenizer
gc.collect()
torch.cuda.empty_cache()

## This model can translate both to and from Vietnamese

First translate to vietnamese

*Model linked below*

[EnViT5 Translation](https://huggingface.co/VietAI/envit5-translation)

In [7]:
model_name = "VietAI/envit5-translation"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
inputs = [f"en: {src}"]
outputs = model.generate(tokenizer(inputs, return_tensors="pt", padding=True).input_ids.to(device), max_length=512)
output = [output for output in tokenizer.batch_decode(outputs, skip_special_tokens=True)][0].split('vi: ')[1]
print(output)



Tôi hy vọng mô hình này có khả năng dịch hoàn toàn câu này, hãy xem nó có thể không.


Now translate back to English to verify the translation. Keep in mind the translation could to Vietnamese may be good, but the translation from could be the issue. Will need further testing to evaluate best model combination.

*Model linked below*

[EnViT5 Translation](https://huggingface.co/VietAI/envit5-translation)

In [8]:

model_name = "VietAI/envit5-translation"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
outputs = model.generate(tokenizer([f'vi: {output}'], return_tensors="pt", padding=True).input_ids.to(device), max_length=512)
test = [output for output in tokenizer.batch_decode(outputs, skip_special_tokens=True)][0].split('en: ')[1]
print(test)

I hope this model is able to completely translate this sentence, let's see if it can.


## Cleanup memory, note we are onlyu deleting the models and tokenizer

In [9]:
model.cpu()
del model
del tokenizer
gc.collect()
torch.cuda.empty_cache()


## This model is very popular and there are versions for most languages

This works well but is a bit robotic and doesn't have female or dialecct variations. Will look at other models later.

scipy is strictly used for writing the audio file so we can use it for transcription later.

The Audio and disply modules give us a very easy way to play back the encoded audio from browser.

*Model linked below*

[Massively Multilingual Speech (MMS): Vietnamese Text-to-Speech](https://huggingface.co/facebook/mms-tts-vie)

In [10]:
import scipy
from IPython.display import Audio, display

model = VitsModel.from_pretrained("facebook/mms-tts-vie")
tokenizer = AutoTokenizer.from_pretrained("facebook/mms-tts-vie")
inputs = tokenizer(output, return_tensors="pt")
with torch.no_grad():
    out = model(**inputs).waveform
    sample = out.numpy()[0]
    rate = model.config.sampling_rate
    scipy.io.wavfile.write("out.wav", rate=rate, data=sample)
    audio = Audio(data=sample, rate = rate, autoplay=False)
    display(audio)

## As always when switch models make sure memory is freed

In [11]:
model.cpu()
del model
del tokenizer
gc.collect()
torch.cuda.empty_cache()

##Whisper is used for speech to text transcription

This version is specifically tuned for Vietnamese, the full whisper model can do multiple language transcription and even translate directly to the chosen langage

*Model linked below*

[PhoWhisper: Automatic Speech Recognition for Vietnamese](https://huggingface.co/vinai/PhoWhisper-small)

In [12]:
transcriber = pipeline("automatic-speech-recognition", model="vinai/PhoWhisper-small")
output = transcriber("out.wav")['text']
print(output)


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Device set to use cpu
`return_token_timestamps` is deprecated for WhisperFeatureExtractor and will be removed in Transformers v5. Use `return_attention_mask` instead, as the number of frames can be inferred from it.
Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`. See https://github.com/huggingface/transformers/pull/28687 for more details.


tôi hy vọng mô hình này có khả năng dịch hoàn toàn câu này hãy xem nó có thể không.


## Release whisper model from memory when done

In [13]:
del transcriber
gc.collect()
torch.cuda.empty_cache()

##Translate

THis is just a repeat of the block above. Same exact code with output coming directly from whisper instead. For this particular sentence the translation worked well.

*Model linked below*

[EnViT5 Translation](https://huggingface.co/VietAI/envit5-translation)

In [14]:
model_name = "VietAI/envit5-translation"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
inputs = [f"vi: {output}"]
print(inputs)
outputs = model.generate(tokenizer(inputs, return_tensors="pt", padding=True).input_ids.to(device), max_length=512)
output = [output for output in tokenizer.batch_decode(outputs, skip_special_tokens=True)][0].split('en: ')[1]
print(output)

['vi: tôi hy vọng mô hình này có khả năng dịch hoàn toàn câu này hãy xem nó có thể không.']
I hope this model is able to completely translate this sentence. Let's see if it can.
