New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NLLB trunc translation #24169
Comments
Hi @FranPuentes Below is the script I played with (4bit model after installing bitsandbytes) # pip install bitsandbytes
import sys,os;
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM;
import torch;
NLLB_MODEL="facebook/nllb-200-3.3B";
#NLLB_MODEL="facebook/nllb-200-distilled-600M";
# NLLB_MODEL="facebook/nllb-200-distilled-1.3B";
tokenizer = AutoTokenizer.from_pretrained(NLLB_MODEL);
model = AutoModelForSeq2SeqLM.from_pretrained(NLLB_MODEL, torch_dtype=torch.float16, load_in_4bit=True);
device = torch.device("cuda" if torch.cuda.is_available() else "cpu");
# model = model.to(device);
def translate(lang:str, text:str):
inputs = tokenizer(text,return_tensors="pt").to(device);
tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id[lang], max_new_tokens=4096);
texts=tokenizer.batch_decode(tokens, skip_special_tokens=True);
return texts[0];
if __name__=="__main__":
import readline;
text="""La Voz de Galicia es el cuarto periódico generalista de España. Posee una audiencia de 492.000 lectores en todo el país, según datos de la primera oleada del Estudio General de Medios de 2020. En el territorio gallego es la cabecera hegemónica. Su edición digital es la primera web informativa de la comunidad.""";
text=translate("eng_Latn", text);
print(text);
>>> La Voz de Galicia is the fourth generalist newspaper in Spain. It has an audience of 492,000 readers in the whole country, according to data from the first wave of the Estudio General de Medios de 2020.
text="""La Voz de Galicia es el cuarto periódico generalista de España. Posee una audiencia de 492.000 lectores en todo el país, según datos de la primera oleada del Estudio General de Medios de 2020, en el territorio gallego es la cabecera hegemónica, su edición digital es la primera web informativa de la comunidad.""";
text=translate("eng_Latn", text);
print(text);
>>> La Voz de Galicia is the fourth generalist newspaper in Spain. It has an audience of 492,000 readers in the whole country, according to data from the first wave of the Estudio General de Medios de 2020, in the Galician territory it is the hegemonic headquarters, its digital edition is the first informative web of the community. I am really not sure about this behaviour, maybe it is related to the way NLLB models have been trained |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
transformers
version: 4.28.1Who can help?
@ArthurZucker
@younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Its output is "La Voz de Galicia is the fourth generalist newspaper in Spain. It has an audience of 492,000 readers in the whole country, according to data from the first wave of the Estudio General de Medios of 2020." witch is only the first and second line. The same output when remove the carriage returns.
Expected behavior
Translate all the text, not only the first and second line, for example:
"La Voz de Galicia is the fourth largest generalist newspaper in Spain. It has a readership of 492,000 readers throughout the country, according to data from the first wave of the General Media Study 2020. In Galicia, it is the leading newspaper.
Its digital edition is the leading news website in the region."
The text was updated successfully, but these errors were encountered: