# English-To-Spanish Translation

This notebook demonstrates how to perform English-to-Spanish translation using Hugging Face transformer models. It is written for both beginners and advanced users and shows multiple approaches, device handling (CPU/GPU), batching, and how to translate different text types (single sentences, paragraphs, conversations).

---


## Install requirements

Run this cell in a fresh environment.

```bash
!pip install --upgrade pip
!pip install transformers[sentencepiece] accelerate torch
!pip install safetensors 
!pip install datasets tqdm huggingface-hub 
```

---


In [9]:
# WE WANT TO DETECT THE DEVICE USED 

from typing import List, Iterable

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
from tqdm.auto import tqdm


def detect_device():
    if torch.cuda.is_available():
        return torch.device('cuda')
    if torch.backends.mps.is_built():
        return torch.device('mps')
    return torch.device('cpu')

DEVICE = detect_device()
print(f"Using device: {DEVICE}")


Using device: cpu


## Quick approach: `pipeline` (recommended for beginners)

The `pipeline` API wraps tokenization, model loading and generation into a single, easy-to-use object. It's great for quick experiments and small-to-medium workloads. As we can see we have short sentences that don't even have long words, so the approach we're taking might be easy to understand. We are going to explore more involved approaches later on this natebook.

---


In [2]:
model_name = "Helsinki-NLP/opus-mt-en-es"  
translator = pipeline("translation", model=model_name, device=0 if str(DEVICE).startswith('cuda') else -1)

examples = [
    "Hello, how are you doing?",
    "God's word is transforming the world."
]

for out in translator(examples, max_length=256):
    print(out['translation_text'])


Device set to use cpu


Hola, ¿cómo estás?
La palabra de Dios está transformando el mundo.


## Manual model loading (advanced users)

Manually loading the tokenizer and model gives you full control over generation parameters (beam search, sampling, length penalties, etc.), and allows batching strategies that may be more efficient for large workloads. Why would we want to take this approach? Sometimes we want to customize our models and set parameters that suit a situation we want, in our case, translating English text(sermons) to Spanish.

---


In [3]:
# Load tokenizer and model manually
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model.to(DEVICE)
model.eval()

def translate_batch_manual(texts: List[str], batch_device: torch.device = DEVICE, max_length: int = 256, num_beams: int = 4) -> List[str]:
    if len(texts) == 0:
        return []

    encoded = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=1024)
    encoded = {k: v.to(batch_device) for k, v in encoded.items()}

    with torch.no_grad():
        outputs = model.generate(**encoded, max_length=max_length, num_beams=num_beams)

    decoded = [tokenizer.decode(t, skip_special_tokens=True) for t in outputs]
    return decoded

print(translate_batch_manual([" In the beginning was the Word, and the Word was with God, and the Word was God."]))


['En el principio era el Verbo, y el Verbo estaba con Dios, y el Verbo era Dios.']


## Batch translation for large datasets

When translating many sentences, divide them into batches (chunks) that fit in memory. We are likely going to be workking on tons of sentences, so it would be best to chop them up into pieces because the models have a maximum sequence length that they can't exceed or else we would lose some content, whenever the data gets truncated. 

---


In [4]:
def chunked_iterable(iterable: Iterable, chunk_size: int):
    it = iter(iterable)
    while True:
        chunk = []
        try:
            for _ in range(chunk_size):
                chunk.append(next(it))
        except StopIteration:
            if chunk:
                yield chunk
            break
        if chunk:
            yield chunk

def batch_translate(texts: List[str], chunk_size: int = 16, **generate_kwargs) -> List[str]:
    translations = []
    for chunk in tqdm(list(chunked_iterable(texts, chunk_size)), desc="Translating chunks"):
        translations.extend(translate_batch_manual(chunk, **generate_kwargs))
    return translations

sample_texts = [f"This is sentence number {i}." for i in range(1, 31)]
translated_sample = batch_translate(sample_texts, chunk_size=8)
print(translated_sample[:5])


Translating chunks: 100%|██████████| 4/4 [00:14<00:00,  3.50s/it]

['Esta es la frase número 1.', 'Esta es la frase número 2.', 'Esta es la frase número 3.', 'Esta es la frase número 4.', 'Esta es la frase número 5.']





## Handling paragraphs and longer inputs

Transformer tokenizers have maximum sequence lengths. For very long paragraphs, consider splitting into sentences.

---


In [5]:
import re

def naive_sentence_split(paragraph: str) -> List[str]:
    pieces = re.split(r'(?<=[.!?])\s+', paragraph.strip())
    pieces = [p.strip() for p in pieces if p.strip()]
    return pieces

def translate_paragraph(paragraph: str, chunk_size: int = 8) -> str:
    sentences = naive_sentence_split(paragraph)
    translations = batch_translate(sentences, chunk_size=chunk_size)
    return ' '.join(translations)

paragraph = (
    "God uses our trials to sharpen us for the future blessings He is going to provide for us. In the present we do not understand this, but with time it all makes sense." \
    "He is always there for us, and will always be"
  
)
print(translate_paragraph(paragraph))


Translating chunks: 100%|██████████| 1/1 [00:07<00:00,  7.29s/it]

Dios usa nuestras pruebas para afinarnos para las bendiciones futuras que Él va a proveer para nosotros. En el presente no entendemos esto, pero con el tiempo todo tiene sentido.Él siempre está ahí para nosotros, y siempre estará





## Conversations / multi-turn contexts

Translate each turn independently or concatenate context.

---


In [6]:
def translate_conversation(turns: List[dict], chunk_size: int = 8) -> List[dict]:
    texts = [f"{t['speaker']}: {t['text']}" for t in turns]
    translations = batch_translate(texts, chunk_size=chunk_size)
    out = []
    for t, tr in zip(turns, translations):
        new = t.copy()
        new['translation'] = tr
        out.append(new)
    return out

conv = [
    {"speaker": "Alice", "text": "Hey, are you coming to the bible study?"},
    {"speaker": "Bob", "text": "Yes! I'll be there at 7 pm."},
    {"speaker": "Alice", "text": "Great — see you then."}
]
print(translate_conversation(conv))


Translating chunks: 100%|██████████| 1/1 [00:03<00:00,  3.85s/it]

[{'speaker': 'Alice', 'text': 'Hey, are you coming to the bible study?', 'translation': 'Oye, ¿vienes al estudio de la Biblia?'}, {'speaker': 'Bob', 'text': "Yes! I'll be there at 7 pm.", 'translation': 'Bob: ¡Sí! Estaré allí a las 7 pm.'}, {'speaker': 'Alice', 'text': 'Great — see you then.', 'translation': 'Alice: Genial, nos vemos entonces.'}]





## Saving translations

---


In [10]:
def save_translations(inputs, translations, out_txt: str = "es_translations.txt", out_tsv: str = "es_translations.tsv"):
    with open(out_txt, 'w', encoding='utf-8') as f_txt, open(out_tsv, 'w', encoding='utf-8') as f_tsv:
        for inp, tr in zip(inputs, translations):
            f_txt.write(tr + "\n")
            f_tsv.write(inp.replace('\t',' ') + "\t" + tr.replace('\t',' ') + "\n")
    print(f"Saved {len(translations)} translations to {out_txt} and {out_tsv}")

save_translations(sample_texts, translated_sample, out_txt="es.txt", out_tsv="eng.tsv")


Saved 30 translations to es.txt and eng.tsv
