# Continued Pre-training: Teaching LLMs a New Language

This notebook demonstrates how to continue pre-training a small LLM to learn a new language using Unsloth. We'll use TinyLlama for this purpose as it's lightweight but still capable.

In [1]:
# Install required packages
!pip install -q unsloth
!pip install -q datasets
!pip install -q accelerate>=0.24.1
!pip install -q bitsandbytes>=0.41.1
!pip install -q peft>=0.6.0
!pip install -q trl>=0.7.6

# Verify GPU availability
import torch
print("CUDA available:", torch.cuda.is_available())
print("CUDA device count:", torch.cuda.device_count())
if torch.cuda.is_available():
    print("GPU Memory:", torch.cuda.get_device_properties(0).total_memory / 1e9, "GB")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.8/46.8 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m218.5/218.5 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.4/491.4 kB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.1/162.1 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.9/318.9 kB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.0/129.0 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.5/31.5 MB[0m [31m54.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m865.2/865.2 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━

## Setting Up for Continued Pre-training in a New Language

Continued pre-training allows us to teach an existing model a new language by exposing it to text data in that language. We'll focus on Spanish as our target language using a small dataset that's more suitable for this task.

In [5]:
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset

# Set a small sequence length to reduce memory requirements
max_seq_length = 512

# Load TinyLlama as our base model for continued pre-training
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    max_seq_length=max_seq_length,
    dtype=torch.float16,
    load_in_4bit=True
)

print(f"Base model loaded with fp16 precision and max sequence length of {max_seq_length}")

# Use the correct configuration for Spanish MLQA
dataset = load_dataset("mlqa", "mlqa.es.es", split="test[:1000]", trust_remote_code=True)
print(f"Dataset loaded with {len(dataset)} examples")

# Extract text for pretraining
def prepare_spanish_text(example):
    # Combine context and question for more text
    return {"text": example["context"] + " " + example["question"]}

spanish_dataset = dataset.map(prepare_spanish_text)

# Preview a sample
print("Sample Spanish text:")
print(spanish_dataset[0]["text"][:100] + "..." if len(spanish_dataset[0]["text"]) > 100 else spanish_dataset[0]["text"])

==((====))==  Unsloth 2025.4.7: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Base model loaded with fp16 precision and max sequence length of 512


Downloading data:   0%|          | 0.00/75.7M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/5253 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Dataset loaded with 1000 examples


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Sample Spanish text:
Tras la erupción, las emisiones de material piroclástico que se produjeron desde la brecha creada po...


## Preparing Dataset for Pre-training

Now we'll format our Spanish dataset for language model pre-training. This involves tokenizing the text and preparing it for causal language modeling, where the model learns to predict the next token in a sequence.

In [6]:
# Prepare the dataset for pre-training
def tokenize_function(examples):
    # Add beginning and end of text tokens
    texts = examples["text"]
    result = tokenizer(
        texts,
        padding="max_length",
        truncation=True,
        max_length=max_seq_length,
        return_tensors="pt"
    )
    # Create input_ids and labels for causal language modeling
    result["labels"] = result["input_ids"].clone()
    return result

# Apply tokenization to our dataset
tokenized_dataset = spanish_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=spanish_dataset.column_names
)

print(f"Tokenized dataset prepared with {len(tokenized_dataset)} examples")
print(f"Dataset features: {tokenized_dataset.features}")

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Tokenized dataset prepared with 1000 examples
Dataset features: {'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}


## Configure Training Parameters

Now we'll set up the training configuration for continued pre-training. We'll use Unsloth's `get_pretrained_lora` function to add LoRA (Low-Rank Adaptation) parameters to our model, which makes fine-tuning more efficient.

In [7]:
  # Set up the training configuration
from unsloth import FastLanguageModel

# Add LoRA adapters to our model
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # Rank of the LoRA adapters
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none"
)

# Print model parameters
print(f"Trainable parameters: {model.print_trainable_parameters()}")

# Set up training arguments
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    report_to="none"  # Disable wandb reporting
)

print("Training configuration set up successfully!")

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.1.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.4.7 patched 22 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


trainable params: 12,615,680 || all params: 1,112,664,064 || trainable%: 1.1338
Trainable parameters: None
Training configuration set up successfully!


## Train the Model

Now let's set up the trainer and start the continued pre-training process. We'll use Hugging Face's `Trainer` class, which Unsloth has optimized for faster training.

In [9]:
# Set up the trainer with reduced optimizations
from transformers import Trainer, DataCollatorForLanguageModeling
import os

# Set environment variable to avoid Triton compiler issues
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TRITON_DISABLE_LINE_INFO"] = "1"  # Try to avoid Triton compiler error

# Data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # We're doing causal language modeling, not masked language modeling
)

# Update training arguments to be more conservative
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=2,  # Reduced batch size
    gradient_accumulation_steps=8,  # Increased gradient accumulation
    learning_rate=2e-4,
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="linear",  # Simpler scheduler
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    report_to="none",  # Disable wandb reporting
    # Disable advanced optimizations that might cause issues
    optim="adamw_torch",
    ddp_find_unused_parameters=False,
    disable_tqdm=False
)

# Create the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)

# Start training with error handling
print("Starting continued pre-training...")
try:
    trainer.train()
    print("Training complete!")

    # Save the model
    model_path = "./spanish_tinyllama"
    trainer.save_model(model_path)
    print(f"Model saved to {model_path}")
except Exception as e:
    print(f"Training encountered an error: {e}")
    print("Let's try an even more basic approach...")

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 3 | Total steps = 186
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 12,615,680/4,000,000,000 (0.32% trained)


Starting continued pre-training...
Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,10.5634
20,9.3944
30,6.9964
40,3.9647
50,3.0922
60,2.6849
70,2.3972
80,2.3199
90,2.1875
100,2.1101


Training complete!
Model saved to ./spanish_tinyllama


## Testing the Spanish-Enhanced Model

Now that we have trained our model on Spanish data, let's test its capabilities in generating Spanish text. We'll load the model and generate some sample Spanish text to evaluate its performance.

In [11]:
# Load the trained model
from unsloth import FastLanguageModel
import torch

# Load our fine-tuned model
model_path = "./spanish_tinyllama"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_path,
    max_seq_length=512,
    dtype=torch.float16,
    load_in_4bit=True
)

# Set up text generation parameters
from transformers import TextGenerationPipeline

# Fix: Remove the device parameter since the model is already on the correct device
generator = TextGenerationPipeline(
    model=model,
    tokenizer=tokenizer
)

# Test with some Spanish prompts
spanish_prompts = [
    "Hola, mi nombre es",
    "El idioma español es muy",
    "La inteligencia artificial puede",
    "En mi opinión, el futuro de la tecnología"
]

print("Testing the model with Spanish prompts:")
for prompt in spanish_prompts:
    response = generator(
        prompt,
        max_length=100,
        do_sample=True,
        temperature=0.7,
        top_p=0.95,
        top_k=40,
        num_return_sequences=1
    )
    print(f"\nPrompt: {prompt}")
    print(f"Response: {response[0]['generated_text']}")

==((====))==  Unsloth 2025.4.7: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Device set to use cuda:0
The model 'PeftModelForCausalLM' is not supported for . Supported models are ['AriaTextForCausalLM', 'BambaForCausalLM', 'BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'DeepseekV3ForCausalLM', 'DiffLlamaForCausalLM', 'ElectraForCausalLM', 'Emu3ForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FalconMambaForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'Gemma3ForConditionalGeneration', 'Gemma3ForCausalLM', 'GitForCausalLM', 'GlmForCausalLM', 'Glm4ForCausalLM', 'GotOcr2ForConditionalGeneration', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GP

Testing the model with Spanish prompts:

Prompt: Hola, mi nombre es
Response: Hola, mi nombre es Juan. ¿Dónde te gusta hacer el bachata? ¿Qué lugar se considera el más hermoso? ¿Qué lugar es el que más gusta? ¿En qué lugar se considera el más hermoso? ¿Cuál es el lugar más hermoso? ¿De qué lugar se decía el hermoso? ¿Cuál era la respuesta? ¿Cuál es el lugar donde se consider

Prompt: El idioma español es muy
Response: El idioma español es muy diverso entre las comunidades hispánicas de América y eso, por lo tanto, se suele hablar de español como español de América, español de América del Norte, español de América del Sur, español de España y el español de los territorios anexos de España en América. El español de los Estados Unidos es el español hablado en los Estados Unidos, Puerto Rico, Guam y las Islas Marianas.

Prompt: La inteligencia artificial puede
Response: La inteligencia artificial puede ser más inteligente que el hombre, ya que puede resolver problemas de manera más riguros