# Continued Pre-training with SmolLM2-135M using Unsloth

## Overview
This notebook demonstrates **Continued Pre-training** to teach SmolLM2-135M a new language (Hindi) using Unsloth.ai.

### What is Continued Pre-training?
- Continued pre-training extends a model's knowledge to new domains/languages
- Different from fine-tuning (which adapts existing knowledge)
- Teaches fundamentally new capabilities
- Uses next-token prediction on raw text

### Model Details
- **Model**: SmolLM2-135M-Instruct
- **Method**: Continued pre-training with LoRA (r=16)
- **Task**: Learn Hindi language
- **Dataset**: Hindi text corpus

### Key Concepts
- **Pre-training**: Learning language fundamentals
- **Next-token prediction**: Predicting what comes next
- **Language adaptation**: Teaching new language
- **Domain adaptation**: Teaching new knowledge areas

## Step 1: Install Required Libraries

In [1]:
# Install Unsloth
!pip install -q unsloth

# Install dependencies
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -q --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes

[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m61.8/61.8 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m351.3/351.3 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m59.4/59.4 MB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m506.8/506.8 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m564.7/564.7 kB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0

## Step 2: Import Libraries

In [2]:
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset, Dataset
import os

# Disable wandb
os.environ["WANDB_DISABLED"] = "true"

print("‚úì All libraries imported successfully!")
print("‚úì Ready for continued pre-training")

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
‚úì All libraries imported successfully!
‚úì Ready for continued pre-training


## Step 3: Configure Model Parameters

In [3]:
# Model configuration
max_seq_length = 512
dtype = None
load_in_4bit = True

model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"

print(f"Configuration:")
print(f"  Model: {model_name}")
print(f"  Method: Continued Pre-training")
print(f"  Task: Teaching Hindi language")
print(f"  Current: Model mainly knows English")
print(f"  Goal: Extend to Hindi")

Configuration:
  Model: HuggingFaceTB/SmolLM2-135M-Instruct
  Method: Continued Pre-training
  Task: Teaching Hindi language
  Current: Model mainly knows English
  Goal: Extend to Hindi


## Step 4: Load the Pre-trained Model

In [4]:
# Load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

print("‚úì Base model loaded successfully!")
print(f"Vocabulary size: {len(tokenizer)}")

==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

HuggingFaceTB/SmolLM2-135M-Instruct does not have a padding token! Will use pad_token = <|endoftext|>.
‚úì Base model loaded successfully!
Vocabulary size: 49152


## Step 5: Prepare Model for Continued Pre-training

In [5]:
# Add LoRA for efficient continued pre-training
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",
                    "embed_tokens", "lm_head"],  # Include embeddings for new language
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    modules_to_save=["embed_tokens", "lm_head"],  # Save embedding updates
)

print("‚úì Model prepared for continued pre-training!")
print("  Including embeddings for new language tokens")

Unsloth: Offloading input_embeddings to disk to save VRAM
Unsloth: Offloading output_embeddings to disk to save VRAM


Unsloth 2025.11.2 patched 30 layers with 30 QKV layers, 30 O layers and 30 MLP layers.


Unsloth: Training embed_tokens in mixed precision to save VRAM
Unsloth: Training lm_head in mixed precision to save VRAM
‚úì Model prepared for continued pre-training!
  Including embeddings for new language tokens


## Step 6: Create Hindi Text Dataset

For continued pre-training, we need raw text in the target language.

In [6]:
# Create a small Hindi text corpus
# In real scenarios, use large Hindi datasets from HuggingFace
hindi_texts = [
    "‡§®‡§Æ‡§∏‡•ç‡§§‡•á, ‡§Æ‡•á‡§∞‡§æ ‡§®‡§æ‡§Æ ‡§∞‡§æ‡§ú ‡§π‡•à‡•§ ‡§Æ‡•à‡§Ç ‡§≠‡§æ‡§∞‡§§ ‡§∏‡•á ‡§π‡•Ç‡§Ç‡•§",
    "‡§Ü‡§ú ‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§π‡•à‡•§ ‡§∏‡•Ç‡§∞‡§ú ‡§ö‡§Æ‡§ï ‡§∞‡§π‡§æ ‡§π‡•à‡•§",
    "‡§Æ‡•Å‡§ù‡•á ‡§ï‡§ø‡§§‡§æ‡§¨‡•á‡§Ç ‡§™‡§¢‡§º‡§®‡§æ ‡§¨‡§π‡•Å‡§§ ‡§™‡§∏‡§Ç‡§¶ ‡§π‡•à‡•§",
    "‡§ï‡§Ç‡§™‡•ç‡§Ø‡•Ç‡§ü‡§∞ ‡§µ‡§ø‡§ú‡•ç‡§û‡§æ‡§® ‡§è‡§ï ‡§∞‡•ã‡§ö‡§ï ‡§µ‡§ø‡§∑‡§Ø ‡§π‡•à‡•§",
    "‡§Æ‡§∂‡•Ä‡§® ‡§≤‡§∞‡•ç‡§®‡§ø‡§Ç‡§ó ‡§ï‡•É‡§§‡•ç‡§∞‡§ø‡§Æ ‡§¨‡•Å‡§¶‡•ç‡§ß‡§ø‡§Æ‡§§‡•ç‡§§‡§æ ‡§ï‡§æ ‡§è‡§ï ‡§π‡§ø‡§∏‡•ç‡§∏‡§æ ‡§π‡•à‡•§",
    "‡§™‡§æ‡§Ø‡§•‡§® ‡§è‡§ï ‡§≤‡•ã‡§ï‡§™‡•ç‡§∞‡§ø‡§Ø ‡§™‡•ç‡§∞‡•ã‡§ó‡•ç‡§∞‡§æ‡§Æ‡§ø‡§Ç‡§ó ‡§≠‡§æ‡§∑‡§æ ‡§π‡•à‡•§",
    "‡§Æ‡•à‡§Ç ‡§∞‡•ã‡§ú ‡§∏‡•Å‡§¨‡§π ‡§µ‡•ç‡§Ø‡§æ‡§Ø‡§æ‡§Æ ‡§ï‡§∞‡§§‡§æ ‡§π‡•Ç‡§Ç‡•§",
    "‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ ‡§¨‡§π‡•Å‡§§ ‡§Æ‡§π‡§§‡•ç‡§µ‡§™‡•Ç‡§∞‡•ç‡§£ ‡§π‡•à‡•§",
    "‡§≠‡§æ‡§∞‡§§ ‡§Æ‡•á‡§Ç ‡§ï‡§à ‡§≠‡§æ‡§∑‡§æ‡§è‡§Ç ‡§¨‡•ã‡§≤‡•Ä ‡§ú‡§æ‡§§‡•Ä ‡§π‡•à‡§Ç‡•§",
    "‡§§‡§ï‡§®‡•Ä‡§ï ‡§®‡•á ‡§π‡§Æ‡§æ‡§∞‡•Ä ‡§ú‡§ø‡§Ç‡§¶‡§ó‡•Ä ‡§¨‡§¶‡§≤ ‡§¶‡•Ä ‡§π‡•à‡•§",
    "‡§°‡•Ä‡§™ ‡§≤‡§∞‡•ç‡§®‡§ø‡§Ç‡§ó ‡§®‡•ç‡§Ø‡•Ç‡§∞‡§≤ ‡§®‡•á‡§ü‡§µ‡§∞‡•ç‡§ï ‡§ï‡§æ ‡§â‡§™‡§Ø‡•ã‡§ó ‡§ï‡§∞‡§§‡§æ ‡§π‡•à‡•§",
    "‡§Æ‡•Å‡§ù‡•á ‡§∏‡§Ç‡§ó‡•Ä‡§§ ‡§∏‡•Å‡§®‡§®‡§æ ‡§™‡§∏‡§Ç‡§¶ ‡§π‡•à‡•§",
    "‡§µ‡§ø‡§ú‡•ç‡§û‡§æ‡§® ‡§π‡§Æ‡•á‡§Ç ‡§¶‡•Å‡§®‡§ø‡§Ø‡§æ ‡§ï‡•ã ‡§∏‡§Æ‡§ù‡§®‡•á ‡§Æ‡•á‡§Ç ‡§Æ‡§¶‡§¶ ‡§ï‡§∞‡§§‡§æ ‡§π‡•à‡•§",
    "‡§ï‡•ã‡§°‡§ø‡§Ç‡§ó ‡§∏‡•Ä‡§ñ‡§®‡§æ ‡§Ü‡§ú‡§ï‡§≤ ‡§¨‡§π‡•Å‡§§ ‡§ú‡§∞‡•Ç‡§∞‡•Ä ‡§π‡•à‡•§",
    "‡§π‡§ø‡§Ç‡§¶‡•Ä ‡§≠‡§æ‡§∞‡§§ ‡§ï‡•Ä ‡§∏‡§¨‡§∏‡•á ‡§¨‡§°‡§º‡•Ä ‡§≠‡§æ‡§∑‡§æ ‡§π‡•à‡•§",
]

# Add bilingual examples to help bridging
bilingual_texts = [
    "Hello ‡§ï‡•ã ‡§π‡§ø‡§Ç‡§¶‡•Ä ‡§Æ‡•á‡§Ç ‡§®‡§Æ‡§∏‡•ç‡§§‡•á ‡§ï‡§π‡§§‡•á ‡§π‡•à‡§Ç‡•§",
    "Computer ‡§ï‡•ã ‡§π‡§ø‡§Ç‡§¶‡•Ä ‡§Æ‡•á‡§Ç ‡§ï‡§Ç‡§™‡•ç‡§Ø‡•Ç‡§ü‡§∞ ‡§ï‡§π‡§§‡•á ‡§π‡•à‡§Ç‡•§",
    "Thank you ‡§ï‡•ã ‡§π‡§ø‡§Ç‡§¶‡•Ä ‡§Æ‡•á‡§Ç ‡§ß‡§®‡•ç‡§Ø‡§µ‡§æ‡§¶ ‡§ï‡§π‡§§‡•á ‡§π‡•à‡§Ç‡•§",
    "Python programming in Hindi: ‡§™‡§æ‡§Ø‡§•‡§® ‡§™‡•ç‡§∞‡•ã‡§ó‡•ç‡§∞‡§æ‡§Æ‡§ø‡§Ç‡§ó",
    "Machine Learning in Hindi: ‡§Æ‡§∂‡•Ä‡§® ‡§≤‡§∞‡•ç‡§®‡§ø‡§Ç‡§ó",
]

# Combine and repeat to get 100+ examples
all_texts = hindi_texts + bilingual_texts
extended_texts = []
for _ in range(6):  # Repeat to get ~120 examples
    extended_texts.extend(all_texts)

# Create dataset
dataset = Dataset.from_dict({"text": extended_texts})

print(f"‚úì Hindi text dataset created: {len(dataset)} examples")
print("\nSample texts:")
for i in range(3):
    print(f"  {i+1}. {dataset[i]['text']}")

‚úì Hindi text dataset created: 120 examples

Sample texts:
  1. ‡§®‡§Æ‡§∏‡•ç‡§§‡•á, ‡§Æ‡•á‡§∞‡§æ ‡§®‡§æ‡§Æ ‡§∞‡§æ‡§ú ‡§π‡•à‡•§ ‡§Æ‡•à‡§Ç ‡§≠‡§æ‡§∞‡§§ ‡§∏‡•á ‡§π‡•Ç‡§Ç‡•§
  2. ‡§Ü‡§ú ‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§π‡•à‡•§ ‡§∏‡•Ç‡§∞‡§ú ‡§ö‡§Æ‡§ï ‡§∞‡§π‡§æ ‡§π‡•à‡•§
  3. ‡§Æ‡•Å‡§ù‡•á ‡§ï‡§ø‡§§‡§æ‡§¨‡•á‡§Ç ‡§™‡§¢‡§º‡§®‡§æ ‡§¨‡§π‡•Å‡§§ ‡§™‡§∏‡§Ç‡§¶ ‡§π‡•à‡•§


## Step 7: Format Dataset for Pre-training

For pre-training, we use raw text with EOS token.

In [7]:
EOS_TOKEN = tokenizer.eos_token

def format_pretraining(example):
    """Format text for continued pre-training."""
    # For pre-training, we just add EOS token
    return {"text": example["text"] + EOS_TOKEN}

dataset = dataset.map(format_pretraining)

print("‚úì Dataset formatted for pre-training!")
print("\nFormatted example:")
print(dataset[0]["text"])

Map:   0%|          | 0/120 [00:00<?, ? examples/s]

‚úì Dataset formatted for pre-training!

Formatted example:
‡§®‡§Æ‡§∏‡•ç‡§§‡•á, ‡§Æ‡•á‡§∞‡§æ ‡§®‡§æ‡§Æ ‡§∞‡§æ‡§ú ‡§π‡•à‡•§ ‡§Æ‡•à‡§Ç ‡§≠‡§æ‡§∞‡§§ ‡§∏‡•á ‡§π‡•Ç‡§Ç‡•§<|im_end|>


## Step 8: Configure Training Arguments

Pre-training typically uses different hyperparameters than fine-tuning.

In [8]:
# Continued pre-training configuration
training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=10,  # More warmup for pre-training
    max_steps=100,     # More steps for language learning
    learning_rate=3e-4,  # Slightly higher for pre-training
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=1,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="cosine",  # Cosine schedule for pre-training
    seed=3407,
    output_dir="outputs_pretrain",
    report_to="none",
)

print("‚úì Pre-training arguments configured!")
print(f"  Max steps: {training_args.max_steps}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Note: Slightly different from fine-tuning config")

‚úì Pre-training arguments configured!
  Max steps: 100
  Learning rate: 0.0003
  Note: Slightly different from fine-tuning config


## Step 9: Initialize Trainer

In [9]:
# Initialize trainer for continued pre-training
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=training_args,
)

print("‚úì Continued pre-training trainer initialized!")

Map (num_proc=2):   0%|          | 0/120 [00:00<?, ? examples/s]

‚úì Continued pre-training trainer initialized!


## Step 10: Continue Pre-training on Hindi

Train the model to understand and generate Hindi text.

In [10]:
print("Starting continued pre-training on Hindi...")
print("Teaching the model a new language!\n")

trainer_stats = trainer.train()

print("\n" + "="*60)
print("‚úì Continued pre-training completed!")
print("="*60)
print(f"Training time: {trainer_stats.metrics['train_runtime']:.2f} seconds")
print(f"Training loss: {trainer_stats.metrics['train_loss']:.4f}")
print(f"\nModel now has Hindi language capability!")

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 2
   \\   /|    Num examples = 120 | Num Epochs = 7 | Total steps = 100
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 61,507,584 of 224,334,144 (27.42% trained)


Starting continued pre-training on Hindi...
Teaching the model a new language!



Step,Training Loss
1,2.6493
2,2.7926
3,2.8331
4,3.1016
5,2.9086
6,2.7636
7,2.7073
8,2.8941
9,2.5776
10,2.3336



‚úì Continued pre-training completed!
Training time: 215.37 seconds
Training loss: 1.3904

Model now has Hindi language capability!


## Step 11: Test Hindi Generation

Test if the model can understand and generate Hindi text.

In [11]:
FastLanguageModel.for_inference(model)

# Test with Hindi prompt
test_prompt = "‡§Æ‡§∂‡•Ä‡§® ‡§≤‡§∞‡•ç‡§®‡§ø‡§Ç‡§ó"

print("Test Prompt (Hindi):")
print(test_prompt)
print("\n" + "="*50 + "\n")

inputs = tokenizer([test_prompt], return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    use_cache=True,
    temperature=0.8,
    top_p=0.9,
    repetition_penalty=1.2,
)

response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print("Model's Hindi Generation:")
print(response)
print("\n" + "="*50)
print("Note: Model should continue in Hindi!")

Test Prompt (Hindi):
‡§Æ‡§∂‡•Ä‡§® ‡§≤‡§∞‡•ç‡§®‡§ø‡§Ç‡§ó


Model's Hindi Generation:
‡§Æ‡§∂‡•Ä‡§® ‡§≤‡§∞‡•ç‡§®‡§ø‡§Ç‡§ó ‡§ï‡§æ ‡§™‡•Ç‡§∞‡§ï‡•ã‡§Ç ‡§∏‡•á ‡§¨‡§¶‡§≤ ‡§¶‡•Å‡§®‡§ø‡§Ø‡§æ ‡§ú‡§∏‡§§‡§æ ‡§π‡•à‡•§

Note: Model should continue in Hindi!


## Step 12: More Hindi Tests

In [12]:
def test_hindi(prompt):
    """Test Hindi generation."""
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        use_cache=True,
        temperature=0.8,
        top_p=0.9,
        repetition_penalty=1.2,
    )

    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    print(f"\n{'='*80}")
    print(f"Prompt: {prompt}")
    print(f"\nGenerated: {response}")
    print(f"{'='*80}")

print("Testing Hindi generation with various prompts...\n")

test_hindi("‡§®‡§Æ‡§∏‡•ç‡§§‡•á,")
test_hindi("‡§ï‡§Ç‡§™‡•ç‡§Ø‡•Ç‡§ü‡§∞")
test_hindi("‡§™‡§æ‡§Ø‡§•‡§® ‡§™‡•ç‡§∞‡•ã‡§ó‡•ç‡§∞‡§æ‡§Æ‡§ø‡§Ç‡§ó")

Testing Hindi generation with various prompts...


Prompt: ‡§®‡§Æ‡§∏‡•ç‡§§‡•á,

Generated: ‡§®‡§Æ‡§∏‡•ç‡§§‡•á, ‡§™‡§∞‡§ø‡§ó‡§Æ‡§æ‡§Ç ‡§∏‡•Å‡§ï‡•Ç‡§¶ ‡§ï‡•ã ‡§π‡•Ä ‡§úÔ∏è‡•á‡§Ø‡•ã‡§Ç ‡§Æ‡•à‡§Ç‡•§

Prompt: ‡§ï‡§Ç‡§™‡•ç‡§Ø‡•Ç‡§ü‡§∞

Generated: ‡§ï‡§Ç‡§™‡•ç‡§Ø‡•Ç‡§ü‡§∞ ‡§π‡§ø‡§∏‡•á ‡§ï‡•ã ‡§∏‡§Æ‡§®‡•Å‡§¶‡§æ‡§®‡§§‡•Ä ‡§¨‡§π‡•Å‡§§ ‡§≠‡§µ‡§§‡•á.

Prompt: ‡§™‡§æ‡§Ø‡§•‡§® ‡§™‡•ç‡§∞‡•ã‡§ó‡•ç‡§∞‡§æ‡§Æ‡§ø‡§Ç‡§ó

Generated: ‡§™‡§æ‡§Ø‡§•‡§® ‡§™‡•ç‡§∞‡•ã‡§ó‡•ç‡§∞‡§æ‡§Æ‡§ø‡§Ç‡§ó ‡§ï‡•á‡•§


## Step 13: Test Bilingual Capability

Test if model retained English while learning Hindi.

In [13]:
print("Testing bilingual capability (English + Hindi)...\n")

# Test English (should still work)
english_prompt = "Machine learning is"
inputs = tokenizer([english_prompt], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.7)
english_response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

print("English Test:")
print(f"Prompt: {english_prompt}")
print(f"Generated: {english_response}")

print("\n" + "="*80 + "\n")

# Test Hindi
hindi_prompt = "‡§Æ‡§∂‡•Ä‡§® ‡§≤‡§∞‡•ç‡§®‡§ø‡§Ç‡§ó"
inputs = tokenizer([hindi_prompt], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.7)
hindi_response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

print("Hindi Test:")
print(f"Prompt: {hindi_prompt}")
print(f"Generated: {hindi_response}")

print("\n" + "="*80)
print("‚úÖ Model should work in both languages!")

Testing bilingual capability (English + Hindi)...

English Test:
Prompt: Machine learning is
Generated: Machine learning is about how we learn best in different situations, and how to adapt our learning style to suit different contexts.


Hindi Test:
Prompt: ‡§Æ‡§∂‡•Ä‡§® ‡§≤‡§∞‡•ç‡§®‡§ø‡§Ç‡§ó
Generated: ‡§Æ‡§∂‡•Ä‡§® ‡§≤‡§∞‡•ç‡§®‡§ø‡§Ç‡§ó ‡§Æ‡•á‡§Ç ‡§Æ‡•Å‡§ù‡•á‡§Ç ‡§Æ‡§π‡§§‡•ç‡§µ‡§ú‡•ç‡§û‡§æ ‡§Æ‡•á‡§Ç ‡§Æ‡•Å‡§ù‡•á‡§Ç ‡§Æ‡§π‡§§‡•ç‡§µ‡§ú‡•ç‡§û‡§æ ‡§Æ

‚úÖ Model should work in both languages!


## Step 14: Save the Multilingual Model

In [14]:
# Save the multilingual model
model.save_pretrained("smollm2_135m_hindi")
tokenizer.save_pretrained("smollm2_135m_hindi")

print("‚úì Multilingual model saved to 'smollm2_135m_hindi' directory")
print("\nThe model now:")
print("  ‚úÖ Understands Hindi")
print("  ‚úÖ Generates Hindi text")
print("  ‚úÖ Retains English capability")
print("  ‚úÖ Is bilingual (English + Hindi)")

‚úì Multilingual model saved to 'smollm2_135m_hindi' directory

The model now:
  ‚úÖ Understands Hindi
  ‚úÖ Generates Hindi text
  ‚úÖ Retains English capability
  ‚úÖ Is bilingual (English + Hindi)


## Step 15: Understanding Continued Pre-training

In [15]:
print("\n" + "="*80)
print("UNDERSTANDING CONTINUED PRE-TRAINING")
print("="*80)

print("\nüìö What is Continued Pre-training?")
print("  Continued pre-training extends a model's knowledge:")
print("  ‚Ä¢ Teaches NEW languages (like Hindi)")
print("  ‚Ä¢ Teaches NEW domains (like medical, legal)")
print("  ‚Ä¢ Uses raw text (no instruction format)")
print("  ‚Ä¢ Next-token prediction objective")

print("\nüî¨ Difference from Fine-tuning:")
print("  Fine-tuning:")
print("    ‚Ä¢ Adapts EXISTING knowledge")
print("    ‚Ä¢ Uses instruction-output pairs")
print("    ‚Ä¢ Short training (60-100 steps)")
print("    ‚Ä¢ For task adaptation")
print("  ")
print("  Continued Pre-training:")
print("    ‚Ä¢ Teaches NEW knowledge")
print("    ‚Ä¢ Uses raw text corpus")
print("    ‚Ä¢ Longer training (1000s of steps ideally)")
print("    ‚Ä¢ For capability expansion")

print("\nüìä This Notebook:")
print("  ‚Ä¢ Taught Hindi to an English model")
print("  ‚Ä¢ Used 100+ Hindi text examples")
print("  ‚Ä¢ Trained for 100 steps (minimal but demonstrative)")
print("  ‚Ä¢ Model now bilingual!")

print("\nüéØ Real-world Applications:")
print("  ‚Ä¢ Multilingual models (add languages)")
print("  ‚Ä¢ Domain-specific models (medical, legal, code)")
print("  ‚Ä¢ Time-updated models (learn new events/facts)")
print("  ‚Ä¢ Custom knowledge bases")

print("\nüí° Key Requirements:")
print("  ‚Ä¢ Large corpus in target domain/language")
print("  ‚Ä¢ Sufficient training time")
print("  ‚Ä¢ Monitor for catastrophic forgetting")
print("  ‚Ä¢ Balance old vs new knowledge")

print("\n‚úÖ What We Learned:")
print("  ‚Ä¢ How to adapt models to new languages")
print("  ‚Ä¢ Difference between pre-training and fine-tuning")
print("  ‚Ä¢ Bilingual capability development")
print("  ‚Ä¢ Knowledge expansion techniques")

print("\n" + "="*80)


UNDERSTANDING CONTINUED PRE-TRAINING

üìö What is Continued Pre-training?
  Continued pre-training extends a model's knowledge:
  ‚Ä¢ Teaches NEW languages (like Hindi)
  ‚Ä¢ Teaches NEW domains (like medical, legal)
  ‚Ä¢ Uses raw text (no instruction format)
  ‚Ä¢ Next-token prediction objective

üî¨ Difference from Fine-tuning:
  Fine-tuning:
    ‚Ä¢ Adapts EXISTING knowledge
    ‚Ä¢ Uses instruction-output pairs
    ‚Ä¢ Short training (60-100 steps)
    ‚Ä¢ For task adaptation
  
  Continued Pre-training:
    ‚Ä¢ Teaches NEW knowledge
    ‚Ä¢ Uses raw text corpus
    ‚Ä¢ Longer training (1000s of steps ideally)
    ‚Ä¢ For capability expansion

üìä This Notebook:
  ‚Ä¢ Taught Hindi to an English model
  ‚Ä¢ Used 100+ Hindi text examples
  ‚Ä¢ Trained for 100 steps (minimal but demonstrative)
  ‚Ä¢ Model now bilingual!

üéØ Real-world Applications:
  ‚Ä¢ Multilingual models (add languages)
  ‚Ä¢ Domain-specific models (medical, legal, code)
  ‚Ä¢ Time-updated models (learn new 

## Summary

### What We Did:
1. ‚úÖ Loaded English-based SmolLM2-135M
2. ‚úÖ Created Hindi text corpus
3. ‚úÖ Continued pre-training on Hindi
4. ‚úÖ Tested Hindi generation
5. ‚úÖ Verified bilingual capability
6. ‚úÖ Saved the multilingual model

### Key Concepts:
- **Continued Pre-training**: Teaching new knowledge/languages
- **Language Adaptation**: Extending to new languages
- **Knowledge Retention**: Keeping existing capabilities
- **Bilingual Models**: Supporting multiple languages

### Complete Journey Across All Colabs:
- **Colab 1**: Full fine-tuning (high rank, instruction following)
- **Colab 2**: LoRA fine-tuning (efficient, instruction following)
- **Colab 3**: DPO (preference learning, alignment)
- **Colab 4**: GRPO (reasoning, chain-of-thought)
- **Colab 5**: Continued pre-training (new language) ‚≠ê

### When to Use Continued Pre-training:
- ‚úÖ Need to add new language
- ‚úÖ Want domain-specific model
- ‚úÖ Have large corpus of raw text
- ‚úÖ Need fundamentally new knowledge

### Important Notes:
- This demo uses minimal data (~120 examples)
- Real language learning needs 1000s-millions of examples
- Training for 100 steps is demonstrative
- Production would need 10,000+ steps

### Next Steps for Production:
1. Use larger Hindi corpus (e.g., Oscar, CC100)
2. Train for many more steps (10K-100K)
3. Monitor perplexity on Hindi and English
4. Use learning rate schedule carefully
5. Validate on both languages regularly

### Resources:
- Continued Pre-training Guide: https://docs.unsloth.ai/basics/continued-pretraining
- Unsloth Documentation: https://docs.unsloth.ai/
- Hindi Datasets: Oscar, CC100, IndicCorp

## üéâ Assignment Complete!

You now have all 5 Colab notebooks:
1. ‚úÖ Full Fine-tuning
2. ‚úÖ LoRA Fine-tuning
3. ‚úÖ DPO Preference Learning
4. ‚úÖ GRPO Reasoning Model
5. ‚úÖ Continued Pre-training

Record your YouTube videos explaining each approach! üöÄ