<a href="https://colab.research.google.com/github/kat-le/cmpe255-unsloth.ai/blob/main/continued_pretraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Teach a Small LLM French with Unsloth (CPT on Wikipedia-FR)

This Colab shows how to continually pretrain (CPT) a lightweight LLM to improve its French fluency French Wikipedia. We use Unsloth to load the model in 4-bit (low VRAM) while computing in a safer precision, and fine-tune with LoRA adapters so training fits on a Colab T4.

## What we’ll do
* Set up Unsloth + Hugging Face ecosystem on Colab.
* Load a compact base model (unsloth/gemma-3-1b-pt…) in 4-bit for speed/VRAM.
* Run a baseline French generation (before training).
* Prepare French Wikipedia texts with a clean template and EOS tokens.
* Fine-tune via CPT (LoRA)
* Generate again after CPT

## Dataset
**Source**: wikimedia/wikipedia snapshot 20231101.fr
* We take exactly 1% of the dataset, then a tiny slice from that for eval.

# Install Libraries
* Installs Unsloth and the Hugging Face stack plus bitsandbytes for 4-/8-bit loading and the Hub client. This sets up everything needed for lightweight fine-tuning and inference on Colab.

In [1]:
!pip -q install -U "unsloth" "unsloth_zoo" "transformers>=4.45.0" "trl>=0.11.4" "datasets" "accelerate" "peft" "bitsandbytes" "huggingface_hub"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.8/61.8 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m351.3/351.3 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.2/278.2 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m564.7/564.7 kB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m506.8/506.8 kB[0m [31m49.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.4/59.4 MB[0m [31m43.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m50.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m87.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# Environment & GPU check

* Import Python deps and prints basic runtime info.
* Check whether a CUDA GPU is available and reports PyTorch/CUDA versions and the GPU name.

In [2]:
import os, torch

# Check GPU
!nvidia-smi -L || echo "No GPU detected."
if torch.cuda.is_available():
    print("PyTorch:", torch.__version__, "| CUDA:", torch.version.cuda, "| Device:", torch.cuda.get_device_name(0))
else:
    print("PyTorch:", torch.__version__, "| CUDA available:", torch.cuda.is_available(), "| Using CPU")


GPU 0: Tesla T4 (UUID: GPU-df067857-72b4-b59d-e854-639288c6323d)
PyTorch: 2.8.0+cu126 | CUDA: 12.6 | Device: Tesla T4


# Global Configuration

* Define our base model, sequence length, precision, and random seed
* using unsloth/gemma-3-1b-pt-unsloth-bnb-4bit for CPT

In [27]:
import torch
from unsloth import is_bfloat16_supported

BASE_MODEL = "unsloth/gemma-3-1b-pt-unsloth-bnb-4bit"
MAX_SEQ_LEN = 512
LOAD_IN_4BIT = True
DTYPE = torch.bfloat16 if is_bfloat16_supported() else torch.float16
SEED = 3407

# Load Model (4-bit)

* Loads the model with 4-bit weights but float32 compute for stability, sets a pad token, and puts the model into a fast inference graph with flash-attention-2 disabled.

In [29]:
from unsloth import FastLanguageModel, is_bfloat16_supported


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = BASE_MODEL,
    max_seq_length = MAX_SEQ_LEN,
    dtype          = torch.float32,
    load_in_4bit   = True,
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

try:
    FastLanguageModel.for_inference(model, use_flash_attention_2=False)
except TypeError:
    FastLanguageModel.for_inference(model)


==((====))==  Unsloth 2025.11.2: Fast Gemma3 patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.
Unsloth: Gemma3 does not support SDPA - switching to fast eager.


# Baseline Inference (before training)

* Run a quick French generation to see how the base model behaves before CPT.

In [30]:
from transformers import TextStreamer

prompt = "Complète cette phrase en bon français, de façon naturelle : La francophonie est un espace où"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
_ = model.generate(
    **inputs,
    max_new_tokens=80,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    streamer=streamer,
)

 se côtoient les langues, les cultures et les valeurs.

L'anglais est une langue de communication très importante dans le monde moderne. Grâce à l'anglais, les gens peuvent communiquer avec des personnes d'autres pays. Ils peuvent étudier à l'étranger, travailler dans d'autres pays et voyager à travers le monde. Les langues étrangères sont très importantes pour les gens qui


## Translation of Baseline Prompt and Response

* Prompt: Complete this sentence in good French, in a natural way: "The Francophonie is a space where"

  * Response: Languages, cultures and values ​​coexist. English is a very important language of communication in the modern world. Thanks to English, people can communicate with people from other countries. They can study abroad, work in other countries, and travel the world. Foreign languages ​​are very important for people who



# Enable LoRA Fine-Tuning

* Wrap the model with PEFT/LoRA adapters on attention/MLP projections and lm_head
* Gradient checkpointing reduces memory usage; RSLoRA improves efficiency.

In [31]:
from unsloth import FastLanguageModel

# If you hit OOM, first remove "embed_tokens" from target_modules.
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=[
        "q_proj","k_proj","v_proj","o_proj",
        "gate_proj","up_proj","down_proj",
        "lm_head",
    ],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=SEED,
    use_rslora=True,
    loftq_config=None,
)



Unsloth: Making `model.base_model.model.model` require gradients


# French Wikipedia Formatting

* Create a French prompt template for Wikipedia pages and a formatting_prompts_func that merges title + article and appends the EOS token

In [32]:
# French template used for training:
wikipedia_prompt = """Article Wikipédia
### Titre : {}

### Article :
{}"""

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    titles = examples["title"]
    texts  = examples["text"]
    outputs = []
    for title, text in zip(titles, texts):
        title = title or ""
        text  = text or ""
        formatted = wikipedia_prompt.format(title, text) + EOS_TOKEN
        outputs.append(formatted)
    return {"text": outputs}

# Load Data and Prepare Splits

* Load the French snapshot from wikimedia/wikipedia.
* Map the formatter to produce a single text column, then create a tiny eval split from that subset.

In [33]:
from datasets import load_dataset

WIKI_CONFIG = "20231101.fr"

dataset_full = load_dataset("wikimedia/wikipedia", WIKI_CONFIG, split="train")

dataset_1p = dataset_full.train_test_split(train_size=0.01, seed=SEED)["train"]

dataset_1p = dataset_1p.map(
    formatting_prompts_func,
    batched=True,
    remove_columns=dataset_1p.column_names,
    desc="Formatting (title+article+EOS)",
)

splits = dataset_1p.train_test_split(test_size=0.01, seed=SEED)
ds_train, ds_eval = splits["train"], splits["test"]

print(f"Train size (within 1%): {len(ds_train):,} | Eval size: {len(ds_eval):,}")
print("\nPreview (first 500 chars):\n", ds_train[0]["text"][:500])

Resolving data files:   0%|          | 0/17 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/17 [00:00<?, ?it/s]

Train size (within 1%): 25,389 | Eval size: 257

Preview (first 500 chars):
 Article Wikipédia
### Titre : The Ghost (film, 1913, Kirkwood)

### Article :
The Ghost est un film américain réalisé par James Kirkwood Sr., sorti en 1913.

Synopsis

Fiche technique 

 Date de sortie :
  :

Distribution 
 James Kirkwood Sr. : Jim
 Gertrude Robinson : Gertrude Howard

Voir aussi

Articles connexes 
 Films américains sortis en 1913

Liens externes 
 

Film américain sorti en 1913
Court métrage américain
Film dramatique américain
Film réalisé par James Kirkwood Sr.
Film muet amér


# Training Setup and Run

* Configures UnslothTrainingArguments for a quick run: small batch with accumulation, max_steps=200 to cap time, cosine scheduler, 8-bit AdamW, and no external logging.
* Builds UnslothTrainer over the text field and starts training on the 1% subset.

In [35]:
from unsloth import UnslothTrainer, UnslothTrainingArguments
from unsloth import is_bfloat16_supported

args = UnslothTrainingArguments(
    output_dir                  = "mistral_fr_cpt",
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 16,
    max_steps                   = 200,
    learning_rate               = 5e-5,
    embedding_learning_rate     = 5e-6,
    warmup_ratio                = 0.1,
    lr_scheduler_type           = "cosine",
    logging_steps               = 20,
    save_strategy               = "steps",
    save_steps                  = 300,
    optim                       = "adamw_8bit",
    weight_decay                = 0.0,
    fp16                        = not is_bfloat16_supported(),
    bf16                        = is_bfloat16_supported(),
    seed                        = SEED,
    report_to                   = "none",
)

trainer = UnslothTrainer(
    model              = model,
    tokenizer          = tokenizer,
    train_dataset      = ds_train,
    eval_dataset       = None,
    dataset_text_field = "text",
    max_seq_length     = MAX_SEQ_LEN,
    dataset_num_proc   = 2,
    args               = args,
)

train_stats = trainer.train()
train_stats

Unsloth: Switching to float32 training since model cannot work with float16


Unsloth: Tokenizing ["text"] (num_proc=12):   0%|          | 0/25389 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 25,389 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 16
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 16 x 1) = 16
 "-____-"     Trainable parameters = 17,258,496 of 1,017,144,448 (1.70% trained)


Step,Training Loss
20,2.2671
40,2.1094
60,2.1621
80,2.2096
100,2.2628
120,2.2586
140,2.2828
160,2.2813
180,2.3129
200,2.2969




TrainOutput(global_step=200, training_loss=2.2443505859375, metrics={'train_runtime': 1470.7136, 'train_samples_per_second': 2.176, 'train_steps_per_second': 0.136, 'total_flos': 4825366653615360.0, 'train_loss': 2.2443505859375, 'epoch': 0.12603883571625507})

# Inference Helpers (After Training)

* Define a helper that prints only the continuation (not the prompt) and a decode preset:  “sampled but guarded” (with repetition controls).
* It then tests a few French prompts so you can assess post-CPT behavior.

In [45]:
from transformers import StoppingCriteria, StoppingCriteriaList

def generate_only_new(prompt, **gen_cfg):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    out = model.generate(
        **inputs,
        return_dict_in_generate=True,
        **gen_cfg,
    )
    new_tokens = out.sequences[0, inputs["input_ids"].shape[1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True)

SAMPLED_GUARDED_CFG = dict(
    do_sample=True,
    temperature=0.5,
    top_p=0.9,
    repetition_penalty=1.15,
    no_repeat_ngram_size=3,
    max_new_tokens=80,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
)

prompts = [
    "Complète proprement en une phrase : « Une francophonie est un lieu où »\nRéponse :",
    "Réponds en exactement deux phrases sur Paris.\nRéponse :",
    "Continue ce récit en une à duex phrases : « Au cœur des Alpes, un petit village vivait au rythme des saisons. Chaque hiver, »\nRéponse :",
]

for p in prompts:
    print("\n=== Prompt ===\n", p)
    print(generate_only_new(p, **SAMPLED_GUARDED_CFG))



=== Prompt ===
 Complète proprement en UNE phrase : « Une francophonie est un lieu où »
Réponse :


Une francopholie, une langue française et d'autres langues françaises. Le terme de Francophonie peut être utilisé pour désigner la communauté franco-canadienne ou les communautés francophones des pays du Nouveau Monde qui parlent français comme langue officielle à l’état civil (en France), ainsi que tous ceux dans le monde qui parleraient ce même français sous forme alphasyllab

=== Prompt ===
 Réponds en exactement DEUX phrases sur Paris.
Réponse :


Paris est une ville très importante dans tous les secteurs de l'économie et des activités humaines, par exemple: la vie politique, le commerce, l’industrie, la recherche académique, l'art, l', architecture, l..., et beaucoup d'autres. Mais il y a aussi certains aspects qui peuvent être controversés ou discutables pour certaines personnes comme peut-être l'urban

=== Prompt ===
 Continue ce récit en UNE à DEUX phrases : « Au cœur des Alpes,

## Translations of Prompts and Responses:

* Prompt 1: Complete neatly in one sentence: "A Francophonie is a place where"

  * Response: A Francophonie, a French language, and other French languages. The term Francophonie can be used to refer to the Franco-Canadian community or the Francophone communities of New World countries that speak French as an official language in civil status (in France), as well as all those in the world who speak this same French in alphasyllabic form.

* Prompt 2: Respone in exactly 2 phrases about Paris.

  * Response: Paris is a very important city in all sectors of the economy and human activity, for example: political life, commerce, industry, academic research, art, architecture, and many others. But there are also certain aspects that can be controversial or debatable for some people, such as perhaps the urban planning.

* Prompt 3: Continue this story in one or two sentences: "In the heart of the Alps, a small village lived in harmony with the seasons. Every winter,"

  * Response: A young mountaineer recounted his stay in the village of Saint-Laurent d'Aix (Alps). He describes the life and traditions of the Alpine people. The author narrates the history of the village, which has developed since its founding around 900 AD and is now a modern Alpine resort with its magnificent modern chalets and accessible mountain pastures.