Sa prima intellighèntzia artifitziale chi chistionat in sardu.
A fine-tuned LLM for Sardinian conversation and text analysis, lightweight enough to run on consumer hardware. From data collection to training, everything open.
LLiMba (from LLM + Limba, Sardinian for "language") is an open-source project to build a language model that speaks Sardinian, a Romance language spoken in Sardinia, Italy, classified as endangered by UNESCO.
The goal is simple: take a small, capable multilingual model and teach it Sardinian well enough to hold conversations, translate text, and answer questions, while staying small enough to run on consumer hardware.
Not to be confused with the 2024 University of Cagliari paper LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models, which uses Sardinian as one of several case studies for a broader framework. Different acronym, independent project.
Sardinian has roughly 1 million speakers and almost zero presence in modern NLP. No machine translation APIs support it. No voice assistants understand it. LLiMba is a step toward changing that, and a proof that you can build a capable language model for an under-resourced language with limited data and a single consumer GPU.
After continued pretraining on Sardinian text and supervised fine-tuning on Sardinian instruction pairs:
| Prompt | Response |
|---|---|
| Cale est sa capitale de sa Sardigna? | Sa capitale de sa Sardigna est Casteddu. |
| Salude! Comente ìstas? | Bene, grà tzias. E tue comente ìstas? |
| Traduzi: «La Sardegna è una bellissima isola nel Mediterraneo.» | Sa Sardigna est una ìsula bella meda in su Mediterrà neu. |
| Phase | Description | Status |
|---|---|---|
| 0. Data Collection | Gather Sardinian text from web, books, Wikipedia | âś… Complete |
| 1. Data Preparation | Clean, deduplicate, filter, build corpus | âś… Complete |
| 2. Continued Pretraining | Teach base model Sardinian via full fine-tuning | âś… Complete |
| 3. SFT | Instruction-following fine-tuning | âś… Complete |
| 4. Hugging Face Release | Publish models, datasets, and demo Space | âś… Complete |
| 5. Export & Deploy | Quantize to GGUF, deploy to mobile/browser | đź“‹ Planned |
A technical report covering the data pipeline, training methodology, the five-way SFT comparison, and qualitative findings is available in docs/llimba.md.
Base model: Qwen2.5-3B-Instruct
| Property | Value |
|---|---|
| Parameters | 3B |
| CPT method | Full fine-tuning, bf16 |
| SFT method | rsLoRA, rank 256, α=256 |
| CPT training data | ~13.9M tokens (Sardinian + Romance replay) |
| SFT training data | ~12.8M tokens (~14.4K instruction pairs) |
| Target dialect | LSC (Limba Sarda Comuna) |
| Planned deployment size | ~1.8 GB (Q4_K_M quantized) |
Why Qwen2.5? Stronger multilingual coverage for Romance languages than alternatives at this size. Qwen2.5 was explicitly trained on Italian, Spanish, Portuguese, French, and other Romance languages, providing useful prior knowledge for adaptation to Sardinian. Pure transformer architecture that works cleanly with standard tooling.
Why 3B? Qwen 2.5 3B produces good enough Sardinian with richer vocabulary, better coherence, and fewer hallucinations than smaller models. At Q4 quantization (~1.8 GB), it may even fit comfortably on mobile devices and modest VPS instances.
Why rsLoRA r256? We compared five SFT configurations on identical data and hardware: full fine-tuning, LoRA r64, rsLoRA r128, rsLoRA r256, and DoRA r256. rsLoRA at rank 256 won every into-Sardinian translation direction and was the most factually grounded on biographical and cultural probes. Adapter rank turned out to matter more than the choice among LoRA variants in this regime.
LLiMba is published as a small family of repos under the lballore namespace, grouped together in the LLiMba Collection.
| Repository | Description |
|---|---|
lballore/llimba-3b-instruct |
The deployable model. Continued pretraining + rsLoRA SFT, merged into a single bf16 checkpoint. Start here if you want to use the model. |
lballore/llimba-3b-instruct-cpt |
Post-CPT intermediate checkpoint. Research artifact for users running their own SFT recipes on top. |
| Repository | Description |
|---|---|
lballore/llimba-corpus |
The pretraining corpus (~13.9M tokens, 18,270 documents). |
lballore/llimba-sft |
The supervised fine-tuning data (~14.4K instruction pairs, ~12.8M tokens). |
lballore/llimba-flores-srd-eval |
The 997-sentence FLORES-200 subset used for translation benchmarking. |
🎮 lballore-llimba-demo.hf.space — interactive Gradio chat with both conversational and translation modes. No installation, no GPU, no token. Click the link, click an example prompt, get a Sardinian response.
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="lballore/llimba-3b-instruct",
torch_dtype="auto",
device_map="auto",
)
messages = [
{"role": "system", "content": "Ses unu assistente chi chistionat in sardu."},
{"role": "user", "content": "Salude! Comente ìstas?"},
]
out = pipe(messages, max_new_tokens=200, do_sample=False)
print(out[0]["generated_text"][-1]["content"])
# Bene, grà tzias. E tue comente ìstas?For inference parameter recommendations and more usage examples, see the model card.
The CPT corpus combines every significant digital source of written Sardinian we could find. Counts below reflect the corpus after deduplication and language filtering:
| Source | Docs | Tokens | Description |
|---|---|---|---|
| Web scrape (6 sites) | 8,110 | ~4.9M | News, culture, technology in Sardinian |
| Wikipedia | 6,309 | ~2.6M | Sardinian Wikipedia articles |
| GlotCC CommonCrawl | 2,270 | ~1.8M | Bulk web crawl, filtered and deduplicated |
| Translated books (PDF/EPUB/markdown) | 409 | ~2.0M | Literary translations (Orwell, Joyce, GarcĂa Márquez, Kafka, Cervantes, etc.) |
| Poetry anthologies | 436 | ~176K | Regional poetry across Sardinian provinces (1400-1900) |
| Other sources | 84 | ~39K | Bilingual texts, song lyrics, folk tales |
| Total Sardinian | 17,618 | ~11.5M | |
| Romance replay (IT/ES/CA/PT) | 652 | ~2.4M | Prevents catastrophic forgetting and Sardinian/Italian register blurring |
| Total CPT corpus | 18,270 | ~13.9M |
The data pipeline includes deduplication (MinHash LSH), language filtering, boilerplate removal, and quality ordering. All sources are documented in the codebase.
The SFT pool combines four buckets: machine-translated Capybara entries (NLLB-200 3.3B), parallel translation pairs, native-reviewed synthesized instructions, and song-related QA pairs. After deduplication and a 5x upsample of the native-reviewed bucket, the final pool is ~14.4K pairs (~12.8M tokens).
Sardinian is a macrolanguage that includes several dialects. Rather than targeting a single one, the corpus includes text in LSC (Limba Sarda Comuna), Logudoresu, Campidanesu and Nugoresu. This reflects how Sardinian is actually written today and makes the model more useful to speakers of all variants. The primary output target is LSC (the standardized form), but the model tries to handle dialectal input as best as it can.
Evaluated on 997 parallel sentences from FLORES-200, greedy decoding, run through lm-evaluation-harness. The deployed model is the rsLoRA r256 SFT checkpoint.
| Direction | Base | After CPT | After SFT (rsLoRA r256) |
|---|---|---|---|
| EN -> SC | 2.8 / 27.4 | 17.3 / 47.8 | 28.5 / 56.8 |
| IT -> SC | 2.2 / 27.5 | 12.7 / 44.8 | 21.3 / 52.1 |
| ES -> SC | 2.0 / 26.4 | 11.4 / 43.4 | 18.6 / 49.4 |
| SC -> EN | 11.7 / 44.6 | 33.5 / 62.8 | 41.3 / 64.6 |
| SC -> IT | 2.9 / 33.4 | 16.5 / 48.8 | 17.6 / 47.3 |
| SC -> ES | 5.7 / 37.0 | 19.3 / 47.8 | 18.6 / 46.3 |
Into-Sardinian directions improve at every stage. From-Sardinian translation into the closer Romance languages saturates at CPT, since Qwen2.5 already handles Italian and Spanish generation well; SFT adds little to those directions.
Five SFT configurations on identical data, hardware, and evaluation:
| Configuration | EN -> SC BLEU |
|---|---|
| Full fine-tuning | 21.0 |
| LoRA r64 | 23.6 |
| rsLoRA r128 | 25.3 |
| rsLoRA r256 | 28.5 |
| DoRA r256 | 23.0 |
Full numbers across all six directions, plus chrF and bootstrap standard errors, are in docs/llimba.md.
Final loss values on the SFT data (5% reserved evaluation split):
| SFT method | Train loss | Eval loss |
|---|---|---|
| Full fine-tuning | 1.19 | 1.11 |
| DoRA r256 | 1.08 | 0.98 |
| rsLoRA r256 | 0.87 | 0.75 |
Loss ordering tracks translation BLEU on into-Sardinian directions. It does not track factual grounding or vocabulary fidelity, both of which we assessed separately on a native-speaker probe set.
All five SFT variants fabricate when asked about content absent from training data. The deployed rsLoRA r256 checkpoint is the most factually grounded of the methods we tested but is not exempt: on long open-ended generation prompts (especially "tell me about Sardinian X"-style queries) it occasionally produces phonotactically plausible but non-attested Sardinian word forms. Bounded structured queries ("list the three main causes of...", "what is the capital of...") give consistently cleaner output. See Section 7 of docs/llimba.md for details.
If you only want to use the model, see the Try it now snippet above; you don't need to clone or build anything. The instructions below are for reproducing the training pipeline from scratch.
- GPU: NVIDIA GPU with ≥24 GB VRAM (tested on RTX 4090)
- Docker with NVIDIA Container Toolkit (for devcontainer)
- VS Code with Dev Containers extension (recommended)
-
Clone the repository:
git clone https://github.com/lballore/LLiMba.git cd LLiMba -
Open in VS Code and reopen in the devcontainer (or build manually):
# The devcontainer handles CUDA 12.8, Python 3.12, and all dependencies code . # Then: Ctrl+Shift+P -> "Dev Containers: Reopen in Container"
-
The
post-create.shscript installs all dependencies viauv.
Flash Attention 2 is recommended for training but not installed automatically. Building package and wheels from source can take up to 180 min. To install:
uv pip install flash-attn --no-build-isolation --break-system-packagesThe training scripts fall back to SDPA (PyTorch native) if Flash Attention is not available. SDPA is slower but produces identical results.
Data collection (requires internet access for web scraping and HuggingFace downloads):
python scripts/data_gathering/gather_corpus_data.py --all
python scripts/data_gathering/gather_sft_data.py --allData preparation (clean, deduplicate, filter, build final corpus):
python scripts/data_preparation/prepare_corpus_data.py --all
python scripts/data_preparation/prepare_sft_data.py --allContinued pretraining:
python scripts/cpt_pretrain.py --model Qwen/Qwen2.5-3B-InstructSFT, full fine-tuning:
python scripts/sft_train_full.py --model models/cpt-pretrain-qwen2.5-3bSFT, LoRA / rsLoRA / DoRA:
python scripts/sft_train_lora.py --model models/cpt-pretrain-qwen2.5-3bEvaluation:
# Probe test (qualitative, Sardinian prompts)
python scripts/model_evaluation/eval_probe_test.py --model-full models/cpt-pretrain-qwen2.5-3b
# BLEU/chrF benchmarks (quantitative, translation quality)
./scripts/model_evaluation/eval_bleu_chrf.sh models/cpt-pretrain-qwen2.5-3bCPT runs in roughly 5.5 hours on a single RTX 4090. Each SFT run takes a few hours more depending on the configuration.
Contributions are welcome, especially from Sardinian speakers. Areas where help is most needed:
- Preference data: Native-speaker preference pairs for DPO-style alignment, the natural next step after the current SFT
- Evaluation: Native speaker review of model outputs, ideally across multiple dialects
- SFT data: Additional high-quality instruction-response pairs in Sardinian, especially long-form
- Data sources: Additional Sardinian text (articles, books, institutional documents)
- Deployment: Quantization to GGUF, browser/mobile integration testing
If you use LLiMba in your work, please cite the technical report:
@misc{llimba2026,
title = {LLiMba: Sardinian on a Single GPU - Adapting a 3B Language Model to a Vanishing Romance Language},
author = {Luca Ballore},
year = {2026},
eprint = {2605.09015},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2605.09015}
}- The Sardinian language community and the speakers who keep the language alive
- Qwen for the base model
- The authors of TibetanLLM for methodological inspiration
- All the authors, translators, and publishers who have written in Sardinian
This repository uses two distinct licenses:
- Code (training scripts, data pipeline, evaluation harness): Apache 2.0.
- Model weights (the released LLiMba models on Hugging Face): Apache 2.0.
The LLiMba pretraining corpus and SFT dataset are released on Hugging Face under CC-BY-NC-SA-4.0; the FLORES-200 evaluation subset under CC-BY-SA-4.0, inherited from the original FLORES-200.