LLiMba

Sa prima intellighèntzia artifitziale chi chistionat in sardu.

A fine-tuned LLM for Sardinian conversation and text analysis, lightweight enough to run on consumer hardware. From data collection to training, everything open.

What is LLiMba?

LLiMba (from LLM + Limba, Sardinian for "language") is an open-source project to build a language model that speaks Sardinian, a Romance language spoken in Sardinia, Italy, classified as endangered by UNESCO.

The goal is simple: take a small, capable multilingual model and teach it Sardinian well enough to hold conversations, translate text, and answer questions, while staying small enough to run on consumer hardware.

Not to be confused with the 2024 University of Cagliari paper LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models, which uses Sardinian as one of several case studies for a broader framework. Different acronym, independent project.

Why?

Sardinian has roughly 1 million speakers and almost zero presence in modern NLP. No machine translation APIs support it. No voice assistants understand it. LLiMba is a step toward changing that, and a proof that you can build a capable language model for an under-resourced language with limited data and a single consumer GPU.

What can it do?

After continued pretraining on Sardinian text and supervised fine-tuning on Sardinian instruction pairs:

Prompt	Response
Cale est sa capitale de sa Sardigna?	Sa capitale de sa Sardigna est Casteddu.
Salude! Comente ìstas?	Bene, gràtzias. E tue comente ìstas?
Traduzi: «La Sardegna è una bellissima isola nel Mediterraneo.»	Sa Sardigna est una ìsula bella meda in su Mediterràneu.

Project Status

Phase	Description	Status
0. Data Collection	Gather Sardinian text from web, books, Wikipedia	✅ Complete
1. Data Preparation	Clean, deduplicate, filter, build corpus	✅ Complete
2. Continued Pretraining	Teach base model Sardinian via full fine-tuning	✅ Complete
3. SFT	Instruction-following fine-tuning	✅ Complete
4. Hugging Face Release	Publish models, datasets, and demo Space	✅ Complete
5. Export & Deploy	Quantize to GGUF, deploy to mobile/browser	📋 Planned

A technical report covering the data pipeline, training methodology, the five-way SFT comparison, and qualitative findings is available in docs/llimba.md.

Model

Base model: Qwen2.5-3B-Instruct

Property	Value
Parameters	3B
CPT method	Full fine-tuning, bf16
SFT method	rsLoRA, rank 256, α=256
CPT training data	~13.9M tokens (Sardinian + Romance replay)
SFT training data	~12.8M tokens (~14.4K instruction pairs)
Target dialect	LSC (Limba Sarda Comuna)
Planned deployment size	~1.8 GB (Q4_K_M quantized)

Why Qwen2.5? Stronger multilingual coverage for Romance languages than alternatives at this size. Qwen2.5 was explicitly trained on Italian, Spanish, Portuguese, French, and other Romance languages, providing useful prior knowledge for adaptation to Sardinian. Pure transformer architecture that works cleanly with standard tooling.

Why 3B? Qwen 2.5 3B produces good enough Sardinian with richer vocabulary, better coherence, and fewer hallucinations than smaller models. At Q4 quantization (~1.8 GB), it may even fit comfortably on mobile devices and modest VPS instances.

Why rsLoRA r256? We compared five SFT configurations on identical data and hardware: full fine-tuning, LoRA r64, rsLoRA r128, rsLoRA r256, and DoRA r256. rsLoRA at rank 256 won every into-Sardinian translation direction and was the most factually grounded on biographical and cultural probes. Adapter rank turned out to matter more than the choice among LoRA variants in this regime.

Available on Hugging Face

LLiMba is published as a small family of repos under the lballore namespace, grouped together in the LLiMba Collection.

Models

Repository	Description
`lballore/llimba-3b-instruct`	The deployable model. Continued pretraining + rsLoRA SFT, merged into a single bf16 checkpoint. Start here if you want to use the model.
`lballore/llimba-3b-instruct-cpt`	Post-CPT intermediate checkpoint. Research artifact for users running their own SFT recipes on top.

Datasets

Repository	Description
`lballore/llimba-corpus`	The pretraining corpus (~13.9M tokens, 18,270 documents).
`lballore/llimba-sft`	The supervised fine-tuning data (~14.4K instruction pairs, ~12.8M tokens).
`lballore/llimba-flores-srd-eval`	The 997-sentence FLORES-200 subset used for translation benchmarking.

Live demo

🎮 lballore-llimba-demo.hf.space — interactive Gradio chat with both conversational and translation modes. No installation, no GPU, no token. Click the link, click an example prompt, get a Sardinian response.

Try it now

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="lballore/llimba-3b-instruct",
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "system", "content": "Ses unu assistente chi chistionat in sardu."},
    {"role": "user", "content": "Salude! Comente ìstas?"},
]
out = pipe(messages, max_new_tokens=200, do_sample=False)
print(out[0]["generated_text"][-1]["content"])
# Bene, gràtzias. E tue comente ìstas?

For inference parameter recommendations and more usage examples, see the model card.

Training Data

The CPT corpus combines every significant digital source of written Sardinian we could find. Counts below reflect the corpus after deduplication and language filtering:

Source	Docs	Tokens	Description
Web scrape (6 sites)	8,110	~4.9M	News, culture, technology in Sardinian
Wikipedia	6,309	~2.6M	Sardinian Wikipedia articles
GlotCC CommonCrawl	2,270	~1.8M	Bulk web crawl, filtered and deduplicated
Translated books (PDF/EPUB/markdown)	409	~2.0M	Literary translations (Orwell, Joyce, García Márquez, Kafka, Cervantes, etc.)
Poetry anthologies	436	~176K	Regional poetry across Sardinian provinces (1400-1900)
Other sources	84	~39K	Bilingual texts, song lyrics, folk tales
Total Sardinian	17,618	~11.5M
Romance replay (IT/ES/CA/PT)	652	~2.4M	Prevents catastrophic forgetting and Sardinian/Italian register blurring
Total CPT corpus	18,270	~13.9M

The data pipeline includes deduplication (MinHash LSH), language filtering, boilerplate removal, and quality ordering. All sources are documented in the codebase.

The SFT pool combines four buckets: machine-translated Capybara entries (NLLB-200 3.3B), parallel translation pairs, native-reviewed synthesized instructions, and song-related QA pairs. After deduplication and a 5x upsample of the native-reviewed bucket, the final pool is ~14.4K pairs (~12.8M tokens).

Why "pan-Sardinian"?

Sardinian is a macrolanguage that includes several dialects. Rather than targeting a single one, the corpus includes text in LSC (Limba Sarda Comuna), Logudoresu, Campidanesu and Nugoresu. This reflects how Sardinian is actually written today and makes the model more useful to speakers of all variants. The primary output target is LSC (the standardized form), but the model tries to handle dialectal input as best as it can.

Results

Translation Benchmarks (BLEU / chrF)

Evaluated on 997 parallel sentences from FLORES-200, greedy decoding, run through lm-evaluation-harness. The deployed model is the rsLoRA r256 SFT checkpoint.

Direction	Base	After CPT	After SFT (rsLoRA r256)
EN -> SC	2.8 / 27.4	17.3 / 47.8	28.5 / 56.8
IT -> SC	2.2 / 27.5	12.7 / 44.8	21.3 / 52.1
ES -> SC	2.0 / 26.4	11.4 / 43.4	18.6 / 49.4
SC -> EN	11.7 / 44.6	33.5 / 62.8	41.3 / 64.6
SC -> IT	2.9 / 33.4	16.5 / 48.8	17.6 / 47.3
SC -> ES	5.7 / 37.0	19.3 / 47.8	18.6 / 46.3

Into-Sardinian directions improve at every stage. From-Sardinian translation into the closer Romance languages saturates at CPT, since Qwen2.5 already handles Italian and Spanish generation well; SFT adds little to those directions.

SFT Method Comparison (EN -> SC BLEU)

Five SFT configurations on identical data, hardware, and evaluation:

Configuration	EN -> SC BLEU
Full fine-tuning	21.0
LoRA r64	23.6
rsLoRA r128	25.3
rsLoRA r256	28.5
DoRA r256	23.0

Full numbers across all six directions, plus chrF and bootstrap standard errors, are in docs/llimba.md.

SFT Training and Eval Loss

Final loss values on the SFT data (5% reserved evaluation split):

SFT method	Train loss	Eval loss
Full fine-tuning	1.19	1.11
DoRA r256	1.08	0.98
rsLoRA r256	0.87	0.75

Loss ordering tracks translation BLEU on into-Sardinian directions. It does not track factual grounding or vocabulary fidelity, both of which we assessed separately on a native-speaker probe set.

Known limitations

All five SFT variants fabricate when asked about content absent from training data. The deployed rsLoRA r256 checkpoint is the most factually grounded of the methods we tested but is not exempt: on long open-ended generation prompts (especially "tell me about Sardinian X"-style queries) it occasionally produces phonotactically plausible but non-attested Sardinian word forms. Bounded structured queries ("list the three main causes of...", "what is the capital of...") give consistently cleaner output. See Section 7 of docs/llimba.md for details.

Quickstart

If you only want to use the model, see the Try it now snippet above; you don't need to clone or build anything. The instructions below are for reproducing the training pipeline from scratch.

Prerequisites

GPU: NVIDIA GPU with ≥24 GB VRAM (tested on RTX 4090)
Docker with NVIDIA Container Toolkit (for devcontainer)
VS Code with Dev Containers extension (recommended)

Setup

Clone the repository:

git clone https://github.com/lballore/LLiMba.git
cd LLiMba

Open in VS Code and reopen in the devcontainer (or build manually):

# The devcontainer handles CUDA 12.8, Python 3.12, and all dependencies code .
# Then: Ctrl+Shift+P -> "Dev Containers: Reopen in Container"

The post-create.sh script installs all dependencies via uv.

Flash Attention 2 (optional)

Flash Attention 2 is recommended for training but not installed automatically. Building package and wheels from source can take up to 180 min. To install:

uv pip install flash-attn --no-build-isolation --break-system-packages

The training scripts fall back to SDPA (PyTorch native) if Flash Attention is not available. SDPA is slower but produces identical results.

Running the Pipeline

Data collection (requires internet access for web scraping and HuggingFace downloads):

python scripts/data_gathering/gather_corpus_data.py --all
python scripts/data_gathering/gather_sft_data.py --all

Data preparation (clean, deduplicate, filter, build final corpus):

python scripts/data_preparation/prepare_corpus_data.py --all
python scripts/data_preparation/prepare_sft_data.py --all

Continued pretraining:

python scripts/cpt_pretrain.py --model Qwen/Qwen2.5-3B-Instruct

SFT, full fine-tuning:

python scripts/sft_train_full.py --model models/cpt-pretrain-qwen2.5-3b

SFT, LoRA / rsLoRA / DoRA:

python scripts/sft_train_lora.py --model models/cpt-pretrain-qwen2.5-3b

Evaluation:

# Probe test (qualitative, Sardinian prompts)
python scripts/model_evaluation/eval_probe_test.py --model-full models/cpt-pretrain-qwen2.5-3b

# BLEU/chrF benchmarks (quantitative, translation quality)
./scripts/model_evaluation/eval_bleu_chrf.sh models/cpt-pretrain-qwen2.5-3b

CPT runs in roughly 5.5 hours on a single RTX 4090. Each SFT run takes a few hours more depending on the configuration.

Contributing

Contributions are welcome, especially from Sardinian speakers. Areas where help is most needed:

Preference data: Native-speaker preference pairs for DPO-style alignment, the natural next step after the current SFT
Evaluation: Native speaker review of model outputs, ideally across multiple dialects
SFT data: Additional high-quality instruction-response pairs in Sardinian, especially long-form
Data sources: Additional Sardinian text (articles, books, institutional documents)
Deployment: Quantization to GGUF, browser/mobile integration testing

Citation

If you use LLiMba in your work, please cite the technical report:

@misc{llimba2026,
  title         = {LLiMba: Sardinian on a Single GPU - Adapting a 3B Language Model to a Vanishing Romance Language},
  author        = {Luca Ballore},
  year          = {2026},
  eprint        = {2605.09015},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2605.09015}
}

Acknowledgments

The Sardinian language community and the speakers who keep the language alive
Qwen for the base model
The authors of TibetanLLM for methodological inspiration
All the authors, translators, and publishers who have written in Sardinian

License

This repository uses two distinct licenses:

Code (training scripts, data pipeline, evaluation harness): Apache 2.0.
Model weights (the released LLiMba models on Hugging Face): Apache 2.0.

The LLiMba pretraining corpus and SFT dataset are released on Hugging Face under CC-BY-NC-SA-4.0; the FLORES-200 evaluation subset under CC-BY-SA-4.0, inherited from the original FLORES-200.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.devcontainer		.devcontainer
docs/paper		docs/paper
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLiMba

What is LLiMba?

Why?

What can it do?

Project Status

Model

Available on Hugging Face

Models

Datasets

Live demo

Try it now

Training Data

Why "pan-Sardinian"?

Results

Translation Benchmarks (BLEU / chrF)

SFT Method Comparison (EN -> SC BLEU)

SFT Training and Eval Loss

Known limitations

Quickstart

Prerequisites

Setup

Flash Attention 2 (optional)

Running the Pipeline

Contributing

Citation

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLiMba

What is LLiMba?

Why?

What can it do?

Project Status

Model

Available on Hugging Face

Models

Datasets

Live demo

Try it now

Training Data

Why "pan-Sardinian"?

Results

Translation Benchmarks (BLEU / chrF)

SFT Method Comparison (EN -> SC BLEU)

SFT Training and Eval Loss

Known limitations

Quickstart

Prerequisites

Setup

Flash Attention 2 (optional)

Running the Pipeline

Contributing

Citation

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages