Skip to content

lballore/LLiMba

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLiMba

Sa prima intellighèntzia artifitziale chi chistionat in sardu.

A fine-tuned LLM for Sardinian conversation and text analysis, lightweight enough to run on consumer hardware. From data collection to training, everything open.

Code License Model License HuggingFace Live Demo Paper


What is LLiMba?

LLiMba (from LLM + Limba, Sardinian for "language") is an open-source project to build a language model that speaks Sardinian, a Romance language spoken in Sardinia, Italy, classified as endangered by UNESCO.

The goal is simple: take a small, capable multilingual model and teach it Sardinian well enough to hold conversations, translate text, and answer questions, while staying small enough to run on consumer hardware.

Not to be confused with the 2024 University of Cagliari paper LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models, which uses Sardinian as one of several case studies for a broader framework. Different acronym, independent project.

Why?

Sardinian has roughly 1 million speakers and almost zero presence in modern NLP. No machine translation APIs support it. No voice assistants understand it. LLiMba is a step toward changing that, and a proof that you can build a capable language model for an under-resourced language with limited data and a single consumer GPU.

What can it do?

After continued pretraining on Sardinian text and supervised fine-tuning on Sardinian instruction pairs:

Prompt Response
Cale est sa capitale de sa Sardigna? Sa capitale de sa Sardigna est Casteddu.
Salude! Comente ìstas? Bene, gràtzias. E tue comente ìstas?
Traduzi: «La Sardegna è una bellissima isola nel Mediterraneo.» Sa Sardigna est una ìsula bella meda in su Mediterràneu.

Project Status

Phase Description Status
0. Data Collection Gather Sardinian text from web, books, Wikipedia âś… Complete
1. Data Preparation Clean, deduplicate, filter, build corpus âś… Complete
2. Continued Pretraining Teach base model Sardinian via full fine-tuning âś… Complete
3. SFT Instruction-following fine-tuning âś… Complete
4. Hugging Face Release Publish models, datasets, and demo Space âś… Complete
5. Export & Deploy Quantize to GGUF, deploy to mobile/browser đź“‹ Planned

A technical report covering the data pipeline, training methodology, the five-way SFT comparison, and qualitative findings is available in docs/llimba.md.


Model

Base model: Qwen2.5-3B-Instruct

Property Value
Parameters 3B
CPT method Full fine-tuning, bf16
SFT method rsLoRA, rank 256, α=256
CPT training data ~13.9M tokens (Sardinian + Romance replay)
SFT training data ~12.8M tokens (~14.4K instruction pairs)
Target dialect LSC (Limba Sarda Comuna)
Planned deployment size ~1.8 GB (Q4_K_M quantized)

Why Qwen2.5? Stronger multilingual coverage for Romance languages than alternatives at this size. Qwen2.5 was explicitly trained on Italian, Spanish, Portuguese, French, and other Romance languages, providing useful prior knowledge for adaptation to Sardinian. Pure transformer architecture that works cleanly with standard tooling.

Why 3B? Qwen 2.5 3B produces good enough Sardinian with richer vocabulary, better coherence, and fewer hallucinations than smaller models. At Q4 quantization (~1.8 GB), it may even fit comfortably on mobile devices and modest VPS instances.

Why rsLoRA r256? We compared five SFT configurations on identical data and hardware: full fine-tuning, LoRA r64, rsLoRA r128, rsLoRA r256, and DoRA r256. rsLoRA at rank 256 won every into-Sardinian translation direction and was the most factually grounded on biographical and cultural probes. Adapter rank turned out to matter more than the choice among LoRA variants in this regime.


Available on Hugging Face

LLiMba is published as a small family of repos under the lballore namespace, grouped together in the LLiMba Collection.

Models

Repository Description
lballore/llimba-3b-instruct The deployable model. Continued pretraining + rsLoRA SFT, merged into a single bf16 checkpoint. Start here if you want to use the model.
lballore/llimba-3b-instruct-cpt Post-CPT intermediate checkpoint. Research artifact for users running their own SFT recipes on top.

Datasets

Repository Description
lballore/llimba-corpus The pretraining corpus (~13.9M tokens, 18,270 documents).
lballore/llimba-sft The supervised fine-tuning data (~14.4K instruction pairs, ~12.8M tokens).
lballore/llimba-flores-srd-eval The 997-sentence FLORES-200 subset used for translation benchmarking.

Live demo

🎮 lballore-llimba-demo.hf.space — interactive Gradio chat with both conversational and translation modes. No installation, no GPU, no token. Click the link, click an example prompt, get a Sardinian response.

Try it now

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="lballore/llimba-3b-instruct",
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "system", "content": "Ses unu assistente chi chistionat in sardu."},
    {"role": "user", "content": "Salude! Comente ìstas?"},
]
out = pipe(messages, max_new_tokens=200, do_sample=False)
print(out[0]["generated_text"][-1]["content"])
# Bene, gràtzias. E tue comente ìstas?

For inference parameter recommendations and more usage examples, see the model card.


Training Data

The CPT corpus combines every significant digital source of written Sardinian we could find. Counts below reflect the corpus after deduplication and language filtering:

Source Docs Tokens Description
Web scrape (6 sites) 8,110 ~4.9M News, culture, technology in Sardinian
Wikipedia 6,309 ~2.6M Sardinian Wikipedia articles
GlotCC CommonCrawl 2,270 ~1.8M Bulk web crawl, filtered and deduplicated
Translated books (PDF/EPUB/markdown) 409 ~2.0M Literary translations (Orwell, Joyce, García Márquez, Kafka, Cervantes, etc.)
Poetry anthologies 436 ~176K Regional poetry across Sardinian provinces (1400-1900)
Other sources 84 ~39K Bilingual texts, song lyrics, folk tales
Total Sardinian 17,618 ~11.5M
Romance replay (IT/ES/CA/PT) 652 ~2.4M Prevents catastrophic forgetting and Sardinian/Italian register blurring
Total CPT corpus 18,270 ~13.9M

The data pipeline includes deduplication (MinHash LSH), language filtering, boilerplate removal, and quality ordering. All sources are documented in the codebase.

The SFT pool combines four buckets: machine-translated Capybara entries (NLLB-200 3.3B), parallel translation pairs, native-reviewed synthesized instructions, and song-related QA pairs. After deduplication and a 5x upsample of the native-reviewed bucket, the final pool is ~14.4K pairs (~12.8M tokens).

Why "pan-Sardinian"?

Sardinian is a macrolanguage that includes several dialects. Rather than targeting a single one, the corpus includes text in LSC (Limba Sarda Comuna), Logudoresu, Campidanesu and Nugoresu. This reflects how Sardinian is actually written today and makes the model more useful to speakers of all variants. The primary output target is LSC (the standardized form), but the model tries to handle dialectal input as best as it can.


Results

Translation Benchmarks (BLEU / chrF)

Evaluated on 997 parallel sentences from FLORES-200, greedy decoding, run through lm-evaluation-harness. The deployed model is the rsLoRA r256 SFT checkpoint.

Direction Base After CPT After SFT (rsLoRA r256)
EN -> SC 2.8 / 27.4 17.3 / 47.8 28.5 / 56.8
IT -> SC 2.2 / 27.5 12.7 / 44.8 21.3 / 52.1
ES -> SC 2.0 / 26.4 11.4 / 43.4 18.6 / 49.4
SC -> EN 11.7 / 44.6 33.5 / 62.8 41.3 / 64.6
SC -> IT 2.9 / 33.4 16.5 / 48.8 17.6 / 47.3
SC -> ES 5.7 / 37.0 19.3 / 47.8 18.6 / 46.3

Into-Sardinian directions improve at every stage. From-Sardinian translation into the closer Romance languages saturates at CPT, since Qwen2.5 already handles Italian and Spanish generation well; SFT adds little to those directions.

SFT Method Comparison (EN -> SC BLEU)

Five SFT configurations on identical data, hardware, and evaluation:

Configuration EN -> SC BLEU
Full fine-tuning 21.0
LoRA r64 23.6
rsLoRA r128 25.3
rsLoRA r256 28.5
DoRA r256 23.0

Full numbers across all six directions, plus chrF and bootstrap standard errors, are in docs/llimba.md.

SFT Training and Eval Loss

Final loss values on the SFT data (5% reserved evaluation split):

SFT method Train loss Eval loss
Full fine-tuning 1.19 1.11
DoRA r256 1.08 0.98
rsLoRA r256 0.87 0.75

Loss ordering tracks translation BLEU on into-Sardinian directions. It does not track factual grounding or vocabulary fidelity, both of which we assessed separately on a native-speaker probe set.

Known limitations

All five SFT variants fabricate when asked about content absent from training data. The deployed rsLoRA r256 checkpoint is the most factually grounded of the methods we tested but is not exempt: on long open-ended generation prompts (especially "tell me about Sardinian X"-style queries) it occasionally produces phonotactically plausible but non-attested Sardinian word forms. Bounded structured queries ("list the three main causes of...", "what is the capital of...") give consistently cleaner output. See Section 7 of docs/llimba.md for details.


Quickstart

If you only want to use the model, see the Try it now snippet above; you don't need to clone or build anything. The instructions below are for reproducing the training pipeline from scratch.

Prerequisites

  • GPU: NVIDIA GPU with ≥24 GB VRAM (tested on RTX 4090)
  • Docker with NVIDIA Container Toolkit (for devcontainer)
  • VS Code with Dev Containers extension (recommended)

Setup

  1. Clone the repository:

    git clone https://github.com/lballore/LLiMba.git
    cd LLiMba
  2. Open in VS Code and reopen in the devcontainer (or build manually):

    # The devcontainer handles CUDA 12.8, Python 3.12, and all dependencies code .
    # Then: Ctrl+Shift+P -> "Dev Containers: Reopen in Container"
  3. The post-create.sh script installs all dependencies via uv.

Flash Attention 2 (optional)

Flash Attention 2 is recommended for training but not installed automatically. Building package and wheels from source can take up to 180 min. To install:

uv pip install flash-attn --no-build-isolation --break-system-packages

The training scripts fall back to SDPA (PyTorch native) if Flash Attention is not available. SDPA is slower but produces identical results.

Running the Pipeline

Data collection (requires internet access for web scraping and HuggingFace downloads):

python scripts/data_gathering/gather_corpus_data.py --all
python scripts/data_gathering/gather_sft_data.py --all

Data preparation (clean, deduplicate, filter, build final corpus):

python scripts/data_preparation/prepare_corpus_data.py --all
python scripts/data_preparation/prepare_sft_data.py --all

Continued pretraining:

python scripts/cpt_pretrain.py --model Qwen/Qwen2.5-3B-Instruct

SFT, full fine-tuning:

python scripts/sft_train_full.py --model models/cpt-pretrain-qwen2.5-3b

SFT, LoRA / rsLoRA / DoRA:

python scripts/sft_train_lora.py --model models/cpt-pretrain-qwen2.5-3b

Evaluation:

# Probe test (qualitative, Sardinian prompts)
python scripts/model_evaluation/eval_probe_test.py --model-full models/cpt-pretrain-qwen2.5-3b

# BLEU/chrF benchmarks (quantitative, translation quality)
./scripts/model_evaluation/eval_bleu_chrf.sh models/cpt-pretrain-qwen2.5-3b

CPT runs in roughly 5.5 hours on a single RTX 4090. Each SFT run takes a few hours more depending on the configuration.


Contributing

Contributions are welcome, especially from Sardinian speakers. Areas where help is most needed:

  • Preference data: Native-speaker preference pairs for DPO-style alignment, the natural next step after the current SFT
  • Evaluation: Native speaker review of model outputs, ideally across multiple dialects
  • SFT data: Additional high-quality instruction-response pairs in Sardinian, especially long-form
  • Data sources: Additional Sardinian text (articles, books, institutional documents)
  • Deployment: Quantization to GGUF, browser/mobile integration testing

Citation

If you use LLiMba in your work, please cite the technical report:

@misc{llimba2026,
  title         = {LLiMba: Sardinian on a Single GPU - Adapting a 3B Language Model to a Vanishing Romance Language},
  author        = {Luca Ballore},
  year          = {2026},
  eprint        = {2605.09015},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2605.09015}
}

Acknowledgments

  • The Sardinian language community and the speakers who keep the language alive
  • Qwen for the base model
  • The authors of TibetanLLM for methodological inspiration
  • All the authors, translators, and publishers who have written in Sardinian

License

This repository uses two distinct licenses:

  • Code (training scripts, data pipeline, evaluation harness): Apache 2.0.
  • Model weights (the released LLiMba models on Hugging Face): Apache 2.0.

The LLiMba pretraining corpus and SFT dataset are released on Hugging Face under CC-BY-NC-SA-4.0; the FLORES-200 evaluation subset under CC-BY-SA-4.0, inherited from the original FLORES-200.

About

An open-source LLM that speaks Sardinian. Fine-tuned for conversation and text analysis in Limba Sarda. Built from the ground up with collected Sardinian corpora, synthetic distillation, and native speaker validation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors