A lightweight pipeline for extracting student assignment metadata from unstructured text into a strict JSON format by fine-tuning a SmolLM2-family base model and exporting an Ollama-ready GGUF model.
- Fine-tuned checkpoint (360M variant): https://huggingface.co/nimendraai/SmolLM2-360M-Assignment-Metadata-Extractor
Note
The default training script in this repository currently uses HuggingFaceTB/SmolLM2-135M-Instruct as the base model.
data/generate_dataset.py
Generates synthetic instruction-tuning examples.training/train.py
Fine-tunesHuggingFaceTB/SmolLM2-135M-Instructusing Unsloth + LoRA, then exports HF and GGUF artifacts.training/train.ipynb
Notebook version of the same training workflow.
- Python 3.10–3.11
- uv for environment + dependency management
- PyTorch, Hugging Face Datasets, TRL (
SFTTrainer) - Unsloth for efficient LoRA fine-tuning and GGUF export
- Python 3.10 or 3.11
uvinstalled- Recommended for training: NVIDIA GPU with CUDA (CPU training is possible but slow)
uv venv
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows PowerShell
uv syncuv run python data/generate_dataset.py --size 400 --output data/dataset.jsonuv run python training/train.py./smollm-student-extractor/(Hugging Face model/tokenizer)./smollm-student-gguf/(GGUF export for Ollama)
Create a Modelfile with the following content:
FROM hf.co/nimendraai/SmolLM2-360M-Assignment-Metadata-Extractor:Q4_K_M
# Apply the strict instruction template used during training
TEMPLATE """### Instruction:
Extract student info as JSON from the following text.
### Input:
{{ .Prompt }}
### Response:
"""
# Set the System constraints
SYSTEM """
You are a precise student assignment data extractor.
Output ONLY a valid JSON object. No explanation. No extra text. No markdown.
Return a JSON object with exactly these keys: "student_number", "student_name", and "assignment_number". All values must be strings extracted from the input text.
"""
# Turn off creativity
PARAMETER temperature 0
# Stop generating once the JSON is closed
PARAMETER stop "}"
Build and run with Ollama:
ollama create assignment-metadata-extractor -f Modelfile
ollama run assignment-metadata-extractordata/generate_dataset.py creates a JSON list where each item contains:
instructioninputoutput(JSON string with keys:student_number,student_name,assignment_number)
training/train.pyexpectsdata/dataset.jsonto exist.- If the dataset file is missing or empty, training exits with a clear error.
@misc{nimendra_2026,
author = { Nimendra },
title = { SmolLM2-360M-Assignment-Metadata-Extractor (Revision 0da34e3) },
year = 2026,
url = { https://huggingface.co/nimendraai/SmolLM2-360M-Assignment-Metadata-Extractor },
doi = { 10.57967/hf/8468 },
publisher = { Hugging Face }
}