train gpt-2 or gpt-neo models on the wikitext dataset using pytorch and transformers.
run:
./main.shthis script provides an interactive setup for training, testing, and evaluation.
-
train — choose model (
gpt-2/gpt-neo), number of epochs, and execution method (local/docker) -
test — run unit tests:
pytest tests/
-
evaluate — generate text using a trained model.
| directory | description |
|---|---|
src/ |
core code — data.py, train.py, evaluate.py, model.py |
models/ |
saved model checkpoints |
data/ |
tokenized datasets (generate via python src/data.py) |
docker/ |
docker configuration files |
edit src/data.py to customize dataset and preprocessing:
# load dataset (change to your preferred dataset)
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
# load tokenizer (change for different models)
tokenizer = gpt2tokenizer.from_pretrained("gpt2")
# tokenize function (adjust max_length for sequence length)
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=true, padding="max_length", max_length=512)
# select subset for training (change range for data size)
tokenized_datasets["train"] = tokenized_datasets["train"].select(range(1000))- default training runs on cpu inside docker; update configurations for gpu usage.
- trained models are automatically saved in
models/with timestamped filenames. - model available on hugging face: https://huggingface.co/harpertoken/harpertokenGPT2
this project enforces conventional commit standards for consistency and clarity.
enable the commit-msg hook:
cp scripts/commit-msg .git/hooks/commit-msg
chmod +x .git/hooks/commit-msgcommit messages must:
- start with a type:
feat:,fix:,docs:,style:,refactor:,test:,chore: - follow with a concise lowercase description
- keep the first line ≤ 30 characters
example:
feat: add new training featurerewrite existing commit messages:
./scripts/rewrite_msg.sh
git push --force-with-lease