Releases: proger/haloop
Training Transformers
This release doubles down on transformers and introduces a training loop program hala
. Pretraining bidirectional models with token denoising objective (aka masked LM) is available hala --objective denoise
. The first training run on uk4b dataset is happening here: https://wandb.ai/stud76/ha/runs/tjoqx491?workspace=user-stud76
Existing causal models can now be finetuned with conditional language modeling objective hala --objective cond
.
hat
is now a repl for both causal and bidirectional models. The hat
repl now supports history thanks to readline.
![image](https://private-user-images.githubusercontent.com/66214/246396210-546217e0-9f24-4f6d-8df7-bd70beb205cb.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjEzMzc2MzAsIm5iZiI6MTcyMTMzNzMzMCwicGF0aCI6Ii82NjIxNC8yNDYzOTYyMTAtNTQ2MjE3ZTAtOWYyNC00ZjZkLThkZjctYmQ3MGJlYjIwNWNiLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MTglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzE4VDIxMTUzMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWY4MTdjMDIwN2VmM2U2ZGQ3YzFmZTQzMGZkNzQyNzgzODhjNTdmMWQ1MWMxZjFmNTIzNjkxMDI4NGU5ZTJkOWYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.1Dwisn4iti4oYUqU72jWDdjlKx9TYMsw332j4LpL_5U)
RNN training program hal
now supports training from u16
binary datasets like hala
. This allowed me to train a world model on VQ-VAE-tokenized images.
New randomly initialized checkpoints can be created with new the hai
program.
Acoustic training with words
This release enables users of hac
to train word- (token-) level models from manifests with txt files:
hac --train labels:train.tsv --eval labels:eval.tsv --vocab words:words.txt
TSV files are expected to be formatted like below. This format is insipired by kaldi text
with paths instead of utterance ids.
path/to/utterance.wav word1 word2 word3
Words.txt are files with lists of words, repeating words will be ignored.
uk4b with LoRA
This release supports running models adapted using LoRA. New modules: ha.lora
. New APIs: ha.attention.load_model
.
uk4b Transformers
This release introduces a REPL for models trained for my and @dchaplinsky's paper on GPT-2 Metadata Pretraining Towards Instruction Finetuning for Ukrainian.
The REPL is accessible via a new CLI program, hat
.
To use hat
, first install some additional dependencies and models:
pip install haloop --upgrade # make sure you have at least 0.0.7
pip install bitsandbytes sentencepiece # I opted for not installing these as dependencies for now
wget https://a.wilab.org.ua/gpt/wiki.model # sentencepiece tokenizer
wget https://a.wilab.org.ua/gpt/ckpt10m.pt # model checkpoint for GPT-2 Large
Now, start the REPL:
hat --spm wiki.model ckpt10m.pt
v0.0.6: Transducer preparations
Train acoustic model with byte targets out of the box. Complete batched Transducer loss implementation. Default to torch implementation of CTC loss (10x faster for now). Add ResNet-32 encoder as an option. When evaluating LM, report BPC.
License code under GPLv3.
v0.0.5: Progress update
haloop v0.0.4
Renaming the project to haloop. Finally have the name I like