Skip to content

Releases: proger/haloop

Training Transformers

16 Jun 10:57
26e6da1
Compare
Choose a tag to compare

This release doubles down on transformers and introduces a training loop program hala. Pretraining bidirectional models with token denoising objective (aka masked LM) is available hala --objective denoise. The first training run on uk4b dataset is happening here: https://wandb.ai/stud76/ha/runs/tjoqx491?workspace=user-stud76

Existing causal models can now be finetuned with conditional language modeling objective $p(y|x)$, which can be used to implement classification with hala --objective cond.

hat is now a repl for both causal and bidirectional models. The hat repl now supports history thanks to readline.

image

RNN training program hal now supports training from u16 binary datasets like hala. This allowed me to train a world model on VQ-VAE-tokenized images.
image

New randomly initialized checkpoints can be created with new the hai program.

Acoustic training with words

25 May 18:16
ec28060
Compare
Choose a tag to compare

This release enables users of hac to train word- (token-) level models from manifests with txt files:

hac --train labels:train.tsv --eval labels:eval.tsv  --vocab words:words.txt

TSV files are expected to be formatted like below. This format is insipired by kaldi text with paths instead of utterance ids.

path/to/utterance.wav  word1 word2 word3

Words.txt are files with lists of words, repeating words will be ignored.

uk4b with LoRA

20 May 16:11
Compare
Choose a tag to compare

This release supports running models adapted using LoRA. New modules: ha.lora. New APIs: ha.attention.load_model.

uk4b Transformers

07 May 15:08
Compare
Choose a tag to compare

This release introduces a REPL for models trained for my and @dchaplinsky's paper on GPT-2 Metadata Pretraining Towards Instruction Finetuning for Ukrainian.

The REPL is accessible via a new CLI program, hat.

To use hat, first install some additional dependencies and models:

pip install haloop --upgrade               # make sure you have at least 0.0.7
pip install bitsandbytes sentencepiece     # I opted for not installing these as dependencies for now

wget https://a.wilab.org.ua/gpt/wiki.model  # sentencepiece tokenizer
wget https://a.wilab.org.ua/gpt/ckpt10m.pt  # model checkpoint for GPT-2 Large

Now, start the REPL:

hat --spm wiki.model ckpt10m.pt

v0.0.6: Transducer preparations

30 Apr 11:30
Compare
Choose a tag to compare

Train acoustic model with byte targets out of the box. Complete batched Transducer loss implementation. Default to torch implementation of CTC loss (10x faster for now). Add ResNet-32 encoder as an option. When evaluating LM, report BPC.

License code under GPLv3.

v0.0.5: Progress update

16 Apr 19:51
9f76105
Compare
Choose a tag to compare

haloop v0.0.4

13 Apr 14:23
12cb45e
Compare
Choose a tag to compare

Renaming the project to haloop. Finally have the name I like

v0.0.3

10 Apr 09:03
51eb830
Compare
Choose a tag to compare

This release adds customizable eval prompts, args metadata to the checkpoint and top-k sampling in hal.

v0.0.2

04 Apr 08:42
f0caac8
Compare
Choose a tag to compare

This PyPI release features an updated README.

v0.0.1

04 Apr 08:31
Compare
Choose a tag to compare

This is the first package release on PyPI. Available CLI programs are hac for acoustic model training and hal for language modeling.