Releases · proger/haloop

16 Jun 10:57

proger

v0.0.10

26e6da1

Latest

This release doubles down on transformers and introduces a training loop program hala. Pretraining bidirectional models with token denoising objective (aka masked LM) is available hala --objective denoise. The first training run on uk4b dataset is happening here: https://wandb.ai/stud76/ha/runs/tjoqx491?workspace=user-stud76

Existing causal models can now be finetuned with conditional language modeling objective $p(y|x)$, which can be used to implement classification with hala --objective cond.

hat is now a repl for both causal and bidirectional models. The hat repl now supports history thanks to readline.

RNN training program hal now supports training from u16 binary datasets like hala. This allowed me to train a world model on VQ-VAE-tokenized images.

New randomly initialized checkpoints can be created with new the hai program.

Assets 2

25 May 18:16

proger

v0.0.9

ec28060

Acoustic training with words

This release enables users of hac to train word- (token-) level models from manifests with txt files:

hac --train labels:train.tsv --eval labels:eval.tsv  --vocab words:words.txt

TSV files are expected to be formatted like below. This format is insipired by kaldi text with paths instead of utterance ids.

path/to/utterance.wav  word1 word2 word3

Words.txt are files with lists of words, repeating words will be ignored.

Assets 2

20 May 16:11

proger

v0.0.8

5033de2

uk4b with LoRA

This release supports running models adapted using LoRA. New modules: ha.lora. New APIs: ha.attention.load_model.

Assets 2

07 May 15:08

proger

v0.0.7

a7b80ed

uk4b Transformers

This release introduces a REPL for models trained for my and @dchaplinsky's paper on GPT-2 Metadata Pretraining Towards Instruction Finetuning for Ukrainian.

The REPL is accessible via a new CLI program, hat.

To use hat, first install some additional dependencies and models:

pip install haloop --upgrade               # make sure you have at least 0.0.7
pip install bitsandbytes sentencepiece     # I opted for not installing these as dependencies for now

wget https://a.wilab.org.ua/gpt/wiki.model  # sentencepiece tokenizer
wget https://a.wilab.org.ua/gpt/ckpt10m.pt  # model checkpoint for GPT-2 Large

Now, start the REPL:

hat --spm wiki.model ckpt10m.pt

Contributors

dchaplinsky

Assets 2

30 Apr 11:30

proger

v0.0.6

02ec03a

v0.0.6: Transducer preparations

Train acoustic model with byte targets out of the box. Complete batched Transducer loss implementation. Default to torch implementation of CTC loss (10x faster for now). Add ResNet-32 encoder as an option. When evaluating LM, report BPC.

License code under GPLv3.

Assets 2