nanoXLSTM

nanoXLSTM is a minimal codebase for playing around with language models based on the xLSTM (extended Long Short-Term Memory) architecture from the awesome research paper: xLSTM: Extended Long Short-Term Memory and heavily inspired by Andrej Karpathy's nanoGPT.

**Note: Work in progress!!! I am working on improving the generated text.

No lofty goals here - just a simple codebase for tinkering with this innovative xLSTM technology!

Contributions are more than welcome as I continue exploring this exciting research direction.

install

pip install torch numpy transformers datasets tiktoken wandb tqdm

data prep

!python data/shakespeare_char/prepare.py

train

python train.py config/train_shakespeare_char.py

inference

python sample.py --out_dir=out-shakespeare-char

todos

Run hyperparameter sweep

changelog

Model changes:

Import OneCycleLR: The OneCycleLR scheduler is imported from torch.optim.lr_scheduler.
sLSTM class: The f_bias and dropout are added to the sLSTM class.
mLSTM class: The f_bias and dropout are added to the mLSTM class.
xLSTMBlock class: The xLSTMBlock class is implemented with a configurable ratio of sLSTM and mLSTM blocks, and layer normalization is applied.
GPT class: The xLSTM_blocks are used in the GPT class instead of separate sLSTM and mLSTM blocks.
configure_optimizers method: The configure_optimizers method in the GPT class is updated to use AdamW optimizer and OneCycleLR scheduler.

20/05/24

Initialize the forget gate bias (self.f_bias) with values between 3 and 6 instead of ones. This helps the forget gate to be effective from the beginning of training.
Introduce a stabilization technique to avoid overflow due to the exponential function. You can use the max function to compute a stabilization factor and subtract it from the input gate and forget gate activations before applying the exponential function.

Training script changes:

Import statement: The OneCycleLR scheduler is imported.
Optimizer and scheduler initialization: The optimizer and scheduler are obtained from the configure_optimizers method of the GPT class.
Loading optimizer and scheduler state: The optimizer and scheduler states are loaded from the checkpoint when resuming training.
Saving scheduler state: The scheduler state is included in the checkpoint dictionary.
Stepping the scheduler: The scheduler.step() is called after each optimizer step.
Logging learning rate and MFU: The learning rate and MFU are logged using wandb (if wandb_log is enabled).
estimate_loss function: The estimate_loss function is updated to use the ctx context manager.
Training loop: The training loop is updated to use scaler.scale(loss).backward() and scaler.step(optimizer) for gradient scaling when training in fp16.

Name		Name	Last commit message	Last commit date
Latest commit History 210 Commits
assets		assets
config		config
data		data
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bench.py		bench.py
configurator.py		configurator.py
model.py		model.py
run_sweep.py		run_sweep.py
sample.py		sample.py
scaling_laws.ipynb		scaling_laws.ipynb
train.py		train.py
transformer_sizing.ipynb		transformer_sizing.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nanoXLSTM

install

data prep

train

inference

todos

changelog

Model changes:

Training script changes:

About

Releases

Packages

Languages

License

jadechip/nanoXLSTM

Folders and files

Latest commit

History

Repository files navigation

nanoXLSTM

install

data prep

train

inference

todos

changelog

Model changes:

Training script changes:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages