Verne Decoder Transformer

This repository contains an implementation of a Transformer Decoder in PyTorch from scratch. The purpose of this project is to generate text based on Jules Verne's literary works, using the original Transformer model as proposed in the "Attention is All You Need" paper and its subsequent improvements. This is also an application of my learnings from Andrey Karpathy's latest youtube series.

▶ Usage

To use this code, first, clone the repository:

git clone https://github.com/joaoflf/transformer_decoder_pytorch.git
cd transformer_decoder_pytorch

Next, install the dependencies:

pip install -r requirements.txt

Training the Model

The train.py script trains the model. It accepts the following command line arguments:

--iters: Total iterations to train. Default is 5000.
--batch-size: Batch size. Default is 32.
--lr: Learning rate. Default is 3e-4.
--device: Device to use for training. Default is "cuda" if CUDA is available, otherwise "mps".
--checkpoint_dir: Directory to save the model checkpoints. Default is "checkpoints".

Example usage:

python train.py --iters 10000 --batch-size 64 --lr 1e-4 --device cuda --checkpoint_dir my_checkpoints

This will train the model for 10000 iterations with a batch size of 64, a learning rate of 1e-4, using a CUDA device for training. The model checkpoints will be saved in the my_checkpoints directory.

Generating New Text

The generate.py script generates new text from a trained model. It accepts the following command line arguments:

--checkpoint_path: Path to the model checkpoint. This argument is required.
- You can download the latest trained weights here
--num_tokens: Number of tokens to generate. Default is 100.

Example usage:

python generate.py --checkpoint_path my_checkpoints/model_state_10000.pt --num_tokens 500

This will generate 500 new tokens from the model checkpoint at my_checkpoints/model_state_10000.pt.

🏈 Game Plan

✅ Start with a basic bigram model and a basic table lookup embedding layer.
```
iterations: 10,000
batch_size: 32
```
Metric Value

Train Loss 2.57

Val Loss N/A

✅ Add a self-attention block and introduce basic positional embeddings.
```
iterations: 10,000
batch_size: 32
block_size: 8
embed_size: 256
```
Metric Value

Train Loss 2.4980

Val Loss 2.5421

✅ Implement multihead self-attention.

iterations: 10,000
batch_size: 32
block_size: 8
embed_size: 256
num_heads: 8

Metric	Value
Train Loss	2.1
Val Loss	2.13

✅ Add a feed-forward network and stack multiple blocks of multi-head attention.
```
iterations: 10,000
batch_size: 32
block_size: 8
embed_size: 256
num_heads: 8
num_blocks: 4
```
Metric Value

Train Loss 3.13

Val Loss 3.17

*the networks is now too deep and is hurting training performance

✅ Implement Layer Normalization and residual connections. Scale up the model

 GPU: M1 Pro 10-core
 iterations: 5,000
 batch_size: 64
 block_size: 256
 embed_size: 384
 num_heads: 6
 num_blocks: 6
 dropout: 0.2

Metric	Value
Train Loss	1.02
Val Loss	1.19

Generated Text

F the fact of this life appeared for its last ten
to the Northern minutes which formed me a mountain number of our worthy and
millions that we have made for land known of the Central Sea."

"Well," said the Professor; "it is a depth of extraordinary track,
their island wood."

"But it is quite getting at Ned Land."

At this moment, I saw the amed horizontal horrible at last would the
hargonal man. I came to fain the extraordinary and excitement power on
the other you."

✅ Replace char level tokenizer with TikToken ('gpt2').

 GPU: M1 Pro 10-core
 iterations: 5,000
 batch_size: 64
 block_size: 256
 embed_size: 384
 num_heads: 6
 num_blocks: 6
 dropout: 0.2

Metric	Value
Train Loss	0.128
Val Loss	7.09

The model now overfits as the training data is too small. Due to the new tokenizer, the model now has a vocabulary of 50k+ tokens, which increases training time by 4x. (~4it/s -> ~1it/s on a M1 Pro 10-core) The generated text is now much more coherent and readable.

Generated Text

"Then," he said, "it is impossible in a contrary, your
cannot be easy to the weight being about. We must put
utterly at last observation to the end of this gallery."

"My dear uncle," I ventured mildly to his answer. "Let
the way to the old--of no means a minute or of the sentence as he did not care answer.

The fartherfied forth in the high seas of the volcano. I looked around. The
excellent Professor, and did not speak English with
fancy a most despairing form a dull
rocks. His telescope began to
uncle, which his great deal of supper, appeared to be
a wide thinking of steed--one that we were to
discovered surrounding us on all sides point.

TheHaving got over this occasion, I sought for it
my head simply
eating made from his making the circumstances.

Our stock of my uncle partly confounded towards Hans.

The Icelander gently pressed our departure, and the guide, I began to feel
a powerful arms. My uncle made no longer moved myface
ready. I began to think or not.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.vscode		.vscode
.gitignore		.gitignore
README.md		README.md
bigram.ipynb		bigram.ipynb
bigram.py		bigram.py
dataset.py		dataset.py
decoder_transformer.py		decoder_transformer.py
generate.py		generate.py
requirements.txt		requirements.txt
train.py		train.py
verne.txt		verne.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Verne Decoder Transformer

▶ Usage

Training the Model

Generating New Text

🏈 Game Plan

About

Releases

Packages

Languages

joaoflf/verne-decoder-transformer

Folders and files

Latest commit

History

Repository files navigation

Verne Decoder Transformer

▶ Usage

Training the Model

Generating New Text

🏈 Game Plan

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages