# Train a language model from scratch

txtai has a robust training pipeline that can fine-tune large language models (LLMs) for downstream tasks such as labeling text. txtai also has the ability to train language models from scratch.

The vast majority of time, fine-tuning a LLM yields the best results. But when making significant changes to the structure of a model, training from scratch is often required.

Examples of significant changes are:

- Changing the vocabulary size
- Changing the number of hidden dimensions
- Changing the number of attention heads or layers

This notebook will show how to build a new tokenizer and train a small language model (known as a micromodel) from scratch.


# Install dependencies

Install `txtai` and all dependencies.

In [1]:
%%capture
!pip install git+https://github.com/neuml/txtai#egg=txtai[pipeline-train] datasets sentence-transformers onnxruntime onnx

# Load dataset

This example will use the `ag_news` dataset, which is a collection of news article headlines.

In [2]:
from datasets import load_dataset

dataset = load_dataset("ag_news", split="train")

# Train the tokenizer

The first step is to train the tokenizer. We could use an existing tokenizer but in this case, we want a smaller vocabulary.


In [3]:
from transformers import AutoTokenizer

def stream(batch=10000):
    for x in range(0, len(dataset), batch):
        yield dataset[x: x + batch]["text"]

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer = tokenizer.train_new_from_iterator(stream(), vocab_size=500, length=len(dataset))
tokenizer.model_max_length = 512

tokenizer.save_pretrained("bert")



('bert\\tokenizer_config.json',
 'bert\\special_tokens_map.json',
 'bert\\vocab.txt',
 'bert\\added_tokens.json',
 'bert\\tokenizer.json')

Let's test the tokenizer.

In [4]:
print(tokenizer.tokenize("Red Sox defeat Yankees 5-3"))

['re', '##d', 'so', '##x', 'de', '##f', '##e', '##at', 'y', '##ank', '##e', '##es', '5', '-', '3']


With a limited vocabulary size of 500, most words require multiple tokens. This limited vocabulary lowers the number of token representations the model needs to learn.

# Train the language model

Now it's time to train the model. We'll train a micromodel, which is an extremely small language model with a limited vocabulary. Micromodels, when paired with a limited vocabulary have the potential to work in limited compute environments like edge devices and microcontrollers.

In [5]:
!pip install --upgrade protobuf

Collecting protobuf
  Using cached protobuf-5.28.2-cp310-abi3-win_amd64.whl.metadata (592 bytes)
Using cached protobuf-5.28.2-cp310-abi3-win_amd64.whl (431 kB)
Installing collected packages: protobuf
  Attempting uninstall: protobuf
    Found existing installation: protobuf 3.20.2
    Uninstalling protobuf-3.20.2:
      Successfully uninstalled protobuf-3.20.2
Successfully installed protobuf-5.28.2


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
onnxconverter-common 1.14.0 requires protobuf==3.20.2, but you have protobuf 5.28.2 which is incompatible.
streamlit 1.32.0 requires protobuf<5,>=3.20, but you have protobuf 5.28.2 which is incompatible.


In [6]:
from transformers import AutoTokenizer, BertConfig, BertForMaskedLM

from txtai.pipeline import HFTrainer

config = BertConfig(
    vocab_size = 500,
    hidden_size = 50,
    num_hidden_layers = 2,
    num_attention_heads = 2,
    intermediate_size = 100,
)

model = BertForMaskedLM(config)
model.save_pretrained("bert")
tokenizer = AutoTokenizer.from_pretrained("bert")

train = HFTrainer()

# Train model
train((model, tokenizer), dataset, task="language-modeling", output_dir="bert",
      fp16=True, per_device_train_batch_size=128, num_train_epochs=10,
      dataloader_num_workers=2)

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.


Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (539 > 512). Running this sequence through the model will result in indexing errors
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


  0%|          | 0/1930 [00:00<?, ?it/s]

  attn_output = torch.nn.functional.scaled_dot_product_attention(


{'loss': 6.2055, 'grad_norm': 0.06600596010684967, 'learning_rate': 3.704663212435233e-05, 'epoch': 2.59}
{'loss': 6.191, 'grad_norm': 0.06479847431182861, 'learning_rate': 2.4093264248704665e-05, 'epoch': 5.18}
{'loss': 6.1816, 'grad_norm': 0.06460917741060257, 'learning_rate': 1.1139896373056995e-05, 'epoch': 7.77}
{'train_runtime': 403.2445, 'train_samples_per_second': 611.713, 'train_steps_per_second': 4.786, 'train_loss': 6.189271853748381, 'epoch': 10.0}


(BertForMaskedLM(
   (bert): BertModel(
     (embeddings): BertEmbeddings(
       (word_embeddings): Embedding(500, 50, padding_idx=0)
       (position_embeddings): Embedding(512, 50)
       (token_type_embeddings): Embedding(2, 50)
       (LayerNorm): LayerNorm((50,), eps=1e-12, elementwise_affine=True)
       (dropout): Dropout(p=0.1, inplace=False)
     )
     (encoder): BertEncoder(
       (layer): ModuleList(
         (0-1): 2 x BertLayer(
           (attention): BertAttention(
             (self): BertSdpaSelfAttention(
               (query): Linear(in_features=50, out_features=50, bias=True)
               (key): Linear(in_features=50, out_features=50, bias=True)
               (value): Linear(in_features=50, out_features=50, bias=True)
               (dropout): Dropout(p=0.1, inplace=False)
             )
             (output): BertSelfOutput(
               (dense): Linear(in_features=50, out_features=50, bias=True)
               (LayerNorm): LayerNorm((50,), eps=1e-12, elem

# Sentence embeddings

Next let's take the language model and fine-tune it to build sentence embeddings.

In [16]:
%%capture
!python training_nli_v2.py bert
!mv output/* bert-nli

# Embeddings search

Now we'll build a txtai embeddings index using the fine-tuned model. We'll index the `ag_news` dataset.

In [14]:
from txtai.embeddings import Embeddings

# Get list of all text
texts = dataset["text"]

embeddings = Embeddings({"path": "bert-nli", "content": True})
embeddings.index((x, text, None) for x, text in enumerate(texts))

OSError: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory bert-nli.

Let's run a search and see how much the model has learned.

In [12]:
embeddings.search("Boston Red Sox Cardinals World Series")

: 

Not too bad. It's far from perfect but we can tell that it has some knowledge! This model was trained for 5 minutes, there is certainly room for improvement in training longer and/or with a larger dataset.

The standard `bert-base-uncased` model has 110M parameters and is around 440MB. Let's see how many parameters this model has.

In [11]:
# Show number of parameters
parameters = sum(p.numel() for p in embeddings.model.model.parameters())
print(f"Number of parameters:\t\t{parameters:,}")
print(f"% of bert-base-uncased\t\t{(parameters / 110000000) * 100:.2f}%")

Number of parameters:		94,450
% of bert-base-uncased		0.09%


In [1]:
!ls -lh bert/pytorch_model.bin

'ls' is not recognized as an internal or external command,
operable program or batch file.


This model is 386KB and has only 0.1% of the parameters. With proper vocabulary selection, a small language model has potential.

# Quantization

If 386KB isn't small enough, we can quantize the model to get it down even further.

In [10]:
from txtai.pipeline import HFOnnx

onnx = HFOnnx()
onnx("bert-nli", task="pooling", output="bert-nli.onnx", quantize=True)

OSError: bert-nli does not appear to have a file named config.json. Checkout 'https://huggingface.co/bert-nli/tree/None' for available files.

In [11]:
embeddings = Embeddings({"path": "bert-nli.onnx", "tokenizer": "bert-nli", "content": True})
embeddings.index((x, text, None) for x, text in enumerate(texts))
embeddings.search("Boston Red Sox Cardinals World Series")

RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-66fbbf2a-217a8a154964825c4b9ed51e;65d918b8-af63-40ee-8dcb-b9eadb4ab9c0)

Repository Not Found for url: https://huggingface.co/bert-nli.onnx/resolve/main/1_Pooling/config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

In [None]:
!ls -lh bert-nli.onnx

We're down to 187KB with a quantized model!


# Train on BERT dataset

The [BERT paper](https://arxiv.org/abs/1810.04805) has all the information regarding training parameters and datasets used. Hugging Face Datasets hosts the `bookcorpus` and `wikipedia` datasets.

Training on this size of a dataset is out of scope for this notebook but example code is shown below on how to build the BERT dataset.

```python
bookcorpus = load_dataset("bookcorpus", split="train")
wiki = load_dataset("wikipedia", "20220301.en", split="train")
wiki = wiki.remove_columns([col for col in wiki.column_names if col != "text"])
dataset = concatenate_datasets([bookcorpus, wiki])
```

Then the same steps to train the tokenizer and model can be run. The dataset is 25GB compressed, so it will take some space and time to process!

# Wrapping up

This notebook covered how to build micromodels from scratch with txtai. Micromodels can be fully rebuilt in hours using the most up-to-date knowledge available. If properly constructed, prepared and trained, micromodels have the potential to be a viable choice for limited resource environments. They can also help when realtime response is more important than having the highest accuracy scores.

It's our hope that further research and exploration into micromodels leads to productive and useful models.

In [6]:
from datasets import concatenate_datasets
from datasets import load_dataset

In [11]:
import fsspec

# Increase timeout in fsspec
fs = fsspec.filesystem('http', timeout=6000)

In [2]:
import datasets
print(datasets.config.CACHE_DIR)

AttributeError: module 'datasets.config' has no attribute 'CACHE_DIR'

In [12]:
bookcorpus = load_dataset("bookcorpus", split="train", trust_remote_code=True)
wiki = load_dataset("wikipedia", "20220301.en", split="train", trust_remote_code=True)
wiki = wiki.remove_columns([col for col in wiki.column_names if col != "text"])
dataset = concatenate_datasets([bookcorpus, wiki])

Downloading data:   0%|          | 0.00/1.18G [00:00<?, ?B/s]

Generating train split:   0%|          | 0/74004228 [00:00<?, ? examples/s]

wikipedia.py:   0%|          | 0.00/36.7k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


README.md:   0%|          | 0.00/16.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0/41 [00:00<?, ?files/s]

train-00002-of-00041.parquet:   0%|          | 0.00/558M [00:00<?, ?B/s]

train-00005-of-00041.parquet:   0%|          | 0.00/391M [00:00<?, ?B/s]

train-00013-of-00041.parquet:   0%|          | 0.00/248M [00:00<?, ?B/s]

train-00009-of-00041.parquet:   0%|          | 0.00/312M [00:00<?, ?B/s]

train-00011-of-00041.parquet:   0%|          | 0.00/247M [00:00<?, ?B/s]

train-00014-of-00041.parquet:   0%|          | 0.00/222M [00:00<?, ?B/s]

train-00000-of-00041.parquet:   0%|          | 0.00/1.04G [00:00<?, ?B/s]

train-00006-of-00041.parquet:   0%|          | 0.00/366M [00:00<?, ?B/s]

train-00012-of-00041.parquet:   0%|          | 0.00/229M [00:00<?, ?B/s]

train-00004-of-00041.parquet:   0%|          | 0.00/431M [00:00<?, ?B/s]

train-00008-of-00041.parquet:   0%|          | 0.00/329M [00:00<?, ?B/s]

train-00015-of-00041.parquet:   0%|          | 0.00/236M [00:00<?, ?B/s]

train-00010-of-00041.parquet:   0%|          | 0.00/267M [00:00<?, ?B/s]

train-00003-of-00041.parquet:   0%|          | 0.00/491M [00:00<?, ?B/s]

train-00007-of-00041.parquet:   0%|          | 0.00/326M [00:00<?, ?B/s]

train-00001-of-00041.parquet:   0%|          | 0.00/705M [00:00<?, ?B/s]

train-00016-of-00041.parquet:   0%|          | 0.00/215M [00:00<?, ?B/s]

train-00017-of-00041.parquet:   0%|          | 0.00/229M [00:00<?, ?B/s]

train-00018-of-00041.parquet:   0%|          | 0.00/241M [00:00<?, ?B/s]

train-00019-of-00041.parquet:   0%|          | 0.00/228M [00:00<?, ?B/s]

train-00020-of-00041.parquet:   0%|          | 0.00/214M [00:00<?, ?B/s]

train-00021-of-00041.parquet:   0%|          | 0.00/255M [00:00<?, ?B/s]

train-00022-of-00041.parquet:   0%|          | 0.00/226M [00:00<?, ?B/s]

train-00023-of-00041.parquet:   0%|          | 0.00/226M [00:00<?, ?B/s]

train-00024-of-00041.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs.hf.co/datasets/wikipedia/0ab22cb8320c5b7631b5f658a273a09d5a0421464daa76ff80a0aa3ea0d1fe64?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27train-00016-of-00041.parquet%3B+filename%3D%22train-00016-of-00041.parquet%22%3B&Expires=1728043825&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyODA0MzgyNX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9kYXRhc2V0cy93aWtpcGVkaWEvMGFiMjJjYjgzMjBjNWI3NjMxYjVmNjU4YTI3M2EwOWQ1YTA0MjE0NjRkYWE3NmZmODBhMGFhM2VhMGQxZmU2ND9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=gjSUg-USezTQPX-vTwVeqnlY%7ELgzJgOxaEUHig9rBE4wI7cahoysKWkwDaKxru9zGIsTTb1UKihnB-4%7EH0DXmOOPNBeV3AzEnhuANipaIyNKNv9LlZB09ueczN-SVlPextqHGIStrBQf%7E6qJ-0-i4rb569h8cMf65EPAMEhsxL5jxlZ6GxhLyMD7TbDnevepOY4gw197XpVrGwHNg-RMj6NiHwHinhzsukuHcNms4C3IwQOuikZBC9DFpyPGcDthe7gdKM5gMJ%7E11lgZj6qAzjE0xjb3c32lrWheEm5MvLx3sAL%7EXSrcznVMthhGgbTfqrZrqC53je59Ws35XCvNew__&Key-Pair-Id=K3RPWS32NSSJCE: HT

train-00016-of-00041.parquet:  73%|#######3  | 157M/215M [00:00<?, ?B/s]

train-00001-of-00041.parquet:  54%|#####3    | 377M/705M [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs.hf.co/datasets/wikipedia/e3fa3a6872b9b3a63c9a0a99b188b38fa2719f9eb11bbcdb1d7e1e6088c58b85?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27train-00003-of-00041.parquet%3B+filename%3D%22train-00003-of-00041.parquet%22%3B&Expires=1728043340&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyODA0MzM0MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9kYXRhc2V0cy93aWtpcGVkaWEvZTNmYTNhNjg3MmI5YjNhNjNjOWEwYTk5YjE4OGIzOGZhMjcxOWY5ZWIxMWJiY2RiMWQ3ZTFlNjA4OGM1OGI4NT9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=v4-QYAPCZVXh9wyFw-r8kag4Zw014QYhvLg3yC7ymhTEQGST71D9MYnd0iiXBIu1NS8Lc8QJMMl8dmTi%7EijKROfd96chtZsfwovUxngGtNw75SRJ4PqJvjrheHtQbqBdTwYRRPNqbsoDyN4Vg5zz3cjTzU1nAN4cVlMd0PVH5wEuuX5LXRInxFb7JO98bFSXUwANvmePlM9KNibyRL7Ta67xHhPUL0lFlSaSa-ajGGPca8HBaJ3kkjj7qVLYkOOWg0WCx7P75SmlNocI%7E-gZJoId%7EvC7JQUtMIMEtsbOPT-tq04WvaZI%7EZDDOIpa24UOAcHNkA-NNcpqRhlFFzoAyg__&Key-Pair-Id=K3RPWS32NSSJCE: HTTP

train-00003-of-00041.parquet:  58%|#####7    | 283M/491M [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs.hf.co/datasets/wikipedia/7893511b5908319c785910e091fbb723b6fed45fafbd787b71f58cb43dc1d167?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27train-00024-of-00041.parquet%3B+filename%3D%22train-00024-of-00041.parquet%22%3B&Expires=1728044121&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyODA0NDEyMX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9kYXRhc2V0cy93aWtpcGVkaWEvNzg5MzUxMWI1OTA4MzE5Yzc4NTkxMGUwOTFmYmI3MjNiNmZlZDQ1ZmFmYmQ3ODdiNzFmNThjYjQzZGMxZDE2Nz9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=ao560laSPMzHZcN-dAFHcGuDtKG8%7EZbIkYGWG6wwHttnwTBdtj21v5kuh5-4nkHb4XJwUcJXzfHkiHJQ61GXGTXhzNE7EgT-t7hMmxvzEQGb7IPAWgYYpH-eqXFH31bPwXEKvhJZsmycy3sRRPa3z2g6rrMzDAt-e30Q5Jqr51CXuvQuH8lGfQZhS9jAXo6tn67E3SQnwjb8yOcelunWekHg-LDvfEQLTqX73zxqbDUwwZTkPEfveliU31dcNQjWG-J1eSSkTkvS4YDYK1fDgl26J8zJFkDAtYoMK5iPx6Ctt25hplYWxJrtrr1vSGAdMizhR-w8PtT5zEgYxqPw7g__&Key-Pair-Id=K3RPWS32NSSJCE: HTTPSConne

train-00024-of-00041.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00025-of-00041.parquet:   0%|          | 0.00/218M [00:00<?, ?B/s]

train-00026-of-00041.parquet:   0%|          | 0.00/212M [00:00<?, ?B/s]

train-00027-of-00041.parquet:   0%|          | 0.00/206M [00:00<?, ?B/s]

train-00028-of-00041.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00029-of-00041.parquet:   0%|          | 0.00/219M [00:00<?, ?B/s]

train-00030-of-00041.parquet:   0%|          | 0.00/214M [00:00<?, ?B/s]

train-00031-of-00041.parquet:   0%|          | 0.00/216M [00:00<?, ?B/s]

train-00032-of-00041.parquet:   0%|          | 0.00/200M [00:00<?, ?B/s]

train-00033-of-00041.parquet:   0%|          | 0.00/203M [00:00<?, ?B/s]

train-00034-of-00041.parquet:   0%|          | 0.00/201M [00:00<?, ?B/s]

train-00035-of-00041.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00036-of-00041.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00037-of-00041.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00038-of-00041.parquet:   0%|          | 0.00/203M [00:00<?, ?B/s]

train-00039-of-00041.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00040-of-00041.parquet:   0%|          | 0.00/185M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/6458670 [00:00<?, ? examples/s]

Loading dataset shards:   0%|          | 0/41 [00:00<?, ?it/s]

: 

In [13]:
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders, processors
from tokenizers import normalizers
from tokenizers.normalizers import NFD, StripAccents
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import WordPieceTrainer
from concurrent.futures import ThreadPoolExecutor, as_completed

# Use ThreadPoolExecutor for parallel data streaming
def stream(batch_size=10000):
    def fetch_batch(x):
        return dataset[x: x + batch_size]["text"]
    
    # Parallel streaming using threads
    with ThreadPoolExecutor() as executor:
        futures = [executor.submit(fetch_batch, x) for x in range(0, len(dataset), batch_size)]
        for future in as_completed(futures):
            yield future.result()

# Initialize a WordPiece tokenizer
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

# Set normalization and pre-tokenization rules
tokenizer.normalizer = normalizers.Sequence([NFD(), StripAccents()])
tokenizer.pre_tokenizer = Whitespace()

# Configure the WordPiece trainer with a smaller batch size
trainer = WordPieceTrainer(
    vocab_size=500, 
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)

# Train tokenizer on the dataset stream with faster parallel execution
tokenizer.train_from_iterator(stream(batch_size=5000), trainer=trainer, length=len(dataset))  # Smaller batch size

# Post-processing: set max length and special tokens
tokenizer.post_processor = processors.BertProcessing(
    ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ("[CLS]", tokenizer.token_to_id("[CLS]")),
)
tokenizer.decoder = decoders.WordPiece()

# Save tokenizer to directory
tokenizer.save("bert-prolang")


In [5]:
from transformers import BertConfig, BertForMaskedLM
from txtai.pipeline import HFTrainer
from tokenizers import Tokenizer
from transformers import BertTokenizerFast

# Load the custom tokenizer from "bert-prolang"
tokenizer = Tokenizer.from_file("bert-prolang/tokenizer.json")

# Create a BertTokenizerFast using the saved tokenizer
tokenizer_fast = BertTokenizerFast(tokenizer_object=tokenizer, model_max_length=512)

# Define the model configuration
config = BertConfig(
    vocab_size=500,
    hidden_size=50,
    num_hidden_layers=2,
    num_attention_heads=2,
    intermediate_size=100,
)

# Initialize the model
model = BertForMaskedLM(config)
model.save_pretrained("bert-prolang")  # Save the model to the same directory as the tokenizer

# Use the custom fast tokenizer for training
train = HFTrainer()

# Train the model using the custom tokenizer and dataset
train(
    (model, tokenizer_fast), 
    dataset, 
    task="language-modeling", 
    output_dir="bert-prolang",  # Specify the output directory
    fp16=True, 
    per_device_train_batch_size=128, 
    num_train_epochs=10,
    dataloader_num_workers=2
)


KeyboardInterrupt: 