# 3. Create a new tokenizer from scratch for the Mutopia Guitar Dataset

**This notebook is based on the Hugging Face course - [Chapter 6: Building a tokenizer, block by block](https://huggingface.co/course/chapter6/8?fw=tf)**

Now that you have the [Mutopia guitar dataset](https://huggingface.co/datasets/juancopi81/mutopia_guitar_dataset) converted into text using the representation proposed in the paper [MMM: Exploring Conditional Multi-Track Music Generation with the Transformer](https://arxiv.org/abs/2008.06048), you can continue working on your casual language model. Remember that you'll be training a GPT-2 model from scratch in these notebooks.

The next step is to train a new tokenizer. This is important because, even though the representation of your music pieces used the English language, your corpus is very different from the one your language model (GPT-2) was trained on.

Let's start by installing the libraries you need to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!apt install git-lfs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[K     |████████████████████████████████| 365 kB 5.2 MB/s 
[?25hCollecting evaluate
  Downloading evaluate-0.2.2-py3-none-any.whl (69 kB)
[K     |████████████████████████████████| 69 kB 5.1 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.21.2-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 43.2 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 64.6 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 48.8 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.13-py37-none-any.whl (115 kB)
[K     |█████████████████████████████

You will need to set up git. Please adapt your email and name in the following cell.

In [None]:
!git config --global user.email "juancopi_81@hotmail.com"
!git config --global user.name "Juan Carlos Piñeros"

To push your new tokenizer to the hub, you need to log in to Hugging Face.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


## 3.1 Preparing the data - the Mutopia Guitar Dataset

### Download the data

To download and cache the [Mutopia guitar dataset](https://huggingface.co/datasets/juancopi81/mutopia_guitar_dataset), you'll use the `load_dataset` function from the `datasets` library.

In [None]:
from datasets import load_dataset, concatenate_datasets

# You can change here the path of load_dataset to use your own dataset
raw_datasets = load_dataset("juancopi81/mutopia_guitar_dataset")



Downloading and preparing dataset text/juancopi81--mutopia_guitar_dataset to /root/.cache/huggingface/datasets/juancopi81___text/juancopi81--mutopia_guitar_dataset-65227e04c08f0443/0.0.0/21a506d1b2b34316b1e82d0bd79066905d846e5d7e619823c0dd338d6f1fa6ad...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/99.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/903k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/juancopi81___text/juancopi81--mutopia_guitar_dataset-65227e04c08f0443/0.0.0/21a506d1b2b34316b1e82d0bd79066905d846e5d7e619823c0dd338d6f1fa6ad. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

You can inspect `raw_datasets` to see how many rows each split has and the name of the columns.

In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 7325
    })
    test: Dataset({
        features: ['text'],
        num_rows: 74
    })
})

### Explore the data

Let's now look at a random sample of the training dataset. You can run the cell many times to see the text representation of different guitar pieces.

In [None]:
import random

sample_num = random.randint(0, len(raw_datasets["train"]))
sample = raw_datasets["train"][sample_num]["text"]
print("The sample " + str(sample_num) + " contains the following text: ")
print(sample)

The sample 609 contains the following text: 
PIECE_START TIME_SIGNATURE=12_8 BPM=90 TRACK_START INST=0 DENSITY=0 BAR_START NOTE_ON=52 TIME_DELTA=2.0 NOTE_ON=55 TIME_DELTA=1.8916666666666666 NOTE_OFF=55 TIME_DELTA=0.10833333333333339 NOTE_ON=59 TIME_DELTA=1.8916666666666666 NOTE_OFF=59 TIME_DELTA=0.10833333333333339 NOTE_ON=64 TIME_DELTA=1.8916666666666666 NOTE_OFF=64 TIME_DELTA=0.10833333333333339 NOTE_ON=59 TIME_DELTA=1.8916666666666657 NOTE_OFF=59 TIME_DELTA=0.10833333333333428 NOTE_ON=55 TIME_DELTA=1.3916666666666675 NOTE_OFF=52 TIME_DELTA=0.4999999999999982 NOTE_OFF=55 TIME_DELTA=0.10833333333333428 NOTE_ON=52 TIME_DELTA=2.0 NOTE_ON=55 TIME_DELTA=1.8916666666666657 NOTE_OFF=55 TIME_DELTA=0.10833333333333428 NOTE_ON=59 TIME_DELTA=1.6916666666666664 NOTE_OFF=52 TIME_DELTA=0.1999999999999993 NOTE_OFF=59 TIME_DELTA=0.10833333333333428 NOTE_ON=47 TIME_DELTA=2.0 NOTE_ON=55 TIME_DELTA=1.8916666666666657 NOTE_OFF=55 TIME_DELTA=0.10833333333333428 NOTE_ON=59 TIME_DELTA=1.6916666666666664 NO

Please note that although the representation uses English, this is a very specialized English language (music!). Tha's why we need to train a new tokenizer.

### Create an iterator for your data

Now you need to create an iterator from your dataset. Doing this allows the tokenizer to run faster (using batches instead of individual samples) and avoids loading everything into memory at once.

In [None]:
# Let's merge the train and test split into on dataset to train our tokenizer 
# in all the available data
datasets = concatenate_datasets([raw_datasets["train"], raw_datasets["test"]])
datasets

Dataset({
    features: ['text'],
    num_rows: 7399
})

In [None]:
def get_training_corpus():
    for i in range(0, len(datasets), 100):
        yield datasets[i : i + 100]["text"]

## 3.2 Building a new tokenizer (WordPiece) for the Mutopia Guitar Dataset

You already have your Mutopia Guitar Dataset as a series of tokens. The intention of the encoding presented in the paper: ["MMM : Exploring Conditional Multi-Track Music Generation with the Transformer"](https://arxiv.org/abs/2008.06048) was to use a representation that does not employ a prohibitively large token vocabulary.

So, to build your tokenizer, you will use the Word-Level algorithm as your tokenization model.

Let's start by importing the necessary libraries.



In [None]:
from tokenizers import (
    decoders,
    models,
    pre_tokenizers,
    trainers,
    Tokenizer,
)

tokenizer = Tokenizer(models.WordLevel(unk_token="[UNK]"))

As explained in the [🤗 course](https://huggingface.co/course/chapter6/8?fw=tf): "We have to specify the `unk_token` so the model knows what to return when it encounters characters it hasn’t seen before."

You would now typically implement a normalization process (for instance, by removing accents, using only lowercase, etc.). Still, the Mutopia Guitar Dataset does not need this, so you can skip this step and go directly to the pre-tokenization step.

For this, you will use the whitespace pre-tokenizer, which splits on whitespace and punctuation:

In [None]:
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

Let's check our progress.

In [None]:
tokenizer.pre_tokenizer.pre_tokenize_str(sample)

[('PIECE_START', (0, 11)),
 ('TIME_SIGNATURE', (12, 26)),
 ('=', (26, 27)),
 ('12_8', (27, 31)),
 ('BPM', (32, 35)),
 ('=', (35, 36)),
 ('90', (36, 38)),
 ('TRACK_START', (39, 50)),
 ('INST', (51, 55)),
 ('=', (55, 56)),
 ('0', (56, 57)),
 ('DENSITY', (58, 65)),
 ('=', (65, 66)),
 ('0', (66, 67)),
 ('BAR_START', (68, 77)),
 ('NOTE_ON', (78, 85)),
 ('=', (85, 86)),
 ('52', (86, 88)),
 ('TIME_DELTA', (89, 99)),
 ('=', (99, 100)),
 ('2', (100, 101)),
 ('.', (101, 102)),
 ('0', (102, 103)),
 ('NOTE_ON', (104, 111)),
 ('=', (111, 112)),
 ('55', (112, 114)),
 ('TIME_DELTA', (115, 125)),
 ('=', (125, 126)),
 ('1', (126, 127)),
 ('.', (127, 128)),
 ('8916666666666666', (128, 144)),
 ('NOTE_OFF', (145, 153)),
 ('=', (153, 154)),
 ('55', (154, 156)),
 ('TIME_DELTA', (157, 167)),
 ('=', (167, 168)),
 ('0', (168, 169)),
 ('.', (169, 170)),
 ('10833333333333339', (170, 187)),
 ('NOTE_ON', (188, 195)),
 ('=', (195, 196)),
 ('59', (196, 198)),
 ('TIME_DELTA', (199, 209)),
 ('=', (209, 210)),
 ('1', (

Finally, you can train the tokenizer. As explained in the [🤗 course](https://huggingface.co/course/chapter6/8?fw=tf):

"The next step in the tokenization pipeline is running the inputs through the model. We already specified our model in the initialization, but we still need to train it, which will require a WordPieceTrainer. The main thing to remember when instantiating a trainer in 🤗 Tokenizers is that you need to pass it all the special tokens you intend to use — otherwise it won’t add them to the vocabulary, since they are not in the training corpus:"

The difference here is that you are using the WordLevelTrainer, not the WordPieceTrainer, as explained above. You do not need to specify the `vocab_size` argument.

You will train a GPT-2 model. For GPT-2, the only special token is the end-of-text token:

In [None]:
special_tokens = ["<|endoftext|>"]
trainer = trainers.WordLevelTrainer(special_tokens=special_tokens)

You are now ready to start training your tokenizer. For this, you will use the iterator defined earlier.

In [None]:
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

Let's see how your tokenizer is working now:

In [None]:
encoding = tokenizer.encode(sample)
print(encoding.tokens)

['PIECE_START', 'TIME_SIGNATURE', '=', '12_8', 'BPM', '=', '90', 'TRACK_START', 'INST', '=', '0', 'DENSITY', '=', '0', 'BAR_START', 'NOTE_ON', '=', '52', 'TIME_DELTA', '=', '2', '.', '0', 'NOTE_ON', '=', '55', 'TIME_DELTA', '=', '1', '.', '8916666666666666', 'NOTE_OFF', '=', '55', 'TIME_DELTA', '=', '0', '.', '10833333333333339', 'NOTE_ON', '=', '59', 'TIME_DELTA', '=', '1', '.', '8916666666666666', 'NOTE_OFF', '=', '59', 'TIME_DELTA', '=', '0', '.', '10833333333333339', 'NOTE_ON', '=', '64', 'TIME_DELTA', '=', '1', '.', '8916666666666666', 'NOTE_OFF', '=', '64', 'TIME_DELTA', '=', '0', '.', '10833333333333339', 'NOTE_ON', '=', '59', 'TIME_DELTA', '=', '1', '.', '8916666666666657', 'NOTE_OFF', '=', '59', 'TIME_DELTA', '=', '0', '.', '10833333333333428', 'NOTE_ON', '=', '55', 'TIME_DELTA', '=', '1', '.', '3916666666666675', 'NOTE_OFF', '=', '52', 'TIME_DELTA', '=', '0', '.', '4999999999999982', 'NOTE_OFF', '=', '55', 'TIME_DELTA', '=', '0', '.', '10833333333333428', 'NOTE_ON', '=', '52'

Finally, let's try training the tokenizer directly from a text file, not an iterator, and see the differences.

In [None]:
# Create txt file 
with open("mutopia-guitar-dataset.txt", "w", encoding="utf-8") as f:
    for i in range(len(datasets)):
        f.write(datasets[i]["text"] + "\n")

In [None]:
tokenizer_txt = Tokenizer(models.WordLevel(unk_token="[UNK]"))
tokenizer_txt.pre_tokenizer = pre_tokenizers.WhitespaceSplit()
trainer_txt = trainers.WordLevelTrainer(special_tokens=special_tokens)
tokenizer_txt.train(files=["/content/mutopia-guitar-dataset.txt"], trainer=trainer_txt)

In [None]:
encoding = tokenizer_txt.encode(sample)
print(encoding.tokens)

['PIECE_START', 'TIME_SIGNATURE=12_8', 'BPM=90', 'TRACK_START', 'INST=0', 'DENSITY=0', 'BAR_START', 'NOTE_ON=52', 'TIME_DELTA=2.0', 'NOTE_ON=55', 'TIME_DELTA=1.8916666666666666', 'NOTE_OFF=55', 'TIME_DELTA=0.10833333333333339', 'NOTE_ON=59', 'TIME_DELTA=1.8916666666666666', 'NOTE_OFF=59', 'TIME_DELTA=0.10833333333333339', 'NOTE_ON=64', 'TIME_DELTA=1.8916666666666666', 'NOTE_OFF=64', 'TIME_DELTA=0.10833333333333339', 'NOTE_ON=59', 'TIME_DELTA=1.8916666666666657', 'NOTE_OFF=59', 'TIME_DELTA=0.10833333333333428', 'NOTE_ON=55', 'TIME_DELTA=1.3916666666666675', 'NOTE_OFF=52', 'TIME_DELTA=0.4999999999999982', 'NOTE_OFF=55', 'TIME_DELTA=0.10833333333333428', 'NOTE_ON=52', 'TIME_DELTA=2.0', 'NOTE_ON=55', 'TIME_DELTA=1.8916666666666657', 'NOTE_OFF=55', 'TIME_DELTA=0.10833333333333428', 'NOTE_ON=59', 'TIME_DELTA=1.6916666666666664', 'NOTE_OFF=52', 'TIME_DELTA=0.1999999999999993', 'NOTE_OFF=59', 'TIME_DELTA=0.10833333333333428', 'NOTE_ON=47', 'TIME_DELTA=2.0', 'NOTE_ON=55', 'TIME_DELTA=1.89166666

Training from the txt file shows better results than training from the iterator. For example, `TIME_SIGNATURE=3_4` is just one token, which is what you want, and not 3 tokens as it was with the iterator (`TIME_SIGNATURE`, `=`, `3_4`).

So you'll use this version to push it to the 🤗 hub.

You do not use any post-processing here because the tokens are ready to use as they are now. There is no need to add a template with the special tokens you want to use. So you can go ahead and just save the tokenizer.

## 3.3 Save and upload the Mutopia Guitar Dataset tokenizer

In [None]:
tokenizer_txt.save("mutopia_guitar_dataset_tokenizer")

As explained in the [🤗 course](https://huggingface.co/course/chapter6/8?fw=tf), to use this tokenizer in 🤗 Transformers, you have to wrap it in a PreTrainedTokenizerFast:

"*To wrap the tokenizer in a PreTrainedTokenizerFast, we can either pass the tokenizer we built as a tokenizer_object or pass the tokenizer file we saved as tokenizer_file. The key thing to remember is that we have to manually set all the special tokens, since that class can’t infer from the tokenizer object which token is the mask token, the [CLS] token, etc.:*"

In [None]:
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer_txt,
    # tokenizer_file="tokenizer.json", # You can load from the tokenizer file, alternatively
    bos_token="<|endoftext|>",
    eos_token="<|endoftext|>",
)

You are now ready to push your tokenizer to the 🤗 hub:

"*You can then use this tokenizer like any other 🤗 Transformers tokenizer. You can save it with the save_pretrained() method, or upload it to the Hub with the push_to_hub() method.*"

In [None]:
wrapped_tokenizer.push_to_hub("mutopia_guitar_dataset_tokenizer", use_temp_dir=True)

Cloning https://huggingface.co/juancopi81/mutopia_guitar_dataset_tokenizer into local empty directory.
To https://huggingface.co/juancopi81/mutopia_guitar_dataset_tokenizer
   7fbf6f8..5bf268e  main -> main

   7fbf6f8..5bf268e  main -> main



'https://huggingface.co/juancopi81/mutopia_guitar_dataset_tokenizer/commit/5bf268ea43daff742db79857a624892893eb0185'

And that's it. You should now be able to see your tokenizer on the hub. You'll use this tokenizer to train your Music Generative GPT-2 model! 

**Congratulations on training and uploading your tokenizer for the Mutopia Guitar Dataset using** 🤗.