**This notebook is based on the Hugging Face course - [Chapter 7: Main NLP tasks, Training a causal language model from scratch.](https://huggingface.co/course/chapter7/6?fw=tf)**

Before moving on, let's recap what you've done so far:
*   First of all, you collected a guitar dataset from the Mutopia Project.
*   Then, you used the [implementation](https://github.com/AI-Guru/MMM-JSB) of [Dr. Tristan Beheren](https://www.linkedin.com/in/dr-tristan-behrens-734967a2/) of the paper ["MMM: Exploring Conditional Multi-Track Music Generation with the Transformer"](https://arxiv.org/abs/2008.06048) to encode the MIDI files into text tokens.
*   Finally, you created a "fast" tokenizer with Hugging Face using this dataset.

Great job so far!

You are now ready to start training your music generation model. Instead of using a pre-trained model, you'll train your model from scratch because the data you collected differs from the pretraining data used for the available models. Usually, you would want to train your generative model with much more data, but the Mutopia Guitar Dataset would do for learning.

Let's now install the libraries you need to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!apt install git-lfs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[K     |████████████████████████████████| 365 kB 4.7 MB/s 
[?25hCollecting evaluate
  Downloading evaluate-0.2.2-py3-none-any.whl (69 kB)
[K     |████████████████████████████████| 69 kB 9.5 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.21.2-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 57.3 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 64.7 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 69.8 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting multiprocess
  Downloadin

You will need to set up git. Please adapt your email and name in the following cell.

In [None]:
# Change this values
!git config --global user.email "me@hotmail.com"
!git config --global user.name "My name"

To push your new model to the hub, you need to log in to Hugging Face.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


## 4.1 Gathering the data

### Download the data

To download and cache the [Mutopia guitar dataset](https://huggingface.co/datasets/juancopi81/mutopia_guitar_dataset), you'll use the `load_dataset` function from the `datasets` library.

In [None]:
from datasets import load_dataset

# You can change here the path of load_dataset to use your own dataset
raw_datasets = load_dataset("juancopi81/mutopia_guitar_dataset")



Downloading and preparing dataset text/juancopi81--mutopia_guitar_dataset to /root/.cache/huggingface/datasets/juancopi81___text/juancopi81--mutopia_guitar_dataset-65227e04c08f0443/0.0.0/21a506d1b2b34316b1e82d0bd79066905d846e5d7e619823c0dd338d6f1fa6ad...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/99.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/903k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/juancopi81___text/juancopi81--mutopia_guitar_dataset-65227e04c08f0443/0.0.0/21a506d1b2b34316b1e82d0bd79066905d846e5d7e619823c0dd338d6f1fa6ad. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

You can inspect `raw_datasets` to see how many rows each split has and the name of the columns.

In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 7325
    })
    test: Dataset({
        features: ['text'],
        num_rows: 74
    })
})

By now, you should be familiar with the text lines of the datasets. Still, just as a refresher, let's print the first characters of a music piece. You can run this cell many times to see different encodings of the MIDI files.

In [None]:
import random

sample_num = random.randint(0, len(raw_datasets["train"]))
print(f"{raw_datasets['train'][sample_num]['text'][:200]}")

PIECE_START TIME_SIGNATURE=2_4 BPM=100 TRACK_START INST=0 DENSITY=2 BAR_START NOTE_ON=44 TIME_DELTA=1.0 NOTE_ON=48 TIME_DELTA=1.0 NOTE_OFF=48 NOTE_ON=51 TIME_DELTA=1.0 NOTE_OFF=51 NOTE_ON=56 TIME_DELT


Excellent! You have your dataset loaded, and it is time to prepare your data for your model.

## 4.2 Preparing the dataset

You'll start preparing the data using the tokenizer you created in the last notebook. In the tokenizer, you'll also define the size of each sample you'll use to feed into the model (the context size).

There is a trade-off here: More context for your model would need more resources, so you'll have a more significant GPU memory footprint. On the other hand, setting a small context size will help your model train faster and use less memory.

Some of the pieces are very short, and some of them are longer. You have to take this into account when defining the context size. Let's try something intermediate first (256), and see the results.

Run the next cell to see this process with the first two `test samples` to understand everything better.

In [None]:
from transformers import AutoTokenizer

context_length = 256

# You can change the URL to use the tokenizer you trained
tokenizer = AutoTokenizer.from_pretrained("juancopi81/mutopia_guitar_dataset_tokenizer")

outputs = tokenizer(
    raw_datasets["test"][:2]["text"],
    truncation=True,
    max_length=context_length,
    return_overflowing_tokens=True,
    return_length=True,
)

print(f"Input IDs length: {len(outputs['input_ids'])}")
print(f"Input chunk lengths: {(outputs['length'])}")
print(f"Chunk mapping: {outputs['overflow_to_sample_mapping']}")

Downloading tokenizer_config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/21.0k [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

Input IDs length: 5
Input chunk lengths: [256, 256, 206, 256, 202]
Chunk mapping: [0, 0, 0, 1, 1]


You can see five segments in total from those two examples. Looking at the chunk lengths, you can see that the chunks at the ends of both documents have less than 256 tokens (206 and 202, respectively). For this notebook, you'll use the same technique as the one explained in the 🤗 course so that you will throw them away. With the overflow_to_sample_mapping field, you can reconstruct which chunks belonged to which input samples.

You'll now implement the tokenizer to create the `tokenized_datasets`. The explanation on the [🤗 course](https://huggingface.co/course/chapter7/6?fw=tf) about this step is excellent: "With this operation we're using a handy feature of the Dataset.map() function in 🤗 Datasets, which is that it does not require one-to-one maps; as we saw in section 3, we can create batches with more or fewer elements than the input batch. This is useful when doing operations like data augmentation or data filtering that change the number of elements. In our case, when tokenizing each element into chunks of the specified context size, we create many samples from each document. We just need to make sure to delete the existing columns, since they have a conflicting size. If we wanted to keep them, we could repeat them appropriately and return them within the Dataset.map() call"

There are two differences here from the code shown in the course:

*   A counter to track the number of removed chunks because its size is less than the context size.
*   An `if statement` that removes the chunks that have an unknown token. Removing these chunks is essential because you do not want to have anonymous notes when generating music. You should not expect unknown tokens if you implemented correctly the last notebook where we trained the tokenizer.

In [None]:
def tokenize(element):
    removed_elements_counter = 0
    outputs = tokenizer(
        element["text"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == context_length:
            input_batch.append(input_ids)
        else:
            removed_elements_counter += 1
    print(f"Removed chunks with size less than context_size: {removed_elements_counter}")
    return {"input_ids": input_batch}

tokenized_datasets = raw_datasets.map(
    tokenize, batched=True, remove_columns=raw_datasets["train"].column_names
)

tokenized_datasets

  0%|          | 0/8 [00:00<?, ?ba/s]

Removed chunks with size less than context_size: 1000
Removed chunks with size less than context_size: 1000
Removed chunks with size less than context_size: 1000
Removed chunks with size less than context_size: 1000
Removed chunks with size less than context_size: 1000
Removed chunks with size less than context_size: 1000
Removed chunks with size less than context_size: 1000
Removed chunks with size less than context_size: 325


  0%|          | 0/1 [00:00<?, ?ba/s]

Removed chunks with size less than context_size: 74


DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 26900
    })
    test: Dataset({
        features: ['input_ids'],
        num_rows: 241
    })
})

It worked fine! Remember that you train the tokenizer with the `train` and `test` datasets, so there should not be any `unknown tokens`. 

You also confirmed that you removed one chunk for each line in the train and test datasets: 7,325 for the train set and 74 for the test set. **Great job!** You now have 26,900 training examples with 256 tokens each, corresponding to 6,886,400 tokens.

Your dataset is ready; the next step is to set up your model!

## 4.3 Initializing a new model

You'll now initialize a new GPT-2 model using the pre-trained configuration of the small GPT-2 model. It is essential to:

*   Ensure that the model vocabulary size matches the tokenizer size.
*   Add the bos and eos (beginning and end of sequence) token IDs.

In [None]:
from transformers import AutoTokenizer, TFGPT2LMHeadModel, AutoConfig

config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

Downloading config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Let's load a model with the pre-trained configuration.

In [None]:
model = TFGPT2LMHeadModel(config)
model(model.dummy_inputs)  # Builds the model
model.summary()

Model: "tfgpt2lm_head_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLaye  multiple                 86294016  
 r)                                                              
                                                                 
Total params: 86,294,016
Trainable params: 86,294,016
Non-trainable params: 0
_________________________________________________________________


Your model has 86M parameters to tune. You need to set up a data collator to create the batches for training the model. Let's see what the [🤗 course](https://huggingface.co/course/chapter7/6?fw=tf) explains about this step: 

"We can use the `DataCollatorForLanguageModeling` collator, which is designed specifically for language modeling (as the name subtly suggests). Besides stacking and padding batches, it also takes care of creating the language model labels — in causal language modeling the inputs serve as labels too (just shifted by one element), and this data collator creates them on the fly during training so we don’t need to duplicate the input_ids.

Note that `DataCollatorForLanguageModeling` supports both masked language modeling (MLM) and causal language modeling (CLM). By default it prepares data for MLM, but we can switch to CLM by setting the argument `mlm=False`:"

In [None]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False, return_tensors="tf")

In [None]:
# Example of the data collator
out = data_collator([tokenized_datasets["test"][i] for i in range(5)])

for key in out:
  print(f"{key} shape: {out[key].shape}")

input_ids shape: (5, 256)
attention_mask shape: (5, 256)
labels shape: (5, 256)


In [None]:
print(tokenizer.decode(out["input_ids"][0]))

PIECE_START TIME_SIGNATURE=2_4 BPM=60 TRACK_START INST=0 DENSITY=2 BAR_START NOTE_ON=69 TIME_DELTA=0.5 NOTE_OFF=69 NOTE_ON=68 TIME_DELTA=1.5 NOTE_OFF=68 NOTE_ON=67 TIME_DELTA=0.5 NOTE_OFF=67 NOTE_ON=66 TIME_DELTA=2.0 NOTE_OFF=66 NOTE_ON=50 NOTE_ON=66 NOTE_ON=62 TIME_DELTA=2.0 NOTE_OFF=50 NOTE_OFF=66 NOTE_OFF=62 NOTE_ON=62 NOTE_ON=66 NOTE_ON=50 TIME_DELTA=1.0 NOTE_OFF=62 NOTE_OFF=66 NOTE_OFF=50 BAR_END BAR_START NOTE_ON=69 TIME_DELTA=0.5 NOTE_OFF=69 NOTE_ON=67 TIME_DELTA=1.5 NOTE_OFF=67 NOTE_ON=66 TIME_DELTA=0.5 NOTE_OFF=66 NOTE_ON=64 TIME_DELTA=2.0 NOTE_OFF=64 NOTE_ON=61 NOTE_ON=64 NOTE_ON=45 TIME_DELTA=2.0 NOTE_OFF=61 NOTE_OFF=64 NOTE_OFF=45 NOTE_ON=61 NOTE_ON=64 NOTE_ON=45 TIME_DELTA=1.0 NOTE_OFF=61 NOTE_OFF=64 NOTE_OFF=45 BAR_END BAR_START NOTE_ON=67 TIME_DELTA=0.5 NOTE_OFF=67 NOTE_ON=66 TIME_DELTA=1.5 NOTE_OFF=66 NOTE_ON=64 TIME_DELTA=0.5 NOTE_OFF=64 NOTE_ON=62 NOTE_ON=54 TIME_DELTA=2.0 NOTE_OFF=62 NOTE_OFF=54 NOTE_ON=62 NOTE_ON=54 TIME_DELTA=2.0 NOTE_OFF=62 NOTE_OFF=54 NOTE_ON=64 

Let's now use the `to_tf_dataset()` method to convert the datasets to TensorFlow datasets with the data collator created above:

In [None]:
tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
    columns=["input_ids", "attention_mask", "lables"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=8
)

tf_eval_dataset = tokenized_datasets["test"].to_tf_dataset(
    columns=["input_ids", "attention_mask", "lables"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=8
)

tf_train_dataset

<PrefetchDataset element_spec={'input_ids': TensorSpec(shape=(None, 256), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(None, 256), dtype=tf.int64, name=None), 'labels': TensorSpec(shape=(None, 256), dtype=tf.int64, name=None)}>

You'll now use the `create_optimizer` function from 🤗 transformers. With this function, you can set up and [AdamW](https://paperswithcode.com/method/adamw) optimizer. It would be best if you tuned your hyperparameters (weight decay, learning rate decay, etc.): With the correct choice, you will improve your model's performance compared to the built-in Adam optimizer.

Here, you'll use the same hyperparameters of [🤗 course: Chapter 7 - TensorFlow version](https://huggingface.co/course/chapter7/6?fw=tf). As explained there, using a learning rate schedule with some warmup improves the stability of training.

In [None]:
from transformers import create_optimizer
import tensorflow as tf

# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs. Note that the tf_train_dataset here is a batched tf.data.Dataset,
# not the original Hugging Face Dataset, so its len() is already num_samples // batch_size.
num_epochs = 10
num_train_steps = len(tf_train_dataset) * num_epochs

optimizer, schedule = create_optimizer(
    init_lr=5e-5,
    num_warmup_steps=1_000,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)

model.compile(optimizer=optimizer)

# Train in mixed-precision float16
tf.keras.mixed_precision.set_global_policy("mixed_float16")

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


Let's now train the model and push it to the 🤗 hub!

In [None]:
from transformers.keras_callbacks import PushToHubCallback

callback = PushToHubCallback(output_dir="juancopi81/mutopia_guitar_mmm", tokenizer=tokenizer)

model.fit(tf_train_dataset, validation_data=tf_eval_dataset, epochs=num_epochs, callbacks=[callback])

/content/juancopi81/mutopia_guitar_mmm is already a clone of https://huggingface.co/juancopi81/mutopia_guitar_mmm. Make sure you pull the latest changes with `repo.git_pull()`.




Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file tf_model.h5:   0%|          | 3.34k/329M [00:00<?, ?B/s]

remote: Scanning LFS files for validity, may be slow...        
remote: LFS file scan complete.        
To https://huggingface.co/juancopi81/mutopia_guitar_mmm
   518045e..8acf259  main -> main

remote: LFS file scan complete.        
To https://huggingface.co/juancopi81/mutopia_guitar_mmm
   518045e..8acf259  main -> main



<keras.callbacks.History at 0x7f889dd123d0>

Great work! Congratulation! In the next step, you will create a gradio demo for your model. See you there.