After finishing this tutorial's first notebook, you should now have around 380 MIDI files of guitar solo music. In this notebook, you will convert your MIDI files into text token representations and upload your new text dataset to the Hugging Face hub. Let's dive in!

## 2.1 Representing MIDI files as text tokens

Before uploading your dataset to the Hugging Face hub, you'll convert your MIDI files to text tokens. Doing this will allow you to use some state-of-the-art casual language models (GPT-2 for this tutorial).

There are many ways to do this, and this is a field in constant evolution. Let's use the text representations of our MIDI files as described in the "MMM: Exploring Conditional Multi-Track Music Generation with the Transformer" [paper](https://arxiv.org/abs/2008.06048). I highly encourage you to read this paper. It has a lot of details that we won't cover here, and it is not a difficult read. 

In this representation, you convert music events to text tokens such as:

*   **"PIECE_START" and "PIECE_END"**: Represent your piece's beginning and end.
*   **"TRACK_START" and "TRACK_END"**: Cover the duration of each track. 
*   **"BAR_START" and "BAR_END"**: Represent a musical bar's span.

Inside pieces, you have tracks. Inside tracks, you have bars. And inside bars, you have notes with pitches and durations, as shown in this diagram from the paper:

![picture](https://drive.google.com/uc?id=1lvUIwFH3ZPhvhy6LUN4vVS7ZvlfszQvC)

Here are some examples of the tokens used to represent musical events inside bars:
*   **"NOTE_ON=60"**: Represents pressing the note with pitch C4 (60 is the MIDI number for C4).
*   **"TIME_DELTA=2"**: Duration in pulses since the last event.
*   **"NOTE_OFF=60"**: This means stop pressing the note with pitch C4.

To illustrate this better, consider the following bar from Op. 50 No. 04 of Giuliani:

![picture](https://drive.google.com/uc?id=1x_napL4BM3EOuXiwtAKBj0pta6vAGlCY)

The tokenization process will represent this bar with the following tokens:
* PIECE_START 
* TIME_SIGNATURE=6_8 
* BPM=108
  * TRACK_START 
  * INST=0 
  * DENSITY=4 
  * BAR_START 
    * NOTE_ON=74 TIME_DELTA=2.0 NOTE_OFF=74 NOTE_ON=74 NOTE_ON=55 TIME_DELTA=2.0 NOTE_OFF=55 NOTE_ON=59 TIME_DELTA=2.0 NOTE_OFF=59 NOTE_ON=62 TIME_DELTA=2.0 NOTE_OFF=74 NOTE_ON=71 NOTE_OFF=62 NOTE_ON=55 TIME_DELTA=2.0 NOTE_OFF=55 NOTE_ON=59 TIME_DELTA=2.0 NOTE_OFF=71 NOTE_OFF=59 
  * BAR_END 

Notice the use of the tokens **TIME_SIGNATURE=** and **BPM=** to denote the meter and tempo of the bar, respectively. The token **INST=** represents the instrument in MIDI notation, and **DENSITY=** indicates the number of notes inside a bar: More density means more notes. I always use INST=0 (piano) since we are using only Guitar. You'll change this later when creating the sound output of the model.

Converting the MIDI files into this token representation is not a trivial task, but fortunately [Dr. Tristan Behrens](https://www.linkedin.com/in/dr-tristan-behrens-734967a2/) open-sourced an excellent code for doing this. I  tweaked the code a bit so it fits our needs for the Mutopia Project.

So to convert your MIDI files to text tokens, please do the following:

*   Go to the repo [MMM Tokenization](https://github.com/juancopi81/MMM_Tokenization)
*   Clone the repo to your local computer
*   Add the MIDI files to the folder `/datasets/custom_midi_dataset/`.
*   Create the MMMTrack datasets with `python create_dataset_mmm.py`.

After doing that, you should have two new text files inside the folder `/datasets/custom_dataset_mmmtrack/`:

*   `token_sequences_train.txt`
*   `token_sequences_valid.txt`

Feel free also to use the files that are already in the repo. You'll use these files to upload your dataset to Hugging Face.

## 2.2 Uploading your dataset to Hugging Face

If you are working in Colab, for this part of this notebook, you need to upload the token_sequences_train.txt and token_sequences_valid.txt files in the folder `/content/mutopia_guitar_dataset_files/`. You also need to have a Hugging Face account for uploading your model. With all that set, you can follow the next instructions; I won't go into many details here since most of them are self-explanatory.

Install the Transformers and Datasets libraries.

In [None]:
!pip install datasets transformers[sentencepiece]
!apt install git-lfs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 2.5 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 53.1 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 68.4 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 52.8 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |█████████████████████████

Setup git

In [None]:
!git config --global user.email "me@hotmail.com"
!git config --global user.name "My name"

Log in to the Hugging Face Hub

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


Create repository of the dataset in 🤗 Hub.

In [None]:
from huggingface_hub import create_repo

repo_url = create_repo(name="mutopia_guitar_dataset", repo_type="dataset")
repo_url



'https://huggingface.co/datasets/juancopi81/mutopia_guitar_dataset'

Clone the repository from the Hub to local machine and copy dataset files into it

In [None]:
from huggingface_hub import Repository

repo = Repository(local_dir="mutopia_guitar_dataset", clone_from=repo_url)

Cloning https://huggingface.co/datasets/juancopi81/mutopia_guitar_dataset into local empty directory.


In [None]:
!cp /content/mutopia_guitar_dataset_files/* /content/mutopia_guitar_dataset/

Include the `.txt` format in the list of the .gitattributes file so Git LFS tracks these files.

In [None]:
repo.lfs_track("*.txt")

Push the dataset to the Hub:

In [None]:
repo.push_to_hub()

Upload file train.txt:   0%|          | 3.34k/94.9M [00:00<?, ?B/s]

Upload file test.txt:   0%|          | 3.34k/882k [00:00<?, ?B/s]

To https://huggingface.co/datasets/juancopi81/mutopia_guitar_dataset
   48c662e..0a3f9a8  main -> main



'https://huggingface.co/datasets/juancopi81/mutopia_guitar_dataset/commit/0a3f9a8575c56bfa63dc94a9a869294e47b94043'

Load the dataset from the hub.

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("juancopi81/mutopia_guitar_dataset")
raw_datasets

Using custom data configuration juancopi81--mutopia_guitar_dataset-7b7b35ea85ad98f4
Reusing dataset text (/root/.cache/huggingface/datasets/juancopi81___text/juancopi81--mutopia_guitar_dataset-7b7b35ea85ad98f4/0.0.0/acc32f2f2ef863c93c2f30c52f7df6cc9053a1c2230b8d7da0d210404683ca08)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 7325
    })
    test: Dataset({
        features: ['text'],
        num_rows: 74
    })
})

Congratulation!! You now have your dataset in the Hugging Face hub. In the next notebook, you'll create your tokenizer for the model you train. See you there.