# Datset tokenization

Tokenization is an important step to represent the input text as numerical inputs. To do so a tokenizer is used. A basic tokenizer could be to split the words.

In this case we use the tokenizer used in the GPT2 model of OpenAI and we also train a new one with the corpus of our dataset.

The goal is to see if the new tokenizer is capable of tokenizing song chord strings better.

## Resources

* [Training a tokenizer](https://huggingface.co/learn/nlp-course/en/chapter6/2)

## Dataset

In [1]:
from datasets import load_dataset
import pandas as pd

In [2]:
dataset = load_dataset("lluccardoner/melodyGPT-song-chords-text-1", split="train")

In [3]:
dataset

Dataset({
    features: ['genres', 'artist_name', 'song_name', 'chords_str'],
    num_rows: 135783
})

In [4]:
df = dataset.to_pandas()

In [5]:
df = df.dropna()

In [6]:
df.head()

Unnamed: 0,genres,artist_name,song_name,chords_str
0,"['canadian pop', 'pop', 'post-teen pop']",Justin Bieber,"10,000 Hours",G G/B C G G G/B C G G Em C G G Em C G G Em C G...
1,"['canadian pop', 'pop', 'post-teen pop']",Justin Bieber,2 Much,Intro: F#m7 D2 F#m7 D2 F#m7 D2 E F#m7 A/C# E D...
2,"['canadian pop', 'pop', 'post-teen pop']",Justin Bieber,2u (feat. David Guetta),Em D C C D Em Em D C C D Em Em D C Am D Em G C...
3,"['canadian pop', 'pop', 'post-teen pop']",Justin Bieber,All Around The World,Intro: Em Bm Am C (2x) Em Bm Am C Em Bm Am C ...
4,"['canadian pop', 'pop', 'post-teen pop']",Justin Bieber,All Around The World (acoustic),Intro: Gm - Dm - C - C x2 Gm Dm C C Gm Dm C C ...


In [7]:
example = df.sample(1)
example_chords = example["chords_str"].iloc[0]
example_chords

'\n\t \t\tIntro : D D Dm D B|------10----10--12^13-12--10----10---10--12--10-------10--| G|-9/11----11--------------------------------------9/11-----| D|----------------------------------------------------------| A|----------------------------------------------------------| E|----------------------------------------------------------| B|------10---------------------| G|-9/11----11-9-9/11-11\\7-7/9--| D|-----------------------------| A|-----------------------------| E|-----------------------------| D\t Bm \tBb\t D \t\t\tBm \t Bb\t\tD \t\t\tBm Bb\t\t\tD \t\t\t Bm Bb\t\t\tD A\t Bm F#m\t G A\t Bm F#m\t G A\t Bm F#m\t\tD A\t Bm F#m\t\tD Intro : D D Dm D B|------10----10--12^13-12--10----10---10--12--10-------10--| G|-9/11----11--------------------------------------9/11-----| D|----------------------------------------------------------| A|----------------------------------------------------------| E|----------------------------------------------------------| B|------10---------------------| 

In [8]:
example_chords = "Intro: Adim G7/13 Em Bb (4x) G#dim Bm/C F#m Ab|---------------------------------| (Bridge) C G Em7 Asus4"

## GPT2 tokenizer

In [9]:
from transformers import AutoTokenizer

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


GPT2 tokenizer has vocab size of 50000 + 256 + 1 tokens

In [10]:
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
print(gpt2_tokenizer.vocab_size)

50257


We can observe that the original GPT2 tokenizer does a great job on separating the chords and the chords alterations.

In [11]:
tokens = gpt2_tokenizer.tokenize(example_chords)
print(tokens)

['Int', 'ro', ':', 'ĠAd', 'im', 'ĠG', '7', '/', '13', 'ĠEm', 'ĠB', 'b', 'Ġ(', '4', 'x', ')', 'ĠG', '#', 'dim', 'ĠB', 'm', '/', 'C', 'ĠF', '#', 'm', 'ĠAb', '|', '--------------------------------', '-|', 'Ġ(', 'Bridge', ')', 'ĠC', 'ĠG', 'ĠEm', '7', 'ĠAsus', '4']


In [12]:
ids = gpt2_tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[5317, 305, 25, 1215, 320, 402, 22, 14, 1485, 2295, 347, 65, 357, 19, 87, 8, 402, 2, 27740, 347, 76, 14, 34, 376, 2, 76, 2275, 91, 3880, 22831, 357, 37385, 8, 327, 402, 2295, 22, 46301, 19]


## Train GPT2 tokenizer

In [13]:
def get_training_corpus(step: int = 1000):
    return (
        df["chords_str"][i : i + step].values.tolist()
        for i in range(0, len(df["chords_str"]), step)
    )

In [14]:
training_corpus = get_training_corpus()

In [15]:
chords_gpt2_tokenizer = gpt2_tokenizer.train_new_from_iterator(training_corpus, 50000)






We can see cases where this tokenizers performs a little bit better. For example the diminished chords are not splitted into two tokens:
* Original GPT2 tokenizer: "Gdim" -> ["Gd", "im"]
* New GTP2 tokenizer: "Gdim" -> ["Gdim"]

In [16]:
new_tokens = chords_gpt2_tokenizer.tokenize(example_chords)
print(new_tokens)

['Intro', ':', 'ĠAdim', 'ĠG', '7', '/', '13', 'ĠEm', 'ĠBb', 'Ġ(', '4', 'x', ')', 'ĠG', '#', 'dim', 'ĠBm', '/', 'C', 'ĠF', '#', 'm', 'ĠAb', '|---------------------------------|', 'Ġ(', 'Bridge', ')', 'ĠC', 'ĠG', 'ĠEm', '7', 'ĠAsus', '4']


In [17]:
new_ids = chords_gpt2_tokenizer.convert_tokens_to_ids(new_tokens)
print(new_ids)

[287, 26, 659, 260, 23, 15, 303, 268, 272, 279, 20, 88, 9, 260, 3, 294, 270, 15, 35, 264, 3, 77, 284, 613, 279, 719, 9, 262, 260, 268, 23, 319, 20]


Save new trained tokenizer

In [18]:
#chords_gpt2_tokenizer.push_to_hub("lluccardoner/melodyGPT-song-chords-tokenizer-1")

In [19]:
t = AutoTokenizer.from_pretrained("lluccardoner/melodyGPT-song-chords-tokenizer-1")
print(t.vocab_size)

19972
