This project uses transformer encoder blocks to predict next notes in a sequence. This model was trained on pokemon music from generations I and IV and was able to recreate pokemon music as well as create originals.

Music sequences were in the form of MIDI files and were parsed using the "pretty midi" library. The parsing of the notes using this library was inspired by Tensorflow's documentation on LSTM music.

To make the model successful, I created a vocabulary that was an arrangement of 128 notes, 8 different durations, and 8 different note steps. The total size of the vocabulary was 8192 and then the sequences of notes were tokenized based on this metric so the network could learn embeddings for the 8192 different tokens.

The actual transformer model used just a single encoder block with the feed forwward network size being 1024/2048. This was just because it was too computationally expensive for my computer to train a larger network. The transformer also uses the noam learning rate scheduler from the "Attention is all you need" paper with warmpup_steps = 4000.

The largest limitation of the network was that most of the pokemon music especially the later generation music featured many different instruments and the model only used an acoustic grand piano. It would have been better to have the network learn which notes are being played on which instruments but that would have exponentially raised the time it would take to train.