## Preparation
Make sure you choose GPU as Hardware accelerator when configuring the runtime (Runtime -> Change runtime type).

#### Installing the dependencies

In [0]:
!pip install pypianoroll
!pip install --upgrade torch
import pypianoroll  # this will install ffmpeg

#### Getting the code

In [0]:
# create ssh keys (the public key is installed on gitlab)
import os, stat
os.mkdir("/root/.ssh")
with open("/root/.ssh/id_ed25519.pub", "w") as pubkey, open("/root/.ssh/id_ed25519", "w") as privkey:
    pubkey.write("ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOaO3EXB/jrEp4cYg5DBj/9yWh3W/X7/xro2iDMOy0ht root@6147081228e3\n")
    privkey.write("-----BEGIN OPENSSH PRIVATE KEY-----\n")
    privkey.write("b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAABAAAAMwAAAAtzc2gtZW\n")
    privkey.write("QyNTUxOQAAACDmjtxFwf46xKeHGIOQwY//clod1v1+/8a6NogzDstIbQAAAJhPsrlvT7K5\n")
    privkey.write("bwAAAAtzc2gtZWQyNTUxOQAAACDmjtxFwf46xKeHGIOQwY//clod1v1+/8a6NogzDstIbQ\n")
    privkey.write("AAAECFjHPM4aI4nSseqXCjUJkomo3uOZTx6A2DLJW+e1FI1uaO3EXB/jrEp4cYg5DBj/9y\n")
    privkey.write("Wh3W/X7/xro2iDMOy0htAAAAEXJvb3RANjE0NzA4MTIyOGUzAQIDBA==\n")
    privkey.write("-----END OPENSSH PRIVATE KEY-----\n")
os.chmod("/root/.ssh/id_ed25519", stat.S_IREAD|stat.S_IWRITE)
# add domain to know hosts
!ssh-keyscan -t ed25519 gitlab.com > ~/.ssh/known_hosts
!git clone git@gitlab.com:agct_music/music-autoencoder.git
os.chdir("./music-autoencoder")

## Training the model

#### Creating a model
We create the model described in the concept, with default parameters. To use a model that is more like the one in the tutorial, create an instance of ```model.autoencoder.Autoencoder``` instead (without ```_SE```).


In [0]:
import model.autoencoder
autoencoder = model.autoencoder.Autoencoder_SE()

#### Loading training data
We can use the small dataset ```small.zip``` for testing and debugging. The archive should contain only ```.json``` (or ```.txt```) files, where each file represents one *song*. The word *song* is used to refer to a 3-tuple containing:
0. The name of the song (```str```).
1. The duration of a time tick (```float```).
2. The stream of tokens (a list of tuples containing two ```int```).

The dataset is split into training set and evaluation set with ratio 0.8 to 0.2. Both ```trainset``` and ```evalset``` will be a Python list of songs. Because the songs are not shuffled, the first 0.8 fraction of the songs (in order they appear in the archive) will belong to the training set.




In [0]:
import util.archive
trainset, evalset = util.archive.load("small.zip", 0.8)

#### Training the model
We train the model on the training set for 7 epochs. During training, the accuracy when not using TF (first value in bracket) and when using TF (second value in bracket) is shown. Non-TF predictions sometimes show a higher accuracy because the first prediction, which is never teacher forced, is often correct - especially when the input sequences are read in reverse.

After each epoch, the loss and accuracy of each song in the evaluation set is computed. The name of the song is also displayed. 

It is possible to send all output to a file, that can then be read and results can be plotted. The code takes about 6 minutes to run.

In [0]:
import train
train.train(autoencoder, 7, trainset, evalset)

#### Saving the model weights
Each model from ```model.autoencoder``` provides a ```save``` function that saves the model parameters and weights to a directory. A new direcotry will be created if it does not already exist. The parameters are saved in a ```.json``` file, so that you can tell which model is which. (Only the parameters to the constructor are saved.)

We also save the names of all the songs we used for training, so that we can later reconstruct the training set (if needed).

In [0]:
dir = os.path.join("demo", "model")
autoencoder.save(dir)
# save the name of the training and eval songs
with open(os.path.join(dir, "train.txt"), "w") as t, open(os.path.join(dir, "eval.txt"), "w") as e:
    for name, _, _ in trainset:
        t.write(name+"\n")
    for name, _, _ in evalset:
        e.write(name+"\n")

## Working with a model

#### Loading as saved model
We provide a pre-trained model. It was trained with the first 1000 songs from ```all.zip``` over 7 epochs. It was actually created by running the file ```main.py``` as is.

The file ```params.json``` in the model directory shows what the model parameters were.

In [0]:
dir = os.path.join("demo", "pretrained")
autoencoder = model.autoencoder.load(dir)
# also show params.json
with open(os.path.join(dir, "params.json")) as params:
    print(params.read())

#### Visualizing the training accuracy
The output generated during training was saved to ```demo/pretrained/train.log```. It can be visualized with ```util.readlog```.

In [0]:
import util.readlog
util.readlog.plot(os.path.join(dir, "train.log"))

#### Loading MIDI files from a directory
Some MIDI files that where excluded from the training set are provided. These can be loaded using ```util.load```. Converting a MIDI file to a token representation will take a while. It is usually faster to save/load the token representation rather than working with a MIDI file.

Google drive can also be very slow. Instead of opening hundreds of files, it is often a good idea to save multiple songs in one archive - as was done for the training set.

In [0]:
import util.load
songs = util.load.midi(os.path.join(dir, "midi"))

#### Evaluating and saving
The model can be used to encode and decode a set of songs. We can then save the reconstructions to listen to the results.

In [0]:
import eval, util.save
pred = eval.eval(autoencoder, songs)
# save songs with -pred attached to their names
util.save.midi(os.path.join(dir, "midi"), [(name+"-pred", dur, toks) for name, dur, toks in pred])

#### Creating the AGCT datasets
We can extract the code vectors of the songs and save them into a format readable by AGCT. The resulting file can then be processed by AGCT.

To keep the gene names short, samples from the first song will be named ```A_1```, ```A_2```, ..., samples from the second song will be called ```B_1```, ```B_2```, ..., and so on.

In [0]:
import encode
c = encode.encode(autoencoder, songs[0:5])
util.save.code(os.path.join(dir, "agct_dataset.txt"), c)