In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

# from fastai.model import fit
from fastai.dataset import *

import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as pickle

## Music modeling

### Data

The data comes from the [IraKorshunova repository](https://github.com/IraKorshunova/folk-rnn/tree/master/data) who has cleaned, parsed and tokenised the [thesession.org dataset](https://github.com/adactio/TheSession-data). We have used the version 3 of this dataset, [allabcwrepeats_parsed_wot](allabcwrepeats_parsed_wot), which has more than 46,000 transcriptions.

The **music generation task**

We tried to create a *music model*, being inspired by the *language model*, used by Jeremy Howard in [fast.ai course, lesson 4](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson4-imdb.ipynb); where he created a model that can predict the next word in a sentence, in order to finally classify sentiments over specific texts.

Because our model first needs to understand the structure of the music files, we decided to use instead of MIDI files, which are matrixes, of 4 columns, text files in [abc format](https://en.wikipedia.org/wiki/ABC_notation).

[Ex:](http://abcnotation.com/)
```X:1
T:Speed the Plough
M:4/4
C:Trad.
K:G
|:GABc dedB|dedB dedB|c2ec B2dB|c2A2 A2BA|
  GABc dedB|dedB dedB|c2ec B2dB|A2F2 G4:|
|:g2gf gdBd|g2f2 e2d2|c2ec B2dB|c2A2 A2df|
  g2gf g2Bd|g2f2 e2d2|c2ec B2dB|A2F2 G4:|
  ```
  
![Speed the Plough](https://github.com/alessaww/fastai_ws/blob/master/SpeedThePlough.png?raw=true "Logo Title Text 1")

There are no good pretrained music models available to download to be used in pytorch, so we need to create our own. 

We divided the data in 5% for validation and 95% for training.

In [2]:
PATH='data/musichack/thesession/'

In [3]:
%ls {PATH}

allabcwrepeats_parsed_wot  [0m[01;34mmodels[0m/  [01;34mtmp[0m/  wot_train  wot_valid


Let's look at an example of the validation dataset

In [4]:
review = !cat {PATH}wot_valid
review[:3]

['M:9/8',
 'K:maj',
 '=G =E =E =E 2 =D =E =D =C | =G =E =E =E =F =G =A =B =c | =G =E =E =E 2 =D =E =D =C | =A =D =D =G =E =C =D 2 =A | =G =E =E =E 2 =D =E =D =C | =G =E =E =E =F =G =A =B =c | =G =E =E =E 2 =D =E =D =C | =A =D =D =G =E =C =D 2 =D | =E =D =E =c 2 =A =B =A =G | =E =D =E =A /2 =B /2 =c =A =B 2 =D | =E =D =E =c 2 =A =B =A =G | =A =D =D =D =E =G =A 2 =D | =E =D =E =c 2 =A =B =A =G | =E =D =E =A /2 =B /2 =c =A =B 2 =B | =G =A =B =c =B =A =B =A =G | =A =D =D =D =E =G =A =B =c |']

Before we can analyze text, the text should be *tokenize* first. This refers in the language world, to the process of splitting a sentence into an array of words (or more generally, into an array of tokens).

Sturm et all describe in this paper ["Music transcription modelling and composition using deep learning"](https://arxiv.org/pdf/1604.08723.pdf)   how he tokenize the music dataset. Here are some tokens used for this dataset:

1. meter "M:9/8"
1. key: "K:maj"
1. duration "/2" and "2"
1. measure: ":|" and "|1"
1. pitch: "C" and "^c’"
1. grouping: "(3"

First, we create a torchtext field, which describes how to preprocess a piece of text. 

In [5]:
TEXT = data.Field(lower=False)

[fastai](https://github.com/fastai/fastai) works closely with torchtext. We create a ModelData object for language modeling by taking advantage of `LanguageModelData`, passing it our torchtext field object, and the paths to our training, test, and validation sets. In this case, we don't have a separate test set, so we'll just use `VAL_PATH` for that too.

As well as the usual `bs` (batch size) parameter, we also not have `bptt`; this define how many words are processing at a time in each row of the mini-batch. More importantly, it defines how many 'layers' we will backprop through. Making this number higher will increase time and memory requirements, but will improve the model's ability to handle long sentences.

In [6]:
bs,bptt = 64,70

In [7]:
FILES = dict(train='wot_train', validation='wot_valid', test='wot_valid')
md = LanguageModelData.from_text_files(f'{PATH}', TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

After building our `ModelData` object, it automatically fills the `TEXT` object with a very important attribute: `TEXT.vocab`. This is a *vocabulary*, which stores which words (or *tokens*) have been seen in the text, and how each word will be mapped to a unique integer id. We'll need to use this information again later, so we save it.

*(Technical note: python's standard `Pickle` library can't handle this correctly, so at the top of this notebook we used the `dill` library instead and imported it as `pickle`)*.

In [8]:
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

Here are the: # batches; # unique tokens in the vocab; # dataset # tokens in the training set;

In [9]:
len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)

(1668, 97, 1, 7481251)

This is the start of the mapping from integer IDs to unique tokens.

In [11]:
TEXT.vocab.itos[:100]

['<unk>',
 '<pad>',
 '|',
 '2',
 '=c',
 '=F',
 '^c',
 '^G',
 '=G',
 '=f',
 '^A',
 '=A',
 '^d',
 '=E',
 '=d',
 '=C',
 '=e',
 '^D',
 '/2',
 '^C',
 '=D',
 '=B',
 '>',
 '^F',
 '3',
 '^g',
 '=g',
 '^f',
 ':|',
 '(3',
 '^a',
 '=a',
 '|:',
 "=c'",
 '^A,',
 '4',
 '=A,',
 '^G,',
 '=B,',
 '=G,',
 'K:maj',
 '|1',
 '|2',
 "^c'",
 'M:4/4',
 '<',
 '-',
 '=b',
 'z',
 'M:6/8',
 '6',
 ']',
 '[',
 '=F,',
 "^d'",
 "=d'",
 'K:min',
 '3/2',
 'K:dor',
 '=E,',
 '^F,',
 'M:3/4',
 'M:2/4',
 'K:mix',
 "=f'",
 "=e'",
 '^D,',
 '=D,',
 'M:9/8',
 '^C,',
 '=C,',
 '8',
 '/4',
 'M:12/8',
 '/2>',
 '/2<',
 '2>',
 'M:3/2',
 "^f'",
 '5',
 '3/4',
 '(4',
 "^g'",
 '(2',
 "=g'",
 '/3',
 "^a'",
 "=a'",
 '7',
 '2<',
 '12',
 '7/2',
 '9',
 '(5',
 '/8',
 '16',
 '5/2']

In [12]:
# 'stoi': 'string to int'
TEXT.vocab.stoi['=c']

4

Note that in a `LanguageModelData` object there is only one item in each dataset: all the words of the text joined together.

In [13]:
md.trn_ds[0].text[:12]

['M:4/4', 'K:maj', '|:', '=g', '=f', '=e', '=c', '=d', '2', '=g', '=f', '|']

torchtext will handle turning this words into integer IDs for us automatically.

In [14]:
TEXT.numericalize([md.trn_ds[0].text[:12]])

Variable containing:
   44
   40
   32
   26
    9
   16
    4
   14
    3
   26
    9
    2
[torch.cuda.LongTensor of size 12x1 (GPU 0)]

Our `LanguageModelData` object will create batches with 64 columns (that's our batch size), and varying sequence lengths of around 80 tokens (that's our `bptt` parameter - *backprop through time*).

Each batch also contains the exact same data as labels, but one word later in the text - since we're trying to always predict the next word. The labels are flattened into a 1d array.

In [15]:
next(iter(md.trn_dl))

(Variable containing:
    44    39    14  ...     17     7     7
    40    36    16  ...     18     6     8
    32    15     9  ...      5     4    10
        ...          ⋱          ...       
    30    20    14  ...      6    12    69
    31    13     4  ...     22     9    66
     9     8     3  ...     10    27    53
 [torch.cuda.LongTensor of size 77x64 (GPU 0)], Variable containing:
  40
  36
  16
  ⋮ 
   7
   2
   2
 [torch.cuda.LongTensor of size 4928 (GPU 0)])

### Train

In [16]:
em_sz = 200  # size of each embedding vector
nh = 500     # number of hidden activations per layer
nl = 3       # number of layers

Researchers have found that large amounts of *momentum* don't work well with these kinds of *RNN* models, so we create a version of the *Adam* optimizer with less momentum than it's default of `0.9`.

In [17]:
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

fastai uses a variant of the state of the art [AWD LSTM Language Model](https://arxiv.org/abs/1708.02182) developed by Stephen Merity. A key feature of this model is that it provides excellent regularization through [Dropout](https://en.wikipedia.org/wiki/Convolutional_neural_network#Dropout). There is no simple way known (yet!) to find the best values of the dropout parameters below - you just have to experiment...

However, the other parameters (`alpha`, `beta`, and `clip`) shouldn't generally need tuning.

In [19]:
learner = md.get_model(opt_fn, em_sz, nh, nl,
    dropout=0.05, dropouth=0.1, dropouti=0.05, dropoute=0.02, wdrop=0.2)

learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

In [20]:
%%time
learner.fit(3e-3, 1, wds=1e-6)

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

[0.      1.44371 1.47092]                                     

CPU times: user 1min 14s, sys: 10.7 s, total: 1min 24s
Wall time: 1min 26s


In [21]:
learner.save_encoder('adam2_enc_l0')

In [22]:
%%time
learner.fit(3e-3, 3, wds=1e-6, cycle_len=1, cycle_mult=2)

HBox(children=(IntProgress(value=0, description='Epoch', max=7), HTML(value='')))

[0.      1.33481 1.32608]                                     
[1.      1.29181 1.28717]                                     
[2.      1.2362  1.23456]                                     
[3.      1.25848 1.26215]                                     
[4.      1.19748 1.20987]                                     
[5.      1.15654 1.16454]                                     
[6.      1.12861 1.15472]                                     

CPU times: user 8min 47s, sys: 1min 14s, total: 10min 2s
Wall time: 10min 3s


In [23]:
learner.save_encoder('adam2_enc')

In [None]:
learner.fit(3e-3, 10, wds=1e-6, cycle_len=5, cycle_save_name='adam3_10')

HBox(children=(IntProgress(value=0, description='Epoch', max=50), HTML(value='')))

[0.      1.19797 1.21638]                                     
[1.      1.16598 1.17822]                                     
[2.      1.11193 1.14258]                                     
[3.      1.09525 1.11355]                                     
[4.      1.07502 1.10745]                                     
[5.      1.15748 1.17669]                                     
[6.      1.11343 1.14069]                                     
[7.      1.08742 1.10299]                                     
[8.      1.05936 1.07688]                                     
[9.      1.04496 1.07082]                                     
[10.       1.12561  1.13999]                                  
[11.       1.09792  1.10975]                                  
[12.       1.06397  1.07107]                                  
[13.       1.02753  1.0458 ]                                  
[14.       1.01059  1.04052]                                   
[15.       1.11865  1.11324]                          

In [23]:
learner.save_encoder('adam3_10_enc')

In [None]:
learner.fit(3e-3, 8, wds=1e-6, cycle_len=10, cycle_save_name='adam3_5')

epoch      trn_loss   val_loss                                 
    0      1.006562   1.043507  
    1      1.000142   1.037049                                 
    2      0.996996   1.027742                                 
    3      0.981878   1.012572                                 
    4      0.968718   0.995868                                 
    5      0.949195   0.981612                                 
    6      0.927161   0.9698                                   
    7      0.903767   0.961762                                 
    8      0.902105   0.955928                                 
    9      0.92872    0.954165                                 
    10     1.023135   1.030608                                 
    11     1.014453   1.026594                                 
    12     0.998357   1.014735                                 
    13     0.974121   1.00059                                  
    14     0.966552   0.984266                                 
    15 

[0.88200098]

In [None]:
learner.fit(3e-3, 1, wds=1e-6, cycle_len=20, cycle_save_name='adam3_20')

epoch      trn_loss   val_loss                                 
    0      0.829702   0.881273  
    1      0.951012   0.970356                                 
    2      0.967439   0.971461                                 
    3      0.947978   0.969625                                 
    4      0.967008   0.96368                                  
    5      0.930247   0.955776                                 
    6      0.915778   0.954348                                 
    7      0.931548   0.939423                                 
    8      0.916852   0.930579                                 
    9      0.90054    0.918163                                 
    10     0.87822    0.910958                                 
    11     0.89871    0.899078                                 
    12     0.861625   0.891122                                 
    13     0.857402   0.883632                                 
    14     0.866496   0.875259                                 
    15 

In [None]:
learner.save_encoder('adam3_20_enc')

In [None]:
learner.save('adam3_20')

Language modeling accuracy is generally measured using the metric *perplexity*, which is simply `exp()` of the loss function we used.

In [42]:
math.exp(4.165)

64.3926824434624

In [43]:
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

### Test

We can play around with our language model a bit to check it seems to be working OK. First, let's create a short bit of text to 'prime' a set of predictions. We'll use our torchtext field to numericalize it so we can feed it to our language model.

In [13]:
def proc_str(s): return TEXT.preprocess(TEXT.tokenize(s))
def num_str(s): return TEXT.numericalize([proc_str(s)])

In [14]:
m=learner.model

 Let's see if our model can generate a bit more text all by itself!

In [19]:
def sample_model(m, s, l=250):
    t = num_str(s)
    m[0].bs=1
    m.eval()
    m.reset()
    res,*_ = m(t)
    print('...', end='')

    for i in range(l):
        n=res[-1].topk(2)[1]
        n = n[1] if n.data[0]==0 else n[0]
        word = TEXT.vocab.itos[n.data[0]]
        print(word, end=' ')
        if word=='<eos>': break
        res,*_ = m(n[0].unsqueeze(0))

    m[0].bs=bs

In [20]:
sample_model(m,"M:4/4")

...k:maj |: (3 =g, =a, =b, | =c 2 =e > =c =g > =c =e > =c | =d > =c =b, > =c =d > =e =f > =d | =c 2 =e > =c =g > =c =e > =c | =d > =c =b, > =a, =g, 2 (3 =g, =a, =b, | =c 2 =e > =c =g > =c =e > =c | =d > =c =b, > =c =d > =e =f > =d | =g > =c =b > =a =g > =f =e > =d | =c 2 =c 2 =c 2 :| |: (3 =g =a =b | =c 2 =e > =c =g > =c =e > =c | =d > =c =b > =a =g > =f =e > =d | =c 2 =e > =c =g > =c =e > =c | =d > =c =b > =a =g > =f =e > =d | =c 2 =e > =c =g > =c =e > =c | =d > =c =b > =a =g > =f =e > =d | =c > =e =g > =c =a > =f =d > =b, | =c 2 =e 2 =c 2 :| m:4/4 k:maj |: (3 =g, =a, =b, | =c 2 =e > =c =g > =c =e > =c | =d > =c =b, > =c =d > =e =f > =d | =c 2 =e > =c =g > =c =e > =c | =d > 