# Building Model 1
The first architecture is based on language-model. <br>
- words input - will receive a list of sequence consisting of 10 words, embed them and concatenate them with the melody input
- melody input - will receive a vector of the summarized melody information, extracted from pretty_midi object <br>
**Output** - probabilities vector

In [2]:
import pandas as pd
import numpy as np
import Model1Base as mb
from nltk.tokenize import RegexpTokenizer

Using TensorFlow backend.


Loading the training data

In [3]:
df = pd.read_csv("data/lyrics_train_set.csv",header=None)
df = df.fillna('')
df[2] = df[2] + df[3] + df[4] + df[5] + df[6] 
df=df.drop([3,4,5,6],axis=1)
df.columns=['singer','song','lyrics']

## Cleaning the text and adding tokens

In [4]:
df['clean_lyrics'] = df.apply(lambda row: mb.clean_text(row.lyrics),axis=1)
df['singer_song']= df.apply(lambda row: mb.clean_singer_song(row['singer'],row['song']),axis=1)
tokenizer = RegexpTokenizer(r'\w+|&+')
df["tokens"] = df["clean_lyrics"].apply(tokenizer.tokenize)

Loading a previously created midi vectors
<br>*The function that creates a midi vector exists on Model1Base.py

In [5]:
midi_df = pd.read_pickle("data/melody_df.pkl")
df_concat=pd.merge(df,midi_df,how='inner', left_on='singer_song', right_on='filename')

## Creating Vocabulary

In [6]:
all_words = [word for tokens in df_concat["tokens"] for word in tokens]
sentence_lengths = [len(tokens) for tokens in df_concat["tokens"]]
VOCAB = sorted(list(set(all_words)))
print("%s words total, with a vocabulary size of %s" % (len(all_words), len(VOCAB)))
print("Max sentence length is %s" % max(sentence_lengths))

176682 words total, with a vocabulary size of 7474
Max sentence length is 1481


Setting parameters for the model

In [7]:
EMBEDDING_DIM = 300
MAX_SEQUENCE_LENGTH = 10
VOCAB_SIZE = len(VOCAB)
VALIDATION_SPLIT=.2

**Creating words sequences for input**

In [8]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=VOCAB_SIZE)
tokenizer.fit_on_texts(df_concat["clean_lyrics"].tolist())
sequences = tokenizer.texts_to_sequences(df_concat["clean_lyrics"].tolist())
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 7474 unique tokens.


## Creating Song sequences

In [9]:
song_index=[]
sequences_list=[]
for i,seq in enumerate(sequences):
    for j in range(1, len(seq)):
        for z in range(MAX_SEQUENCE_LENGTH):
            sequence = seq[j:j+z+2]
            sequences_list.append(np.array(sequence))
            song_index.append(i)
print('Total Sequences: %d' % len(sequences_list))

Total Sequences: 1760980


Padding sequences according the max_length

In [10]:
from keras.preprocessing.sequence import pad_sequences
max_length = max([len(seq) for seq in sequences_list])
sequences_pad = pad_sequences(sequences_list, maxlen=max_length, padding='pre')

Rearranging data for X_train and y_train data

In [11]:
data = np.array(sequences_pad)
song_index =np.array(song_index)
X=data[:,:-1]
Y=data[:,-1]
midi_data = df_concat[[i for i in range(297)]].values

Using word2vec pretrained model to embed all words

In [12]:
from gensim.models import KeyedVectors

word2vec = KeyedVectors.load_word2vec_format('data/wiki-news-300d-1M.vec')
embedding_weights = np.zeros((len(word_index)+1, EMBEDDING_DIM))
for word,index in word_index.items():
    embedding_weights[index,:] = word2vec[word] if word in word2vec else np.random.uniform(-1,1,EMBEDDING_DIM)
print(embedding_weights.shape)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


(7475, 300)


In [13]:
x_train, midi_train, y_train, x_test, midi_test, y_test = mb.create_training_data(song_index, midi_data, X, Y)

## Model Creation
*The full function is in Model1Base.py

In [None]:
model = mb.build_model(word_index, embedding_weights)

**Model Training** (first iteration example)

In [15]:
history=model.fit(
        [x_train,midi_train],
        y_train, 
        batch_size = 256, 
        epochs = 10, 
        validation_data=([x_test,midi_test], y_test))#,callbacks=[early_stopping_monitor,checkpoint])

W0727 19:28:53.344172  5940 deprecation.py:323] From C:\Users\TomerMeirman\Anaconda3\lib\site-packages\tensorflow\python\ops\math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Train on 1734290 samples, validate on 26690 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Saving the model

In [None]:
import datetime
now = datetime.datetime.now()
datestr=now.strftime("%Y_%m_%d__%H%M")

name='model_'+datestr
mb.save_model(model,name)