# Generate Embeddings

This script contains code snippets to outline the process of generating embeddings for various fields in the songs.csv - songs, artist, composer, lyricist etc. We generate embeddings using the following method: Taking the one of the columns as input, we try to predict the output of the other 3 columns. There are two ways we can take the individual rows of the input column, as input - (1) char-rnn (2) A one hot encoding with each unique input element will be considered different from each other. Advantage of (1) is that it will capture textual level similarity between the names whereas (2) will be faster to train and will avoid capturing misleading features

In [1]:
import keras
from keras.models import Model
from keras.layers import Input, Dense, Dropout
import pandas as pd
import numpy as np
import scipy.sparse

Using TensorFlow backend.


## Loading Data

In [2]:
data = pd.read_csv('data/songs.csv').sample(n=1000).fillna('')
print("Data Loaded")

Data Loaded


Lets start by looking at number of distinct characters and number of distinct units in each columns. This will (hopefully) help in deciding which of the two approaches to choose from

In [3]:
def get_unique_chars(data, column):
    char_set = set([c for (i, row) in data.iterrows() for c in str(row[column])])
    return len(char_set)

# Some of the rows corresponding to a column have multiple values separated by '|'
# character. We need to split and separate these multiple values

def get_unique_entities(data, column):
    unique = set([name.strip() for (i, row) in data.iterrows() for name in str(row[column]).split('|')])
    return unique

In [4]:
#num_chars_artist_name = get_unique_chars(data, 'artist_name')
#num_chars_composer = get_unique_chars(data, 'composer')
#num_chars_lyricist = get_unique_chars(data, 'lyricist')
#num_chars_song_id = get_unique_chars(data, 'song_id')

In [5]:
#unique_artists = get_unique_entities(data, 'artist_name')
#unique_composers = get_unique_entities(data, 'composer')
#unique_lyricists = get_unique_entities(data, 'lyricist')
#unique_songs = get_unique_entities(data, 'song_id')
#print("Unique elements identified")

In [23]:
def to_one_hot(batch_rows, mappers):
    batch_size = batch_rows.shape[0]
    one_hot = [None]*batch_rows.shape[1]
    
    for i in range(len(one_hot)):
        one_hot[i] = np.zeros((batch_size, len(mappers[i])))
    
    row_num = 0
    for (_, row) in batch_rows.iterrows():
        for (i, element) in enumerate(row):
            parts = [p.strip() for p in element.split('|')]
            for p in parts:
                one_hot[i][row_num][mappers[i][p]] = 1
        row_num += 1
            
    return (one_hot[0], one_hot[1:])

def generate_mapper(data, column):
    unique_elements = get_unique_entities(data, column)
    mapper = dict()
    mapper['<unk>'] = 0
    for u in unique_elements:
        mapper[u] = len(mapper)
    return mapper

In [7]:
artist_mapper = generate_mapper(data, 'artist_name')
composer_mapper = generate_mapper(data, 'composer')
lyricist_mapper = generate_mapper(data, 'lyricist')
song_mapper = generate_mapper(data, 'song_id')
mappers = [artist_mapper, composer_mapper, lyricist_mapper, song_mapper]

In [8]:
#oh_artist = to_one_hot(data.artist_name, artist_mapper)
#oh_composer = to_one_hot(data.composer, composer_mapper)
#oh_lyricist = to_one_hot(data.lyricist, lyricist_mapper)
#oh_song = to_one_hot(data.song_id, song_mapper)
#print("Input-output matrices generated")

## Creating the model


We will start by creating a simple MLP model with one hidden layer. This corresponds to idea (2).

Changeable parameters:

* `num_hidden_units`
* `hidden_activation`
* `dropout`
* `batch_size`

In [24]:
def batch_generator(data, input_columns, target_columns, mappers, batch_size):
    num_rows = data.shape[0]
    num_inputs = len(input_columns)
    num_outputs = len(target_columns)
    all_columns = input_columns+target_columns
    permutation = np.random.permutation(num_rows)
    count = 0
    while True:
        batch_indices = permutation[count*batch_size:min((count+1)*batch_size, num_rows)]
        batch = data[all_columns].iloc[batch_indices]
        yield to_one_hot(batch, mappers)

In [25]:
batch_size = 64
input_generator = batch_generator(data, ['artist_name'],
                    ['composer', 'lyricist', 'song_id'],
                    mappers, batch_size)

In [17]:
input_col = 'artist_name'
input_shape = len(mappers[0])
output_shapes = [len(mappers[1]), len(mappers[2]), len(mappers[3])]
num_hidden_units = 100
hidden_activation = 'relu'
dropout = 1.0
batch_size = 64

input_features = Input(shape = (input_shape,))
hidden = Dropout(dropout)(
    Dense(num_hidden_units,activation=hidden_activation)(input_features))
output_0 = Dense(output_shapes[0], activation='softmax')(hidden)
output_1 = Dense(output_shapes[1], activation='softmax')(hidden)
output_2 = Dense(output_shapes[2], activation='softmax')(hidden)

model = keras.models.Model(inputs = [input_features],
                           outputs = [output_0, output_1, output_2])

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
print("model compiled")

model compiled


In [12]:
import IPython.display
from keras.utils import plot_model
plot_model(model, to_file='./model.png')

![model-visualization](./model.png)

In [26]:
model.fit_generator(input_generator, steps_per_epoch=data.shape[0]/batch_size, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x2b73a56d8c88>

### Todo
* Make code efficient for large scale
* CSR matrix