## Load data

In [1]:
with open("../data/train/lyrics/lyrics.txt", "r") as f:
    lyrics = f.read()
with open("../data/train/lyrics/labels.txt", "r") as f:
    labels = f.read()

In [2]:
print(lyrics[:2000])
print()
print(labels[:20])

Gently hold our hands Gently hold our heads on high  Aimless time in fear new hide Overthrow the plan Confusion lies in all my words Mad is the soul  We barricade ourselves in holes of temperament This is the dawning of a new age A heart that beats the wrong way Insanity's crescendo  Windcolour, second sight A touch of silence and the violence of dark Illusion span, the aroma of time Shadowlife and the scent of nothingness  Infinite fall of instinct Order of one spells deceit Infin
We are the Sun We are the dead stars We are the black sky Invading your room We are the candle The only light We are the machines of the past Forever victims and murderers of your joy  We are Death The ancient knowledge The source of origin The red and white sacred hatred Enthroned, materialized The wrath of heaven and hell united in one Worship us, be faithful Beautiful great and cursed  Vexilla regis prodeunt, fulget crucis mysterium Vexilla regis prodeunt inferni  We are the... Mother of suffering Bringer

## Data pre-processing

In [55]:
from string import punctuation

print(punctuation)
# remove punctuations
lyrics = lyrics.lower()
all_text = ''.join([c for c in lyrics if c not in punctuation])

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [56]:
lyrics_split = all_text.split('\n')
lyrics_split.remove('')
all_text = ' '.join(lyrics_split)
words = all_text.split()

## Encoding 

In [57]:
# encoding lyrics
from collections import Counter
# map words to integers
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: i for i, word in enumerate(vocab, 1)}
# tokenize each lyric
lyrics_ints = []
for lyric in lyrics_split:
    lyrics_ints.append([vocab_to_int[word] for word in lyric.split()])

In [58]:
print(len(vocab_to_int))
print()
print(lyrics_ints[:1])

3704

[[1154, 153, 105, 193, 1154, 153, 105, 1155, 15, 201, 1695, 48, 9, 278, 279, 255, 1696, 3, 501, 1156, 308, 9, 16, 8, 354, 717, 18, 3, 309, 31, 1697, 885, 9, 1157, 11, 1698, 27, 18, 3, 1158, 11, 7, 279, 1699, 7, 91, 12, 886, 3, 256, 70, 1700, 1701, 1702, 887, 454, 7, 194, 11, 718, 4, 3, 1703, 11, 310, 1704, 1705, 3, 1706, 11, 48, 1707, 4, 3, 1708, 11, 1709, 1710, 166, 11, 1711, 1712, 11, 42, 1159, 1713, 1714]]


In [60]:
# encoding labels.
import numpy as np

labels_split = labels.split('\n')
labels_split.remove('')
encoded_labels = np.array([int(label) for label in labels_split])

## Removing Outliers 

In [63]:
# outlier review stats
lyrics_lens = Counter([len(x) for x in lyrics_ints])
print("Zero-length lyrics: {}".format(lyrics_lens[0]))
print("Maximum lyrics length: {}".format(max(lyrics_lens)))

Zero-length lyrics: 0
Maximum lyrics length: 265


No zero length lyrics

## Padding sequences

In [65]:
def pad_features(lyrics_ints, seq_length):
    ''' Return features of lyrics_ints, where each lyric is padded with 0's 
        or truncated to the input seq_length.
    '''
    # getting correct rows and columns shape
    features = np.zeros((len(lyrics_ints), seq_length), dtype=int)
    
    for i, row in enumerate(lyrics_ints):
        features[i, -len(row):] = np.array(row)[:seq_length]
    
    return features

In [86]:
seq_length = 200

features = pad_features(lyrics_ints, seq_length=seq_length)

## test statements - do not change - ##
assert len(features)==len(lyrics_ints), "Your features should have as many rows as reviews."
assert len(features[0])==seq_length, "Each feature row should contain seq_length values."

# print first 10 values of the first 30 batches 
print(features[:30,:50])

[[   0    0    0 ...    0    0    0]
 [   0    0    0 ...    0    0    0]
 [   0    0    0 ...    0    0    0]
 ...
 [   0    0    0 ...    0   24    1]
 [   0    0    0 ... 1883    4 1884]
 [   0    0    0 ...    0    0    0]]


## Trai, Validation, Test 