# In this notebook I arrange and execute the final rap lyric model.

I'll start by importing the necessary packages.

In [None]:
import pandas as pd
import numpy as np

from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional, BatchNormalization, GRU
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku
import string, os
import warnings
from sklearn.model_selection import train_test_split
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category='FutureWarning')

Now I'll import the (mostly) cleaned csv file containing the rap lyrics.

In [None]:
rap_df = pd.read_csv('/content/drive/MyDrive/rap_df.csv', converters={'lyrics': eval})

In [None]:
rap_df.head(5)

Unnamed: 0.1,Unnamed: 0,lyrics,song,artist,lyrics_string
0,0,[Awww yeah! We in the motherfuckin' place toni...,moonstruck,actionbronson,"[Intro:],Awww yeah! We in the motherfuckin' pl..."
1,1,"[""His opponent from St. Petersburg, Florida"", ...",barryhorowitz,actionbronson,"""His opponent from St. Petersburg, Florida"",""T..."
2,2,"[Hey yo you ready? Yeah I'm ready right, the f...",themadness,actionbronson,"Hey yo you ready? Yeah I'm ready right, the fa..."
3,3,"[Bronsolino, Fuck that sitting-down rap type s...",larrycsonka,actionbronson,"[Intro:],Bronsolino,Fuck that sitting-down rap..."
4,4,"[When I'm alone, Smoking weed, sitting by the ...",ronniecoleman,actionbronson,"When I'm alone,Smoking weed, sitting by the wi..."


I'll drop the extra index column, which is unnecessary.

In [None]:
rap_df.drop('Unnamed: 0', axis=1, inplace=True)
rap_df.head(1)

Unnamed: 0,lyrics,song,artist,lyrics_string
0,[Awww yeah! We in the motherfuckin' place toni...,moonstruck,actionbronson,"[Intro:],Awww yeah! We in the motherfuckin' pl..."


I am going to sample the lyrics because the dataset is too large for modeling. This is largely due to the massive vocabulary of the rap lyric dataset (~70,000 unique words before sampling). I'll take 40% of the existing lyrics. 

In [None]:
rap_df = rap_df.sample(frac=0.4)

In [None]:
rap_df['lyrics']

3222     [Gucci Mane's a G, G, I'm tryna sell a P, P's,...
7684     [M.I.A. Lyrics, , "Super Tight", , Got my shit...
11028    [Exclusive swave, Swavey, , Every day stuntin'...
10561    [, Dogg Pound, Don Colion, whatever, whatever,...
8742     [[E-Dub] Redman.. Method Man.. Lady Luck.. Def...
                               ...                        
1512     [Would've came back for you, I just needed tim...
2616     ['Cause I do, 'Cause I do, 'Cause I do, Keep w...
5409     [I told my nigga don't tell my nigga for real ...
10334    [Yeah, yeah, yeah, yeah, come on, Yeah, my nig...
2966     [Imma run out front to see me when you walked ...
Name: lyrics, Length: 4832, dtype: object

Now I'll append the lyrics to a list called all_lyrics.

In [None]:
all_lyrics = []

for i in rap_df.lyrics:
  all_lyrics.extend(i)

all_lyrics[0]

"Gucci Mane's a G, G"

Now I will define the clean text function to remove punctuation and capitalisation.

In [None]:
def clean_text(txt):
    txt = "".join(v for v in txt if v not in string.punctuation).lower()
    txt = txt.encode("utf8").decode("ascii",'ignore')
    return txt  

Now I'll remove items from the lyric list that are empty. I also noticed that there are still some lingering data quality issues, so I'll pop out lines with potentially offensive language and non lyric content again.

In [None]:
for i in all_lyrics:
  if i == '':
    all_lyrics.remove(i)

In [None]:
for i in all_lyrics:
  if 'X' in i: # X is subbed in for a potentially offensive word
    all_lyrics.remove(i)

In [None]:
for i in all_lyrics:
  if 'X' in i: # X is subbed in for a potentially offensive word
    all_lyrics.remove(i)

In [None]:
for i in all_lyrics:
  if 'X' in i: # X is subbed in for a potentially offensive word
    all_lyrics.remove(i)

In [None]:
for i in all_lyrics:
  if ']' in i:
    all_lyrics.remove(i)

In [None]:
for i in all_lyrics:
  if '[' in i:
    all_lyrics.remove(i)

In [None]:
corpus = [clean_text(x) for x in all_lyrics]
len(corpus)

292357

We have 292,357 lines of lyrics! Now i'll fit the word level tokenizer on the corpus.

In [None]:
# Note: char_level is False now
rap_tokenizer = Tokenizer(char_level=False) 
rap_tokenizer.fit_on_texts(corpus)

Now I'll save the tokenizer in order to use it later in the web application (as it is required for the text generate function).

In [None]:
import pickle
# saving
with open('rap_tokenizer.pkl', 'wb') as handle:
    pickle.dump(rap_tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

Let's check the vocabulary size of the corpus...

In [None]:
word_to_number = rap_tokenizer.word_index
number_to_word = rap_tokenizer.index_word

all_words = list(word_to_number.keys())

print(f"Vocabulary size: {len(all_words)}")

Vocabulary size: 47323


47,323 words! Wow! That is why I needed to sample the data, as before there were 70,000 plus and the computation times were too long.

Now I'll transform the tokenized corpus into sequences.

In [None]:
dataset = rap_tokenizer.texts_to_sequences(corpus)

I'll define the sliding window length which will create the shapes for X and y.

In [None]:
# sliding window
SEQUENCE_LENGTH = 5

X = []
y = []

for song in dataset:
    for window_start_idx in range(len(song)-SEQUENCE_LENGTH):
        window_end_idx = window_start_idx + SEQUENCE_LENGTH
        X.append(song[window_start_idx: window_end_idx])
        y.append(song[window_end_idx])

X = np.array(X)
y = np.array(y)

# Let's look at the shapes
print(X.shape)
print(y.shape)

(754760, 5)
(754760,)


Now I will arrange the architecture for the model. This architecture is optimized from earlier experimentation and is the same as the optimized models for the folk and pop lyrics.

In [None]:
number_of_classes = len(all_words)+1

rap_lyric_model = Sequential()
rap_lyric_model.add(Embedding(number_of_classes, 5))


rap_lyric_model.add(LSTM(700, activation='tanh', return_sequences=True))
rap_lyric_model.add(BatchNormalization())
rap_lyric_model.add(Dropout(0.2))


rap_lyric_model.add(LSTM(350, activation='tanh', return_sequences=False))
rap_lyric_model.add(BatchNormalization())
rap_lyric_model.add(Dropout(0.2))

rap_lyric_model.add(Dense(175, activation='relu'))
rap_lyric_model.add(BatchNormalization())
rap_lyric_model.add(Dropout(0.2))

rap_lyric_model.add(Dense(number_of_classes, activation='softmax'))

In [None]:
# Compile model
rap_lyric_model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

In [None]:
rap_lyric_model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 5)           236620    
_________________________________________________________________
lstm_4 (LSTM)                (None, None, 700)         1976800   
_________________________________________________________________
batch_normalization_6 (Batch (None, None, 700)         2800      
_________________________________________________________________
dropout_6 (Dropout)          (None, None, 700)         0         
_________________________________________________________________
lstm_5 (LSTM)                (None, 350)               1471400   
_________________________________________________________________
batch_normalization_7 (Batch (None, 350)               1400      
_________________________________________________________________
dropout_7 (Dropout)          (None, 350)              

Because this model is more computationally intensive I decided to shrink the number of epochs by 50. I found I still got excellent results at this number of epochs.

In [None]:
history = rap_lyric_model.fit(X, y,
        batch_size=1024,
        epochs=200)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

In [None]:
# saving the model
rap_lyric_model.save('/content/drive/MyDrive/rap_lyric_model.h5') 

I'll define the generate_text function (same function as in the other model notebooks).

In [None]:
def generate_text(input_phrase, next_words, model):
    # process for the model
    processed_phrase = rap_tokenizer.texts_to_sequences([input_phrase])[0]
    for i in range(next_words):
      network_input = np.array(processed_phrase[-(len(processed_phrase)):], dtype=np.float32)
      network_input = network_input.reshape((1, (len(processed_phrase)))) 

      # the RNN gives the probability of each word as the next one
      predict_proba = model.predict(network_input)[0] 
      
      # sample one word using these chances
      predicted_index = np.random.choice(number_of_classes, 1, p=predict_proba)[0]

      # add new index at the end of our list
      processed_phrase.append(predicted_index)
      

  # indices mapped to words - the method expects a list of lists so we need the extra bracket
      output_phrase = rap_tokenizer.sequences_to_texts([processed_phrase])[0]

    return output_phrase

Let's test it out!

In [None]:
generate_text('the mountains', 10, rap_lyric_model)

'the mountains on the block just like his mother to em in'

In [None]:
generate_text('my homie', 15, rap_lyric_model)

'my homie is never on the beat of the underworld in our crib pop in the sink'

In [None]:
generate_text('the wind', 15, rap_lyric_model)

'the wind take a shot to the money burn it in ya wallet and you stuck slow'

In [None]:
generate_text('the wind', 15, rap_lyric_model)

'the wind up with your body when you do appreciate your business then i speak with the'

Works great! There are some really interesting ideas here.