# The Lyricist for Lovers

Goal of this project
- Create a text generator with lyrics textfiles dataset.
- Perform Data cleaning in proper ways.
- Get an acceptable validation loss of the text generator model, lower then 2.2.

# Data Preparation

In [1]:
lyrics_path = '/content/drive/MyDrive/aiffel/EXP_4_data/lyrics/'

In [2]:
import os, re, glob
import numpy as np
import tensorflow as tf

# open the file in read mode
# read the data as a list, line by line
file_paths = glob.glob(lyrics_path + '*.txt')
raw_corpus = []
for textfile in file_paths:
  with open(textfile, "r") as f:
      raw_corpus.extend(f.read().splitlines())

print(raw_corpus[:9])
print(len(raw_corpus))

['Looking for some education', 'Made my way into the night', 'All that bullshit conversation', "Baby, can't you read the signs? I won't bore you with the details, baby", "I don't even wanna waste your time", "Let's just say that maybe", 'You could help me ease my mind', "I ain't Mr. Right But if you're looking for fast love", "If that's love in your eyes"]
187088


# Clean the Text Data

## Remove the special characters

1. Convert to lowercase, remove spaces on both sides
2. Put a space on either side of the special character
3. Replace multiple spaces with a single space
4. Replace all characters other than a-zA-Z?.!,¿ with a single space
5. Erase both spaces again
6. Add <start> at the beginning of the statement and <end> at the end


In [3]:
def preprocess_sentence(sentence):
    sentence = sentence.lower().strip() # 1
    sentence = re.sub(r"([?.!,¿])", r" \1 ", sentence) # 2
    sentence = re.sub(r'[" "]+', " ", sentence) # 3
    sentence = re.sub(r"[^a-zA-Z?.!,¿]+", " ", sentence) # 4
    sentence = sentence.strip() # 5
    sentence = '<start> ' + sentence + ' <end>' # 6
    return sentence

print(preprocess_sentence("This @_is ;;;sample        sentence."))

<start> this is sample sentence . <end>


We need to clean the texts to put them in the traning model.

In [4]:
def preprocess_add(raw_corpus):
  corpus = []
  for idx, sentence in enumerate(raw_corpus):
    if len(sentence) == 0: continue   # skip if the length is 0
    if sentence[-1] == ":": continue  # skip if the text ends with ":"
  
    corpus.append(preprocess_sentence(sentence))
  return corpus

    # if idx > 9: break   
        
    # print(sentence)

In [5]:
corpus = preprocess_add(raw_corpus)
print(len(corpus))
print(corpus[:9])

175749
['<start> looking for some education <end>', '<start> made my way into the night <end>', '<start> all that bullshit conversation <end>', '<start> baby , can t you read the signs ? i won t bore you with the details , baby <end>', '<start> i don t even wanna waste your time <end>', '<start> let s just say that maybe <end>', '<start> you could help me ease my mind <end>', '<start> i ain t mr . right but if you re looking for fast love <end>', '<start> if that s love in your eyes <end>']


All the special characeters are gone, it looks clean now.

Additionally, remove lines if the number of words is bigger then 15

In [6]:
reduced_corpus = [line for line in corpus if len(line.split(' ')) <= 15]
print(len(reduced_corpus))


156013


## Tokenization

In [7]:
def tokenize(corpus):
    # create a tokenizer that remembers 7000 words.
    # no need filters any more since we already did it above.
    # replace with '<unk>' if the words not inluded in the 7000
    tokenizer = tf.keras.preprocessing.text.Tokenizer(
        num_words=7000, 
        filters=' ',
        oov_token="<unk>"
    )
    # cinokete dictionary of tokenizer using corpus
    tokenizer.fit_on_texts(corpus)
    # Transofrm corpus to Tensor using the tokenizer
    tensor = tokenizer.texts_to_sequences(corpus)   

    # Set the sequence length of the input data to be constant
    # If the sequence is short, add padding to the end of the sentence to match the length.
    # Use padding='pre' if you want to add padding to the front of the sentence to match the length
    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post', maxlen=15)  # set maxlen to 15 to set the length of the sequence.
    
    print(tensor,tokenizer)
    return tensor, tokenizer

tensor, tokenizer = tokenize(reduced_corpus)

[[  2 290  28 ...   0   0   0]
 [  2 219  13 ...   0   0   0]
 [  2  25  15 ...   0   0   0]
 ...
 [  2  21  77 ...   0   0   0]
 [  2  41  26 ...   0   0   0]
 [  2  21  77 ...   0   0   0]] <keras_preprocessing.text.Tokenizer object at 0x7f21a1743e50>


Check how the dictionary is formatted

In [8]:
for idx in tokenizer.index_word:
    print(idx, ":", tokenizer.index_word[idx])

    if idx >= 10: break

1 : <unk>
2 : <start>
3 : <end>
4 : i
5 : ,
6 : the
7 : you
8 : and
9 : a
10 : to


Check the generated source and target sentences for the first sentence within corpus.

In [9]:
# tensor에서 마지막 토큰을 잘라내서 소스 문장을 생성합니다
# 마지막 토큰은 <end>가 아니라 <pad>일 가능성이 높습니다.
src_input = tensor[:, :-1]  
# tensor에서 <start>를 잘라내서 타겟 문장을 생성합니다.
tgt_input = tensor[:, 1:]    

print(src_input[0])
print(tgt_input[0])

[   2  290   28   94 4486    3    0    0    0    0    0    0    0    0]
[ 290   28   94 4486    3    0    0    0    0    0    0    0    0    0]


The source is filled with 0 (\<pad>) starting at 2 (\<start>) and ending at 3 (\<end>). However, the target doesn't start with a 2 and has the source shifted one space to the left.

# Train Dataset and Test Dataset

In [10]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(src_input,tgt_input,test_size = 0.2, random_state = 15)

In [11]:
print("Source Train:", x_train.shape)
print("Target Train:", y_train.shape)

Source Train: (124810, 14)
Target Train: (124810, 14)


# Create Model

I added the `Dense` parameter to have `activation='softmax'`, and also added Dropout 0.2

In [31]:
class TextGenerator(tf.keras.Model):
    def __init__(self, vocab_size, embedding_size, hidden_size):
        super().__init__()
        
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_size)
        self.rnn_1 = tf.keras.layers.LSTM(hidden_size, return_sequences=True)
        self.rnn_2 = tf.keras.layers.LSTM(hidden_size, return_sequences=True)
        self.linear = tf.keras.layers.Dense(vocab_size,  activation='softmax')  # added activation softmax
        
    def call(self, x):
        out = self.embedding(x)
        out = self.rnn_1(out)
        out = self.rnn_2(out)
        out = self.linear(out)
        
        return out
    
embedding_size = 256
hidden_size = 2048
model = TextGenerator(tokenizer.num_words + 1, embedding_size , hidden_size)

# Try Dropout to simplify data

In [27]:
from keras.layers import Dropout

model.add(Dropout(0.2))

# Train the dataset

Instead of setting the `batch_size` separately with `tf.data.Dataset.from_tensor_slices()` and `dataset.batch`, I set `batch_size` to 256 as a parameter in the `model.fit()` function.

Also, I tried to add `validation_split = 0.25` here to evaluate the loss and check the accuracy at the end of each epoch. The train / test data is already splitted and I put them iinto `validation_data`. But this data is not shuffled.

I added `metrics=['accuracy']` to check the accuracy at each epoch, too.

In [15]:
epochs = 10
batch_size = 256

optimizer = tf.keras.optimizers.Adam()

loss = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True,
    reduction='none'
)

model.compile(loss=loss, optimizer=optimizer, metrics=['accuracy'])

model.fit(x_train, y_train, batch_size=batch_size, validation_data=(x_test, y_test), epochs=epochs, validation_split=0.25)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


TT.. the val_loss is 2.2587, slightly greater then my goal which is 2.2.

**This training is taking a lot of time. Literally, too slow:** it took more than 40 mins. I was curious about how I can reduce the slowness. (I added the number of hidden layers and I know this will make the train slow, but I want to improve this as possible.)   


## Define Checkpoints
And I found that I can **define a checkpoint to record all of the network weights to file each time an improvement in loss** is observed at the end of the epoch. (This means we can save some time if the computation is already done in the previous epoch!)   
I will use the **best set of weights (lowest loss) **to the model.

In [32]:
from keras.callbacks import ModelCheckpoint

# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

# Run the model again

In [15]:
epochs = 10

optimizer = tf.keras.optimizers.Adam()

loss = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True,
    reduction='none'
)

model.compile(loss=loss, optimizer=optimizer)

history = model.fit(x_train, y_train, batch_size=256, validation_data=(x_test, y_test), epochs=epochs,)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Okay, wih the `hidden_size` with 2048, I was able to got the `val_loss` of 2.1068.

# Overfitting?

But it looks like that there is an **overfitting**. The `val_loss` was dropped to 2.0883 at the epoch 9/10 and then incremented to 2.1068 at the epoch 10/10   
To reduce the overfitting, there are 3 options that I can think of.   
1. Put more data into the train dataset.   
2. Data Augmentation.   
3. Simplify data with `dropout` to rduce overfittng by decreasiing the complexity of the model to prevent overfit.   

However it seems like I need to work only with the given lyrics dataset, so I want to try the `dropout` method.

# Generate lyrics

The model is ready now, let's generate lyrics with the model!

In [37]:
def generate_text(model, tokenizer, init_sentence="<start>", max_len=20):
    # Convert input init_sentence to tensor for testing purposes. 
    test_input = tokenizer.texts_to_sequences([init_sentence])
    test_tensor = tf.convert_to_tensor(test_input, dtype=tf.int64)
    end_token = tokenizer.word_index["<end>"]

    # Create a sentence by predicting a word
    #    1. 입력받은 문장의 텐서를 입력합니다
    #    2. 예측된 값 중 가장 높은 확률인 word index를 뽑아냅니다
    #    3. 2에서 예측된 word index를 문장 뒤에 붙입니다
    #    4. 모델이 <end>를 예측했거나, max_len에 도달했다면 문장 생성을 마칩니다
    while True:
        # 1
        predict = model(test_tensor) 
        # 2
        predict_word = tf.argmax(tf.nn.softmax(predict, axis=-1), axis=-1)[:, -1] 
        # 3 
        test_tensor = tf.concat([test_tensor, tf.expand_dims(predict_word, axis=0)], axis=-1)
        # 4
        if predict_word.numpy()[0] == end_token: break
        if test_tensor.shape[1] >= max_len: break

    generated = ""
    # tokenizer를 이용해 word index를 단어로 하나씩 변환합니다 
    for word_index in test_tensor[0].numpy():
        generated += tokenizer.index_word[word_index] + " "

    return generated

In [38]:
generate_text(model, tokenizer, init_sentence="<start> i love", max_len=20)

'<start> i love you and the breeze that <unk> around you <end> '

How sweet...

In [39]:
generate_text(model, tokenizer, init_sentence="<start> New York", max_len=20)

'<start> new york city <end> '

In [40]:
generate_text(model, tokenizer, init_sentence="<start> i want", max_len=20)

'<start> i want to get in the zone <end> '

In [41]:
generate_text(model, tokenizer, init_sentence="<start> What can I", max_len=20)

'<start> what can i do for you ? <end> '

In [43]:
generate_text(model, tokenizer, init_sentence="<start> What should I", max_len=20)

'<start> what should i do , babe ? <end> '

This is so sweet..!!! 

In [44]:
generate_text(model, tokenizer, init_sentence="<start> Love", max_len=20)

'<start> love me like you do <end> '

In [45]:
generate_text(model, tokenizer, init_sentence="<start> You love", max_len=20)

'<start> you love when i whine it <end> '

In [46]:
generate_text(model, tokenizer, init_sentence="<start> i hate", max_len=20)

'<start> i hate the headlines and the weather <end> '

In [48]:
generate_text(model, tokenizer, init_sentence="<start> This evening", max_len=20)

'<start> this evening s too heavy , <end> '

The lyrics that the model generated are so poetic and romantic. I guess this is because we trained the model with love songs, so our lyricist model is full of **LOVE**.

# Conclusion

# What I've learned and tried
- I learned how to preprocess the text data to organize them with predictors. 
- Build a RNN model using LSTM.   
- Fit the train data with embedding size and hidden layers size.
- Tried adding **Checkpoints** to reduce the time to train the model.
- Got an acceptable validation loss of the text generator model, lower then 2.2.
    
# Things that I learned - RNNs & LSTM networks
- It works with **Sequential** data
- It works greatly with **Short Contexts**. Simple RNN's prediction is dependent on all previous predictions and information learned from them. Though its effective for those short contexts.
- The reason for the above dependency is becuase of **Vanishing Gradient**. RNN remembers things for a small duration of time. This is because for a conventional feed-forward neural netowrk, the weight updating that is applied on a particular layer is a multiple of the learning rate.
- That is why we used RNN with **LSTM**, (Long Short-Term Memory Networks).
- Disadvantage of LSTM: it takes to much time to train this simple model. (Hardware Constraint)

# Further Ideas to Improve the Model
1. Add dropout to the visible input layer and consider tuning the dropout percentage.
2. Add `Bidirectional` LSTM layer
3. Tune the batch size, try a batch size of 1 as a (very slow) baseline and larger sizes from there.
4. Try a one hot encoded for the input sequences.
5. Apply `ModelCheckpoint` appropriately. ( I was failed to apply this because I thought I had to change the model type to `Seuquential` to do this.)

# References
- I got a hint from this article to use `Bidirection al LSTM`: https://www.programmersought.com/article/67438889091/