### LSTM Networks
LSTM (Long Short-Term Memory) networks are a type of recurrent neural network (RNN) capable of learning order dependence in sequence prediction problems. Unlike standard feedforward neural networks, LSTM has feedback connections that make it capable of processing not only single data points but also entire sequences of data.

### Predictors and Labels in LSTM
In the context of text generation using LSTMs:

- **Predictors** are the input sequences to the model. These are portions of text (or tokens) that the model will use to make its predictions. The model learns to predict the next token in a sequence based on these inputs.

- **Labels** are the actual outcomes the model is trying to predict. In text generation, a label is typically the next token in the sequence that follows the input sequence.

### Example:
Imagine you have a sentence: "The quick brown fox jumps"

If you break this sentence into sequences of words for training an LSTM, your predictors (input sequences) and labels might look like this:

- Predictor: "The", Label: "quick"
- Predictor: "The quick", Label: "brown"
- Predictor: "The quick brown", Label: "fox"
- Predictor: "The quick brown fox", Label: "jumps"

### In the Code:
- **Tokenization**: The text data (lyrics, in your case) is converted into tokens. Each unique word is given a unique integer (token).

- **Sequence Creation**: Sequences of tokens are created. Each sequence is a set of tokens (words) from the text.

- **Padding**: Sequences are padded to have the same length for training the LSTM.

- **Predictors**: The predictors are all the tokens in a sequence except the last one.

- **Label**: The label is the last token in the sequence. This is what the model tries to predict.

In the model training process, the LSTM network learns to predict the label based on the predictors. For example, given the sequence of words "The quick brown", it learns to predict the next word "fox". This training process involves adjusting the neural network's weights through backpropagation based on the error between the predicted word and the actual next word in the sequence.


In [16]:
import pandas as pd
import numpy as np
from keras.utils import np_utils

In [2]:
df = pd.read_csv('../data/taylor_swift_lyrics.csv', encoding = "latin1")

In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import Callback

2023-12-16 23:29:03.192992: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [19]:
df = pd.read_csv('../data/cleaned.csv')
df

Unnamed: 0,lyric_clean
0,he said the way my blue eyes shined
1,put those georgia stars to shame that night
2,i said that is a lie
3,just a boy in a chevy truck
4,that had a tendency of gettin' stuck
...,...
4857,hold on to the memories they will hold on to you
4858,please do not ever become a stranger
4859,hold on to the memories they will hold on to you
4860,whose laugh i could recognize anywhere


In [33]:
combined_lyrics = ' '.join(df['lyric_clean'].dropna())
tokenizer = Tokenizer()
tokenizer.fit_on_texts([combined_lyrics])
total_words = len(tokenizer.word_index) + 1
input_sequences = []
for line in df['lyric_clean'].dropna():
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

In [39]:
max_sequence_len = max(len(x) for x in input_sequences)
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
predictors, label = input_sequences[:,:-1], input_sequences[:,-1]
label = to_categorical(label, num_classes=total_words)

# Splitting Data
X_train, X_val, y_train, y_val = train_test_split(predictors, label, test_size=0.2, random_state=42)


In [None]:
# Building the Model
model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(LSTM(150, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Perplexity Calculation
class Perplexity(Callback):
    def on_epoch_end(self, epoch, logs={}):
        cross_entropy = logs.get('loss')
        perplexity = np.exp(cross_entropy)
        print(f' - perplexity: {perplexity}')

# Model Training
history = model.fit(X_train, y_train, epochs=100, verbose=1, validation_data=(X_val, y_val), callbacks=[Perplexity()])

# Visualizing Training and Validation Metrics
plt.figure(figsize=(12, 6))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()