# Yoda Language Model Training

In this notebook, we will:
1. Import the yoda-corpus dataset.
2. Train a tokenizer.
3. Generate tokenized n-grams for each line in the corpus.
4. Split the data into features and labels.
5. Add padding to the sequences.
6. Construct and train a neural network model.
7. Implement a simple form of top-k sampling.
8. Train a model with masked padding tokens.
9. Test different models' performances.

## Introduction
In this exercise, we'll be using a corpus of sentences styled after the unique manner of speaking of the Star Wars character Master Yoda.  

The dataset was generated using the large language model [Claude](https://claude.ai/). The initial prompt was:  
"Create 10 short sentences that mimic the style of Yoda."  

Via the prompt "Generate 200 more short sentences." were then several batches of 200 examples generated.
In a further step examples that included full stops, question marks, and exclamation marks were split into seperate examples (some "short" sentences were actually not that short and included more than one sentence). Finally, the examples were distilled by using fuzzy comparison, and for all examples that had a similarity of more than 95% only one example was kept.

You can download the `yoda-corpus.txt` file [here](https://github.com/opencampus-sh/course-material/blob/main/machine-learning-with-tensorflow/week-06/yoda-corpus.txt).

## Importing the Yoda-Corpus Dataset
Before proceeding, ensure that you have downloaded the `yoda-corpus.txt` and uploaded it to your Google Drive in a specified folder.


In [None]:
# Mount your Google Drive

from google.colab import drive
drive.mount("/content/drive")

In [None]:
# Import the data into a pandas dataframe.

import pandas as pd

# Read the file as a plain text file since the file is not formatted as csv
with open("/content/drive/MyDrive/path_to_your_file", "r") as f:
    data = f.readlines()

# Convert the data to a DataFrame
df = pd.DataFrame(data, columns=["text"])

df.head()

### Descriptive Statistics

In [None]:
# print results
print(
    "Average number of words in each text: ",
    df["text"].apply(lambda x: len(x.split(" "))).mean(),
)
print(
    "Shortest text: ",
    df["text"].apply(lambda x: len(x.split(" "))).min(),
)
print(
    "Longest text: ",
    df["text"].apply(lambda x: len(x.split(" "))).max(),
)

## Tokenization

First, we define and add special tokens to our training data, then we train the Tokenizer on the corpus.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

# initialize tokenizer
tokenizer = Tokenizer(oov_token="<OOV>")

# at the end of each sentence add end-of-sequence token "<EOS>"
df["text"] = df["text"] + "<EOS>"

# train tokenizer on text
tokenizer.fit_on_texts(df["text"])

# get vocabulary size
vocab_size = len(tokenizer.word_index) + 1

# print vocabulary size
print("Vocabulary size: ", vocab_size)

Now we can use the tokenizer to convert the texts into sequences of integers. Additionally, the following code creates all possible n-grams from the tokenized sequences:

In [None]:
# create all n-grams from a single sentence (row)
def create_ngrams_from_sentence(tokenized_text):
    sentence_ngrams = []
    for i in range(1, len(tokenized_text)):
        sentence_ngrams.append(tokenized_text[: i + 1])
    return sentence_ngrams


# collect all n-grams from all sentences of the corpus in a single list
corpus_ngrams = []
for row in df["text"]:
    tokenized_text = tokenizer.texts_to_sequences([row])[0]
    corpus_ngrams.extend(create_ngrams_from_sentence(tokenized_text))

# print the first 20 n-grams
corpus_ngrams[:20]

## Splitting Data into Features and Labels
The last element of each sequence will be our label.

In [None]:
import numpy as np

features = [(i[: len(i) - 1]) for i in corpus_ngrams]
labels = [i[len(i) - 1] for i in corpus_ngrams]

# print the first 10 features and labels
print("Features: ", features[:10])
print("Labels: ", labels[:10])

The feature sequences need to be padded to ensure they have the same length and turn features and labels into a numpy array as needed for Tensorflow.

In [None]:
from keras.preprocessing.sequence import pad_sequences

max_len = XXX  # Replace with the maximum sequence length your model will be trained with (context window size)
numpy_features = np.array(pad_sequences(features, padding="post", maxlen=max_len))
numpy_labels = np.array(labels)
# print the first 10 padded features and labels in numpy format
print("Features: ", numpy_features[:10])
print("Labels: ", numpy_labels[:10])

## Model Construction and Training

Insert the correct value for the input dimension of the embedding layer.

In [None]:
from keras.models import Sequential
from keras.layers import Embedding, Bidirectional, LSTM, Dense

# Model parameters
input_dim = XXX  # Replace with the correct value
output_dim = 128  # the dimensionality of the embedding vectors
input_length = max_len  # the maximum sequence length your model will be trained with (context window size)

# Define model architecture
model = Sequential()
model.add(Embedding(input_dim=input_dim, output_dim=output_dim, input_length=max_len))
model.add(Bidirectional(LSTM(150)))
model.add(Dense(input_dim, activation="softmax"))  # Output layer

# Compile model
model.compile(
    optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)

# Train model
history = model.fit(numpy_features, numpy_labels, epochs=50)

Visualize the loss and accuracy curves to review the model performance.

In [None]:
import matplotlib.pyplot as plt

plt.plot(history.history["loss"], label="Training Loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.figure()
plt.plot(history.history["accuracy"], label="Training Accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

## Testing the Model
Let's see our model in action.

In [None]:
import numpy as np


def generate_yoda_speak(seed_text, max_words):
    for _ in range(max_words):
        # Tokenize the sentence
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        # Pad the sentence to the maximum length
        token_list = pad_sequences([token_list], maxlen=max_len, padding="post")
        # Get the predicted probabilities for each word
        predicted_probs = model.predict(token_list, verbose=0)
        # Get the index of the most probable word
        predicted = np.argmax(predicted_probs, axis=-1).item()
        # Convert index to word
        output_word = tokenizer.index_word[predicted]
        # Check if end of sequence token was generated
        if output_word == "eos":
            # terminate generation
            break
        # Add the predicted word to the seed text
        seed_text += " " + output_word
    return seed_text


prompt = "Seek"
max_tokens = 15

# Example usage
print(generate_yoda_speak(prompt, max_tokens))

## Including Randomness in the Prediction (Top-K Sampling)  

The above implemented inference function follows a so called "greedy" approach, that is, the predicted next word is the one with highest probability. In praxis, however, models often have a better output if you include randomness in the selection of the predicted word.

Implement an inference function that randomly selects the predicted word from the 3 most probable words and compare the results with the ones from above.
Vary the group size and compare the differences.

In [None]:
import numpy as np


def generate_yoda_speak(prompt, max_words, k):
    # INCLUDE YOUR CODE HERE
    # (Hint: You can use the generate_yoda_speak function from above and just adjust one line of code)

    return prompt

# Example usage

prompt = "Seek"
max_tokens = 15
k = 3

print(generate_yoda_speak(prompt, max_tokens, k))

## Masking Padding Tokens

Take the model definition from above abd add the argument `mask_zeros=true` to the definition of the embedding layer.  
Train the model. Do you notice a difference?  

Find out what is changed during training when `mask_zeros` is set to true.

In [None]:
from keras.models import Sequential
from keras.layers import Embedding, Bidirectional, LSTM, Dense

# Define model architecture
# INCLUDE YOUR CODE HERE

# Compile model
model.compile(
    optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)

# Train model
history = model.fit(numpy_features, numpy_labels, epochs=50)

Visualize the loss and accuracy curves to review the model performance.

In [None]:
import matplotlib.pyplot as plt

plt.plot(history.history["loss"], label="Training Loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.figure()
plt.plot(history.history["accuracy"], label="Training Accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()
plt.show()