# RNN (Graded)

Welcome to your RNN (required) programming assignment! You will build a **Recurrent Neural Network(RNN)** for **text completion** task. You will be using [WikiText language modeling dataset](https://huggingface.co/datasets/Salesforce/wikitext) which contains million of tokens extracted from the set of verified Good and Featured articles on Wikipedia.

**Instructions:**
* Do not modify any of the codes.
* Only write code when prompted. For example in some sections you will find the following,
```
# TODO
```
Only modify those sections of the code.
And follow the instructions in the code cell where you need to write code.

**You will learn to:**
* Explore the [WikiText language modeling dataset](https://huggingface.co/datasets/Salesforce/wikitext) dataset.
* Clean the dataset before using it for training.
* Build a robust text completion model using just `SimpleRNN()`.
* Build the general architecture of a RNN, including:
  * Initializing parameters
  * Calculating the cost function and its gradient
  * Using an optimization algorithm (gradient descent)
* Gather all three functions above into a main model function, in the right order.

In [None]:
import random
import numpy as np
import tensorflow as tf
import random

from helpers import *
from tests import *

# Loading and Visualizing the dataset

In [None]:
from datasets import load_dataset

train_dataset = load_dataset("iohadrubin/wikitext-103-raw-v1", split="train")
valid_dataset = load_dataset("iohadrubin/wikitext-103-raw-v1", split="validation")
test_dataset  = load_dataset("iohadrubin/wikitext-103-raw-v1", split="test")

In [None]:
print(f"Train dataset size: {len(train_dataset)}")
print(f"Valid dataset size: {len(valid_dataset)}")
print(f"Test dataset size: {len(test_dataset)}")

In [None]:
train_dataset['text'][0]

<center>
<h1>Preparing Dataset for Text Completion</h1>
</center>

There are essentially 3 steps to achieve text completion.


### 1. Start with a sentence
Take your sentence and break it into tokens (or words), for example:
```
["The", "cat", "sat", "on", "the", "mat"]
```
### 2. Create progressive input output pairs
* Start with just the first word and make the next word the "output" you're trying to predict:
```
Input: ["The"] → Output: "cat"
```
* Now, add one more word to the input and make the next word the output:
```
Input: ["The", "cat"] → Output: "sat"
```
* Keep adding more words to the input until you've gone through the sentence:
```
Input: ["The", "cat", "sat"] → Output: "on"
Input: ["The", "cat", "sat", "on"] → Output: "the"
Input: ["The", "cat", "sat", "on", "the"] → Output: "mat"
```

### 3. Pad the input to a fixed length
To make all the inputs the same length (because some sentences might be shorter or longer), you "pad" the inputs by adding zeros at the beginning. For example, if you want every input to be 6 words long, the input would look like this:
```
Input: [0, 0, 0, "The"] → Output: "cat"
Input: [0, 0, "The", "cat"] → Output: "sat"
```
This padding makes sure every input is the same size, which is important for training the model.

### In Summary:
* **Input**: You take a growing part of the sentence, starting small and getting bigger.
* **Output**: The next word in the sentence.
* **Padding**: If the input is too short, you add zeros at the start to make all inputs the same length.

By doing this for each sentence in your dataset, you create many input-output pairs for the model to learn from.

# Preprocessing the dataset

## Data Cleaning

As you can see, there are a lot of ambiguous characters such as:
```
"= Valkyria Chronicles III =\nSenjō, "戦場のヴァルキュリア3".
```
Its important to consider cleaning them.<br>
Complete the following `clean_text()` method to build a function to clean the texts.

In [None]:
# TODO

import re

# Function to clean the dataset
def clean_text(texts):
    """
    Cleans the input text by performing necessary preprocessing steps like lowercasing,
    removing special characters, etc.
    """
    cleaned_texts = []
    for text in texts:
        # Lowercase text
        text = text.lower()
        
        # Remove headers/formatting (e.g., '= Valkyria Chronicles III =')
        text = re.sub(r'=.*=', '', text)
        
        # Remove non-alphabetic characters (punctuation, numbers, special characters)
        text = re.sub(r'[^a-z\s]', '', text)
        
        # Remove extra spaces
        text = ' '.join(text.split())
        
        # Only keep non-empty cleaned lines
        if text:
            cleaned_texts.append(text)

    return cleaned_texts


## Train, val, test split
Out of 23k training records, you are asked to **consider atleast 5k records for training**.

* **Training samples:** ~5000

* **Valid samples:** 60

* **Test samples:** 62

Adjust the following `train_samples` variable to select the number of training samples.

In [None]:
# TODO

# Adjust this variable to select the number of training samples
train_samples = 5000
valid_samples = 60   
test_samples = 62  
clean_train_dataset = clean_text(train_dataset['text'][:train_samples])
clean_valid_dataset = clean_text(valid_dataset['text'][:valid_samples])
clean_test_dataset = clean_text(test_dataset['text'][:test_samples])

## Tokenizing and Padding



In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer


# Tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(clean_train_dataset)

Apply tokenization and convert all texts to sequences in train, valid and test datasets.

In [None]:
# TODO

train_sequences = tokenizer.texts_to_sequences(clean_train_dataset)
valid_sequences = tokenizer.texts_to_sequences(clean_valid_dataset)
test_sequences = tokenizer.texts_to_sequences(clean_test_dataset)

This is one of the crucial parameters for training. You can adjust the `max_seq_length` to determine the maximum length of the number of words in each sentence.

In [None]:
# TODO

# Set maximum sequence length (you can adjust this)
max_seq_length = 500

# Truncate the length of each sequence upto max_seq_length
train_trunc_sequences = truncate_sequences(train_sequences, max_seq_length)
valid_trunc_sequences = truncate_sequences(valid_sequences, max_seq_length)
test_trunc_sequences = truncate_sequences(test_sequences, max_seq_length)

In [None]:
sequence_lengths = [len(seq) for seq in train_trunc_sequences]
print(f"Max sequence length: {max(sequence_lengths)}")
print(f"Average sequence length: {sum(sequence_lengths) / len(sequence_lengths)}")


## Creating Input Output pairs
The following method is used to create the input output pairs.
```
create_input_output_pairs(sequences, max_seq_length):
    """
    Creates input-output pairs from the tokenized sequences. The input will be
    subsequences of the original sequence (up to max_seq_length), and the output
    will be the next token in the sequence.
    
    Args:
        sequences (List[List[int]]): A list of tokenized sequences.
        max_seq_length (int): The maximum sequence length for truncation.

    Returns:
        np.array: Array of padded input sequences.
        np.array: Array of output words (next token in the sequence).
    """
```

In [None]:
# Create input-output pairs for training

train_inputs, train_outputs = create_input_output_pairs(train_trunc_sequences, max_seq_length)
valid_inputs, valid_outputs = create_input_output_pairs(valid_trunc_sequences, max_seq_length)
test_inputs, test_outputs = create_input_output_pairs(test_trunc_sequences, max_seq_length)
print('complete')

# Model Training and Evaluation





## Model Building
Build a `SimpleRNN` model, add hidden layers and an output layer.

In [None]:
# TODO

# Import necessary libraries
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import SimpleRNN, Dense

def build_rnn_model(vocab_size, max_seq_length):
    """
    Builds an RNN model.

    Args:
        vocab_size (int): Size of the vocabulary (number of unique tokens).
        max_seq_length (int): Maximum input sequence length.

    Returns:
        keras.Model: Compiled RNN model.
    """

    # Define RNN model
    model = Sequential([
        # Embedding layer
        Embedding(input_dim=vocab_size, output_dim=100),
        # Hidden Layers with Dropout
        SimpleRNN(128, return_sequences=False),
        Dense(64, activation='relu'),
        tf.keras.layers.Dropout(0.5),  # Dropout layer
        # Dense layer for output
        Dense(vocab_size, activation='softmax')
    ])

    test_model_structure(model, vocab_size)

    return model

### Improvement Strategies

Here are some techniques to make the training process more robust and improve performance:

* Increase the model capacity by stacking multiple **SimpleRNN** layers.
* Change the Embedding size.
* Consider adding `Dropout` or `BatchRegularization` layers to converge faster and generalize well.
* Increase the dataset size or add more training data. (**Remember:** You can add upto 23k training records)
  * You can also consider increasing the `max_seq_length`.
* Try different optimization techniques.
* Increase the number of epochs.

**Test case:** Achieve atleast 25% validation accuracy in order to pass this test.

In [None]:
# TODO
def main():
  import tensorflow as tf
  tf.keras.backend.clear_session()   
  
  vocab_size = len(tokenizer.word_index) + 1
  model = build_rnn_model(vocab_size, max_seq_length)
  
  # Compile
  model.compile(loss='sparse_categorical_crossentropy', 
                optimizer='adam', 
                metrics=['accuracy'])
  
  # Train with progress
  history = model.fit(train_inputs, train_outputs, 
                      epochs=10,   
                      batch_size=512, 
                      validation_data=(valid_inputs, valid_outputs),  
                      verbose=1)   
  
  # Summary
  model.summary()
  
  # Evaluate
  loss, accuracy = model.evaluate(test_inputs, test_outputs, verbose=0)
  print(f'\nTest Accuracy: {accuracy}')
  test_validation_accuracy(history)

if __name__=="__main__":
  main()


## BLEU Score Evaluation

BLEU (Bilingual Evaluation Understudy) is a metric used to evaluate how well a machine-generated text matches a human-written reference text. It's commonly used in tasks like machine translation and text generation. <br> The score ranges from 0 to 1.
* 1 means the generated text perfectly matches the reference.
* 0 means there's no similarity at all.

It checks for both word matches and the correct sequence of words, while penalizing texts that are too short or too long.

In order to perform this evaluation,
1. We'll be converting the tokens back to words and get the reference words.
2. Generate predictions on the input tokens and get the predicted words.
3. Then calculate BLEU score by passing the reference and predicted words as input.

**Test case**: Achieve Atleast 15% to pass this test.



In [None]:
# Convert test outputs to reference words
reference_words = convert_token_ids_to_words(test_outputs, tokenizer)

# Generate predictions
predicted_words = generate_predictions(model, test_inputs, tokenizer)

# Calculate BLEU score
calculate_bleu(predicted_words, reference_words)