<div style="  background: linear-gradient(145deg, #0f172a, #1e293b);  border: 4px solid transparent;  border-radius: 14px;  padding: 18px 22px;  margin: 12px 0;  font-size: 26px;  font-weight: 600;  color: #f8fafc;  box-shadow: 0 6px 14px rgba(0,0,0,0.25);  background-clip: padding-box;  position: relative;">  <div style="    position: absolute;    inset: 0;    padding: 4px;    border-radius: 14px;    background: linear-gradient(90deg, #06b6d4, #3b82f6, #8b5cf6);    -webkit-mask:       linear-gradient(#fff 0 0) content-box,       linear-gradient(#fff 0 0);    -webkit-mask-composite: xor;    mask-composite: exclude;    pointer-events: none;  "></div>    <b>Sequence to Sequence Models</b>    <br/>  <span style="color:#9ca3af; font-size: 18px; font-weight: 400;">(Recurrent Neural Networks for Language Modeling with Keras)</span></div>

## Table of Contents

1. [Introduction to Sequence Architectures](#section-1)
2. [Text Generation: Concepts & Examples](#section-2)
3. [The Text Generating Function](#section-3)
4. [Probability Scaling (Temperature)](#section-4)
5. [Text Generation Model Architecture](#section-5)
6. [Neural Machine Translation (NMT) Overview](#section-6)
7. [NMT: Encoder Architecture](#section-7)
8. [NMT: Decoder Architecture](#section-8)
9. [Data Preparation for NMT](#section-9)
10. [Training and Evaluation](#section-10)
11. [Course Wrap-up and RNN Pitfalls](#section-11)
12. [Conclusion](#section-12)

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 1. Introduction to Sequence Architectures</span><br>

Sequence-to-sequence models are a powerful class of Recurrent Neural Networks (RNNs) used to map input sequences to output sequences. Depending on the task, the relationship between input and output length can vary significantly.

### Possible Architectures

There are two primary architectural patterns discussed in this module:

| Architecture Type | Description | Examples |
| :--- | :--- | :--- |
| **Many-to-One** | Many inputs mapped to a single output. | â€¢ Sentiment Analysis<br>â€¢ Classification |
| **Many-to-Many** | Many inputs mapped to many outputs. | â€¢ Text Generation<br>â€¢ Neural Machine Translation (NMT) |

<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> In this notebook, we focus on <b>Many-to-Many</b> architectures, specifically for Text Generation and Neural Machine Translation. </div>

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 2. Text Generation: Concepts & Examples</span><br>

Text generation involves training a model to predict the next token in a sequence. This can be used to mimic the style of a specific author or character.

### Example: Mimicking Sheldon Cooper
Using a pre-trained model, we can generate phrases in the style of Sheldon Cooper (from *The Big Bang Theory*).



In [None]:
# Conceptual example of using a pre-trained model method
class MockModel:
    def generate_sheldon_phrase(self):
        return "'knock knock. penny. do you have an epost is part in your expert, too bealie to play the tariment with last night.'"

model = MockModel()
print(model.generate_sheldon_phrase())



### Modeling Decisions
When building text generation models, you must decide on the granularity of your tokens:

1.  **Word-level Tokens**:
    *   **Pros**: Generates coherent words.
    *   **Cons**: Requires massive datasets (hundreds of millions of sentences) to learn a robust vocabulary.
2.  **Character-level Tokens**:
    *   **Pros**: Can be trained faster; smaller vocabulary size.
    *   **Cons**: Can generate typos or non-existent words.

### Workflow
1.  **Decide Token Type**: Characters vs. Words.
2.  **Prepare Data**: Build training samples with `(past_tokens, next_token)` pairs.
3.  **Design Architecture**: Embedding layers, LSTM layers, etc.
4.  **Train & Experiment**: Iterate on hyperparameters.

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 3. The Text Generating Function</span><br>

To generate sentences, we feed the model a seed sequence, predict the next character, append it to the sequence, and repeat.

### Sentence Structure
*   Sentences are determined by punctuation (e.g., `.`, `!`, `?`).
*   Punctuation marks must be included in the vocabulary.
*   Special tokens like `<SENT>` and `</SENT>` can mark the beginning and end of sentences.

### Generation Logic
The following code demonstrates the logic loop for generating a sentence character by character.



In [None]:
import numpy as np

# Mocking necessary components for the example to run
class MockKerasModel:
    def predict(self, X):
        # Returns a dummy probability distribution for 5 characters
        return np.array([[0.1, 0.1, 0.1, 0.6, 0.1]])

def index_to_char(index):
    mapping = {0: 'a', 1: 'b', 2: 'c', 3: '.', 4: ' '}
    return mapping.get(index, '')

# Initialize variables
model = MockKerasModel()
sentence = ''
next_char = ''
X = np.zeros((1, 10, 5)) # Dummy input shape

# Loop until end of sentence (period is detected)
# Note: In a real scenario, you would update X with the new char in every iteration
counter = 0 
while next_char != '.' and counter < 10: # Added counter to prevent infinite loop in this mock
    # Predict next char: Get pred array in position 0
    pred = model.predict(X)[0]
    
    # Get index of highest probability
    char_index = np.argmax(pred)
    
    # Convert index to character
    next_char = index_to_char(char_index)
    
    # Concatenate to sentence
    sentence = sentence + next_char
    
    counter += 1

print(f"Generated sentence segment: {sentence}")



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 4. Probability Scaling (Temperature)</span><br>

Instead of always picking the character with the absolute highest probability (which can be repetitive), we can sample from the probability distribution. **Temperature** is a hyperparameter that scales this distribution to control "creativity."

### Temperature Effects
*   **Small values (< 1.0)**: Makes the distribution "sharper." The model becomes more confident and conservative.
*   **Value = 1.0**: No scaling.
*   **Higher values (> 1.0)**: Flattens the distribution. The model becomes more random and "creative," but potentially incoherent.

### Implementation
The function below implements the mathematical scaling using Logarithms and Exponentials.



In [None]:
import numpy as np

def scale_softmax(softmax_pred, temperature=1.0):
    # Take the logarithm
    scaled_pred = np.log(softmax_pred) / temperature
    
    # Re-apply the exponential
    scaled_pred = np.exp(scaled_pred)
    
    # Build probability distribution (normalize)
    scaled_pred = scaled_pred / np.sum(scaled_pred)
    
    # Simulate multinomial sampling based on new distribution
    # np.random.multinomial returns a one-hot vector, we take argmax to get the index
    scaled_pred = np.random.multinomial(1, scaled_pred, 1)
    
    # Return simulated class index
    return np.argmax(scaled_pred)

# Example Usage
dummy_preds = np.array([0.1, 0.2, 0.6, 0.1]) # High confidence in index 2
print(f"Original Max: {np.argmax(dummy_preds)}")

# Low temp (conservative)
print(f"Low Temp (0.1): {scale_softmax(dummy_preds, temperature=0.1)}")

# High temp (creative/random)
print(f"High Temp (2.0): {scale_softmax(dummy_preds, temperature=2.0)}")



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 5. Text Generation Model Architecture</span><br>

The architecture for text generation is similar to a classification model, but with specific nuances.

### Key Characteristics
*   **Classes**: The vocabulary size acts as the number of classes.
*   **Output Layer**: Softmax activation with units equal to vocabulary size.
*   **Loss Function**: `categorical_crossentropy`.
*   **Evaluation**: We monitor loss, but "accuracy" is less relevant. Humans evaluate the quality of the generated text.

### Keras Implementation
Here is a standard LSTM-based architecture for text generation.



In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense

# Hyperparameters (Example values)
units = 128
chars_window = 50  # Length of input sequence
n_vocab = 40       # Size of vocabulary
dropout = 0.15

model = Sequential()

# First LSTM Layer: Must return sequences to feed the next LSTM layer
model.add(LSTM(units, input_shape=(chars_window, n_vocab),
               dropout=dropout, recurrent_dropout=dropout, return_sequences=True))

# Second LSTM Layer: Returns only the last output (return_sequences=False by default)
model.add(LSTM(units, dropout=dropout, recurrent_dropout=dropout,
               return_sequences=False))

# Output Layer: Predicts probability of next character
model.add(Dense(n_vocab, activation='softmax'))

# Compile
model.compile(loss='categorical_crossentropy', optimizer='adam')

model.summary()



<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> If the results are not good, try training for more epochs or adding complexity (more memory cells/units, more layers). </div>

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 6. Neural Machine Translation (NMT) Overview</span><br>

Neural Machine Translation (NMT) is a "Many-to-Many" problem where we map a sequence of words in one language to a sequence of words in another.

### Example
*   **Input**: "Vamos jogar futebol?" (Portuguese)
*   **Output**: "Let's go play soccer?" (English)



In [None]:
# Conceptual usage
# model.translate("Vamos jogar futebol?")
# Output: 'Let's go play soccer?'



### The Encoder-Decoder Architecture
NMT models typically use two distinct parts:
1.  **Encoder**: Processes the input sequence and compresses the information into a context vector (the internal state).
2.  **Decoder**: Takes the context vector and generates the output sequence step-by-step.

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 7. NMT: Encoder Architecture</span><br>

The Encoder is responsible for reading the input language. It uses an Embedding layer to handle word vectors and an LSTM layer to capture sequential dependencies.

### Key Component: RepeatVector
The output of the Encoder is a single vector (the final state). However, the Decoder expects a sequence as input (one for each time step of the output sentence). `RepeatVector` bridges this gap by repeating the Encoder's output vector `N` times, where `N` is the length of the output sequence.



In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, RepeatVector

# Dimensions
input_language_size = 1000
input_wordvec_dim = 64
input_language_len = 20
output_language_len = 20

# Instantiate the model
model = Sequential()

# Embedding layer for input language
# mask_zero=True ignores 0-padded values
model.add(Embedding(input_language_size, input_wordvec_dim,
                    input_length=input_language_len, mask_zero=True))

# Add LSTM layer (The Encoder)
model.add(LSTM(128))

# Repeat the last vector to match the length of the output sequence
model.add(RepeatVector(output_language_len))

print("Encoder built successfully.")



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 8. NMT: Decoder Architecture</span><br>

The Decoder takes the repeated context vectors and generates the translated sentence.

### Key Component: TimeDistributed
We want to apply a Dense layer to *every* time step of the sequence independently to predict the word for that specific position. `TimeDistributed` wraps the Dense layer to apply it to every slice of the temporal input.



In [None]:
from tensorflow.keras.layers import TimeDistributed, Dense

# Continuing from the previous model...
eng_vocab_size = 2000

# Add LSTM layer (The Decoder)
# return_sequences=True is mandatory here because we need an output for every time step
model.add(LSTM(128, return_sequences=True))

# Add Time Distributed Dense layer
# Applies the Dense layer to every time step to predict the word index
model.add(TimeDistributed(Dense(eng_vocab_size, activation='softmax')))

model.summary()



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 9. Data Preparation for NMT</span><br>

Data preparation for NMT involves tokenizing both the input language (e.g., Portuguese) and the output language (e.g., English), padding them to fixed lengths, and one-hot encoding the output for the classification task.

### 9.1 Tokenize Input Language
We convert text sentences into sequences of integers.



In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Dummy Data
input_texts_list = ["Vamos jogar futebol", "Bom dia"]
length = 20

# Use the Tokenizer class
tokenizer = Tokenizer()
tokenizer.fit_on_texts(input_texts_list)

# Text to sequence of numerical indexes
X = tokenizer.texts_to_sequences(input_texts_list)

# Pad sequences (padding='post' adds zeros at the end)
X = pad_sequences(X, maxlen=length, padding='post')

print(f"Input Shape: {X.shape}")
print(X)



### 9.2 Tokenize Output Language
Similar process for the target language.



In [None]:
# Dummy Data
output_texts_list = ["Let's play soccer", "Good morning"]

# Use the Tokenizer class
tokenizer_out = Tokenizer()
tokenizer_out.fit_on_texts(output_texts_list)

# Text to sequence of numerical indexes
Y = tokenizer_out.texts_to_sequences(output_texts_list)

# Pad sequences
Y = pad_sequences(Y, maxlen=length, padding='post')

print(f"Output Sequence Shape: {Y.shape}")



### 9.3 One-Hot Encode Output
The output `Y` needs to be one-hot encoded because the final layer is a Dense layer with Softmax activation (classification).



In [None]:
from tensorflow.keras.utils import to_categorical
import numpy as np

# Get vocab size from tokenizer
vocab_size = len(tokenizer_out.word_index) + 1 

# Instantiate a temporary variable
ylist = list()

# Loop over the sequence of numerical indexes
for sequence in Y:
    # One-hot encode each index on current sentence
    encoded = to_categorical(sequence, num_classes=vocab_size)
    
    # Append one-hot encoded values to the list
    ylist.append(encoded)

# Transform to np.array and reshape
# Shape: (Number of samples, Sequence Length, Vocab Size)
Y = np.array(ylist).reshape(Y.shape[0], Y.shape[1], vocab_size)

print(f"Final One-Hot Output Shape: {Y.shape}")



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 10. Training and Evaluation</span><br>

### Training
Training is performed using the standard Keras `fit` method.



In [None]:
# N is the number of epochs
N = 5
# model.fit(X, Y, epochs=N) 
print("Model training command: model.fit(X, Y, epochs=N)")



### Evaluation: BLEU Score
Accuracy is not the best metric for translation. We use **BLEU (Bilingual Evaluation Understudy)**, which compares the machine-generated translation to one or more human reference translations.



In [None]:
import nltk

# Example of how to access BLEU score in NLTK
# nltk.translate.bleu_score.sentence_bleu(references, hypothesis)
print("Evaluation metric: nltk.translate.bleu_score")



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 11. Course Wrap-up and RNN Pitfalls</span><br>

This section summarizes the entire course on Recurrent Neural Networks for Language Modeling.

### Course Summary
1.  **Introduction to Language Tasks**:
    *   Sentiment Classification.
    *   Multi-class Classification.
    *   Text Generation.
    *   Neural Machine Translation.
2.  **Sequence to Sequence Models**: Understanding Many-to-Many architectures.
3.  **Implementation**: Using Keras `Sequential`, `LSTM`, `Embedding`, and `TimeDistributed` layers.

### RNN Pitfalls & Solutions
*   **Vanishing/Exploding Gradients**: Standard RNNs struggle with long sequences because gradients can become zero or infinitely large during backpropagation.
*   **Solution**: Use **GRU** (Gated Recurrent Unit) or **LSTM** (Long Short-Term Memory) cells, which have internal mechanisms (gates) to regulate information flow.

### Other Applications
*   **Name Creation**: Baby names, star names.
*   **Marked Text Generation**: LaTeX, Markdown, XML, Code.
*   **Chatbots**: Conversational agents.

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 12. Conclusion</span><br>

In this notebook, we explored the advanced capabilities of Recurrent Neural Networks, specifically focusing on **Sequence-to-Sequence** models.

**Key Takeaways:**
1.  **Text Generation**: We learned how to generate text character-by-character using LSTM networks and how to control the creativity of the output using **Temperature**.
2.  **NMT Architecture**: We built a complete Encoder-Decoder architecture for Machine Translation using `RepeatVector` to bridge the gap between the encoder's summary and the decoder's sequence generation.
3.  **Data Prep**: We mastered the complex preprocessing pipeline required for NMT, including dual tokenization and one-hot encoding of the target sequence.

**Next Steps:**
*   Experiment with different **Temperature** values to see how it affects text generation.
*   Try replacing `LSTM` layers with `GRU` layers to compare training speed and performance.
*   Apply the NMT architecture to a real-world dataset (e.g., the Anki project datasets) to build a functional translator.
