<a href="https://colab.research.google.com/github/gnoejh/ict1022/blob/main/Transformer/4_sequence_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sequence Models

In [26]:
# Install required packages (if not already installed)
%pip install torch
# Import necessary libraries
from collections import defaultdict, Counter
import random
import torch
import torch.nn as nn

Note: you may need to restart the kernel to use updated packages.


### 1. Traditional Sequence Models: n-grams and Markov Models

**Objective**: Capture word sequences by analyzing fixed-length contexts, useful for generating text based on word sequence probabilities.

**Explanation**: 
- **n-grams**: A sequence of `n` items from text. For example, in a trigram model (n=3), each sequence of three words has an assigned probability based on frequency. However, n-grams struggle with long-term dependencies.
- **Markov Models**: Use state transitions with probabilities to predict the next word based on the current state (e.g., previous word).

In [27]:
# Sample text for Markov chain
text = "The sun sets over the distant hills as a gentle breeze rustles through the leaves, carrying with it the fragrance of blooming flowers and the promise of a peaceful evening."
# Generate bigrams and compute probabilities
def generate_bigrams(text):
    words = text.split()
    bigrams = [(words[i], words[i+1]) for i in range(len(words)-1)]
    return bigrams

def markov_chain(bigrams):
    chain = defaultdict(Counter)
    for word1, word2 in bigrams:
        chain[word1][word2] += 1
    # Normalize counts to probabilities
    for word1 in chain:
        total_count = float(sum(chain[word1].values()))
        for word2 in chain[word1]:
            chain[word1][word2] /= total_count
    return chain

bigrams = generate_bigrams(text)
chain = markov_chain(bigrams)

# Generate text based on the Markov chain
def generate_text(chain, start_word, length=10):
    word = start_word
    text = [word]
    for _ in range(length - 1):
        if word not in chain:
            break
        next_words = list(chain[word].keys())
        probabilities = list(chain[word].values())
        word = random.choices(next_words, probabilities)[0]
        text.append(word)
    return ' '.join(text)

print("Generated Text:", generate_text(chain, start_word="The"))

Generated Text: The sun sets over the leaves, carrying with it the


### 2. Recurrent Neural Networks (RNNs)

**Objective**: Model sequences by retaining information from previous steps, enabling context across time steps.

**Explanation**: 
- RNNs process sequences element by element, retaining a hidden state that captures information from prior steps. They struggle with long sequences due to issues like the vanishing gradient problem.

In [28]:
# Sample RNN
class SimpleRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleRNN, self).__init__()
        self.hidden_size = hidden_size
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        out, hidden = self.rnn(x)
        out = self.fc(out[:, -1, :])  # Using the last output for prediction
        return out

# Example usage
input_size = 10
hidden_size = 20
output_size = 1

rnn = SimpleRNN(input_size, hidden_size, output_size)
inputs = torch.randn(5, 3, input_size)
outputs = rnn(inputs)
print("RNN Outputs:", outputs)

RNN Outputs: tensor([[ 0.1883],
        [ 0.1001],
        [-0.0622],
        [ 0.0175],
        [ 0.1789]], grad_fn=<AddmmBackward0>)


### 3. Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU)

**Objective**: Overcome RNN limitations with gates to selectively retain or forget information.

**Explanation**:
- **LSTM**: Adds input, forget, and output gates to control the flow of information, addressing vanishing gradient issues.
- **GRU**: A simplified LSTM with fewer gates, offering similar performance with less computational cost.

In [29]:
# Sample LSTM
class SimpleLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleLSTM, self).__init__()
        self.hidden_size = hidden_size
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        out, (hidden, cell) = self.lstm(x)
        out = self.fc(out[:, -1, :])
        return out

# Example usage
lstm = SimpleLSTM(input_size, hidden_size, output_size)
outputs = lstm(inputs)
print("LSTM Outputs:", outputs)

LSTM Outputs: tensor([[ 0.0953],
        [-0.0282],
        [ 0.0621],
        [ 0.0555],
        [ 0.0997]], grad_fn=<AddmmBackward0>)


In [30]:
# Sample GRU
class SimpleGRU(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleGRU, self).__init__()
        self.hidden_size = hidden_size
        self.gru = nn.GRU(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        out, hidden = self.gru(x)
        out = self.fc(out[:, -1, :])
        return out

# Example usage
gru = SimpleGRU(input_size, hidden_size, output_size)
outputs = gru(inputs)
print("GRU Outputs:", outputs)

GRU Outputs: tensor([[0.0555],
        [0.0071],
        [0.2153],
        [0.2888],
        [0.2065]], grad_fn=<AddmmBackward0>)


### 4. Introduction to the Attention Mechanism

**Objective**: Enable models to focus on relevant parts of input sequences, essential for handling longer dependencies.

**Explanation**: 
- Attention allows a model to “attend” to specific input parts when generating each output. It’s especially useful for tasks requiring selective referencing of input tokens (e.g., translation, summarization).

In [31]:
# Simple attention mechanism
def attention(query, key, value):
    scores = torch.matmul(query, key.transpose(-2, -1)) / torch.sqrt(torch.tensor(key.size(-1), dtype=torch.float32))
    attn_weights = torch.nn.functional.softmax(scores, dim=-1)
    output = torch.matmul(attn_weights, value)
    return output, attn_weights

# Example inputs
query = torch.randn(5, 3, 8)
key = torch.randn(5, 3, 8)
value = torch.randn(5, 3, 8)

# Apply attention
output, weights = attention(query, key, value)
print("Attention Output:", output)
print("Attention Weights:", weights)

Attention Output: tensor([[[-0.8891,  0.4020,  0.7994,  0.3439,  0.3556,  0.6809,  0.4139,
          -0.3774],
         [ 0.0418,  0.3143, -0.2218,  0.9471, -0.6442,  1.3846, -0.3155,
          -0.4482],
         [-0.7372,  0.4270,  0.6818,  0.3993,  0.2088,  0.7749,  0.3235,
          -0.3409]],

        [[-0.0585,  1.3033, -0.3759, -0.8232,  0.3286,  0.4324, -0.0348,
          -1.2970],
         [ 0.0905,  1.4962, -0.4545, -0.9702,  0.6231,  0.5169, -0.0101,
          -1.1888],
         [-1.1606, -0.0238,  0.0884,  0.2125, -1.5161, -0.0946, -0.1975,
          -1.9829]],

        [[ 0.5571,  0.2656,  0.8578,  0.1008, -0.2550,  0.8173,  0.3074,
          -0.0349],
         [-0.5294,  0.3932,  0.7593,  0.8918, -0.2605,  0.7131,  0.5405,
           0.2413],
         [-0.1079,  0.3202,  0.7876,  0.8017, -0.1750,  0.3296,  0.3486,
           0.0338]],

        [[-0.3111, -0.6088, -0.7258, -0.8805,  0.3606,  0.5548, -0.2823,
          -0.1240],
         [-0.3587, -0.1623, -0.3428, -1.3550, 

**Note**: The application examples provided in this notebook (Markov Models, RNNs, LSTMs, GRUs, and Attention) are appropriate for sequence modeling tasks. Each method addresses specific challenges in sequence modeling, such as handling long-term dependencies or focusing on relevant input parts.