<a href="https://colab.research.google.com/github/ochaudha/sample/blob/main/RNN1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Explanation of the Code:

Configuration: Sets up constants like HIDDEN_SIZE, LEARNING_RATE, NUM_EPOCHS, and selects the device (cpu by default for broad compatibility). MAX_LENGTH is crucial for fixed-size tensors.

Data Preparation (Lang Class):

The Lang class helps build character-to-index (char2idx) and index-to-character (idx2char) mappings for both English (Roman) and Hindi (Devanagari) alphabets.

It includes special tokens: <SOS> (Start Of Sequence), <EOS> (End Of Sequence), and <PAD> (Padding) for sequence handling.

tensor_from_text: A helper function to convert a string into a padded PyTorch tensor of character indices, adding the <EOS> token.

Model Architecture:

EncoderRNN:

Takes an input character index, converts it to an embedding vector using nn.Embedding.

Processes the embedded character using a nn.GRU (Gated Recurrent Unit), which outputs an output (per time step) and a hidden state (the context of the sequence so far).

init_hidden() provides an initial zero hidden state.

AttnDecoderRNN:

Also uses nn.Embedding for output characters.

Attention (self.attn, self.attn_combine): This is the core of the attention mechanism.

It calculates attn_weights by looking at the current decoder input embedding and its hidden state, and comparing them against all encoder outputs. F.softmax normalizes these weights.

attn_applied is the weighted sum of encoder outputs, allowing the decoder to focus on relevant input parts.

This attn_applied context vector is concatenated with the current embedded input before being fed into the GRU.

nn.GRU: Processes the combined input and attention context.

nn.Linear: Projects the GRU's output to the size of the output vocabulary.

F.log_softmax: Converts the linear output into log-probabilities over the next possible characters.

Training Function (train):

Initializes encoder's hidden state.

Zeroes gradients for both optimizers.

Encoder Loop: Iterates through the input characters, feeding each into the encoder to get encoder_outputs (all encoder hidden states, which attention will use) and the final encoder_hidden state (the context vector for the decoder).

Decoder Loop:

Starts with <SOS> token as input and the encoder_hidden state.

Teacher Forcing: A technique where, during training, the decoder is sometimes fed the actual next character from the target sequence instead of its own prediction. This helps the model learn faster and more stably in early stages. TEACHER_FORCING_RATIO controls how often this happens.

At each step, it predicts the next character, calculates loss against the true target character.

Continues until <EOS> is predicted or MAX_LENGTH is reached.

loss.backward(): Computes gradients.

optimizer.step(): Updates model weights.

Returns the average loss per target character.

Evaluation/Inference Function (evaluate):

Sets models to eval() mode (with torch.no_grad()) to disable gradient calculation and dropout.

Similar to the training decoder loop, but it always feeds the decoder's own predicted character as the next input (no teacher forcing).

Collects decoded characters until <EOS> or MAX_LENGTH is reached.

Main Execution (if __name__ == "__main__":)

Initializes encoder and decoder models, optimizers, and the nn.NLLLoss criterion (Negative Log Likelihood Loss, which works with LogSoftmax output). ignore_index=PAD_token ensures padding tokens don't contribute to the loss.

Runs the training loop for NUM_EPOCHS, randomly selecting a training_pair for each epoch.

Prints loss and sample translations periodically to monitor progress.

After training, it tests the model on a few specific names, including "Omveer" and "NonExistent" to see how it generalizes (or fails to).

This sample provides a foundational understanding of how a character-level sequence-to-sequence model can be built in PyTorch for tasks like transliteration. For production-ready systems, you would need much larger datasets, more complex architectures (e.g., attention variants, deeper LSTMs/Transformers), and more sophisticated training techniques.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import random

# --- 1. Configuration ---
# You can uncomment and modify these if you have a GPU
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = torch.device("cpu") # For broader compatibility

# Hyperparameters
HIDDEN_SIZE = 256
EMBEDDING_DIM = 64
LEARNING_RATE = 0.005
NUM_EPOCHS = 3000
MAX_LENGTH = 15 # Max characters in a name (e.g., "Omveer" is 6, "Rahul" is 5)
TEACHER_FORCING_RATIO = 0.5 # For training stability

# --- 2. Data Preparation ---

# Tiny dataset of Roman script names and their Hindi transliterations
# In a real scenario, this would be a much larger dataset.
training_pairs = [
    ("Omveer", "ओमवीर"),
    ("Rahul", "राहुल"),
    ("Priya", "प्रिया"),
    ("Amit", "अमित"),
    ("Saurabh", "सौरभ"),
    ("Deepak", "दीपक"),
    ("Anjali", "अंजलि"),
    ("Kavita", "कविता"),
    ("Nitin", "नितिन"),
    ("Sneha", "स्नेहा"),
    ("Vivek", "विवेक"),
    ("Pooja", "पूजा"),
    ("Mohan", "मोहन"),
    ("Ritu", "ऋतु"),
    ("Gaurav", "गौरव"),
    ("Preeti", "प्रीति"),
    ("Rakesh", "राकेश"),
    ("Seema", "सीमा"),
    ("Vijay", "विजय"),
    ("Sarita", "सरिता"),
    ("Omveer Singh", "ओमवीर सिंह") # A slightly longer example
]

# Special tokens
SOS_token = 0  # Start Of Sequence
EOS_token = 1  # End Of Sequence
PAD_token = 2  # Padding

class Lang:
    def __init__(self, name):
        self.name = name
        self.char2idx = {}
        self.idx2char = {0: "<SOS>", 1: "<EOS>", 2: "<PAD>"}
        self.n_chars = 3  # Count SOS, EOS, PAD

    def add_sentence(self, sentence):
        for char in sentence:
            self.add_char(char)

    def add_char(self, char):
        if char not in self.char2idx:
            self.char2idx[char] = self.n_chars
            self.idx2char[self.n_chars] = char
            self.n_chars += 1

# Build separate language objects for input (English) and output (Hindi)
input_lang = Lang('eng')
output_lang = Lang('hin')

for eng, hin in training_pairs:
    input_lang.add_sentence(eng)
    output_lang.add_sentence(hin)

print(f"Input vocabulary size: {input_lang.n_chars}")
print(f"Output vocabulary size: {output_lang.n_chars}")

# Helper to convert text to indices tensor, with padding
def tensor_from_text(lang, text, max_length=MAX_LENGTH):
    indices = [lang.char2idx[char] for char in text]
    indices.append(EOS_token) # Add EOS token
    if len(indices) > max_length: # Truncate if too long
        indices = indices[:max_length-1] + [EOS_token]
    padded_indices = indices + [PAD_token] * (max_length - len(indices)) # Pad
    return torch.tensor(padded_indices, dtype=torch.long, device=device).view(-1, 1)

# --- 3. Model Architecture ---

# Encoder
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, embedding_dim):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(input_size, embedding_dim)
        self.gru = nn.GRU(embedding_dim, hidden_size)

    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        output, hidden = self.gru(embedded, hidden)
        return output, hidden

    def init_hidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

# Decoder with Attention
class AttnDecoderRNN(nn.Module):
    def __init__(self, output_size, hidden_size, embedding_dim, dropout_p=0.1, max_length=MAX_LENGTH):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p
        self.max_length = max_length

        self.embedding = nn.Embedding(self.output_size, embedding_dim)
        self.attn = nn.Linear(embedding_dim + hidden_size, self.max_length) # Attention weights
        self.attn_combine = nn.Linear(embedding_dim + hidden_size, hidden_size) # Combine attended context and embedding
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, self.output_size)

    def forward(self, input, hidden, encoder_outputs):
        embedded = self.embedding(input).view(1, 1, -1)
        embedded = self.dropout(embedded)

        # Attention mechanism
        attn_weights = F.softmax(
            self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)
        attn_applied = torch.bmm(attn_weights.unsqueeze(0),
                                 encoder_outputs.unsqueeze(0))

        # Combine embedded input and attended context
        output = torch.cat((embedded[0], attn_applied[0]), 1)
        output = self.attn_combine(output).unsqueeze(0)

        output = F.relu(output)
        output, hidden = self.gru(output, hidden)

        output = F.log_softmax(self.out(output[0]), dim=1)
        return output, hidden, attn_weights

    def init_hidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)


# --- 4. Training Function ---

def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH):
    encoder_hidden = encoder.init_hidden()

    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)

    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

    loss = 0

    # Encoder pass
    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(input_tensor[ei], encoder_hidden)
        encoder_outputs[ei] = encoder_output[0, 0]

    decoder_input = torch.tensor([[SOS_token]], device=device)
    decoder_hidden = encoder_hidden

    # Teacher forcing: Use the real target as next input
    use_teacher_forcing = True if random.random() < TEACHER_FORCING_RATIO else False

    if use_teacher_forcing:
        # Teacher forcing: Feed the target as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            loss += criterion(decoder_output, target_tensor[di])
            decoder_input = target_tensor[di]  # Teacher forcing
    else:
        # Without teacher forcing: Use its own prediction as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            topv, topi = decoder_output.topk(1)
            decoder_input = topi.squeeze().detach()  # Detach from history as input

            loss += criterion(decoder_output, target_tensor[di])
            if decoder_input.item() == EOS_token:
                break

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item() / target_length

# --- 5. Evaluation / Inference Function ---

def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):
    with torch.no_grad():
        input_tensor = tensor_from_text(input_lang, sentence)
        input_length = input_tensor.size(0)
        encoder_hidden = encoder.init_hidden()

        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

        for ei in range(input_length):
            encoder_output, encoder_hidden = encoder(input_tensor[ei],
                                                     encoder_hidden)
            encoder_outputs[ei] = encoder_output[0, 0]

        decoder_input = torch.tensor([[SOS_token]], device=device)  # SOS
        decoder_hidden = encoder_hidden

        decoded_chars = []

        for di in range(max_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            topv, topi = decoder_output.data.topk(1)
            if topi.item() == EOS_token:
                decoded_chars.append('<EOS>')
                break
            else:
                decoded_chars.append(output_lang.idx2char[topi.item()])

            decoder_input = topi.squeeze().detach()

        return ''.join(decoded_chars)

# --- 6. Main Execution ---

if __name__ == "__main__":
    # Initialize models
    encoder = EncoderRNN(input_lang.n_chars, HIDDEN_SIZE, EMBEDDING_DIM).to(device)
    decoder = AttnDecoderRNN(output_lang.n_chars, HIDDEN_SIZE, EMBEDDING_DIM, dropout_p=0.1, max_length=MAX_LENGTH).to(device)

    # Optimizers
    encoder_optimizer = optim.Adam(encoder.parameters(), lr=LEARNING_RATE)
    decoder_optimizer = optim.Adam(decoder.parameters(), lr=LEARNING_RATE)

    # Loss function
    criterion = nn.NLLLoss(ignore_index=PAD_token) # NLLLoss with LogSoftmax output, ignore padding

    print("Starting training...")
    for epoch in range(1, NUM_EPOCHS + 1):
        # Pick a random training pair for simplicity
        input_text, target_text = random.choice(training_pairs)

        input_tensor = tensor_from_text(input_lang, input_text).to(device)
        target_tensor = tensor_from_text(output_lang, target_text).to(device)

        loss = train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion)

        if epoch % 500 == 0:
            print(f"Epoch {epoch}/{NUM_EPOCHS}, Loss: {loss:.4f}")
            # Evaluate some examples during training
            print(f"  Input: {input_text} -> Predicted: {evaluate(encoder, decoder, input_text)}")
            print(f"  Input: Omveer -> Predicted: {evaluate(encoder, decoder, 'Omveer')}")
            print(f"  Input: Rahul -> Predicted: {evaluate(encoder, decoder, 'Rahul')}")
            print("-" * 20)

    print("\nTraining complete! Testing specific examples:")
    test_names = ["Omveer", "Rahul", "Priya", "Saurabh", "Anjali", "Omveer Singh", "NonExistent"]
    for name in test_names:
        print(f"'{name}' -> '{evaluate(encoder, decoder, name)}'")

Input vocabulary size: 34
Output vocabulary size: 33
Starting training...
Epoch 500/3000, Loss: nan
  Input: Pooja -> Predicted: पूजा<EOS>
  Input: Omveer -> Predicted: दममी<EOS>
  Input: Rahul -> Predicted: राजु<EOS>
--------------------
Epoch 1000/3000, Loss: 0.0008
  Input: Sneha -> Predicted: स्नेहा<EOS>
  Input: Omveer -> Predicted: ओमवीर<EOS>
  Input: Rahul -> Predicted: राहुल<EOS>
--------------------
Epoch 1500/3000, Loss: 0.0003
  Input: Rahul -> Predicted: राहुल<EOS>
  Input: Omveer -> Predicted: ओमवीर<EOS>
  Input: Rahul -> Predicted: राहुल<EOS>
--------------------
Epoch 2000/3000, Loss: nan
  Input: Vivek -> Predicted: विवेक<EOS>
  Input: Omveer -> Predicted: ओमवीर<EOS>
  Input: Rahul -> Predicted: राहुल<EOS>
--------------------
Epoch 2500/3000, Loss: 0.0001
  Input: Gaurav -> Predicted: गौरव<EOS>
  Input: Omveer -> Predicted: ओमवीर<EOS>
  Input: Rahul -> Predicted: राहुल<EOS>
--------------------
Epoch 3000/3000, Loss: nan
  Input: Preeti -> Predicted: प्रीति<EOS>
  Inpu

KeyError: 'E'

https://hussainwali.medium.com/guide-to-pytorchs-embedding-layer-68913ee53d2e

In [None]:
import torch
import torch.nn as nn

# Define the vocabulary size (number of unique words/items)
vocab_size = 1000  # Example: 1000 unique words

# Define the embedding dimension (size of each embedding vector)
embedding_dim = 128 # Example: Each word will be represented by a 128-dimensional vector

# Create the embedding layer
# nn.Embedding(num_embeddings, embedding_dim)
# num_embeddings: size of the dictionary of embeddings (vocab_size)
# embedding_dim: the size of each embedding vector
embedding_layer = nn.Embedding(vocab_size, embedding_dim)

# Example input: a tensor of indices representing words/items
# These indices would typically come from a pre-processed dataset
# For example, a sequence of word IDs in a sentence
input_indices = torch.tensor([
    [10, 25, 50, 75],  # First sequence of indices
    [100, 200, 300, 400] # Second sequence of indices
])

# Pass the input indices through the embedding layer
# This will perform a lookup and return the corresponding embedding vectors
output_embeddings = embedding_layer(input_indices)

# Print the shape of the output embeddings
# Expected shape: (batch_size, sequence_length, embedding_dim)
print(f"Shape of input indices: {input_indices.shape}")
print(f"Shape of output embeddings: {output_embeddings.shape}")

# Print a sample of the output embeddings (e.g., the first embedding)
print(f"First embedding vector from the output: {output_embeddings[0, 0]}")

Shape of input indices: torch.Size([2, 4])
Shape of output embeddings: torch.Size([2, 4, 128])
First embedding vector from the output: tensor([ 1.1230e+00, -1.2377e+00, -3.6551e-02, -4.4483e-01,  1.5836e+00,
        -9.2762e-01,  3.1391e-01, -4.5577e-01,  3.2637e-01, -1.9478e+00,
         2.4414e+00, -4.4207e-01,  3.4708e-01,  1.1208e+00, -5.7387e-01,
        -1.4290e+00,  6.9519e-02,  1.1232e+00,  3.0263e-01,  9.6875e-01,
         1.3296e-01, -5.5316e-01,  1.3394e+00,  3.4011e-01,  8.0473e-01,
        -7.0787e-01,  2.1581e-01, -1.6603e-02, -6.2092e-01, -1.1744e+00,
         2.9277e-01, -4.7665e-02,  2.6018e+00, -5.9178e-02, -6.7986e-01,
         1.0847e+00,  1.9285e+00,  6.7120e-02, -4.3000e-01, -9.1484e-01,
        -2.9730e-01, -1.5690e-01, -3.4910e-01, -3.6099e-01, -9.6690e-03,
        -3.6325e-01,  3.0076e-04,  1.3215e-01,  8.7606e-01, -4.3771e-01,
        -1.1194e-02, -3.1891e-01,  4.4592e-02, -1.9853e-01,  5.2782e-01,
         6.0708e-01, -6.0074e-01,  9.7206e-01, -6.4579e-01,  4

In [None]:
import torch
import torch.nn as nn

torch.manual_seed(1)
#creating the dictionary
word_to_ix = {"geeks": 0, "for": 1, "code":2}
#creating embedding layer - 3 words in vocab, 5-dimensional embeddings
embeds = nn.Embedding(2, 5)

#converting to tensor
lookup_tensor = torch.tensor([word_to_ix["geeks"]], dtype=torch.long)
#accessing the embeddings of the word "geeks"
pytorch_embed = embeds(lookup_tensor)

#print the embeddings
print(pytorch_embed)

tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward0>)


Explanation:

Data Preparation:

We define a small vocab (vocabulary) to map words to unique integer indices. This is crucial for nn.Embedding, which takes integer indices as input.

training_data consists of pairs: (word_index, category_label). This is our tiny dataset.

SimpleWordClassifier Model:

__init__(...):

self.embedding = nn.Embedding(vocab_size, embedding_dim): This creates our embedding table. It means we'll have vocab_size (5) rows, and each row will be an embedding_dim (10) dimensional vector. PyTorch initializes these vectors randomly.

self.linear = nn.Linear(embedding_dim, num_categories): This is a standard linear layer. It takes the 10-dimensional word embedding as input and outputs 2 values (logits), one for "Category A" and one for "Category B".

forward(self, word_idx):

embedded_word = self.embedding(word_idx): When you pass a word_idx (e.g., tensor([0]) for 'apple'), the embedding layer looks up the corresponding 10-dimensional vector for 'apple' from its internal table.

logits = self.linear(embedded_word): This takes the 10-dimensional word embedding and transforms it into 2-dimensional logits, which are then used by the loss function.

Loss and Optimizer:

nn.CrossEntropyLoss(): This is the standard loss for multi-class classification. It expects raw logits from the model and integer labels for the correct class. It internally applies a softmax function to the logits to get probabilities, then calculates the negative log-likelihood.

optim.Adam(): An optimization algorithm that adjusts the model's parameters (the embedding vectors and the linear layer's weights/biases) to minimize the loss.

Training Loop:

The loop iterates for a fixed number of num_epochs.

For each training example:

optimizer.zero_grad(): Clears old gradients.

model(input_word_idx): Performs the forward pass to get predictions.

criterion(...): Calculates the difference between predictions and true labels.

loss.backward(): Computes gradients for all parameters.

optimizer.step(): Updates parameters using the gradients.

The loss is printed periodically to show training progress. As training proceeds, the loss should ideally decrease.

Inference / Testing:

model.eval() and torch.no_grad(): These are important for testing. eval() sets the model to evaluation mode (e.g., disables dropout layers if present), and no_grad() disables gradient computation, which saves memory and speeds up inference.

We iterate through our vocabulary, get predictions for each word, and determine the predicted category based on the highest logit.

Finally, we print the model.embedding.weight. This is the learned embedding matrix. Each row is the 10-dimensional vector that the model has learned for each word. After training, words in "Category A" should ideally have similar embeddings, and words in "Category B" should also cluster together, but words across categories should be distinct.

This example provides a clear illustration of how nn.Embedding and nn.Linear work together in a simple neural network.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

# --- 1. Data Preparation ---

# Define a tiny vocabulary
# Let's say we have 5 unique words
vocab = {
    'apple': 0,
    'banana': 1,
    'cat': 2,
    'dog': 3,
    'orange': 4
}
idx_to_word = {v: k for k, v in vocab.items()}
vocab_size = len(vocab)

# Define a simple "classification" task
# Words 0, 1, 4 belong to Category A (index 0)
# Words 2, 3 belong to Category B (index 1)
# This is our "ground truth" for training
training_data = [
    (vocab['apple'], 0),    # apple -> Category A
    (vocab['banana'], 0),   # banana -> Category A
    (vocab['cat'], 1),      # cat -> Category B
    (vocab['dog'], 1),      # dog -> Category B
    (vocab['orange'], 0)    # orange -> Category A
]

# Convert data to PyTorch tensors
# Inputs are just word indices
# Targets are category indices (0 or 1)
word_indices = torch.tensor([data[0] for data in training_data], dtype=torch.long)
category_labels = torch.tensor([data[1] for data in training_data], dtype=torch.long)

# --- 2. Define the Model ---

class SimpleWordClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, num_categories):
        super(SimpleWordClassifier, self).__init__()

        # Layer 1: Embedding Layer
        # Takes word indices and outputs dense vectors
        # vocab_size: total number of unique words
        # embedding_dim: desired size of each word vector
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        # Layer 2: Linear Layer
        # Takes the word embedding (embedding_dim features)
        # and outputs logits for each category (num_categories features)
        self.linear = nn.Linear(embedding_dim, num_categories)

    def forward(self, word_idx):
        # Step 1: Get the embedding for the input word index
        # word_idx is typically a tensor like [0] or [1, 4] etc.
        # self.embedding(word_idx) performs a lookup in the embedding table
        # Output shape: (batch_size, embedding_dim) or (1, embedding_dim) if word_idx is scalar
        embedded_word = self.embedding(word_idx)

        # Step 2: Pass the embedding through the linear layer
        # This transforms the embedding into raw scores (logits) for each category
        # Output shape: (batch_size, num_categories) or (1, num_categories)
        logits = self.linear(embedded_word)

        return logits

# --- 3. Model, Loss, and Optimizer Initialization ---

# Define model parameters
embedding_dim = 10  # Each word will be represented by a 10-dimensional vector
num_categories = 2  # We have two categories: A and B

# Instantiate the model
model = SimpleWordClassifier(vocab_size, embedding_dim, num_categories)

# Define Loss Function
# CrossEntropyLoss is good for classification tasks.
# It internally applies Softmax to the logits and then calculates negative log likelihood.
criterion = nn.CrossEntropyLoss()

# Define Optimizer
# Adam is a popular choice for many tasks
optimizer = optim.Adam(model.parameters(), lr=0.01)

print("Model Architecture:")
print(model)

# --- 4. Training Loop ---

num_epochs = 200 # Train for 200 epochs (iterations over the entire dataset)

print("\nStarting Training...")
for epoch in range(num_epochs):
    total_loss = 0

    # Iterate through each word in our small training data
    for i in range(len(word_indices)):
        input_word_idx = word_indices[i].unsqueeze(0) # unsqueeze to make it a batch of 1
        target_category_label = category_labels[i].unsqueeze(0) # unsqueeze for consistency

        # 1. Zero the gradients
        # Clear gradients from the previous step before computing new ones
        optimizer.zero_grad()

        # 2. Forward pass
        # Get the predicted logits from the model
        predicted_logits = model(input_word_idx)

        # 3. Calculate the loss
        # Compare predicted logits with the true category label
        loss = criterion(predicted_logits, target_category_label)

        # 4. Backward pass
        # Compute gradients of the loss with respect to model parameters
        loss.backward()

        # 5. Optimizer step
        # Update model parameters using the computed gradients
        optimizer.step()

        total_loss += loss.item()

    # Print loss every few epochs
    if (epoch + 1) % 20 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {total_loss/len(training_data):.4f}")

print("\nTraining Finished.")

# --- 5. Inference / Testing ---

print("\nTesting the trained model:")
model.eval() # Set the model to evaluation mode (disables dropout, etc.)

with torch.no_grad(): # Disable gradient calculation for inference
    for word_name, word_idx in vocab.items():
        input_word_idx = torch.tensor([word_idx], dtype=torch.long)
        predicted_logits = model(input_word_idx)

        # Get the predicted category by finding the index with the highest logit
        predicted_category_idx = torch.argmax(predicted_logits).item()

        # Map the category index back to a readable label
        category_map = {0: "Category A", 1: "Category B"}
        predicted_category_label = category_map[predicted_category_idx]

        print(f"Word: '{word_name}' (Index: {word_idx}) -> Predicted Category: {predicted_category_label}")

# You can also inspect the learned embeddings
print("\nLearned Embeddings:")
# Access the embedding weights
# Each row corresponds to the embedding of a word
# For example, model.embedding.weight[0] is the embedding for 'apple'
print(model.embedding.weight)

Model Architecture:
SimpleWordClassifier(
  (embedding): Embedding(5, 10)
  (linear): Linear(in_features=10, out_features=2, bias=True)
)

Starting Training...
Epoch [20/200], Loss: 0.0447
Epoch [40/200], Loss: 0.0096
Epoch [60/200], Loss: 0.0042
Epoch [80/200], Loss: 0.0024
Epoch [100/200], Loss: 0.0015
Epoch [120/200], Loss: 0.0011
Epoch [140/200], Loss: 0.0008
Epoch [160/200], Loss: 0.0006
Epoch [180/200], Loss: 0.0005
Epoch [200/200], Loss: 0.0004

Training Finished.

Testing the trained model:
Word: 'apple' (Index: 0) -> Predicted Category: Category A
Word: 'banana' (Index: 1) -> Predicted Category: Category A
Word: 'cat' (Index: 2) -> Predicted Category: Category B
Word: 'dog' (Index: 3) -> Predicted Category: Category B
Word: 'orange' (Index: 4) -> Predicted Category: Category A

Learned Embeddings:
Parameter containing:
tensor([[ 0.6656,  0.7644, -1.4641,  1.2887, -0.1995,  1.9459,  1.0940, -1.1723,
         -0.7733,  0.2999],
        [-0.9799,  0.2903, -1.5163,  0.7261, -0.263

In [5]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.utils.rnn import pad_sequence # For padding sequences

# --- 1. Define our small dataset ---
# Each tuple: (email_text, label)
# Label: 0 for Not Spam (Ham), 1 for Spam
raw_data = [
    ("Hi, meeting tomorrow at 10 AM.", 0),
    ("Claim your free prize now!", 1),
    ("Project update for next week.", 0),
    ("You won a lottery! Click here.", 1),
    ("Can we reschedule the call?", 0),
    ("Urgent: Your account is suspended.", 1),
    ("Regarding your recent order.", 0),
    ("Congratulations, get rich quick!", 1),
    ("Reminder: Team sync at 2 PM.", 0),
    ("Exclusive offer just for you!", 1),
    ("Hello, how are you?", 0),
    ("Win iPhone, act fast!", 1)
]

# --- 2. Text Preprocessing: Vocabulary and Tokenization ---

# Special tokens
PAD_TOKEN = '<pad>'
UNK_TOKEN = '<unk>'

word_to_idx = {PAD_TOKEN: 0, UNK_TOKEN: 1} # Start with padding and unknown tokens
idx_to_word = {0: PAD_TOKEN, 1: UNK_TOKEN}
vocab_size = 2 # Start counting from 2

def build_vocabulary(data):
    global vocab_size
    for text, _ in data:
        for word in text.lower().split(): # Simple space tokenization
            if word not in word_to_idx:
                word_to_idx[word] = vocab_size
                idx_to_word[vocab_size] = word
                vocab_size += 1

def text_to_sequence(text, word_to_idx_map):
    return [word_to_idx_map.get(word, word_to_idx_map[UNK_TOKEN]) for word in text.lower().split()]

# Build vocabulary from our raw data
build_vocabulary(raw_data)

print(f"Vocabulary size: {vocab_size}")
# print("Word to Index mapping:", word_to_idx)

# Prepare sequences and labels for PyTorch
sequences = []
labels = []
for text, label in raw_data:
    sequences.append(torch.tensor(text_to_sequence(text, word_to_idx), dtype=torch.long))
    labels.append(torch.tensor([label], dtype=torch.float)) # Labels need to be float for BCEWithLogitsLoss

# Pad sequences to the same length within a batch
# For this small example, we'll just pad all sequences to the longest one
padded_sequences = pad_sequence(sequences, batch_first=True, padding_value=word_to_idx[PAD_TOKEN])
labels_tensor = torch.cat(labels).float() # Concatenate labels into a single tensor

print(f"Example padded sequence (indices): {padded_sequences[0]}")
print(f"Example label: {labels_tensor[0]}")
print(f"Shape of padded sequences: {padded_sequences.shape}") # (num_samples, max_seq_len)

# --- 3. Define the Spam Classifier Model ---

class SpamClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(SpamClassifier, self).__init__()

        # Embedding Layer: Converts word indices to dense vectors
        # vocab_size: total number of unique words
        # embedding_dim: size of each word vector (e.g., 100)
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=word_to_idx[PAD_TOKEN])

        # LSTM Layer: Processes the sequence of embeddings
        # embedding_dim: input size (from embedding layer)
        # hidden_dim: size of the LSTM's hidden state
        # batch_first=True: input/output tensors will have (batch_size, seq_len, features)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)

        # Linear Layer: Maps the LSTM's output to the final classification logit
        # hidden_dim: input size (from LSTM's last hidden state)
        # output_dim: 1 for binary classification (spam/not spam)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, text_sequence):
        # 1. Pass through Embedding Layer
        # Input: (batch_size, seq_len) -> Output: (batch_size, seq_len, embedding_dim)
        embedded = self.embedding(text_sequence)

        # 2. Pass through LSTM Layer
        # lstm_out: (batch_size, seq_len, hidden_dim * num_directions)
        # (h_n, c_n): tuple of (num_layers * num_directions, batch_size, hidden_dim)
        # We only need the output from the last time step for classification
        lstm_out, (h_n, c_n) = self.lstm(embedded)

        # For classification, we often use the last hidden state of the LSTM
        # h_n[-1, :, :] gets the last layer's hidden state for all batches
        final_hidden_state = h_n[-1] # Shape: (batch_size, hidden_dim)

        # 3. Pass through Linear Layer
        # Output: (batch_size, output_dim)
        output = self.fc(final_hidden_state)

        return output # Returns logits

# --- 4. Model, Loss, and Optimizer Initialization ---

# Hyperparameters for the model
EMBEDDING_DIM = 100  # Size of word vectors
HIDDEN_DIM = 128     # Size of LSTM's hidden state
OUTPUT_DIM = 1       # 1 for binary classification (logit)

model = SpamClassifier(vocab_size, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

# Loss Function: Binary Cross-Entropy with Logits
# This loss function is ideal for binary classification as it combines a sigmoid activation
# and binary cross-entropy loss in one stable step.
criterion = nn.BCEWithLogitsLoss()

# Optimizer: Adam is a good general-purpose optimizer
LEARNING_RATE = 0.001
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

print("\nModel Architecture:")
print(model)

# --- 5. Training Loop ---

NUM_EPOCHS = 100 # Number of times to iterate over the entire dataset
BATCH_SIZE = 4   # Number of samples per update

# Create a DataLoader for easier batching and shuffling
from torch.utils.data import TensorDataset, DataLoader

dataset = TensorDataset(padded_sequences, labels_tensor)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)

print("\nStarting Training...")
model.train() # Set the model to training mode

for epoch in range(1, NUM_EPOCHS + 1):
    total_loss = 0
    correct_predictions = 0
    total_samples = 0

    for inputs, labels in dataloader:
        optimizer.zero_grad() # Clear gradients from previous step

        # Forward pass
        outputs = model(inputs).squeeze(1) # Squeeze to make labels (batch_size,) vs outputs (batch_size, 1)

        # Calculate loss
        loss = criterion(outputs, labels)

        # Backward pass and optimize
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

        # Calculate accuracy for this batch (optional, but good for monitoring)
        predictions = torch.round(torch.sigmoid(outputs)) # Apply sigmoid and round to get binary predictions
        correct_predictions += (predictions == labels).sum().item()
        total_samples += labels.size(0)

    avg_loss = total_loss / len(dataloader)
    accuracy = correct_predictions / total_samples

    if epoch % 10 == 0:
        print(f"Epoch [{epoch}/{NUM_EPOCHS}], Loss: {avg_loss:.4f}, Accuracy: {accuracy:.4f}")

print("\nTraining Finished.")

# --- 6. Inference / Prediction ---

def predict_spam(model, text, word_to_idx_map, max_len=padded_sequences.shape[1]):
    model.eval() # Set model to evaluation mode
    with torch.no_grad(): # Disable gradient calculations
        # Convert text to sequence
        seq = text_to_sequence(text, word_to_idx_map)

        # Pad sequence to max_len
        if len(seq) < max_len:
            seq += [word_to_idx_map[PAD_TOKEN]] * (max_len - len(seq))
        else: # Truncate if longer than max_len
            seq = seq[:max_len]

        input_tensor = torch.tensor([seq], dtype=torch.long) # Add batch dimension

        # Get raw logit output
        output_logit = model(input_tensor).item()

        # Apply sigmoid to get probability
        probability = torch.sigmoid(torch.tensor(output_logit)).item()

        # Classify based on a threshold (e.g., 0.5)
        prediction = 1 if probability >= 0.5 else 0

        return "Spam" if prediction == 1 else "Not Spam", probability

print("\n--- Testing Predictions ---")
test_emails = [
    "Your account is compromised, click link now!", # Expected: Spam
    "Hi, just checking in on the report.",         # Expected: Not Spam
    "Exclusive offer to win cash prize!",         # Expected: Spam
    "Let's catch up next week.",                  # Expected: Not Spam
    "New lottery winner notification!"            # Expected: Spam (even if slightly new phrasing)
]

for email in test_emails:
    pred, prob = predict_spam(model, email, word_to_idx)
    print(f"Email: '{email}'\n  -> Predicted: {pred} (Probability: {prob:.4f})\n")

Vocabulary size: 57
Example padded sequence (indices): tensor([2, 3, 4, 5, 6, 7])
Example label: 0.0
Shape of padded sequences: torch.Size([12, 6])

Model Architecture:
SpamClassifier(
  (embedding): Embedding(57, 100, padding_idx=0)
  (lstm): LSTM(100, 128, batch_first=True)
  (fc): Linear(in_features=128, out_features=1, bias=True)
)

Starting Training...
Epoch [10/100], Loss: 0.1980, Accuracy: 1.0000
Epoch [20/100], Loss: 0.0057, Accuracy: 1.0000
Epoch [30/100], Loss: 0.0020, Accuracy: 1.0000
Epoch [40/100], Loss: 0.0013, Accuracy: 1.0000
Epoch [50/100], Loss: 0.0010, Accuracy: 1.0000
Epoch [60/100], Loss: 0.0008, Accuracy: 1.0000
Epoch [70/100], Loss: 0.0006, Accuracy: 1.0000
Epoch [80/100], Loss: 0.0005, Accuracy: 1.0000
Epoch [90/100], Loss: 0.0004, Accuracy: 1.0000
Epoch [100/100], Loss: 0.0004, Accuracy: 1.0000

Training Finished.

--- Testing Predictions ---
Email: 'Your account is compromised, click link now!'
  -> Predicted: Spam (Probability: 0.9997)

Email: 'Hi, just check

Explanation:

Data Preparation:

raw_data: Our very small dataset of email strings and their corresponding binary labels (0 for "Not Spam", 1 for "Spam").

word_to_idx and idx_to_word: Dictionaries to map words to unique integer IDs and vice-versa. Special tokens (<pad>, <unk>) are included.

build_vocabulary and text_to_sequence: Functions to process the text: lowercasing, splitting into words, and converting words to their numerical IDs. Unknown words are mapped to <unk>.

pad_sequence: Since RNNs usually expect fixed-length inputs in a batch, pad_sequence adds <pad> tokens to shorter sequences to match the length of the longest sequence in the batch.

labels_tensor: Converted to float as nn.BCEWithLogitsLoss expects float targets.

SpamClassifier Model (nn.Module):

nn.Embedding(vocab_size, embedding_dim, padding_idx):

Takes the integer word IDs.

Converts each ID into a dense embedding_dim-dimensional vector. These vectors are learned during training.

padding_idx: Crucially, this tells the embedding layer not to update the embedding for the <pad> token, and its embedding will remain all zeros.

nn.LSTM(embedding_dim, hidden_dim, batch_first=True):

A Long Short-Term Memory (LSTM) layer, a type of RNN. It's excellent for processing sequences and capturing long-range dependencies in text.

embedding_dim: The size of the input features at each time step (the size of our word embeddings).

hidden_dim: The size of the LSTM's hidden state. This determines the complexity of the information it can learn to carry through the sequence.

batch_first=True: Ensures the input and output tensors are in the format (batch_size, sequence_length, features), which is generally more convenient.

We extract h_n[-1] (the hidden state of the last LSTM layer for the last time step) as the fixed-size representation of the entire email sequence.

nn.Linear(hidden_dim, output_dim):

Takes the hidden_dim-sized output from the LSTM.

Transforms it into a single output_dim (which is 1) value. This single value is a logit, representing the raw score before activation.

Loss and Optimizer:

nn.BCEWithLogitsLoss(): This is perfect for binary classification. It internally applies a sigmoid activation to the model's output logits and then calculates the Binary Cross-Entropy loss. This combination is numerically more stable than applying sigmoid explicitly and then using BCELoss.

optim.Adam(): An adaptive optimizer that efficiently updates the model's weights during training.

Training Loop:

TensorDataset and DataLoader: Used to organize our data into batches and handle shuffling for more robust training.

model.train(): Sets the model to training mode (important for layers like dropout or batch normalization if they were present).

Standard training steps:

optimizer.zero_grad(): Clears gradients from the previous iteration.

model(inputs): Performs the forward pass to get predictions.

criterion(...): Calculates the loss based on predictions and true labels.

loss.backward(): Computes gradients for all trainable parameters.

optimizer.step(): Updates model parameters using the calculated gradients.

The loss and accuracy are printed periodically to monitor how well the model is learning.

Inference/Prediction:

model.eval() and with torch.no_grad(): Sets the model to evaluation mode and disables gradient calculations, which saves memory and speeds up prediction.

Input text is converted to a padded numerical sequence.

The model makes a prediction (a logit).

torch.sigmoid(): Applied to the logit to convert it into a probability between 0 and 1.

A threshold (0.5) is used to make the final "Spam" or "Not Spam" decision.

This example provides a foundational understanding of how to build a simple text classifier in PyTorch. For real-world applications, you'd work with much larger datasets, potentially more complex models (e.g., bidirectional LSTMs, Transformers), and more sophisticated text preprocessing.

===========4
Core Idea:

Data Preparation: Convert text emails into numerical sequences (indices) that PyTorch can understand.

Embedding Layer: Convert these numerical indices into dense vector representations (embeddings).

Recurrent Layer (LSTM): Process the sequence of embeddings to capture patterns and context within the email.

Linear Layer: Take the output of the LSTM and map it to a single value (logit) representing the likelihood of being spam.

Sigmoid Activation (Implicit in Loss): Convert the logit into a probability.

Binary Cross-Entropy Loss: Measure how well our model's predictions match the true spam/not-spam labels.