# Simple Neural Probabilistic Language Model (NPLM)

This notebook implements a simplified version of the Neural Probabilistic Language Model originally proposed by Bengio et al. (2003).

## Overview

The NPLM predicts the next word in a sequence by:
1. Looking at a fixed window of previous words (context)
2. Embedding each word into a continuous vector space
3. Concatenating these embeddings
4. Passing them through a feed-forward neural network
5. Outputting a probability distribution over the vocabulary

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt

## 1. Simple Text Dataset

We'll create a simple dataset for demonstration. In practice, you would use a larger corpus.

In [None]:
# Sample text data
text = """
the cat sat on the mat
the dog sat on the log
the cat and the dog played together
cats and dogs are great pets
the mat is on the floor
dogs like to play outside
cats like to sleep inside
""".lower().strip()

# Tokenize
tokens = text.split()
print(f"Total tokens: {len(tokens)}")
print(f"First 20 tokens: {tokens[:20]}")

## 2. Build Vocabulary

In [None]:
# Build vocabulary
vocab = ['<PAD>', '<UNK>'] + sorted(set(tokens))
vocab_size = len(vocab)
word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for word, idx in word2idx.items()}

print(f"Vocabulary size: {vocab_size}")
print(f"Vocabulary: {vocab}")

## 3. Create Training Data

We'll create (context, target) pairs where context is a fixed window of previous words.

In [None]:
def create_training_data(tokens, context_size=3):
    """
    Create training data with fixed context window.
    
    Args:
        tokens: List of tokens
        context_size: Number of previous words to use as context
    
    Returns:
        contexts: List of context sequences
        targets: List of target words
    """
    contexts = []
    targets = []
    
    for i in range(context_size, len(tokens)):
        context = tokens[i-context_size:i]
        target = tokens[i]
        contexts.append([word2idx[w] for w in context])
        targets.append(word2idx[target])
    
    return torch.LongTensor(contexts), torch.LongTensor(targets)

# Create training data
CONTEXT_SIZE = 3
X_train, y_train = create_training_data(tokens, CONTEXT_SIZE)

print(f"Number of training examples: {len(X_train)}")
print(f"\nFirst 5 training examples:")
for i in range(5):
    context_words = [idx2word[idx.item()] for idx in X_train[i]]
    target_word = idx2word[y_train[i].item()]
    print(f"Context: {context_words} -> Target: {target_word}")

## 4. Neural Probabilistic Language Model

The model architecture:
- **Embedding Layer**: Maps each word to a dense vector
- **Concatenation**: Concatenates embeddings of context words
- **Hidden Layer**: Feed-forward layer with tanh activation
- **Output Layer**: Projects to vocabulary size for prediction

In [None]:
class NPLM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size, hidden_dim):
        """
        Neural Probabilistic Language Model
        
        Args:
            vocab_size: Size of vocabulary
            embedding_dim: Dimension of word embeddings
            context_size: Number of context words
            hidden_dim: Dimension of hidden layer
        """
        super(NPLM, self).__init__()
        
        self.embedding_dim = embedding_dim
        self.context_size = context_size
        
        # Embedding layer
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        # Feed-forward network
        # Input: concatenated embeddings (context_size * embedding_dim)
        self.fc1 = nn.Linear(context_size * embedding_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, vocab_size)
        
    def forward(self, context):
        """
        Forward pass
        
        Args:
            context: Tensor of shape (batch_size, context_size)
        
        Returns:
            logits: Tensor of shape (batch_size, vocab_size)
        """
        # Embed each word in the context
        embeds = self.embeddings(context)  # (batch_size, context_size, embedding_dim)
        
        # Concatenate embeddings
        embeds = embeds.view(-1, self.context_size * self.embedding_dim)
        
        # Pass through feed-forward network
        hidden = torch.tanh(self.fc1(embeds))
        logits = self.fc2(hidden)
        
        return logits

## 5. Initialize Model and Training Setup

In [None]:
# Hyperparameters
EMBEDDING_DIM = 10
HIDDEN_DIM = 128
LEARNING_RATE = 0.01
NUM_EPOCHS = 100

# Initialize model
model = NPLM(vocab_size, EMBEDDING_DIM, CONTEXT_SIZE, HIDDEN_DIM)
print(model)

# Count parameters
num_params = sum(p.numel() for p in model.parameters())
print(f"\nTotal parameters: {num_params:,}")

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

## 6. Training Loop

In [None]:
# Training
losses = []

for epoch in range(NUM_EPOCHS):
    model.train()
    
    # Forward pass
    logits = model(X_train)
    loss = criterion(logits, y_train)
    
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    losses.append(loss.item())
    
    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{NUM_EPOCHS}], Loss: {loss.item():.4f}')

print("\nTraining complete!")

## 7. Visualize Training Loss

In [None]:
plt.figure(figsize=(10, 5))
plt.plot(losses)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss over Time')
plt.grid(True)
plt.show()

## 8. Evaluation and Prediction

Let's test the model by predicting the next word given a context.

In [None]:
def predict_next_word(model, context_words):
    """
    Predict the next word given context words.
    
    Args:
        model: Trained NPLM model
        context_words: List of context words
    
    Returns:
        predicted_word: The predicted next word
        probabilities: Top-5 predictions with probabilities
    """
    model.eval()
    
    # Convert words to indices
    context_indices = [word2idx.get(w, word2idx['<UNK>']) for w in context_words]
    context_tensor = torch.LongTensor([context_indices])
    
    # Get predictions
    with torch.no_grad():
        logits = model(context_tensor)
        probs = F.softmax(logits, dim=1)
    
    # Get top-5 predictions
    top_probs, top_indices = torch.topk(probs[0], k=min(5, vocab_size))
    
    predictions = []
    for prob, idx in zip(top_probs, top_indices):
        word = idx2word[idx.item()]
        predictions.append((word, prob.item()))
    
    return predictions[0][0], predictions

# Test predictions
test_contexts = [
    ['the', 'cat', 'sat'],
    ['the', 'dog', 'sat'],
    ['cats', 'and', 'dogs'],
    ['like', 'to', 'play']
]

print("Model Predictions:\n")
for context in test_contexts:
    predicted, top_preds = predict_next_word(model, context)
    print(f"Context: {' '.join(context)}")
    print(f"Predicted: {predicted}")
    print("Top 5 predictions:")
    for word, prob in top_preds:
        print(f"  {word}: {prob:.4f}")
    print()

## 9. Calculate Perplexity

In [None]:
def calculate_perplexity(model, X, y):
    """
    Calculate perplexity on the dataset.
    
    Args:
        model: Trained NPLM model
        X: Context sequences
        y: Target words
    
    Returns:
        perplexity: The perplexity score
    """
    model.eval()
    
    with torch.no_grad():
        logits = model(X)
        loss = criterion(logits, y)
        perplexity = torch.exp(loss)
    
    return perplexity.item()

perplexity = calculate_perplexity(model, X_train, y_train)
print(f"Training Perplexity: {perplexity:.4f}")

## 10. Generate Text

Generate text by repeatedly predicting the next word and using it as context.

In [None]:
def generate_text(model, start_context, max_length=20):
    """
    Generate text by iteratively predicting next words.
    
    Args:
        model: Trained NPLM model
        start_context: Initial context words (list)
        max_length: Maximum number of words to generate
    
    Returns:
        generated_text: The generated text as a string
    """
    model.eval()
    context = start_context.copy()
    generated = start_context.copy()
    
    for _ in range(max_length):
        # Get the last context_size words
        current_context = context[-CONTEXT_SIZE:]
        
        # Predict next word
        predicted_word, _ = predict_next_word(model, current_context)
        
        # Add to generated text
        generated.append(predicted_word)
        context.append(predicted_word)
    
    return ' '.join(generated)

# Generate text from different starting contexts
print("Generated Text:\n")

start_contexts = [
    ['the', 'cat', 'sat'],
    ['the', 'dog', 'like'],
    ['cats', 'and', 'dogs']
]

for start in start_contexts:
    generated = generate_text(model, start, max_length=10)
    print(f"Start: {' '.join(start)}")
    print(f"Generated: {generated}")
    print()

## 11. Visualize Word Embeddings

We can visualize the learned word embeddings in 2D space using t-SNE or PCA.

In [None]:
from sklearn.decomposition import PCA

# Get embeddings
embeddings = model.embeddings.weight.detach().numpy()

# Reduce to 2D using PCA
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

# Plot
plt.figure(figsize=(12, 8))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], alpha=0.5)

# Add labels for each word
for i, word in enumerate(vocab):
    plt.annotate(word, (embeddings_2d[i, 0], embeddings_2d[i, 1]), 
                 fontsize=9, alpha=0.7)

plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('Word Embeddings Visualization (PCA)')
plt.grid(True, alpha=0.3)
plt.show()

## Summary

This notebook demonstrates a simple implementation of the Neural Probabilistic Language Model (NPLM). Key concepts:

1. **Word Embeddings**: Each word is mapped to a continuous vector space
2. **Context Window**: Fixed-size window of previous words used for prediction
3. **Feed-Forward Network**: Simple neural network processes concatenated embeddings
4. **Next Word Prediction**: Model outputs probability distribution over vocabulary

### Limitations of this Simple Model:
- Fixed context size (cannot handle variable-length dependencies)
- Small training corpus (would need much more data for real applications)
- No regularization techniques (dropout, weight decay, etc.)
- Simple architecture (modern models use more sophisticated architectures)

### Extensions:
- Use larger datasets (e.g., WikiText, Penn Treebank)
- Add dropout for regularization
- Implement the distant context aggregation from the paper
- Compare with Transformer-based models
- Add proper train/validation/test splits