


# Language Modeling
**bold text**
Language modeling is a fundamental task in Natural Language Processing (NLP) that involves predicting the probability of a sequence of words in a language. It is used to understand and generate human language by training models on large amounts of text data.

# **Language Modeling Using N-gram Model**

---



An N-gram model predicts the next word based on the previous n-1 words. Implementation of a bigram model (n=2):

**Approach:**
- Predict the next word based on the previous n-1 words.
- Use statistical probabilities derived from the frequency of word sequences in the training data.

**Pros:**
- Simple and easy to implement.
- Efficient for small datasets and limited computational resources.
- Can capture local dependencies and short-term context.

**Cons:**
- Struggle with long-range dependencies and capturing the broader context.
- Suffer from data sparsity, especially for higher values of n.
- Limited ability to handle rare or unseen word sequences.

In [None]:
from collections import defaultdict, Counter
import random

class BigramModel:
    def __init__(self):
        self.bigrams = defaultdict(Counter)

    def train(self, sentences):
        for sentence in sentences:
            tokens = sentence.split()
            for i in range(len(tokens) - 1):
                self.bigrams[tokens[i]][tokens[i + 1]] += 1

    def predict_next_word(self, word):
        if word in self.bigrams:
            next_words = self.bigrams[word]
            total = sum(next_words.values())
            r = random.uniform(0, total)
            upto = 0
            for next_word, count in next_words.items():
                if upto + count >= r:
                    return next_word
                upto += count
        return None


sentences = [
    "the cat sat on the mat",
    "the dog barked at the cat",
    "the cat chased the mouse"
]

model = BigramModel()
model.train(sentences)
print(model.predict_next_word("the"))

dog


# **Language Modeling Using Neural Language Model**

A neural language model uses a neural network to predict the next word in a sequence. Implementation using a recurrent neural network (RNN) with PyTorch:

**Approach:**
- Use neural networks, such as feedforward networks or recurrent neural networks (RNNs), to model the probability distribution of sequences of words.
- Learn continuous representations of words (embeddings) and capture complex patterns in the data.

**Pros:**
- Can capture longer-range dependencies compared to N-gram models.
- Learn rich, dense representations of words that capture semantic similarities.
- Adaptable to various NLP tasks with transfer learning.

**Cons:**
- Require large amounts of data and computational resources for training.
- Can be difficult to interpret and understand the internal representations.
- May struggle with rare words or out-of-vocabulary items without proper handling.

In [11]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# Sample data
sentences = [
    "the cat sat on the mat",
    "the dog barked at the cat",
    "the cat chased the mouse"
]

# Tokenize the text
tokenizer = lambda x: x.split()
vocab = set(word for sentence in sentences for word in tokenizer(sentence))
word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for word, idx in word2idx.items()}

# Prepare input-output pairs
input_seq = []
output_seq = []
for sentence in sentences:
    tokens = tokenizer(sentence)
    for i in range(len(tokens) - 1):
        input_seq.append(word2idx[tokens[i]])
        output_seq.append(word2idx[tokens[i + 1]])

input_seq = np.array(input_seq)
output_seq = np.array(output_seq)

# Convert to PyTorch tensors
input_tensor = torch.from_numpy(input_seq).long()
output_tensor = torch.from_numpy(output_seq).long()

# Define the RNN model
class RNNLanguageModel(nn.Module):  # Fix 1: Inherit from nn.Module
    def __init__(self, vocab_size, embedding_dim, hidden_dim):  # Fix 2: Correct __init__ method syntax
        super(RNNLanguageModel, self).__init__()  # Fix 3: Call parent constructor
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)
        self.hidden_dim = hidden_dim

    def forward(self, x, hidden):
        x = self.embedding(x)
        x = x.view(x.size(0), 1, -1)  # Fix 4: Reshape for batch_first=True
        out, hidden = self.rnn(x, hidden)
        out = self.fc(out.squeeze(1))
        return out, hidden

    def init_hidden(self, batch_size=1):
        return torch.zeros(1, batch_size, self.hidden_dim)  # Fix 5: Correct hidden state dimensions

# Hyperparameters
vocab_size = len(vocab)
embedding_dim = 64
hidden_dim = 128
learning_rate = 0.01
num_epochs = 100

# Initialize the model, loss function, and optimizer
model = RNNLanguageModel(vocab_size, embedding_dim, hidden_dim)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)  # Fix 6: No need for custom parameters method

# Train the model
for epoch in range(num_epochs):
    total_loss = 0
    hidden = model.init_hidden()  # Fix 7: Initialize hidden state once per epoch

    for i in range(len(input_tensor)):
        optimizer.zero_grad()

        input_word = input_tensor[i].unsqueeze(0)  # Add batch dimension
        target = output_tensor[i].unsqueeze(0)  # Fix 8: Rename for clarity

        output, hidden = model(input_word, hidden)  # Fix 9: Use direct call syntax
        hidden = hidden.detach()  # Fix 10: Detach hidden state to prevent backprop through entire sequence

        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {total_loss/len(input_tensor):.4f}')

# Generate text
def generate_text(model, start_word, length=10):
    model.eval()
    with torch.no_grad():  # Fix 11: Add no_grad context for inference
        hidden = model.init_hidden()
        input_word = torch.tensor([word2idx[start_word]]).long()  # Fix 12: Specify tensor dtype
        generated_words = [start_word]

        for _ in range(length):
            output, hidden = model(input_word, hidden)
            output_word_idx = torch.argmax(output, dim=1).item()  # Fix 13: Specify argmax dimension
            output_word = idx2word[output_word_idx]
            generated_words.append(output_word)
            input_word = torch.tensor([output_word_idx]).long()

    return ' '.join(generated_words)

print(generate_text(model, "the"))

Epoch [10/100], Loss: 0.4509
Epoch [20/100], Loss: 0.0084
Epoch [30/100], Loss: 0.0041
Epoch [40/100], Loss: 0.0025
Epoch [50/100], Loss: 0.0017
Epoch [60/100], Loss: 0.0013
Epoch [70/100], Loss: 0.0010
Epoch [80/100], Loss: 0.0008
Epoch [90/100], Loss: 0.0006
Epoch [100/100], Loss: 0.0005
the cat sat on the mat dog barked at the cat


# Language Modeling Using Transformer Model

**Approach:**
- Use self-attention mechanisms to weigh the importance of input words relative to each other, regardless of their distance.
- Models like BERT, T5, and others have achieved state-of-the-art performance on various NLP tasks.

**Pros:**
- Excellent at capturing long-range dependencies and contextual information.
- Highly parallelizable, leading to faster training times compared to RNNs.
- Can be fine-tuned for specific tasks with transfer learning.

**Cons:**
- Require massive amounts of data and computational resources for pre-training.
- Can be memory-intensive due to the need to store attention weights.
- May require careful tuning and adaptation for domain-specific tasks.

In [13]:
from transformers import BertTokenizer, BertForMaskedLM
import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

# Example sentence
sentence = "The cat sat on the [MASK]."
inputs = tokenizer(sentence, return_tensors='pt')

# Predict the masked token
with torch.no_grad():
    outputs = model(**inputs)
    predictions = outputs.logits

# Get the predicted token
predicted_token_id = torch.argmax(predictions[0, inputs['input_ids'][0] == tokenizer.mask_token_id])
predicted_token = tokenizer.decode([predicted_token_id])
print(predicted_token)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


floor
