# Named Entity Recognition (NER) and N-gram Language Models
## Using Popular NLP Libraries

This notebook demonstrates:
1. **Named Entity Recognition (NER)**
   - Dictionary-based NER with spaCy's PhraseMatcher
   - CRF-based NER with sklearn-crfsuite
   - Pre-trained NER with spaCy
2. **N-gram Language Models**
   - NLTK's N-gram utilities
   - Smoothing techniques (Laplace, Add-k, Kneser-Ney)
   - Backoff models
   - Perplexity evaluation

In [None]:
# Install required packages (uncomment to install)
# !pip install spacy nltk sklearn-crfsuite pandas numpy matplotlib seaborn
# !python -m spacy download en_core_web_sm

In [None]:
# Import libraries
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

import nltk
from nltk import ngrams, FreqDist
from nltk.lm import MLE, Laplace, KneserNeyInterpolated
from nltk.lm.preprocessing import padded_everygram_pipeline, pad_both_ends

import sklearn_crfsuite
from sklearn_crfsuite import metrics

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict, Counter
import warnings
warnings.filterwarnings('ignore')

# Download required NLTK data
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
    nltk.download('averaged_perceptron_tagger')
    nltk.download('maxent_ne_chunker')
    nltk.download('words')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

---
## Part 1: Named Entity Recognition (NER)

### 1.1 Dictionary-based NER using spaCy's PhraseMatcher

spaCy's `PhraseMatcher` allows efficient dictionary-based entity matching with support for:
- Fast pattern matching
- Multi-word entities
- Linguistic features (lemmatization, case-insensitivity)

In [None]:
# Load spaCy model
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print("Downloading spaCy model...")
    import os
    os.system("python -m spacy download en_core_web_sm")
    nlp = spacy.load("en_core_web_sm")

print(f"Loaded spaCy model: {nlp.meta['name']} v{nlp.meta['version']}")

In [None]:
class DictionaryNER:
    """Dictionary-based NER using spaCy's PhraseMatcher"""
    
    def __init__(self, nlp):
        self.nlp = nlp
        self.matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
        self.entity_patterns = {}
    
    def add_entities(self, entity_type, entity_list):
        """Add entities to the matcher"""
        patterns = [self.nlp.make_doc(entity) for entity in entity_list]
        self.matcher.add(entity_type, patterns)
        self.entity_patterns[entity_type] = entity_list
    
    def __call__(self, doc):
        """Process document and add custom entities"""
        matches = self.matcher(doc)
        spans = []
        
        for match_id, start, end in matches:
            entity_type = self.nlp.vocab.strings[match_id]
            span = Span(doc, start, end, label=entity_type)
            spans.append(span)
        
        # Filter overlapping spans (keep longest)
        filtered_spans = spacy.util.filter_spans(spans)
        doc.ents = filtered_spans
        return doc
    
    def visualize(self, text):
        """Visualize entities in text"""
        doc = self.nlp(text)
        doc = self(doc)
        
        # Display using spaCy's displacy
        from spacy import displacy
        displacy.render(doc, style="ent", jupyter=True)
        
        return doc

In [None]:
# Create dictionary-based NER
dict_ner = DictionaryNER(nlp)

# Add custom entity dictionaries
dict_ner.add_entities("TECH_COMPANY", [
    "Google", "Microsoft", "Apple", "Amazon", "Tesla", 
    "Meta", "Netflix", "OpenAI", "NVIDIA", "Intel"
])

dict_ner.add_entities("TECH_PERSON", [
    "Bill Gates", "Steve Jobs", "Elon Musk", "Jeff Bezos",
    "Mark Zuckerberg", "Sundar Pichai", "Tim Cook", "Sam Altman"
])

dict_ner.add_entities("PROGRAMMING_LANGUAGE", [
    "Python", "Java", "JavaScript", "C++", "Go", "Rust",
    "TypeScript", "Swift", "Kotlin", "Ruby"
])

print("Dictionary-based NER initialized with custom entities:")
for entity_type, entities in dict_ner.entity_patterns.items():
    print(f"  {entity_type}: {len(entities)} entities")

In [None]:
# Test dictionary-based NER
test_texts = [
    "Bill Gates founded Microsoft and revolutionized personal computing.",
    "Elon Musk leads Tesla and SpaceX, pushing boundaries in electric vehicles and space exploration.",
    "Python and JavaScript are popular programming languages used at Google and Amazon.",
    "Sam Altman is the CEO of OpenAI, which developed ChatGPT using advanced AI."
]

print("Dictionary-based NER Results:\n")
for i, text in enumerate(test_texts, 1):
    print(f"Example {i}: {text}")
    doc = nlp(text)
    doc = dict_ner(doc)
    
    if doc.ents:
        print("\nEntities found:")
        for ent in doc.ents:
            print(f"  - {ent.text:25s} => {ent.label_}")
    else:
        print("  No entities found.")
    print("-" * 80)

In [None]:
# Visualize entities (works in Jupyter)
sample_text = "Elon Musk uses Python at Tesla and works with OpenAI on AI research."
print("Entity Visualization:\n")
doc = dict_ner.visualize(sample_text)

### 1.2 Pre-trained NER with spaCy

spaCy comes with pre-trained NER models that can recognize standard entity types:
- PERSON, ORG, GPE (Geopolitical Entity), LOC, DATE, TIME, MONEY, etc.

In [None]:
# Test spaCy's pre-trained NER
def analyze_with_spacy_ner(text):
    """Analyze text with spaCy's pre-trained NER"""
    doc = nlp(text)
    
    print(f"Text: {text}\n")
    print("Entities found:")
    for ent in doc.ents:
        print(f"  - {ent.text:25s} => {ent.label_:10s} ({spacy.explain(ent.label_)})")
    print()
    
    return doc

# Test examples
examples = [
    "Apple Inc. was founded by Steve Jobs in Cupertino, California on April 1, 1976.",
    "Google acquired YouTube for $1.65 billion in October 2006.",
    "The United Nations was established in New York on October 24, 1945."
]

print("spaCy Pre-trained NER Results:\n")
for text in examples:
    analyze_with_spacy_ner(text)
    print("-" * 80)

### 1.3 CRF-based NER with sklearn-crfsuite

Conditional Random Fields (CRF) is a popular sequence labeling algorithm for NER.
We'll use the CoNLL-2003 dataset format for training.

In [None]:
# Feature extraction for CRF
def word2features(sent, i):
    """Extract features for word at position i"""
    word = sent[i][0]
    postag = sent[i][1]
    
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    
    # Features from previous word
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True
    
    # Features from next word
    if i < len(sent) - 1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True
    
    return features

def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

In [None]:
# Create training data (BIO format)
# Format: [(word, POS_tag, NER_tag), ...]

train_sentences = [
    [('Apple', 'NNP', 'B-ORG'), ('Inc.', 'NNP', 'I-ORG'), ('was', 'VBD', 'O'),
     ('founded', 'VBN', 'O'), ('by', 'IN', 'O'), ('Steve', 'NNP', 'B-PER'),
     ('Jobs', 'NNP', 'I-PER'), ('in', 'IN', 'O'), ('Cupertino', 'NNP', 'B-LOC'),
     (',', ',', 'O'), ('California', 'NNP', 'B-LOC'), ('.', '.', 'O')],
    
    [('Microsoft', 'NNP', 'B-ORG'), ('Corporation', 'NNP', 'I-ORG'), ('is', 'VBZ', 'O'),
     ('based', 'VBN', 'O'), ('in', 'IN', 'O'), ('Redmond', 'NNP', 'B-LOC'),
     (',', ',', 'O'), ('Washington', 'NNP', 'B-LOC'), ('.', '.', 'O')],
    
    [('Bill', 'NNP', 'B-PER'), ('Gates', 'NNP', 'I-PER'), ('co-founded', 'VBD', 'O'),
     ('Microsoft', 'NNP', 'B-ORG'), ('with', 'IN', 'O'), ('Paul', 'NNP', 'B-PER'),
     ('Allen', 'NNP', 'I-PER'), ('.', '.', 'O')],
    
    [('Google', 'NNP', 'B-ORG'), ('was', 'VBD', 'O'), ('founded', 'VBN', 'O'),
     ('by', 'IN', 'O'), ('Larry', 'NNP', 'B-PER'), ('Page', 'NNP', 'I-PER'),
     ('and', 'CC', 'O'), ('Sergey', 'NNP', 'B-PER'), ('Brin', 'NNP', 'I-PER'),
     ('in', 'IN', 'O'), ('California', 'NNP', 'B-LOC'), ('.', '.', 'O')],
    
    [('Tesla', 'NNP', 'B-ORG'), ('Motors', 'NNP', 'I-ORG'), ('is', 'VBZ', 'O'),
     ('led', 'VBN', 'O'), ('by', 'IN', 'O'), ('Elon', 'NNP', 'B-PER'),
     ('Musk', 'NNP', 'I-PER'), ('.', '.', 'O')],
    
    [('Amazon', 'NNP', 'B-ORG'), ('is', 'VBZ', 'O'), ('headquartered', 'VBN', 'O'),
     ('in', 'IN', 'O'), ('Seattle', 'NNP', 'B-LOC'), (',', ',', 'O'),
     ('Washington', 'NNP', 'B-LOC'), ('.', '.', 'O')],
    
    [('Mark', 'NNP', 'B-PER'), ('Zuckerberg', 'NNP', 'I-PER'), ('founded', 'VBD', 'O'),
     ('Facebook', 'NNP', 'B-ORG'), ('in', 'IN', 'O'), ('2004', 'CD', 'B-DATE'), ('.', '.', 'O')],
    
    [('Tim', 'NNP', 'B-PER'), ('Cook', 'NNP', 'I-PER'), ('is', 'VBZ', 'O'),
     ('the', 'DT', 'O'), ('CEO', 'NN', 'O'), ('of', 'IN', 'O'),
     ('Apple', 'NNP', 'B-ORG'), ('.', '.', 'O')],
]

test_sentences = [
    [('Jeff', 'NNP', 'B-PER'), ('Bezos', 'NNP', 'I-PER'), ('founded', 'VBD', 'O'),
     ('Amazon', 'NNP', 'B-ORG'), ('in', 'IN', 'O'), ('1994', 'CD', 'B-DATE'), ('.', '.', 'O')],
    
    [('Sundar', 'NNP', 'B-PER'), ('Pichai', 'NNP', 'I-PER'), ('leads', 'VBZ', 'O'),
     ('Google', 'NNP', 'B-ORG'), ('and', 'CC', 'O'), ('Alphabet', 'NNP', 'B-ORG'), ('.', '.', 'O')],
    
    [('Netflix', 'NNP', 'B-ORG'), ('operates', 'VBZ', 'O'), ('from', 'IN', 'O'),
     ('Los', 'NNP', 'B-LOC'), ('Gatos', 'NNP', 'I-LOC'), (',', ',', 'O'),
     ('California', 'NNP', 'B-LOC'), ('.', '.', 'O')],
]

print(f"Training data: {len(train_sentences)} sentences")
print(f"Test data: {len(test_sentences)} sentences")

In [None]:
# Prepare data for CRF
X_train = [sent2features(s) for s in train_sentences]
y_train = [sent2labels(s) for s in train_sentences]

X_test = [sent2features(s) for s in test_sentences]
y_test = [sent2labels(s) for s in test_sentences]

# Train CRF model
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,          # L1 regularization coefficient
    c2=0.1,          # L2 regularization coefficient
    max_iterations=100,
    all_possible_transitions=True,
    verbose=True
)

print("Training CRF model...\n")
crf.fit(X_train, y_train)
print("\nTraining completed!")

In [None]:
# Make predictions
y_pred = crf.predict(X_test)

# Display results
print("CRF-based NER Predictions:\n")
for i, sent in enumerate(test_sentences):
    print(f"Sentence {i+1}:")
    tokens = sent2tokens(sent)
    print(" ".join(tokens))
    print()
    print(f"{'Token':<20} {'True Label':<15} {'Predicted':<15} {'Match'}")
    print("-" * 65)
    
    for j, token in enumerate(tokens):
        true_label = y_test[i][j]
        pred_label = y_pred[i][j]
        match = "✓" if true_label == pred_label else "✗"
        print(f"{token:<20} {true_label:<15} {pred_label:<15} {match}")
    print("\n" + "=" * 80 + "\n")

In [None]:
# Evaluate CRF model
labels = list(crf.classes_)
labels.remove('O')  # Remove 'O' label for entity-focused evaluation

print("CRF Model Performance Report:\n")
print(metrics.flat_classification_report(
    y_test, y_pred, labels=labels, digits=3
))

In [None]:
# Inspect most important features
def print_transitions(trans_features, top_n=10):
    """Print top transition features"""
    print(f"\nTop {top_n} Transition Features:")
    for (label_from, label_to), weight in trans_features[:top_n]:
        print(f"  {label_from:10s} -> {label_to:10s}  {weight:>8.3f}")

def print_state_features(state_features, top_n=20):
    """Print top state features"""
    print(f"\nTop {top_n} State Features:")
    for (attr, label), weight in state_features[:top_n]:
        print(f"  {attr:40s} => {label:10s}  {weight:>8.3f}")

# Get feature importance
print("Feature Importance Analysis:")
print("=" * 80)

# Transition features
trans_features = Counter(crf.transition_features_).most_common(10)
print_transitions(trans_features)

# State features
state_features = Counter(crf.state_features_).most_common(20)
print_state_features(state_features)

---
## Part 2: N-gram Language Models with NLTK

NLTK provides robust tools for building and evaluating n-gram language models with various smoothing techniques.

### 2.1 Building N-gram Models with NLTK

In [None]:
# Sample corpus for language modeling
train_corpus = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "the cat sat on the log",
    "the dog sat on the mat",
    "a cat is a pet",
    "a dog is a pet",
    "the cat likes the mat",
    "the dog likes the log",
    "cats and dogs are pets",
    "the mat is soft",
    "the log is hard",
    "pets are nice",
    "the cat sleeps on the mat",
    "the dog runs to the log",
]

test_corpus = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "a cat is a good pet",
    "the mat is very soft",
]

# Tokenize corpus
train_tokenized = [sent.split() for sent in train_corpus]
test_tokenized = [sent.split() for sent in test_corpus]

print(f"Training corpus: {len(train_tokenized)} sentences")
print(f"Test corpus: {len(test_tokenized)} sentences")
print(f"\nSample sentences:")
for sent in train_tokenized[:3]:
    print(f"  {' '.join(sent)}")

In [None]:
# Build vocabulary and n-grams
n = 3  # trigram model

# Pad sentences and create n-grams
train_data, vocab = padded_everygram_pipeline(n, train_tokenized)

# Separate train data for each model (need to regenerate as it's a generator)
train_data_mle, vocab_mle = padded_everygram_pipeline(n, train_tokenized)
train_data_laplace, vocab_laplace = padded_everygram_pipeline(n, train_tokenized)
train_data_kn, vocab_kn = padded_everygram_pipeline(n, train_tokenized)

print(f"Building {n}-gram models...")

### 2.2 Smoothing Techniques

#### 2.2.1 Maximum Likelihood Estimation (MLE) - No Smoothing

In [None]:
# MLE model (no smoothing)
lm_mle = MLE(n)
lm_mle.fit(train_data_mle, vocab_mle)

print("MLE Model (No Smoothing)")
print(f"Vocabulary size: {len(lm_mle.vocab)}")
print(f"N-gram order: {lm_mle.order}")

#### 2.2.2 Laplace Smoothing (Add-One)

In [None]:
# Laplace smoothing model
lm_laplace = Laplace(n)
lm_laplace.fit(train_data_laplace, vocab_laplace)

print("Laplace Smoothing Model")
print(f"Vocabulary size: {len(lm_laplace.vocab)}")
print(f"N-gram order: {lm_laplace.order}")

#### 2.2.3 Kneser-Ney Smoothing (Advanced)

In [None]:
# Kneser-Ney smoothing model
lm_kn = KneserNeyInterpolated(n)
lm_kn.fit(train_data_kn, vocab_kn)

print("Kneser-Ney Interpolated Smoothing Model")
print(f"Vocabulary size: {len(lm_kn.vocab)}")
print(f"N-gram order: {lm_kn.order}")

### 2.3 Comparing Probabilities with Different Smoothing

In [None]:
# Test various n-grams
test_ngrams = [
    ('cat',),                    # Unigram (seen)
    ('elephant',),               # Unigram (unseen)
    ('the', 'cat'),              # Bigram (seen)
    ('the', 'elephant'),         # Bigram (unseen)
    ('cat', 'sat', 'on'),        # Trigram (seen)
    ('cat', 'jumped', 'over'),   # Trigram (unseen)
]

# Compare probabilities
results = []
for ngram in test_ngrams:
    # Get context (all but last word)
    context = ngram[:-1] if len(ngram) > 1 else ()
    word = ngram[-1]
    
    # Calculate probabilities with different models
    prob_mle = lm_mle.score(word, context)
    prob_laplace = lm_laplace.score(word, context)
    prob_kn = lm_kn.score(word, context)
    
    results.append({
        'N-gram': ' '.join(ngram),
        'Order': len(ngram),
        'MLE': prob_mle,
        'Laplace': prob_laplace,
        'Kneser-Ney': prob_kn
    })

# Display as DataFrame
df_probs = pd.DataFrame(results)
print("\nProbability Comparison Across Smoothing Methods:\n")
print(df_probs.to_string(index=False))

# Visualize
fig, ax = plt.subplots(figsize=(14, 6))
x = np.arange(len(df_probs))
width = 0.25

ax.bar(x - width, df_probs['MLE'], width, label='MLE', alpha=0.8)
ax.bar(x, df_probs['Laplace'], width, label='Laplace', alpha=0.8)
ax.bar(x + width, df_probs['Kneser-Ney'], width, label='Kneser-Ney', alpha=0.8)

ax.set_xlabel('N-grams', fontsize=12)
ax.set_ylabel('Probability', fontsize=12)
ax.set_title('N-gram Probabilities with Different Smoothing Methods', fontsize=14)
ax.set_xticks(x)
ax.set_xticklabels(df_probs['N-gram'], rotation=45, ha='right')
ax.legend()
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

### 2.4 Backoff Demonstration

In [None]:
def demonstrate_backoff(lm, ngram):
    """Show how model backs off to lower-order n-grams"""
    print(f"\nN-gram: {' '.join(ngram)}")
    print("-" * 50)
    
    # Try different orders
    for order in range(len(ngram), 0, -1):
        current_ngram = ngram[-order:]
        context = current_ngram[:-1] if len(current_ngram) > 1 else ()
        word = current_ngram[-1]
        
        prob = lm.score(word, context)
        print(f"  {order}-gram: P({word}|{' '.join(context) if context else 'ε'}) = {prob:.6f}")

# Demonstrate backoff with Kneser-Ney model
print("Backoff Behavior with Kneser-Ney Smoothing:")
print("=" * 50)

backoff_examples = [
    ('the', 'cat', 'sat'),
    ('the', 'cat', 'jumped'),
    ('the', 'elephant', 'danced'),
]

for example in backoff_examples:
    demonstrate_backoff(lm_kn, example)

### 2.5 Perplexity Evaluation

Perplexity measures how well a language model predicts a test corpus.
- Lower perplexity = better model
- Perplexity of K means the model is as confused as choosing uniformly from K words

In [None]:
# Prepare test data
test_data = [list(pad_both_ends(sent, n=n)) for sent in test_tokenized]

# Calculate perplexity for each model
perplexities = []

for model_name, model in [("MLE", lm_mle), ("Laplace", lm_laplace), ("Kneser-Ney", lm_kn)]:
    perplexity = model.perplexity(test_data)
    perplexities.append({
        'Model': model_name,
        'Perplexity': perplexity
    })
    print(f"{model_name:15s} Perplexity: {perplexity:.4f}")

# Visualize perplexity
df_perplexity = pd.DataFrame(perplexities)

plt.figure(figsize=(10, 6))
bars = plt.bar(df_perplexity['Model'], df_perplexity['Perplexity'], 
               color=['#ff9999', '#66b3ff', '#99ff99'], alpha=0.8)
plt.xlabel('Smoothing Method', fontsize=12)
plt.ylabel('Perplexity', fontsize=12)
plt.title('Model Perplexity Comparison (Lower is Better)', fontsize=14)
plt.grid(axis='y', alpha=0.3)

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{height:.2f}',
             ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

### 2.6 Text Generation with N-gram Models

In [None]:
def generate_text(lm, num_words=10, text_seed=None, random_seed=42):
    """Generate text using n-gram language model"""
    np.random.seed(random_seed)
    
    # Start with padding or seed
    if text_seed:
        content = text_seed.split()
    else:
        content = ['<s>'] * (lm.order - 1)
    
    for _ in range(num_words):
        # Get context
        context = tuple(content[-(lm.order - 1):])
        
        # Generate next word
        next_word = lm.generate(1, text_seed=context)
        
        if next_word == '</s>':
            break
        
        content.append(next_word)
    
    # Remove padding tokens
    generated = [w for w in content if w not in ['<s>', '</s>']]
    return ' '.join(generated)

# Generate text with different models
print("Generated Text Samples:\n")
print("=" * 80)

for model_name, model in [("Laplace", lm_laplace), ("Kneser-Ney", lm_kn)]:
    print(f"\n{model_name} Model:")
    print("-" * 40)
    for i in range(5):
        text = generate_text(model, num_words=8, random_seed=42+i)
        print(f"  {i+1}. {text}")

In [None]:
# Generate text with seed
print("\nText Generation with Seed:\n")
print("=" * 80)

seeds = ["the cat", "the dog", "a pet"]

for seed in seeds:
    print(f"\nSeed: '{seed}'")
    text = generate_text(lm_kn, num_words=6, text_seed=seed)
    print(f"Generated: {text}")

### 2.7 Entropy and Cross-Entropy

In [None]:
def calculate_entropy(lm, test_data):
    """Calculate cross-entropy of language model on test data"""
    return lm.entropy(test_data)

print("Cross-Entropy Analysis:\n")
print("=" * 80)

for model_name, model in [("MLE", lm_mle), ("Laplace", lm_laplace), ("Kneser-Ney", lm_kn)]:
    try:
        entropy = calculate_entropy(model, test_data)
        perplexity = 2 ** entropy
        print(f"{model_name:15s} - Entropy: {entropy:.4f}, Perplexity: {perplexity:.4f}")
    except Exception as e:
        print(f"{model_name:15s} - Error: {e}")

### 2.8 N-gram Analysis and Visualization

In [None]:
# Analyze n-gram frequencies
from nltk import bigrams, trigrams

# Flatten training corpus
all_words = [word for sent in train_tokenized for word in sent]

# Get bigrams and trigrams
bigram_list = list(bigrams(all_words))
trigram_list = list(trigrams(all_words))

# Count frequencies
bigram_freq = FreqDist(bigram_list)
trigram_freq = FreqDist(trigram_list)

print("Top 10 Bigrams:")
print("-" * 40)
for (w1, w2), count in bigram_freq.most_common(10):
    print(f"  ({w1}, {w2}): {count}")

print("\nTop 10 Trigrams:")
print("-" * 40)
for (w1, w2, w3), count in trigram_freq.most_common(10):
    print(f"  ({w1}, {w2}, {w3}): {count}")

In [None]:
# Visualize bigram frequencies
top_bigrams = bigram_freq.most_common(15)
bigram_labels = [f"{w1} {w2}" for (w1, w2), _ in top_bigrams]
bigram_counts = [count for _, count in top_bigrams]

plt.figure(figsize=(12, 6))
plt.barh(bigram_labels, bigram_counts, color='steelblue', alpha=0.8)
plt.xlabel('Frequency', fontsize=12)
plt.ylabel('Bigrams', fontsize=12)
plt.title('Top 15 Most Frequent Bigrams', fontsize=14)
plt.gca().invert_yaxis()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

---
## Summary and Key Takeaways

### Named Entity Recognition (NER)

1. **Dictionary-based NER**
   - ✅ Fast and accurate for known entities
   - ✅ Easy to implement with spaCy's PhraseMatcher
   - ❌ Cannot handle unseen entities
   - ❌ Requires manual dictionary maintenance

2. **CRF-based NER**
   - ✅ Learns from training data
   - ✅ Can generalize to unseen entities
   - ✅ Considers context and features
   - ❌ Requires labeled training data
   - ❌ More computationally intensive

3. **Pre-trained NER (spaCy)**
   - ✅ Ready to use out-of-the-box
   - ✅ Good performance on standard entities
   - ✅ Supports multiple languages

### N-gram Language Models

1. **Smoothing Techniques**
   - **MLE**: No smoothing, zero probability for unseen n-grams
   - **Laplace**: Simple add-one smoothing, tends to over-smooth
   - **Kneser-Ney**: Advanced smoothing, best performance

2. **Backoff**
   - Falls back to lower-order n-grams when data is sparse
   - Improves robustness and generalization

3. **Perplexity**
   - Primary evaluation metric for language models
   - Lower perplexity indicates better model
   - Related to cross-entropy: Perplexity = 2^(cross-entropy)

### Best Practices

- Use **dictionary-based NER** for domain-specific, well-defined entities
- Use **CRF-based NER** when you have training data and need generalization
- Use **pre-trained models** as a baseline before building custom models
- For language models, **Kneser-Ney smoothing** generally performs best
- Higher-order n-grams (trigrams, 4-grams) capture more context but require more data
- Always evaluate on held-out test data

### Libraries Used

- **spaCy**: Industrial-strength NLP with pre-trained models
- **NLTK**: Comprehensive NLP toolkit with language modeling utilities
- **sklearn-crfsuite**: Efficient CRF implementation for sequence labeling
- **pandas/matplotlib/seaborn**: Data analysis and visualization

---
## Exercises

1. **NER Exercise**: Create a custom entity recognizer for your domain (e.g., medical terms, product names)

2. **CRF Features**: Experiment with different feature sets (word shapes, character n-grams, etc.)

3. **Language Model**: Train n-gram models on a larger corpus (e.g., news articles, Wikipedia)

4. **Smoothing Comparison**: Compare different smoothing techniques on various corpus sizes

5. **Text Generation**: Build an autocomplete system using n-gram models

6. **Evaluation**: Implement additional metrics (F1-score for NER, BLEU score for generation)