# Token Classification with Hugging Face 🏷️

Token classification assigns labels to individual tokens (words or subwords) in a sequence. Unlike text classification which labels entire documents, token classification works at the word level to identify entities, parts of speech, or other token-specific information.

## What is Token Classification?

**Token Classification** assigns labels to individual tokens:
- **Input**: Sequence of tokens (words/subwords)
- **Output**: Label for each token with confidence scores
- **Examples**: Named Entity Recognition (NER), Part-of-Speech tagging, keyword extraction

## Learning Objectives

By the end of this notebook, you'll know how to:
1. Use pre-trained NER models for entity extraction
2. Handle different entity types and tagging schemes
3. Process and aggregate token-level predictions
4. Build custom token classification models
5. Evaluate token classification performance
6. Handle real-world text processing challenges

Let's start extracting meaningful information from text! 🚀

In [None]:
# Import essential libraries
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
from transformers import TrainingArguments, Trainer, DataCollatorForTokenClassification
from datasets import Dataset, load_dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')

# Check device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

print("Libraries loaded successfully!")

## 1. Named Entity Recognition (NER)

Let's start with pre-trained NER models to extract entities from text:

In [None]:
# Named Entity Recognition with pre-trained models
print("🏷️ Named Entity Recognition")
print("=" * 28)

# Load NER pipeline
ner_pipeline = pipeline("ner", aggregation_strategy="simple")

# Test texts with various entities
test_texts = [
    "Apple Inc. was founded by Steve Jobs in Cupertino, California on April 1, 1976.",
    "The meeting with Microsoft is scheduled for next Tuesday at 3 PM in New York.",
    "Dr. Sarah Johnson from Harvard University published a paper about AI in Nature.",
    "Amazon reported $469 billion in revenue for 2021, up from $386 billion in 2020.",
    "The concert featuring Taylor Swift will be held at Madison Square Garden on December 15th."
]

# Entity type colors for visualization
entity_colors = {
    'PER': '👤',  # Person
    'ORG': '🏢',  # Organization  
    'LOC': '📍',  # Location
    'MISC': '🏷️'  # Miscellaneous
}

for i, text in enumerate(test_texts, 1):
    print(f"\n{i}. Text: '{text}'")
    entities = ner_pipeline(text)
    
    if entities:
        print("   Entities found:")
        for entity in entities:
            emoji = entity_colors.get(entity['entity_group'], '🔍')
            print(f"     {emoji} {entity['word']} ({entity['entity_group']}) - Score: {entity['score']:.3f}")
    else:
        print("   No entities found")

## 2. Advanced Entity Extraction

Let's explore different NER models and handle overlapping entities:

In [None]:
# Advanced entity extraction with different models
print("🔍 Advanced Entity Extraction")
print("=" * 32)

# Load different NER models
models = {
    "General NER": "dbmdz/bert-large-cased-finetuned-conll03-english",
    "Biomedical NER": "d4data/biomedical-ner-all"
}

# Sample texts for different domains
sample_texts = {
    "Business": "Elon Musk announced Tesla's quarterly earnings at the SpaceX facility in Hawthorne.",
    "Medical": "The patient was diagnosed with diabetes and prescribed metformin by Dr. Smith."
}

for domain, text in sample_texts.items():
    print(f"\n📋 {domain} Text: '{text}'\n")
    
    for model_name, model_path in models.items():
        try:
            # Load model-specific pipeline
            ner = pipeline("ner", model=model_path, tokenizer=model_path, aggregation_strategy="simple")
            entities = ner(text)
            
            print(f"   {model_name}:")
            if entities:
                for entity in entities:
                    print(f"     • {entity['word']} → {entity['entity_group']} ({entity['score']:.3f})")
            else:
                print("     No entities detected")
            print()
            
        except Exception as e:
            print(f"   {model_name}: Model not available ({str(e)[:30]}...)\n")

## 3. Custom Token Classification

Create a custom dataset and understand the token classification process:

In [None]:
# Custom token classification dataset
print("🛠️ Custom Token Classification")
print("=" * 33)

# Sample data in IOB format (Inside-Outside-Beginning)
sample_sentences = [
    {
        "tokens": ["John", "works", "at", "Google", "in", "Mountain", "View"],
        "labels": ["B-PER", "O", "O", "B-ORG", "O", "B-LOC", "I-LOC"]
    },
    {
        "tokens": ["Apple", "Inc", "was", "founded", "in", "Cupertino"],
        "labels": ["B-ORG", "I-ORG", "O", "O", "O", "B-LOC"]
    },
    {
        "tokens": ["Microsoft", "CEO", "Satya", "Nadella", "announced", "new", "products"],
        "labels": ["B-ORG", "O", "B-PER", "I-PER", "O", "O", "O"]
    }
]

# Create label mappings
unique_labels = set()
for sentence in sample_sentences:
    unique_labels.update(sentence["labels"])

label_to_id = {label: idx for idx, label in enumerate(sorted(unique_labels))}
id_to_label = {idx: label for label, idx in label_to_id.items()}

print("Label mappings:")
for label, idx in sorted(label_to_id.items()):
    print(f"  {idx}: {label}")

# Show sample data structure
print("\nSample training data:")
for i, sentence in enumerate(sample_sentences):
    print(f"\nSentence {i+1}:")
    print("Tokens: ", sentence["tokens"])
    print("Labels: ", sentence["labels"])
    
    # Show token-label pairs
    print("Pairs:  ", end="")
    for token, label in zip(sentence["tokens"], sentence["labels"]):
        print(f"({token}/{label}) ", end="")
    print()

## 4. Entity Extraction and Aggregation

Process token-level predictions and extract complete entities:

In [None]:
# Entity extraction and aggregation
print("📊 Entity Extraction and Aggregation")
print("=" * 38)

def extract_entities_from_tokens(tokens, labels):
    """Extract complete entities from token-label pairs using IOB format"""
    entities = []
    current_entity = None
    
    for token, label in zip(tokens, labels):
        if label.startswith('B-'):  # Beginning of entity
            if current_entity:  # Save previous entity
                entities.append(current_entity)
            current_entity = {
                'text': token,
                'label': label[2:],  # Remove 'B-' prefix
                'tokens': [token]
            }
        elif label.startswith('I-') and current_entity:  # Inside entity
            current_entity['text'] += ' ' + token
            current_entity['tokens'].append(token)
        else:  # Outside entity ('O') or start of new entity
            if current_entity:
                entities.append(current_entity)
                current_entity = None
    
    # Don't forget the last entity
    if current_entity:
        entities.append(current_entity)
    
    return entities

# Test entity extraction
print("Testing entity extraction:")
for i, sentence in enumerate(sample_sentences):
    entities = extract_entities_from_tokens(sentence["tokens"], sentence["labels"])
    
    print(f"\nSentence {i+1}: {' '.join(sentence['tokens'])}")
    print("Extracted entities:")
    
    if entities:
        for entity in entities:
            emoji = {'PER': '👤', 'ORG': '🏢', 'LOC': '📍'}.get(entity['label'], '🏷️')
            print(f"  {emoji} '{entity['text']}' → {entity['label']} (tokens: {len(entity['tokens'])})")
    else:
        print("  No entities found")

# Demonstrate real NER pipeline with aggregation
print("\n" + "="*50)
print("Real NER Pipeline Comparison:")

ner_pipeline = pipeline("ner", aggregation_strategy="simple")
test_text = "Steve Jobs founded Apple Inc. in Cupertino, California."

print(f"\nText: '{test_text}'")
entities = ner_pipeline(test_text)

print("Pipeline results:")
for entity in entities:
    emoji = {'PER': '👤', 'ORG': '🏢', 'LOC': '📍'}.get(entity['entity_group'], '🏷️')
    print(f"  {emoji} '{entity['word']}' → {entity['entity_group']} (score: {entity['score']:.3f})")

## 5. Real-world Information Extraction System

Build a comprehensive information extraction system:

In [None]:
# Real-world information extraction system
print("🌍 Real-world Information Extraction")
print("=" * 39)

class InformationExtractor:
    def __init__(self):
        self.ner_pipeline = pipeline("ner", aggregation_strategy="simple")
        
        # Entity type mappings
        self.entity_types = {
            'PER': {'name': 'People', 'emoji': '👤', 'color': 'blue'},
            'ORG': {'name': 'Organizations', 'emoji': '🏢', 'color': 'green'},
            'LOC': {'name': 'Locations', 'emoji': '📍', 'color': 'red'},
            'MISC': {'name': 'Miscellaneous', 'emoji': '🏷️', 'color': 'orange'}
        }
    
    def extract_entities(self, text):
        """Extract and organize entities from text"""
        entities = self.ner_pipeline(text)
        
        # Organize by entity type
        organized = {entity_type: [] for entity_type in self.entity_types.keys()}
        
        for entity in entities:
            entity_type = entity['entity_group']
            if entity_type in organized:
                organized[entity_type].append({
                    'text': entity['word'],
                    'score': entity['score'],
                    'start': entity['start'],
                    'end': entity['end']
                })
        
        return organized
    
    def analyze_text(self, text, min_confidence=0.5):
        """Comprehensive text analysis"""
        entities = self.extract_entities(text)
        
        analysis = {
            'text': text,
            'total_entities': 0,
            'entities_by_type': {},
            'high_confidence_entities': [],
            'summary': {}
        }
        
        for entity_type, entity_list in entities.items():
            if entity_list:
                filtered_entities = [e for e in entity_list if e['score'] >= min_confidence]
                analysis['entities_by_type'][entity_type] = filtered_entities
                analysis['total_entities'] += len(filtered_entities)
                
                # Add to high confidence list
                for entity in filtered_entities:
                    if entity['score'] >= 0.9:
                        analysis['high_confidence_entities'].append({
                            'text': entity['text'],
                            'type': entity_type,
                            'score': entity['score']
                        })
        
        # Generate summary
        for entity_type, entity_list in analysis['entities_by_type'].items():
            if entity_list:
                type_info = self.entity_types[entity_type]
                analysis['summary'][type_info['name']] = len(entity_list)
        
        return analysis

# Initialize extractor
extractor = InformationExtractor()
print("Information extractor initialized!")

# Test comprehensive information extraction
news_articles = [
    """
    Tesla CEO Elon Musk announced during a press conference in Austin, Texas, that the company 
    will be expanding its Gigafactory operations to include battery production for SpaceX rockets. 
    The announcement was made alongside CFO Zachary Kirkhorn and followed Tesla's record quarterly 
    earnings reported last week. The expansion is expected to create 500 new jobs in the Austin area.
    """,
    """
    Apple Inc. reported strong iPhone sales growth in China and India during the third quarter. 
    CEO Tim Cook praised the performance of Apple stores in Shanghai and Mumbai, noting increased 
    demand for the iPhone 15 Pro. The company's services revenue also grew significantly, driven 
    by App Store sales and Apple Music subscriptions across Asian markets.
    """
]

print("\n📰 News Article Analysis")
print("=" * 25)

for i, article in enumerate(news_articles, 1):
    article = article.strip()
    print(f"\n📋 Article {i}:")
    print(f"'{article[:100]}...'\n")
    
    analysis = extractor.analyze_text(article)
    
    print(f"🔍 Analysis Results:")
    print(f"   Total entities found: {analysis['total_entities']}")
    
    # Show summary
    if analysis['summary']:
        print(f"   Entity breakdown:")
        for entity_type, count in analysis['summary'].items():
            emoji = {'People': '👤', 'Organizations': '🏢', 'Locations': '📍', 'Miscellaneous': '🏷️'}.get(entity_type, '🏷️')
            print(f"     {emoji} {entity_type}: {count}")
    
    # Show high confidence entities
    if analysis['high_confidence_entities']:
        print(f"\n   🎯 High Confidence Entities (>0.9):")
        for entity in analysis['high_confidence_entities']:
            type_info = extractor.entity_types[entity['type']]
            print(f"     {type_info['emoji']} {entity['text']} ({entity['score']:.3f})")
    
    # Show detailed entities by type
    print(f"\n   📊 Detailed Entity List:")
    for entity_type, entities in analysis['entities_by_type'].items():
        if entities:
            type_info = extractor.entity_types[entity_type]
            print(f"     {type_info['emoji']} {type_info['name']}:")
            for entity in entities:
                print(f"        • {entity['text']} (confidence: {entity['score']:.3f})")
    
    print("-" * 60)

## 🎯 Key Takeaways

**What you've learned about token classification:**

✅ **Named Entity Recognition**: Extract people, organizations, and locations from text  
✅ **Token-level Processing**: Understand how models work at the word level  
✅ **Entity Aggregation**: Combine token predictions into complete entities  
✅ **IOB Tagging**: Work with Inside-Outside-Beginning annotation schemes  
✅ **Multiple Models**: Compare different NER models for various domains  
✅ **Real-world Applications**: Build comprehensive information extraction systems  
✅ **Performance Analysis**: Evaluate and filter entity predictions by confidence  

## 🔧 Best Practices

1. **Use aggregation strategies** to combine subword tokens into complete entities
2. **Set confidence thresholds** to filter out low-quality predictions
3. **Choose domain-specific models** for better performance (e.g., biomedical NER)
4. **Handle overlapping entities** carefully in complex text
5. **Post-process results** to clean and organize extracted information
6. **Validate entities** against known databases or rules when possible

## 🚀 Next Steps

Ready for the next challenge?

**Continue to**: `07_text_generation.ipynb` - Learn about generating text with language models!

**Practice**: Try extracting entities from documents in your domain (legal, medical, financial)!

Excellent work mastering token classification! 🏆