# NER Testing and Fine-tuning Notebook

This notebook provides:
1. Testing different NER models (spaCy, Transformers, etc.)
2. Evaluation metrics for organization extraction
3. Fine-tuning pipelines for custom models
4. Comparison of model performance
5. Data preparation utilities

## Installation Requirements
```bash
pip install spacy transformers datasets torch evaluate seqeval
pip install spacy-transformers
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_lg
python -m spacy download en_core_web_trf
```
"""

In [3]:
import spacy
import pandas as pd
import numpy as np
import re
import json
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict
import warnings
warnings.filterwarnings('ignore')

# Transformers and datasets
from transformers import (
    AutoTokenizer, AutoModelForTokenClassification,
    TrainingArguments, Trainer, DataCollatorForTokenClassification,
    pipeline
)
from datasets import Dataset, DatasetDict
import torch
from torch.utils.data import DataLoader

# Evaluation
from seqeval.metrics import classification_report, f1_score, precision_score, recall_score
from sklearn.metrics import confusion_matrix

# spaCy training
from spacy.training.example import Example
from spacy.util import minibatch, compounding

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("Setup complete!")


Setup complete!


In [5]:
def create_sample_data():
    """Create sample data for testing and training"""
    
    # Sample texts with organizations
    sample_texts = [
        "Microsoft Corporation announced a partnership with OpenAI to integrate AI capabilities.",
        "Apple Inc. reported record quarterly earnings, outperforming Google LLC and Meta Platforms Inc.",
        "JPMorgan Chase & Co. is the largest bank in the United States, followed by Bank of America Corp.",
        "Tesla Inc. and General Motors Company are competing in the electric vehicle market.",
        "Amazon.com Inc. acquired Whole Foods Market for $13.7 billion.",
        "Salesforce Inc. is a leading cloud computing company based in San Francisco.",
        "Oracle Corporation provides database software and technology solutions.",
        "Netflix Inc. competes with Disney+ and HBO Max in the streaming market.",
        "IBM Corporation has been a technology leader for over a century.",
        "NVIDIA Corporation is known for its graphics processing units and AI chips.",
        "Cisco Systems Inc. provides networking hardware and software solutions.",
        "Intel Corporation manufactures semiconductors and microprocessors.",
        "Adobe Inc. develops creative software including Photoshop and Illustrator.",
        "PayPal Holdings Inc. facilitates online payments for millions of users.",
        "Goldman Sachs Group Inc. is a leading investment banking firm.",
        "Morgan Stanley provides wealth management and investment services.",
        "Wells Fargo & Company is one of the largest banks in the United States.",
        "Berkshire Hathaway Inc. is Warren Buffett's investment company.",
        "Johnson & Johnson develops pharmaceuticals and medical devices.",
        "Pfizer Inc. is a global pharmaceutical corporation.",
        "The Federal Reserve announced changes to interest rates affecting all major banks.",
        "Stanford University researchers collaborated with MIT on artificial intelligence.",
        "Harvard Business School published a study on corporate governance.",
        "The New York Stock Exchange saw heavy trading in technology stocks.",
        "BlackRock Inc. is the world's largest asset management firm."
    ]
    
    return sample_texts

def create_training_data():
    """Create training data in spaCy format"""
    
    training_data = [
        ("Microsoft Corporation announced a partnership with OpenAI.", 
         {"entities": [(0, 19, "ORG"), (48, 54, "ORG")]}),
        
        ("Apple Inc. reported record quarterly earnings, outperforming Google LLC.", 
         {"entities": [(0, 10, "ORG"), (59, 69, "ORG")]}),
        
        ("JPMorgan Chase & Co. is the largest bank in the United States.", 
         {"entities": [(0, 20, "ORG")]}),
        
        ("Tesla Inc. and General Motors Company are competing in the market.", 
         {"entities": [(0, 10, "ORG"), (15, 40, "ORG")]}),
        
        ("Amazon.com Inc. acquired Whole Foods Market for $13.7 billion.", 
         {"entities": [(0, 15, "ORG"), (25, 42, "ORG")]}),
        
        ("Salesforce Inc. is a leading cloud computing company.", 
         {"entities": [(0, 15, "ORG")]}),
        
        ("Oracle Corporation provides database software solutions.", 
         {"entities": [(0, 18, "ORG")]}),
        
        ("Netflix Inc. competes with Disney+ and HBO Max in streaming.", 
         {"entities": [(0, 12, "ORG"), (27, 34, "ORG"), (39, 46, "ORG")]}),
        
        ("IBM Corporation has been a technology leader for decades.", 
         {"entities": [(0, 15, "ORG")]}),
        
        ("NVIDIA Corporation is known for its graphics processing units.", 
         {"entities": [(0, 18, "ORG")]}),
    ]
    
    return training_data

# Create sample data
sample_texts = create_sample_data()
training_data = create_training_data()

print(f"Created {len(sample_texts)} sample texts")
print(f"Created {len(training_data)} training examples")


Created 25 sample texts
Created 10 training examples


In [6]:
class NERModelTester:
    """Class to test different NER models"""
    
    def __init__(self):
        self.models = {}
        self.results = {}
    
    def load_spacy_models(self):
        """Load available spaCy models"""
        spacy_models = {
            'en_core_web_sm': 'en_core_web_sm',
            'en_core_web_md': 'en_core_web_md', 
            'en_core_web_lg': 'en_core_web_lg',
            'en_core_web_trf': 'en_core_web_trf'
        }
        
        for name, model_name in spacy_models.items():
            try:
                self.models[name] = spacy.load(model_name)
                print(f"Loaded {name}")
            except OSError:
                print(f"Model {name} not available. Install with: python -m spacy download {model_name}")
    
    def load_transformer_models(self):
        """Load transformer-based NER models"""
        transformer_models = {
            'dbmdz-bert': 'dbmdz/bert-large-cased-finetuned-conll03-english',
            'dslim-bert': 'dslim/bert-base-NER',
            'microsoft-deberta': 'microsoft/deberta-v3-base'
        }
        
        for name, model_name in transformer_models.items():
            try:
                tokenizer = AutoTokenizer.from_pretrained(model_name)
                model = AutoModelForTokenClassification.from_pretrained(model_name)
                self.models[name] = pipeline("ner", 
                                            model=model, 
                                            tokenizer=tokenizer, 
                                            aggregation_strategy="simple")
                print(f"Loaded {name}")
            except Exception as e:
                print(f"Failed to load {name}: {str(e)}")
    
    def extract_organizations_spacy(self, text, model):
        """Extract organizations using spaCy model"""
        doc = model(text)
        organizations = []
        
        for ent in doc.ents:
            if ent.label_ == "ORG":
                organizations.append({
                    'text': ent.text,
                    'start': ent.start_char,
                    'end': ent.end_char,
                    'confidence': 1.0  # spaCy doesn't provide confidence scores
                })
        
        return organizations
    
    def extract_organizations_transformer(self, text, model):
        """Extract organizations using transformer model"""
        results = model(text)
        organizations = []
        
        for result in results:
            if 'ORG' in result['entity_group'] or 'ORGANIZATION' in result['entity_group']:
                organizations.append({
                    'text': result['word'],
                    'start': result['start'],
                    'end': result['end'],
                    'confidence': result['score']
                })
        
        return organizations
    
    def test_model_on_texts(self, model_name, texts):
        """Test a specific model on sample texts"""
        model = self.models[model_name]
        results = []
        
        for text in texts:
            if 'spacy' in model_name or isinstance(model, spacy.lang.en.English):
                orgs = self.extract_organizations_spacy(text, model)
            else:
                orgs = self.extract_organizations_transformer(text, model)
            
            results.append({
                'text': text,
                'organizations': orgs,
                'org_count': len(orgs)
            })
        
        return results
    
    def run_comprehensive_test(self, texts):
        """Run tests on all available models"""
        print("Running comprehensive NER tests...\n")
        
        for model_name in self.models.keys():
            print(f"Testing {model_name}...")
            try:
                results = self.test_model_on_texts(model_name, texts[:5])  # Test on first 5 texts
                self.results[model_name] = results
                
                # Print summary
                total_orgs = sum([r['org_count'] for r in results])
                print(f"  Total organizations found: {total_orgs}")
                
                # Show examples
                for i, result in enumerate(results[:2]):
                    print(f"  Text {i+1}: {result['text'][:50]}...")
                    for org in result['organizations']:
                        print(f"    - {org['text']} (confidence: {org['confidence']:.3f})")
                
                print()
                
            except Exception as e:
                print(f"  Error testing {model_name}: {str(e)}\n")

# Initialize and test models
tester = NERModelTester()
tester.load_spacy_models()
tester.load_transformer_models()

# Run tests
tester.run_comprehensive_test(sample_texts)

Loaded en_core_web_sm
Model en_core_web_md not available. Install with: python -m spacy download en_core_web_md
Model en_core_web_lg not available. Install with: python -m spacy download en_core_web_lg
Model en_core_web_trf not available. Install with: python -m spacy download en_core_web_trf


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Loaded dbmdz-bert


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Loaded dslim-bert


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Failed to load microsoft-deberta: Converting from SentencePiece and Tiktoken failed, if a converter for SentencePiece is available, provide a model path with a SentencePiece tokenizer.model file.Currently available slow->fast converters: ['AlbertTokenizer', 'BartTokenizer', 'BarthezTokenizer', 'BertTokenizer', 'BigBirdTokenizer', 'BlenderbotTokenizer', 'CamembertTokenizer', 'CLIPTokenizer', 'CodeGenTokenizer', 'ConvBertTokenizer', 'DebertaTokenizer', 'DebertaV2Tokenizer', 'DistilBertTokenizer', 'DPRReaderTokenizer', 'DPRQuestionEncoderTokenizer', 'DPRContextEncoderTokenizer', 'ElectraTokenizer', 'FNetTokenizer', 'FunnelTokenizer', 'GPT2Tokenizer', 'HerbertTokenizer', 'LayoutLMTokenizer', 'LayoutLMv2Tokenizer', 'LayoutLMv3Tokenizer', 'LayoutXLMTokenizer', 'LongformerTokenizer', 'LEDTokenizer', 'LxmertTokenizer', 'MarkupLMTokenizer', 'MBartTokenizer', 'MBart50Tokenizer', 'MPNetTokenizer', 'MobileBertTokenizer', 'MvpTokenizer', 'NllbTokenizer', 'OpenAIGPTTokenizer', 'PegasusTokenizer', 'Q