# Model-based Named Entity Recognition with spaCy

In this notebook, we explore **model-based Named Entity Recognition (NER)** using spaCy, a state-of-the-art NLP library.

## What is Model-based NER?

Unlike lexicon-based approaches that use predefined word lists, model-based NER uses **machine learning models** trained on millions of texts to recognize entities based on:
- **Context**: Words surrounding the entity
- **Grammar**: Syntactic patterns
- **Learned patterns**: Statistical patterns from training data

## Learning Objectives

After completing this notebook, you will:
- Understand how spaCy's NER model works
- Apply spaCy to extract locations from EU Parliament speeches
- Analyze the model's strengths and weaknesses
- Compare model results with lexicon-based approaches

## Structure

1. Introduction to spaCy and NER
2. Loading and exploring the spaCy model
3. Extracting locations from speeches
4. Analyzing results
5. Error analysis and model limitations
6. Comparison with lexicon-based approach

## 0. Setup and Installation

In [6]:
# Install packages (only needed the first time)
!pip install spacy pandas tqdm matplotlib
# !python -m spacy download en_core_web_lg



In [7]:
# Import libraries
import spacy
import pandas as pd
from collections import Counter
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

print("✓ All libraries successfully loaded!")

✓ All libraries successfully loaded!


## 1. Load Data

In [8]:
# Load EU Parliament speeches
df = pd.read_csv('eu_speeches_2024_english.csv')

print(f"Number of speeches: {len(df)}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst rows:")
df.head()

Number of speeches: 1828

Columns: ['speaker', 'text', 'party', 'date', 'agenda', 'speechnumber', 'procedure_ID', 'partyfacts_ID', 'period', 'chair', 'MEP', 'commission', 'written', 'multispeaker', 'link', 'translationInSpeech', 'translatedText']

First rows:


Unnamed: 0,speaker,text,party,date,agenda,speechnumber,procedure_ID,partyfacts_ID,period,chair,MEP,commission,written,multispeaker,link,translationInSpeech,translatedText
0,President,I wish you an excellent good morning on this l...,-,2024-04-25,2. Interinstitutional Body for Ethical Standar...,1,,,9,True,False,False,False,False,https://www.europarl.europa.eu/doceo/document/...,,
1,President,"Dear colleagues, today marks the 50th annivers...",-,2024-04-25,6. Statements by the President,1,,,9,True,False,False,False,False,https://www.europarl.europa.eu/doceo/document/...,,
2,Virginijus Sinkevičius,"Mr President, honourable Members, dear rapport...",-,2024-04-25,4. Framework of measures for strengthening Eur...,7,bill_165_ID bill_195_ID bill_165_ID bill_195_ID,,9,False,False,True,False,False,https://www.europarl.europa.eu/doceo/document/...,,
3,Anna Deparnay-Grunenberg,"Mr President, I came to this House to be a voi...",Greens/EFA,2024-04-25,4. Framework of measures for strengthening Eur...,9,bill_165_ID bill_195_ID bill_165_ID bill_195_ID,6403.0,9,False,True,False,False,False,https://www.europarl.europa.eu/doceo/document/...,,
4,Seán Kelly,"A Uachtaráin, teastaíonn uaim mo thacaíocht io...",PPE,2024-04-25,4. Framework of measures for strengthening Eur...,24,bill_165_ID bill_195_ID bill_165_ID bill_195_ID,6398.0,9,False,True,False,False,False,https://www.europarl.europa.eu/doceo/document/...,,


---

# Part 1: Understanding spaCy's NER Model

## How does spaCy NER work?

spaCy uses a **neural network** trained on millions of texts. The model:

1. **Tokenizes** the text (splits into words)
2. **Analyzes context** (looks at surrounding words)
3. **Applies learned patterns** (from training data)
4. **Predicts entity types** (PERSON, GPE, LOC, ORG, etc.)

### Key Features:

- ✓ **Context-aware**: Understands "Paris Hilton" (person) vs "Paris, France" (location)
- ✓ **Generalizable**: Can recognize entities not in training data
- ✓ **Multi-entity**: Recognizes many entity types simultaneously
- ✗ **Black box**: Hard to understand why specific decisions were made
- ✗ **Slower**: Requires more computational resources

## 1.1 Load spaCy Model

In [9]:
# Load the English language model
# Available models:
# - en_core_web_sm: Small, fast, 12 MB
# - en_core_web_md: Medium, balanced, 40 MB
# - en_core_web_lg: Large, accurate, 560 MB

MODEL_NAME = 'en_core_web_lg'

print(f"Loading spaCy model '{MODEL_NAME}'...")
nlp = spacy.load(MODEL_NAME)
print("✓ Model loaded!")

# Model information
print(f"\nModel information:")
print(f"  Language: {nlp.meta['lang']}")
print(f"  Pipeline: {nlp.pipe_names}")
print(f"  Vectors: {nlp.meta.get('vectors', {}).get('keys', 0)} word vectors")

Loading spaCy model 'en_core_web_lg'...
✓ Model loaded!

Model information:
  Language: en
  Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
  Vectors: 684830 word vectors


## 1.2 Entity Types in spaCy

spaCy recognizes 18 entity types:

In [10]:
# Display all entity types
entity_types = {
    'PERSON': 'People, including fictional',
    'NORP': 'Nationalities or religious or political groups',
    'FAC': 'Buildings, airports, highways, bridges, etc.',
    'ORG': 'Companies, agencies, institutions, etc.',
    'GPE': 'Countries, cities, states (Geopolitical Entity)',
    'LOC': 'Non-GPE locations, mountain ranges, bodies of water',
    'PRODUCT': 'Objects, vehicles, foods, etc.',
    'EVENT': 'Named hurricanes, battles, wars, sports events, etc.',
    'WORK_OF_ART': 'Titles of books, songs, etc.',
    'LAW': 'Named documents made into laws',
    'LANGUAGE': 'Any named language',
    'DATE': 'Absolute or relative dates or periods',
    'TIME': 'Times smaller than a day',
    'PERCENT': 'Percentage',
    'MONEY': 'Monetary values',
    'QUANTITY': 'Measurements',
    'ORDINAL': '"first", "second", etc.',
    'CARDINAL': 'Numerals that do not fall under another type'
}

print("Entity types relevant for location extraction:\n")
print(f"  GPE: {entity_types['GPE']}")
print(f"  LOC: {entity_types['LOC']}")
print(f"\nWe'll focus on these two types for finding locations!")

Entity types relevant for location extraction:

  GPE: Countries, cities, states (Geopolitical Entity)
  LOC: Non-GPE locations, mountain ranges, bodies of water

We'll focus on these two types for finding locations!


## 1.3 Testing the Model

Let's see how spaCy performs on example sentences:

In [11]:
# Test sentences
test_sentences = [
    "In Paris and Berlin, we are discussing the future of Europe with partners from Madrid and Rome.",
    "Apple CEO Tim Cook met with German Chancellor Olaf Scholz in Munich.",
    "The Amazon rainforest and the Amazon company are both important topics.",
    "President Biden visited Brussels for a NATO summit before traveling to Warsaw."
]

print("Testing spaCy NER on example sentences:\n")

for i, sentence in enumerate(test_sentences, 1):
    doc = nlp(sentence)
    
    print(f"Sentence {i}: {sentence}")
    print(f"\nFound entities:")
    
    for ent in doc.ents:
        print(f"  '{ent.text}' → {ent.label_} ({entity_types.get(ent.label_, 'Unknown')})")
    
    print(f"\nLocations only (GPE + LOC):")
    locations = [ent.text for ent in doc.ents if ent.label_ in ['GPE', 'LOC']]
    print(f"  {locations}")
    print("\n" + "─" * 80 + "\n")

print("Observations:")
print("  - The model understands context (e.g., 'Apple' the company vs potential location)")
print("  - It distinguishes between different entity types")
print("  - It can handle multi-word entities ('Tim Cook', 'Olaf Scholz')")

Testing spaCy NER on example sentences:

Sentence 1: In Paris and Berlin, we are discussing the future of Europe with partners from Madrid and Rome.

Found entities:
  'Paris' → GPE (Countries, cities, states (Geopolitical Entity))
  'Berlin' → GPE (Countries, cities, states (Geopolitical Entity))
  'Europe' → LOC (Non-GPE locations, mountain ranges, bodies of water)
  'Madrid' → GPE (Countries, cities, states (Geopolitical Entity))
  'Rome' → GPE (Countries, cities, states (Geopolitical Entity))

Locations only (GPE + LOC):
  ['Paris', 'Berlin', 'Europe', 'Madrid', 'Rome']

────────────────────────────────────────────────────────────────────────────────

Sentence 2: Apple CEO Tim Cook met with German Chancellor Olaf Scholz in Munich.

Found entities:
  'Apple' → ORG (Companies, agencies, institutions, etc.)
  'Tim Cook' → PERSON (People, including fictional)
  'German' → NORP (Nationalities or religious or political groups)
  'Olaf Scholz' → PERSON (People, including fictional)
  'Mun

---

# Part 2: Applying spaCy to EU Parliament Speeches

Now let's extract locations from real political speeches.

## 2.1 Define Location Extraction Function

In [12]:
def find_locations_spacy(text, nlp_model, include_loc=True):
    """
    Finds locations in a text using spaCy.
    
    Parameters:
        text: The text to analyze
        nlp_model: The loaded spaCy model
        include_loc: Whether to include LOC entities (non-GPE locations)
    
    Return:
        List of found locations with their types
    """
    # Process text through model
    doc = nlp_model(text)
    
    # Extract location entities
    if include_loc:
        locations = [(ent.text, ent.label_) for ent in doc.ents if ent.label_ in ['GPE', 'LOC']]
    else:
        locations = [(ent.text, ent.label_) for ent in doc.ents if ent.label_ == 'GPE']
    
    return locations

def get_location_texts(text, nlp_model):
    """
    Returns just the location names (without types).
    """
    locations = find_locations_spacy(text, nlp_model)
    return [loc[0] for loc in locations]

# Test the function
test_text = "The European Commission in Brussels works with member states like France, Germany, and Poland to address challenges facing the EU."

print("Testing extraction function:\n")
print(f"Text: {test_text}\n")

locations_with_types = find_locations_spacy(test_text, nlp)
print("Locations with types:")
for loc, typ in locations_with_types:
    print(f"  {loc} ({typ})")

print(f"\nLocation names only:")
print(f"  {get_location_texts(test_text, nlp)}")

Testing extraction function:

Text: The European Commission in Brussels works with member states like France, Germany, and Poland to address challenges facing the EU.

Locations with types:
  Brussels (GPE)
  France (GPE)
  Germany (GPE)
  Poland (GPE)

Location names only:
  ['Brussels', 'France', 'Germany', 'Poland']


## 2.2 Extract Locations from All Speeches

**Note**: This will take several minutes (2-10 minutes depending on dataset size and computer speed).

In [13]:
print("Extracting locations from all speeches with spaCy...")
print(f"Processing {len(df)} speeches (this may take 2-10 minutes)\n")

# Apply extraction to all speeches
tqdm.pandas(desc="Processing speeches")

df['ner_spacy_detailed'] = df['text'].progress_apply(
    lambda text: find_locations_spacy(text, nlp)
)

# Extract just the location names
df['ner_spacy'] = df['ner_spacy_detailed'].apply(
    lambda locs: [loc[0] for loc in locs]
)

# Count locations per speech
df['ner_spacy_count'] = df['ner_spacy'].apply(len)

print("\n✓ Extraction completed!")
print(f"\nStatistics:")
print(f"  Speeches with at least one location: {(df['ner_spacy_count'] > 0).sum()} of {len(df)}")
print(f"  Average locations per speech: {df['ner_spacy_count'].mean():.2f}")
print(f"  Maximum locations in one speech: {df['ner_spacy_count'].max()}")
print(f"  Total unique locations found: {len(set([loc for locs in df['ner_spacy'] for loc in locs]))}")

Extracting locations from all speeches with spaCy...
Processing 1828 speeches (this may take 2-10 minutes)



Processing speeches: 100%|██████████| 1828/1828 [00:45<00:00, 40.61it/s]


✓ Extraction completed!

Statistics:
  Speeches with at least one location: 1297 of 1828
  Average locations per speech: 3.69
  Maximum locations in one speech: 54
  Total unique locations found: 580





---

# Part 3: Analyzing Results

Let's explore what the model found.

## 3.1 Most Frequently Mentioned Locations

In [14]:
# Collect all locations
all_locations_spacy = [loc for locs in df['ner_spacy'] for loc in locs]
location_counts_spacy = Counter(all_locations_spacy)

print("Top 20 most mentioned locations:\n")
for i, (location, count) in enumerate(location_counts_spacy.most_common(20), 1):
    print(f"  {i:2d}. {location:25s} ({count:3d} mentions)")

Top 20 most mentioned locations:

   1. Europe                    (1153 mentions)
   2. Ukraine                   (758 mentions)
   3. Russia                    (501 mentions)
   4. Israel                    (280 mentions)
   5. Gaza                      (271 mentions)
   6. Ireland                   (162 mentions)
   7. US                        (121 mentions)
   8. Iran                      (111 mentions)
   9. Hungary                   (106 mentions)
  10. China                     ( 98 mentions)
  11. EUR                       ( 87 mentions)
  12. the Member States         ( 74 mentions)
  13. Armenia                   ( 71 mentions)
  14. Finland                   ( 66 mentions)
  15. Leyen                     ( 61 mentions)
  16. Germany                   ( 56 mentions)
  17. Serbia                    ( 56 mentions)
  18. Romania                   ( 56 mentions)
  19. Brussels                  ( 55 mentions)
  20. Greece                    ( 55 mentions)


## 3.2 Distribution by Entity Type

Let's see the split between GPE (countries/cities) and LOC (other locations):

In [15]:
# Collect all entities with types
all_entities_detailed = [ent for ents in df['ner_spacy_detailed'] for ent in ents]

# Count by type
gpe_entities = [ent[0] for ent in all_entities_detailed if ent[1] == 'GPE']
loc_entities = [ent[0] for ent in all_entities_detailed if ent[1] == 'LOC']

print("Distribution by entity type:\n")
print(f"  GPE (Geopolitical Entity): {len(gpe_entities)} mentions")
print(f"    Examples: {list(Counter(gpe_entities).most_common(5))}")
print()
print(f"  LOC (Other Locations): {len(loc_entities)} mentions")
print(f"    Examples: {list(Counter(loc_entities).most_common(5))}")
print()
print(f"Observation: GPE typically includes countries and cities,")
print(f"   while LOC includes regions, bodies of water, etc.")

Distribution by entity type:

  GPE (Geopolitical Entity): 5194 mentions
    Examples: [('Ukraine', 758), ('Russia', 501), ('Israel', 280), ('Gaza', 271), ('Ireland', 162)]

  LOC (Other Locations): 1549 mentions
    Examples: [('Europe', 1153), ('the Middle East', 30), ('the Western Balkans', 25), ('Africa', 21), ('Horizon Europe', 21)]

Observation: GPE typically includes countries and cities,
   while LOC includes regions, bodies of water, etc.


## 3.3 Example Speeches

Let's look at specific speeches in detail:

In [16]:
# Find speech with most locations
max_locations_idx = df['ner_spacy_count'].idxmax()
max_locations_speech = df.loc[max_locations_idx]

print("="*80)
print("SPEECH WITH MOST LOCATION MENTIONS")
print("="*80)
print(f"Speaker: {max_locations_speech['speaker']}")
print(f"Party: {max_locations_speech['party']}")
print(f"Date: {max_locations_speech['date']}")
print(f"Number of locations: {max_locations_speech['ner_spacy_count']}")
print(f"\nLocations found: {max_locations_speech['ner_spacy']}")
print(f"\nText (first 500 characters):")
print(max_locations_speech['text'][:500] + "...")
print("="*80)

SPEECH WITH MOST LOCATION MENTIONS
Speaker: Ursula von der Leyen
Party: -
Date: 2024-03-12
Number of locations: 54

Locations found: ['Ukraine', 'the Middle East', 'the Middle East', 'Cyprus', 'Gaza', 'Cyprus', 'the United\xa0Arab Emirates', 'the United States', 'the United Kingdom', 'Larnaca', 'Gaza', 'Gaza', 'Gaza', 'Gaza', 'Gaza', 'The United States', 'The United\xa0Arab Emirates', 'Cyprus', 'Larnaca', 'Cyprus', 'Jordan', 'Gaza', 'Gaza', 'EUR', 'Strip', 'Gaza', 'EUR', 'Israel', 'Gaza', 'Israel', 'Lebanon', 'Iran', 'Yemen', 'Russia', 'Ukraine', 'Iran', 'Russia', 'Gaza', 'the Member States', 'Gaza', 'the Western Balkans', 'the Western Balkans', 'the Western Balkans', 'Albania', 'North\xa0Macedonia', 'Bosnia and Herzegovina', 'Bosnia and Herzegovina', 'Bosnia and Herzegovina', 'Bosnia and Herzegovina', 'Yugoslavia', 'Bosnia and Herzegovina', 'Bosnia and Herzegovina', 'Bosnia and Herzegovina', 'Europe']

Text (first 500 characters):
Madam President, dear Roberta, Minister, dear Hadja, h

In [17]:
# Show a random speech with detailed entity information
# ADJUSTMENT: Change this number to see different speeches
EXAMPLE_INDEX = 20

example_speech = df.iloc[EXAMPLE_INDEX]

print("="*80)
print(f"DETAILED ANALYSIS: SPEECH #{EXAMPLE_INDEX}")
print("="*80)
print(f"Speaker: {example_speech['speaker']}")
print(f"Party: {example_speech['party']}")
print(f"Date: {example_speech['date']}")
print(f"\nText (first 400 characters):")
print(example_speech['text'][:400] + "...")
print(f"\n{'─'*80}")
print(f"\nLocations found by spaCy:")
for loc, typ in example_speech['ner_spacy_detailed']:
    print(f"  • {loc:30s} [{typ}]")
print("="*80)

DETAILED ANALYSIS: SPEECH #20
Speaker: Mick Wallace
Party: The Left
Date: 2024-04-25

Text (first 400 characters):
Madam President, when we talk about EU enlargement, including Ukrainian accession, the people of Europe and especially Ireland should be aware of exactly what it will mean. The cost of Ukrainian accession will be monumental. It will cost trillions to rebuild Ukraine, not the EUR 50 billion. That’s before we talk about the actual cost of enlargement. Ukraine and Moldova are three times poorer than ...

────────────────────────────────────────────────────────────────────────────────

Locations found by spaCy:
  • Europe                         [LOC]
  • Ireland                        [GPE]
  • Ukraine                        [GPE]
  • Ukraine                        [GPE]
  • Moldova                        [GPE]
  • Bulgaria                       [GPE]
  • Ireland                        [GPE]
  • Ireland                        [GPE]


---

# Part 4: Error Analysis

No model is perfect. Let's examine potential errors and limitations.

## Common Error Patterns

Let's identify common types of errors:

In [18]:
print("Common types of NER errors:\n")

print("1. AMBIGUOUS WORDS:")
print("   Words that can be locations OR other things:")
ambiguous_examples = ['Union', 'Parliament', 'Council', 'Commission', 'Court']
for word in ambiguous_examples:
    if word in location_counts_spacy:
        print(f"     - '{word}': {location_counts_spacy[word]} mentions")
        print(f"       (Could be: location, organization, or institution)")

print("\n2. PARTIAL NAMES:")
print("   Sometimes only part of a location name is extracted:")
print("     - 'North' instead of 'North Korea'")
print("     - 'United' instead of 'United States'")

print("\n3. ADJECTIVES AS LOCATIONS:")
print("   Adjectives derived from place names:")
print("     - 'European' (adjective) vs 'Europe' (location)")
print("     - 'American' (adjective) vs 'America' (location)")

print("\nThese errors occur because the model:")
print("   - Relies on statistical patterns, not perfect understanding")
print("   - Trained on general text, not specifically on political speeches")
print("   - Cannot always distinguish between different meanings")

Common types of NER errors:

1. AMBIGUOUS WORDS:
   Words that can be locations OR other things:
     - 'Council': 1 mentions
       (Could be: location, organization, or institution)

2. PARTIAL NAMES:
   Sometimes only part of a location name is extracted:
     - 'North' instead of 'North Korea'
     - 'United' instead of 'United States'

3. ADJECTIVES AS LOCATIONS:
   Adjectives derived from place names:
     - 'European' (adjective) vs 'Europe' (location)
     - 'American' (adjective) vs 'America' (location)

These errors occur because the model:
   - Relies on statistical patterns, not perfect understanding
   - Trained on general text, not specifically on political speeches
   - Cannot always distinguish between different meanings


---

# Part 5: Comparison with Lexicon-based Approach

Let's compare spaCy with a simple lexicon-based approach.

## 5.1 Create Simple Lexicon

In [21]:
# Create a simple location lexicon
european_locations = {
    # EU Countries
    'Germany', 'France', 'Italy', 'Spain', 'Poland', 'Romania',
    'Netherlands', 'Belgium', 'Greece', 'Portugal', 'Sweden',
    'Hungary', 'Austria', 'Bulgaria', 'Denmark', 'Finland',
    'Slovakia', 'Ireland', 'Croatia', 'Slovenia', 'Lithuania',
    'Latvia', 'Estonia', 'Cyprus', 'Malta', 'Luxembourg',
    
    # Major capitals
    'Paris', 'Berlin', 'Rome', 'Madrid', 'Warsaw', 'Brussels',
    'Vienna', 'Athens', 'Lisbon', 'Budapest', 'Prague', 'Stockholm',
    'Amsterdam', 'Copenhagen', 'Dublin', 'Helsinki', 'Bucharest',
    
    # Other important cities
    'Barcelona', 'Milan', 'Munich', 'Hamburg', 'Lyon', 'Marseille',
    'Manchester', 'Birmingham', 'Glasgow', 'Edinburgh',
    
    # Regions
    'Europe', 'Catalonia', 'Bavaria', 'Andalusia', 'Tuscany',
    'Flanders', 'Scotland', 'Wales',
    
    # Non-EU but commonly mentioned
    'UK', 'United Kingdom', 'Britain', 'England', 'London',
    'Ukraine', 'Russia', 'Moscow', 'Kyiv', 'Switzerland',
    'Norway', 'Turkey', 'USA', 'United States', 'America', 'China'
}

locations_lower = {loc.lower(): loc for loc in european_locations}

print(f"✓ Created lexicon with {len(european_locations)} locations")
print(f"\nExamples: {list(european_locations)[:10]}")

✓ Created lexicon with 77 locations

Examples: ['Germany', 'Belgium', 'Bavaria', 'United Kingdom', 'Dublin', 'Hamburg', 'London', 'Latvia', 'Amsterdam', 'Britain']


In [22]:
# Lexicon-based extraction function
def find_locations_lexicon(text, location_dict):
    """
    Simple lexicon-based location extraction.
    """
    found_locations = []
    words = text.split()
    
    for word in words:
        word_clean = word.strip('.,;:!?()[]"\'')
        if word_clean.lower() in location_dict:
            found_locations.append(location_dict[word_clean.lower()])
    
    return found_locations

# Apply lexicon approach
print("Applying lexicon-based approach...")
df['ner_lexicon'] = df['text'].apply(
    lambda text: find_locations_lexicon(text, locations_lower)
)
df['ner_lexicon_count'] = df['ner_lexicon'].apply(len)

print("✓ Lexicon-based extraction completed!")

Applying lexicon-based approach...
✓ Lexicon-based extraction completed!


## 5.2 Direct Comparison

In [23]:
# Collect all locations from both methods
all_locations_lexicon = [loc for locs in df['ner_lexicon'] for loc in locs]

# Calculate overlaps
set_spacy = set(all_locations_spacy)
set_lexicon = set(all_locations_lexicon)

only_spacy = set_spacy - set_lexicon
only_lexicon = set_lexicon - set_spacy
both = set_spacy & set_lexicon

print("="*80)
print("COMPARISON: spaCy vs Lexicon")
print("="*80)

print(f"\nQuantitative Comparison:")
print(f"  Lexicon approach:")
print(f"    - Average locations per speech: {df['ner_lexicon_count'].mean():.2f}")
print(f"    - Total unique locations: {len(set_lexicon)}")
print(f"    - Total mentions: {len(all_locations_lexicon)}")

print(f"\n  spaCy approach:")
print(f"    - Average locations per speech: {df['ner_spacy_count'].mean():.2f}")
print(f"    - Total unique locations: {len(set_spacy)}")
print(f"    - Total mentions: {len(all_locations_spacy)}")

print(f"\nOverlap Analysis:")
print(f"  Locations found by both methods: {len(both)}")
print(f"  Only found by spaCy: {len(only_spacy)}")
print(f"  Only found by Lexicon: {len(only_lexicon)}")

print(f"\nOverlap percentage: {len(both) / len(set_spacy | set_lexicon) * 100:.1f}%")

COMPARISON: spaCy vs Lexicon

Quantitative Comparison:
  Lexicon approach:
    - Average locations per speech: 1.88
    - Total unique locations: 63
    - Total mentions: 3445

  spaCy approach:
    - Average locations per speech: 3.69
    - Total unique locations: 580
    - Total mentions: 6743

Overlap Analysis:
  Locations found by both methods: 61
  Only found by spaCy: 519
  Only found by Lexicon: 2

Overlap percentage: 10.5%


In [24]:
# What did spaCy find that the lexicon didn't?
print("Locations found ONLY by spaCy (sample of 30):\n")
sample_spacy_unique = list(only_spacy)[:30]
for i, loc in enumerate(sorted(sample_spacy_unique), 1):
    count = location_counts_spacy[loc]
    print(f"  {i:2d}. {loc:30s} ({count} mentions)")

print(f"\nThese might be:")
print(f"   ✓ Real locations not in our lexicon (shows model's generalization)")
print(f"   ✗ False positives (shows model's errors)")

Locations found ONLY by spaCy (sample of 30):

   1. Alabama                        (1 mentions)
   2. Americas                       (2 mentions)
   3. Bangladesh                     (1 mentions)
   4. Bosnia-Herzegovina             (2 mentions)
   5. Christ                         (1 mentions)
   6. ENVI                           (1 mentions)
   7. Horizon Europe                 (21 mentions)
   8. Ilan                           (1 mentions)
   9. Iohannis                       (2 mentions)
  10. Iran                           (111 mentions)
  11. Lebanon                        (8 mentions)
  12. Moldova                        (19 mentions)
  13. Mosul                          (4 mentions)
  14. North Macedonia                (3 mentions)
  15. Pirates                        (1 mentions)
  16. Sie                            (1 mentions)
  17. Svenja                         (1 mentions)
  18. Tibet                          (2 mentions)
  19. Uganda                         (3 mentions)

In [25]:
# What did the lexicon find that spaCy didn't?
location_counts_lexicon = Counter(all_locations_lexicon)

print("Locations found ONLY by Lexicon:\n")
for i, loc in enumerate(sorted(only_lexicon), 1):
    count = location_counts_lexicon[loc]
    print(f"  {i:2d}. {loc:30s} ({count} mentions)")

print(f"\nThese show:")
print(f"   - Locations the model missed (false negatives)")
print(f"   - Places where lexicon's simplicity is an advantage")

Locations found ONLY by Lexicon:

   1. Barcelona                      (1 mentions)
   2. England                        (1 mentions)

These show:
   - Locations the model missed (false negatives)
   - Places where lexicon's simplicity is an advantage


## 5.3 Example-by-Example Comparison

In [None]:
# Compare on a specific speech
# ADJUSTMENT: Change this number to see different speeches
COMPARE_INDEX = 5

compare_speech = df.iloc[COMPARE_INDEX]

print("="*80)
print(f"SIDE-BY-SIDE COMPARISON: SPEECH #{COMPARE_INDEX}")
print("="*80)
print(f"Speaker: {compare_speech['speaker']}")
print(f"Party: {compare_speech['party']}")
print(f"\nText (first 500 characters):")
print(compare_speech['text'][:500] + "...")

print(f"\n{'─'*80}")
print(f"\nLexicon approach found {compare_speech['ner_lexicon_count']} locations:")
print(f"   {compare_speech['ner_lexicon']}")

print(f"\nspaCy approach found {compare_speech['ner_spacy_count']} locations:")
print(f"   {compare_speech['ner_spacy']}")

# Find differences
set_lex = set(compare_speech['ner_lexicon'])
set_spa = set(compare_speech['ner_spacy'])

print(f"\nDifferences:")
print(f"   Found by both: {set_lex & set_spa}")
print(f"   Only by Lexicon: {set_lex - set_spa}")
print(f"   Only by spaCy: {set_spa - set_lex}")
print("="*80)

---

# Part 6: Conclusions and Reflection

## Key Findings

### Advantages of Model-based NER (spaCy):

✅ **Generalization**: Finds locations not in any predefined list
✅ **Context awareness**: Distinguishes "Paris Hilton" from "Paris, France"
✅ **No manual curation**: No need to maintain location lists
✅ **Multi-word entities**: Handles "United Kingdom", "New York" correctly
✅ **Multiple entity types**: Can extract persons, organizations, dates simultaneously

### Disadvantages of Model-based NER:

❌ **False positives**: May identify non-locations as locations
❌ **Intransparent**: Hard to understand why specific decisions were made
❌ **Computationally expensive**: Slower than lexicon lookup
❌ **Domain sensitivity**: Trained on general text, may miss domain-specific terms
❌ **No confidence scores**: Hard to filter uncertain predictions

### When to Use Model-based NER:

- ✓ When you need to find entities you haven't anticipated
- ✓ When context is important for disambiguation
- ✓ When you have diverse, natural language text
- ✓ When you can tolerate some errors
- ✓ When you have computational resources

### When to Use Lexicon-based NER:

- ✓ When you have a well-defined, limited domain
- ✓ When transparency is critical
- ✓ When speed is essential
- ✓ When you need 100% precision on known terms
- ✓ When computational resources are limited

## Best Practice: Hybrid Approach

In practice, the best solution is often a **hybrid approach**:

1. Use spaCy to find candidate locations
2. Filter results using a lexicon (keep only known locations)
3. Review rare/uncertain entities manually
4. Continuously update the lexicon based on findings

This combines the **recall** (finding many entities) of models with the **precision** (avoiding errors) of lexicons.

## Discussion Questions

1. **Accuracy vs Coverage**: Would you prefer a method that finds 90% of locations with 80% accuracy, or 60% of locations with 95% accuracy?

2. **Domain Adaptation**: How could we improve spaCy's performance on EU Parliament speeches specifically?

3. **Error Impact**: In what applications would false positives be worse than false negatives, and vice versa?

4. **Explainability**: How important is it to know WHY a model made a specific decision?

5. **Future Directions**: How might large language models (like GPT) change NER in the future?