# Analysis of EU Parliament Speeches: Comparing Lexicon-based and Model-based Approaches

In this notebook, we analyze speeches from the European Parliament and compare two fundamentally different approaches to text analysis:

1. **Lexicon-based approaches**: Work with predefined dictionaries and rules
2. **Model-based approaches**: Use machine learning and pre-trained AI models

We will apply both approaches to two tasks:
- **Named Entity Recognition (NER)**: Finding location names in texts
- **Sentiment Analysis**: Determining the emotional tone (positive/negative/neutral)

## Learning Objectives

After completing this notebook, you will be able to:
- Understand the differences between rule-based and ML-based approaches
- Apply both methods practically
- Assess the advantages and disadvantages of both approaches
- Compare and interpret results

## Structure

1. Setup and Data Loading
2. **Lexicon-based Approaches**
   - Named Entity Recognition with location dictionary
   - Sentiment Analysis with VADER lexicon
3. **Model-based Approaches**
   - Named Entity Recognition with spaCy
4. Comparison and Reflection

## 0. Setup and Installation

First, we will install all required packages. You only need to run this once.

In [16]:
# Install packages (only needed the first time)
!pip install tqdm pandas



In [17]:
# Import libraries
import pandas as pd
import re
import random
from tqdm import tqdm
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

print("✓ All libraries successfully loaded!")

✓ All libraries successfully loaded!


## 1. Load Data

We load the EU Parliament speeches from the CSV file.

In [18]:
# Load CSV file
df = pd.read_csv('eu_speeches_2024_english.csv')

print(f"Number of speeches: {len(df)}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst rows:")
df.head()

Number of speeches: 1828

Columns: ['speaker', 'text', 'party', 'date', 'agenda', 'speechnumber', 'procedure_ID', 'partyfacts_ID', 'period', 'chair', 'MEP', 'commission', 'written', 'multispeaker', 'link', 'translationInSpeech', 'translatedText']

First rows:


Unnamed: 0,speaker,text,party,date,agenda,speechnumber,procedure_ID,partyfacts_ID,period,chair,MEP,commission,written,multispeaker,link,translationInSpeech,translatedText
0,President,I wish you an excellent good morning on this l...,-,2024-04-25,2. Interinstitutional Body for Ethical Standar...,1,,,9,True,False,False,False,False,https://www.europarl.europa.eu/doceo/document/...,,
1,President,"Dear colleagues, today marks the 50th annivers...",-,2024-04-25,6. Statements by the President,1,,,9,True,False,False,False,False,https://www.europarl.europa.eu/doceo/document/...,,
2,Virginijus Sinkevičius,"Mr President, honourable Members, dear rapport...",-,2024-04-25,4. Framework of measures for strengthening Eur...,7,bill_165_ID bill_195_ID bill_165_ID bill_195_ID,,9,False,False,True,False,False,https://www.europarl.europa.eu/doceo/document/...,,
3,Anna Deparnay-Grunenberg,"Mr President, I came to this House to be a voi...",Greens/EFA,2024-04-25,4. Framework of measures for strengthening Eur...,9,bill_165_ID bill_195_ID bill_165_ID bill_195_ID,6403.0,9,False,True,False,False,False,https://www.europarl.europa.eu/doceo/document/...,,
4,Seán Kelly,"A Uachtaráin, teastaíonn uaim mo thacaíocht io...",PPE,2024-04-25,4. Framework of measures for strengthening Eur...,24,bill_165_ID bill_195_ID bill_165_ID bill_195_ID,6398.0,9,False,True,False,False,False,https://www.europarl.europa.eu/doceo/document/...,,


In [19]:
# Brief overview of the data
print("Distribution by party:")
print(df['party'].value_counts().head(10))

print(f"\nDate range: {df['date'].min()} to {df['date'].max()}")
print(f"Average length of a speech: {df['text'].str.len().mean():.0f} characters")

Distribution by party:
party
-             665
PPE           290
The Left      223
Renew         202
Greens/EFA    193
S&D           152
ECR            68
NI             26
ID              9
Name: count, dtype: int64

Date range: 2024-01-15 to 2024-04-25
Average length of a speech: 1530 characters


---

# Part 2: Lexicon-based Approaches

In this section, we use **predefined dictionaries and word lists** to analyze texts. These approaches are:
- ✓ **Transparent**: Every decision is traceable
- ✓ **Fast**: Very quick processing
- ✓ **Controllable**: You can modify the lexicons
- ✗ **Limited**: Only finds what's in the dictionary
- ✗ **No context understanding**: Cannot understand negations or irony

We will implement two lexicon-based approaches:
1. **NER**: Finding place names using a location dictionary
2. **Sentiment Analysis**: Analyzing emotional tone using the VADER lexicon

## 2.1 Lexicon-based NER: Finding Place Names

### How it works:

1. We create a list of European locations (cities, countries, regions)
2. We split each text into words
3. For each word, we check: Is it in our location dictionary?
4. If yes, we mark it as a location

This is the **simplest form of Named Entity Recognition**!

### Step 1: Create the location lexicon

In [20]:
# Base lexicon: European cities, countries, and regions
# ADJUSTMENT: You can add or remove locations here!

european_locations = {
    # Major capitals
    'Paris', 'Berlin', 'Rome', 'Madrid', 'Warsaw', 'Brussels',
    'Vienna', 'Athens', 'Lisbon', 'Budapest', 'Prague', 'Stockholm',
    'Amsterdam', 'Copenhagen', 'Dublin', 'Helsinki', 'Bucharest',
    'Sofia', 'Zagreb', 'Ljubljana', 'Bratislava', 'Tallinn', 'Riga',
    'Vilnius', 'Valletta', 'Luxembourg', 'Nicosia',
    
    # Other major cities
    'Barcelona', 'Milan', 'Munich', 'Hamburg', 'Lyon', 'Marseille',
    'Krakow', 'Gdansk', 'Porto', 'Valencia', 'Turin', 'Naples',
    'Rotterdam', 'Frankfurt', 'Cologne', 'Stuttgart', 'Seville',
    'Bilbao', 'Malaga', 'Manchester', 'Birmingham', 'Leeds', 'Liverpool',
    'Glasgow', 'Edinburgh', 'Belfast', 'Cardiff', 'Bristol',
    
    # EU Countries
    'Germany', 'France', 'Italy', 'Spain', 'Poland', 'Romania',
    'Netherlands', 'Belgium', 'Greece', 'Portugal', 'Sweden',
    'Hungary', 'Austria', 'Bulgaria', 'Denmark', 'Finland',
    'Slovakia', 'Ireland', 'Croatia', 'Slovenia', 'Lithuania',
    'Latvia', 'Estonia', 'Cyprus', 'Malta', 'Luxembourg',
    
    # Regions
    'Catalonia', 'Bavaria', 'Andalusia', 'Tuscany', 'Brittany',
    'Flanders', 'Scotland', 'Wales', 'Corsica', 'Sicily',
    'Lombardy', 'Yorkshire', 'Lancashire', 'Saxony', 'Brandenburg',
    
    # Non-EU but commonly mentioned
    'London', 'UK', 'United Kingdom', 'Britain', 'England',
    'Ukraine', 'Russia', 'Switzerland', 'Norway', 'Turkey'
}

# For searching: both original and lowercase
locations_lower = {loc.lower(): loc for loc in european_locations}

print(f"Location lexicon created with {len(european_locations)} entries")
print(f"\nExamples: {list(european_locations)[:10]}")

Location lexicon created with 105 entries

Examples: ['Yorkshire', 'Leeds', 'Turin', 'Malaga', 'Rome', 'Austria', 'Lithuania', 'Copenhagen', 'Norway', 'France']


### Step 2: Implement the search function

Now we program the function that searches for locations in a text.

In [21]:
def find_locations_lexicon(text, location_dict):
    """
    Finds locations in a text based on a lexicon.
    
    How it works (very simple):
    1. Split the text into individual words (at spaces)
    2. For each word: Check if it's in our location lexicon
    3. If yes: Save it
    
    This is the simplest form of Named Entity Recognition!
    
    Parameters:
        text: The text to analyze
        location_dict: Dictionary with locations (lowercase -> Original)
    
    Return:
        List of found locations
    """
    found_locations = []
    
    # Split text into words (at spaces)
    words = text.split()
    
    # Go through each word
    for word in words:
        # Remove punctuation (e.g., "Paris," -> "Paris")
        word_clean = word.strip('.,;:!?()[]"\'')
        
        # Check if the word (in lowercase) is in our lexicon
        if word_clean.lower() in location_dict:
            # Add the original spelling
            found_locations.append(location_dict[word_clean.lower()])
    
    return found_locations

# Test with example texts
test_texts = [
    "In Paris and Berlin, we are discussing the future of Europe with partners from Madrid and Rome.",
    "Poland, Hungary, and the Czech Republic have different views than France and Germany.",
    "The crisis affects Greece, Italy, Spain, and Portugal, but also Ireland and Cyprus."
]

print("Function tests:\n")
for i, test_text in enumerate(test_texts, 1):
    test_result = find_locations_lexicon(test_text, locations_lower)
    print(f"Test {i}:")
    print(f"  Text: {test_text}")
    print(f"  Found locations: {test_result}")
    print(f"  Count: {len(test_result)}\n")

print("Observation:")
print("The function only finds locations that")
print("  1. Are in the lexicon AND")
print("  2. Appear exactly like that in the text")
print("\nExperiment: Add a new location to the lexicon above")
print("   and run both cells again. Will it be found now?")

Function tests:

Test 1:
  Text: In Paris and Berlin, we are discussing the future of Europe with partners from Madrid and Rome.
  Found locations: ['Paris', 'Berlin', 'Madrid', 'Rome']
  Count: 4

Test 2:
  Text: Poland, Hungary, and the Czech Republic have different views than France and Germany.
  Found locations: ['Poland', 'Hungary', 'France', 'Germany']
  Count: 4

Test 3:
  Text: The crisis affects Greece, Italy, Spain, and Portugal, but also Ireland and Cyprus.
  Found locations: ['Greece', 'Italy', 'Spain', 'Portugal', 'Ireland', 'Cyprus']
  Count: 6

Observation:
The function only finds locations that
  1. Are in the lexicon AND
  2. Appear exactly like that in the text

Experiment: Add a new location to the lexicon above
   and run both cells again. Will it be found now?


### Step 3: Apply to all speeches

In [22]:
# Apply the function to all speeches
print("Analyzing speeches with lexicon-based NER...")

df['ner_lexicon'] = df['text'].apply(
    lambda text: find_locations_lexicon(text, locations_lower)
)

# Number of found locations per speech
df['ner_lexicon_count'] = df['ner_lexicon'].apply(len)

print(f"✓ Analysis completed!")
print(f"\nStatistics:")
print(f"- Speeches with at least one location: {(df['ner_lexicon_count'] > 0).sum()} of {len(df)}")
print(f"- Average {df['ner_lexicon_count'].mean():.2f} locations per speech")
print(f"- Maximum: {df['ner_lexicon_count'].max()} locations in one speech")

Analyzing speeches with lexicon-based NER...
✓ Analysis completed!

Statistics:
- Speeches with at least one location: 693 of 1828
- Average 1.21 locations per speech
- Maximum: 33 locations in one speech


In [23]:
# Which locations are mentioned most frequently?
all_locations_lexicon = [loc for locs in df['ner_lexicon'] for loc in locs]
location_counts_lexicon = Counter(all_locations_lexicon)

print("Top 15 most mentioned locations (lexicon method):")
for location, count in location_counts_lexicon.most_common(15):
    print(f"  {location}: {count}x")

Top 15 most mentioned locations (lexicon method):
  Ukraine: 747x
  Russia: 394x
  Ireland: 169x
  Hungary: 99x
  Finland: 58x
  Germany: 55x
  Greece: 54x
  Brussels: 53x
  Romania: 52x
  Poland: 39x
  Spain: 37x
  France: 37x
  Italy: 32x
  Sweden: 29x
  Slovakia: 28x


### Reflection: Lexicon-based NER

**What did we observe?**
- The method only finds locations that are in our lexicon
- The results are 100% traceable
- Smaller cities or regions are not found (unless we add them)

**Advantages:**
- Very transparent and understandable
- Fast processing
- No false positives (if the lexicon is correct)
- Easy to customize

**Disadvantages:**
- Only finds known locations
- Cannot handle variations
- Requires manual maintenance

---

## 2.2 Lexicon-based Sentiment Analysis with VADER

### How it works:

**VADER** (Valence Aware Dictionary and sEntiment Reasoner) is an English sentiment analysis tool specifically designed for social media and political text. It contains approximately 7,500 words with sentiment scores.

**The algorithm:**
1. Split text into words
2. For each word: Look up the score in the lexicon
   - Positive words have scores > 0 (e.g., "good" = +0.7)
   - Negative words have scores < 0 (e.g., "bad" = -0.7)
3. Apply rules for:
   - Punctuation ("!!!" increases intensity)
   - Capitalization ("GREAT" is stronger than "great")
   - Negations ("not good" is negative)
4. Calculate compound score (-1 to +1)

### Why VADER?

- Specifically designed for political and social media text
- Handles negations better than simple lexicons
- Understands intensifiers ("very good", "extremely bad")
- Works well for EU Parliament debate style

### Step 1: Load VADER

In [24]:
def load_vader_lexicon(lexicon_file):
    """
    Loads the VADER lexicon from a text file.
    
    File format:
    word    score
    e.g.: good    1.9
    
    Return:
        Dictionary: {word: score}
    """
    lexicon = {}
    
    try:
        with open(lexicon_file, 'r', encoding='utf-8') as f:
            for line in f:
                # Skip comments and empty lines
                if line.startswith('#') or not line.strip():
                    continue
                
                parts = line.strip().split('\t')
                if len(parts) == 2:
                    word = parts[0]
                    score = float(parts[1])
                    lexicon[word.lower()] = score
                    
    except FileNotFoundError:
        print(f"File not found: {lexicon_file}")
        print("Please run the extraction script first to create vader_lexicon.txt")
        return None
    
    return lexicon

# Load VADER lexicon
vader_lex = load_vader_lexicon('vader_lexicon.txt')

if vader_lex:
    print(f"✓ VADER lexicon successfully loaded!")
    print(f"Lexicon contains {len(vader_lex)} words\n")
    
    # Show examples
    print("Examples of positive words:")
    positive_words = {k: v for k, v in vader_lex.items() if v > 0}
    positive_items = list(positive_words.items())
    random.shuffle(positive_items)
    for word, score in positive_items[:5]:
        print(f"  {word}: {score:+.3f}")
    
    print("\nExamples of negative words:")
    negative_words = {k: v for k, v in vader_lex.items() if v < 0}
    negative_items = list(negative_words.items())
    random.shuffle(negative_items)
    for word, score in negative_items[:5]:
        print(f"  {word}: {score:+.3f}")
    
    print(f"\nStatistics:")
    print(f"  Positive words: {len(positive_words)}")
    print(f"  Negative words: {len(negative_words)}")
    print(f"  Total: {len(vader_lex)}")

✓ VADER lexicon successfully loaded!
Lexicon contains 7494 words

Examples of positive words:
  freewheel: +0.500
  vitalities: +1.200
  beneficial: +1.900
  hhok: +0.900
  (':: +2.300

Examples of negative words:
  stupids: -2.300
  bribe: -0.800
  coward: -2.000
  v.v: -2.900
  inferiority: -1.100

Statistics:
  Positive words: 3329
  Negative words: 4165
  Total: 7494


### Step 2: Implement sentiment analysis function

In [25]:
def analyze_sentiment_lexicon(text, lexicon):
    """
    Analyzes the sentiment of a text using a lexicon.
    
    Simplified version - just averages word scores.
    (VADER library also handles punctuation, capitalization, etc.)
    
    Algorithm:
    1. Split text into words
    2. For each word: Look up the score in the lexicon
    3. Calculate average of all scores
    
    Parameters:
        text: Text to analyze
        lexicon: VADER dictionary
    
    Return:
        (score, details)
    """
    # Split text into words (only letters)
    words = re.findall(r'\b[a-z]+\b', text.lower())
    
    # Collect scores
    scores = []
    sentiment_words = []
    
    for word in words:
        if word in lexicon:
            score = lexicon[word]
            scores.append(score)
            sentiment_words.append((word, score))
    
    # Calculate average
    if not scores:
        return 0.0, {
            'total_words': len(words),
            'sentiment_words': 0,
            'top_positive': [],
            'top_negative': []
        }
    
    # Normalize to -1 to +1 range (VADER scores are -4 to +4)
    avg_score = sum(scores) / len(words)
    normalized_score = avg_score / 4.0  # Normalize to -1 to +1
    
    details = {
        'total_words': len(words),
        'sentiment_words': len(scores),
        'top_positive': sorted([w for w in sentiment_words if w[1] > 0],
                               key=lambda x: x[1], reverse=True)[:3],
        'top_negative': sorted([w for w in sentiment_words if w[1] < 0],
                               key=lambda x: x[1])[:3]
    }
    
    return normalized_score, details

### Step 3: Apply to all speeches

In [26]:
print("Analyzing sentiment with VADER...\n")

results_vader = []

for idx, row in tqdm(df.iterrows(), total=len(df), desc="Processing speeches"):
    score, details = analyze_sentiment_lexicon(row['text'], vader_lex)
    results_vader.append({
        'score': score,
        'details': details
    })

df['sentiment_vader_score'] = [r['score'] for r in results_vader]

print("\n✓ Analysis completed!")
print(f"\nAverage sentiment score: {df['sentiment_vader_score'].mean():+.4f}")
print(f"Score range: {df['sentiment_vader_score'].min():+.4f} to {df['sentiment_vader_score'].max():+.4f}")

Analyzing sentiment with VADER...



Processing speeches: 100%|██████████| 1828/1828 [00:00<00:00, 6899.16it/s]


✓ Analysis completed!

Average sentiment score: +0.0076
Score range: -0.0716 to +0.0866





In [27]:
# Sentiment by party group
sentiment_by_party = df.groupby('party')['sentiment_vader_score'].agg(['mean', 'count'])
sentiment_by_party = sentiment_by_party.sort_values('mean')

print("\nAverage sentiment by party:")
print(sentiment_by_party)

print("\nObservations:")
print(f"- Most positive party: {sentiment_by_party.index[-1]} ({sentiment_by_party['mean'].iloc[-1]:+.4f})")
print(f"- Most negative party: {sentiment_by_party.index[0]} ({sentiment_by_party['mean'].iloc[0]:+.4f})")

# Most positive speech
most_positive_idx = df['sentiment_vader_score'].idxmax()
most_positive = df.loc[most_positive_idx]

print("\n" + "="*80)
print("MOST POSITIVE SPEECH")
print("="*80)
print(f"Speaker: {most_positive['speaker']} ({most_positive['party']})")
print(f"Date: {most_positive['date']}")
print(f"Score: {most_positive['sentiment_vader_score']:+.4f}")
print(f"\nText (first 500 characters):")
print(most_positive['text'][:500] + "...")
print("="*80)

# Most negative speech
most_negative_idx = df['sentiment_vader_score'].idxmin()
most_negative = df.loc[most_negative_idx]

print("\n" + "="*80)
print("MOST NEGATIVE SPEECH")
print("="*80)
print(f"Speaker: {most_negative['speaker']} ({most_negative['party']})")
print(f"Date: {most_negative['date']}")
print(f"Score: {most_negative['sentiment_vader_score']:+.4f}")
print(f"\nText (first 500 characters):")
print(most_negative['text'][:500] + "...")
print("="*80)


Average sentiment by party:
                mean  count
party                      
The Left   -0.004680    223
NI          0.002666     26
ID          0.002724      9
Greens/EFA  0.005032    193
ECR         0.005845     68
Renew       0.006121    202
S&D         0.006675    152
PPE         0.009348    290
-           0.012810    665

Observations:
- Most positive party: - (+0.0128)
- Most negative party: The Left (-0.0047)

MOST POSITIVE SPEECH
Speaker: Reinhard Bütikofer (Greens/EFA)
Date: 2024-02-06
Score: +0.0866

Text (first 500 characters):
Colleague, as you just expressed your strong support for helping Ukraine to help us defend the European security architecture and our freedom, would you support the Estonian proposal that every EU Member State should dedicate 0.25 % of their GDP to military support for Ukraine?...

MOST NEGATIVE SPEECH
Speaker: President (-)
Date: 2024-03-13
Score: -0.0716

Text (first 500 characters):
The next item is the debate on the Commission statement o

### Reflection: Lexicon-based Sentiment Analysis

**Observations:**

**Advantages:**
- Transparent: VADER shows positive/negative/neutral proportions
- Fast processing
- Handles negations ("not good" = negative)
- Understands intensifiers ("very", "extremely")
- Designed for social and political text

**Limitations:**
- Still limited context understanding
- May miss subtle political rhetoric
- Complex irony not always detected
- Domain-specific political terminology may not be in lexicon
   
→ That's why models can be even better!