# Analysis of German Bundestag Speeches: Comparing Lexicon-based and Model-based Approaches

In this notebook, we analyze speeches from the German Bundestag and compare two fundamentally different approaches to text analysis:

1. **Lexicon-based approaches**: Work with predefined dictionaries and rules
2. **Model-based approaches**: Use machine learning and pre-trained AI models

We will apply both approaches to two tasks:
- **Named Entity Recognition (NER)**: Finding location names in texts
- **Sentiment Analysis**: Determining the emotional tone (positive/negative/neutral)

## Learning Objectives

After completing this notebook, you will be able to:
- Understand the differences between rule-based and ML-based approaches
- Apply both methods practically
- Assess the advantages and disadvantages of both approaches
- Compare and interpret results

## Structure

1. Setup and Data Loading
2. **Lexicon-based Approaches**
   - Named Entity Recognition with location dictionary
   - Sentiment Analysis with SentiWS lexicon
3. **Model-based Approaches**
   - Named Entity Recognition with spaCy
   - Sentiment Analysis with Transformers
4. Comparison and Reflection

## 0. Setup and Installation

First, we will install all required packages. You only need to run this once.

In [None]:
# Install packages (only needed the first time)
!pip install spacy transformers pandas torch tqdm
!python -m spacy download de_core_news_sm

In [None]:
# Import libraries
import pandas as pd
import re
import random
from tqdm import tqdm
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

print("✓ All libraries successfully loaded!")

## 1. Load Data

We load the already extracted Bundestag speeches from the CSV file.

In [None]:
# Load CSV file
df = pd.read_csv('reden.csv')

print(f"Number of speeches: {len(df)}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst rows:")
df.head()

In [None]:
# Brief overview of the data
print("Distribution by faction:")
print(df['fraktion'].value_counts())

print(f"\nAverage length of a speech: {df['fliesstext'].str.len().mean():.0f} characters")

---

# Part 2: Lexicon-based Approaches

In this section, we use **predefined dictionaries and word lists** to analyze texts. These approaches are:
- ✓ **Transparent**: Every decision is traceable
- ✓ **Fast**: Very quick processing
- ✓ **Controllable**: You can modify the lexicons
- ✗ **Limited**: Only finds what's in the dictionary
- ✗ **No context understanding**: Cannot understand negations or irony

We will implement two lexicon-based approaches:
1. **NER**: Finding place names using a location dictionary
2. **Sentiment Analysis**: Analyzing emotional tone using the SentiWS lexicon

## 2.1 Lexicon-based NER: Finding Place Names

### How it works:

1. We create a list of German locations (cities, states, regions)
2. We split each text into words
3. For each word, we check: Is it in our location dictionary?
4. If yes, we mark it as a location

This is the **simplest form of Named Entity Recognition**!

### Step 1: Create the location lexicon

In [None]:
# Base lexicon: German cities and federal states
# ADJUSTMENT: You can add or remove locations here! (Erfurt, Thüringen)

german_locations = {
    # Major cities
    'Berlin', 'Hamburg', 'München', 'Köln', 'Frankfurt',
    'Stuttgart', 'Düsseldorf', 'Dortmund', 'Essen', 'Leipzig',
    'Bremen', 'Dresden', 'Hannover', 'Nürnberg', 'Duisburg',
    'Bochum', 'Wuppertal', 'Bielefeld', 'Bonn', 'Münster',
    'Karlsruhe', 'Mannheim', 'Augsburg', 'Wiesbaden', 'Gelsenkirchen',
    'Mönchengladbach', 'Braunschweig', 'Chemnitz', 'Kiel', 'Aachen',
    'Halle', 'Magdeburg', 'Freiburg', 'Krefeld', 'Lübeck',
    'Oberhausen', 'Mainz', 'Rostock', 'Kassel',
    
    # Federal states
    'Baden-Württemberg', 'Bayern', 'Brandenburg', 'Hessen',
    'Mecklenburg-Vorpommern', 'Niedersachsen', 'Nordrhein-Westfalen',
    'Rheinland-Pfalz', 'Saarland', 'Sachsen', 'Sachsen-Anhalt',
    'Schleswig-Holstein',
    
    # Regions
    'Ruhrgebiet', 'Rheinland', 'Franken', 'Schwaben', 'Pfalz',
    'Ostfriesland', 'Allgäu', 'Schwarzwald'
}

# For searching: both original and lowercase
locations_lower = {loc.lower(): loc for loc in german_locations}

print(f"Location lexicon created with {len(german_locations)} entries")
print(f"\nExamples: {list(german_locations)[:10]}")

### Step 2: Implement the search function

Now we program the function that searches for locations in a text.

In [None]:
def find_locations_lexicon(text, location_dict):
    """
    Finds locations in a text based on a lexicon.
    
    How it works (very simple):
    1. Split the text into individual words (at spaces)
    2. For each word: Check if it's in our location lexicon
    3. If yes: Save it
    
    This is the simplest form of Named Entity Recognition!
    
    Parameters:
        text: The text to analyze
        location_dict: Dictionary with locations (lowercase -> Original)
    
    Return:
        List of found locations
    """
    found_locations = []
    
    # Split text into words (at spaces)
    words = text.split()
    
    # Go through each word
    for word in words:
        # Remove punctuation (e.g., "Berlin," -> "Berlin")
        # This is important because "Berlin," is not in our lexicon
        word_clean = word.strip('.,;:!?()[]"\'')
        
        # Check if the word (in lowercase) is in our lexicon
        # We use .lower() because our lexicon is lowercase
        if word_clean.lower() in location_dict:
            # Add the original spelling (e.g., "Berlin" instead of "berlin")
            found_locations.append(location_dict[word_clean.lower()])
    
    return found_locations

# Test with example texts
test_texts = [
    "In Berlin und München wird viel über die Politik in Bayern und Sachsen diskutiert.",
    "Hamburg, Bremen und Kiel sind wichtige Hafenstädte.",
    "Die Bundesregierung tagt in Berlin, die thüringische Landesregierung in Erfurt."
]

print("Function tests:\n")
for i, test_text in enumerate(test_texts, 1):
    test_result = find_locations_lexicon(test_text, locations_lower)
    print(f"Test {i}:")
    print(f"  Text: {test_text}")
    print(f"  Found locations: {test_result}")
    print(f"  Count: {len(test_result)}\n")

print("Observation:")
print("The function only finds locations that")
print("  1. Are in the lexicon AND")
print("  2. Appear exactly like that in the text")
print("\nExperiment: Add a new location to the lexicon above")
print("   and run the cell again. Will it be found now?")

#### Observation
The function only finds locations that
1. Are in the lexicon AND
2. Appear exactly like that in the text

Add 'Erfurt' to the lexicon in the cell above and run both cells again. 

### Step 3: Apply to all speeches

In [None]:
# Apply the function to all speeches
print("Analyzing speeches with lexicon-based NER...")

df['ner_lexicon'] = df['fliesstext'].apply(
    lambda text: find_locations_lexicon(text, locations_lower)
)

# Number of found locations per speech
df['ner_lexicon_count'] = df['ner_lexicon'].apply(len)

print(f"✓ Analysis completed!")
print(f"\nStatistics:")
print(f"- Speeches with at least one location: {(df['ner_lexicon_count'] > 0).sum()} of {len(df)}")
print(f"- Average {df['ner_lexicon_count'].mean():.2f} locations per speech")
print(f"- Maximum: {df['ner_lexicon_count'].max()} locations in one speech")

In [None]:
# Which locations are mentioned most frequently?
all_locations_lexicon = [loc for locs in df['ner_lexicon'] for loc in locs]
location_counts_lexicon = Counter(all_locations_lexicon)

print("Top 15 most mentioned locations (lexicon method):")
for location, count in location_counts_lexicon.most_common(15):
    print(f"  {location}: {count}x")

### Reflection: Lexicon-based NER

**What did we observe?**
- The method only finds locations that are in our lexicon
- The results are 100% traceable
- Small towns or districts are not found (unless we add them)

**Advantages:**
- Very transparent and understandable
- Fast processing
- No false positives (if the lexicon is correct)
- Easy to customize

**Disadvantages:**
- Only finds known locations
- Cannot handle variations (e.g., "Münchner" for "München")
- Requires manual maintenance

---

## 2.2 Lexicon-based Sentiment Analysis with SentiWS

### How it works:

**SentiWS** is a German sentiment lexicon from the University of Leipzig. It contains approximately 1,800 positive and 1,800 negative words with weights.

**The algorithm:**
1. Split text into words
2. For each word: Look up the score in the lexicon
   - Positive words have scores > 0 (e.g., "gut" = +0.5)
   - Negative words have scores < 0 (e.g., "schlecht" = -0.5)
   - Neutral words are not in the lexicon (score = 0)
3. Calculate the average of all scores
4. Classify based on a threshold:
   - Score > threshold → positive
   - Score < -threshold → negative
   - Otherwise → neutral

### Why is this interesting?

- Do factions differ in their emotional language?
- Are opposition speeches more negative than government speeches?
- How objective or emotional are political debates?

### Step 1: Load SentiWS lexicon

**Important**: Make sure the SentiWS files are in the same folder!

(The files were originally downloaded from: https://wortschatz.uni-leipzig.de/de/download)

In [None]:
def load_sentiws(positive_file, negative_file):
    """
    Loads the SentiWS lexicon.
    
    File format:
    Word|POS    Score    Inflections
    e.g.: Abbau|NN    -0.058    Abbaus,Abbaues,Abbaue,Abbauten
    
    We extract:
    - The base word ("Abbau")
    - The score (-0.058)
    - All inflections (declined forms)
    
    Return:
        Dictionary: {word: score}
    """
    lexicon = {}
    
    # Load positive words
    try:
        with open(positive_file, 'r', encoding='utf-8') as f:
            for line in f:
                parts = line.strip().split('\t')
                if len(parts) >= 2:
                    # Extract base word (before the |)
                    word = parts[0].split('|')[0]
                    score = float(parts[1])
                    lexicon[word.lower()] = score
                    
                    # Add inflections (if present)
                    if len(parts) >= 3 and parts[2]:
                        inflections = parts[2].split(',')
                        for infl in inflections:
                            lexicon[infl.lower()] = score
    except FileNotFoundError:
        print(f"File not found: {positive_file}")
        print("Please download SentiWS from:")
        print("https://wortschatz.uni-leipzig.de/de/download")
        return None
    
    # Load negative words
    try:
        with open(negative_file, 'r', encoding='utf-8') as f:
            for line in f:
                parts = line.strip().split('\t')
                if len(parts) >= 2:
                    word = parts[0].split('|')[0]
                    score = float(parts[1])
                    lexicon[word.lower()] = score
                    
                    if len(parts) >= 3 and parts[2]:
                        inflections = parts[2].split(',')
                        for infl in inflections:
                            lexicon[infl.lower()] = score
    except FileNotFoundError:
        print(f"File not found: {negative_file}")
        return None
    
    return lexicon

# Load SentiWS
# Make sure the files are in the same folder!
sentiws = load_sentiws(
    'SentiWS_v2.0_Positive.txt',
    'SentiWS_v2.0_Negative.txt'
)

if sentiws:
    print(f"✓ SentiWS successfully loaded!")
    print(f"Lexicon contains {len(sentiws)} words\n")
    
    # Show examples
    print("Examples of positive words:")
    # Filter for positive words first, then take examples
    positive_words = {k: v for k, v in sentiws.items() if v > 0}
    positive_items = list(positive_words.items())
    random.shuffle(positive_items)
    for word, score in positive_items[:5]:
        print(f"  {word}: {score:+.3f}")
    
    print("\nExamples of negative words:")
    # Filter for negative words first, then take examples
    negative_words = {k: v for k, v in sentiws.items() if v < 0}
    negative_items = list(negative_words.items())
    random.shuffle(negative_items)
    for word, score in negative_items[:5]:
        print(f"  {word}: {score:+.3f}")
    
    # Additional statistics
    print(f"\nStatistics:")
    print(f"  Positive words: {len(positive_words)}")
    print(f"  Negative words: {len(negative_words)}")
    print(f"  Total: {len(sentiws)}")

### Step 2: Implement sentiment analysis function

In [None]:
def analyze_sentiment_lexicon(text, lexicon):
    """
    Analyzes the sentiment of a text using a lexicon.
    
    Algorithm:
    1. Split text into words
    2. For each word: Look up the score in the lexicon
    3. Calculate average of all scores
    
    Parameters:
        text: Text to analyze
        lexicon: SentiWS dictionary
    
    Return:
        (score, details)
    """
    # Split text into words (only letters)
    words = re.findall(r'\b[a-zäöüß]+\b', text.lower())
    
    # Collect scores
    scores = []
    sentiment_words = []  # Which words contribute to sentiment?
    
    for word in words:
        if word in lexicon:
            score = lexicon[word]
            scores.append(score)
            # Save ALL sentiment words
            sentiment_words.append((word, score))
    
    # Calculate average
    if not scores:
        # No sentiment words found
        return 0.0, {'total_words': len(words), 'sentiment_words': 0, 'top_positive': [], 'top_negative': []}
    
    avg_score = sum(scores) / len(words)  # Average over ALL words
    
    # Details for transparency
    details = {
        'total_words': len(words),
        'sentiment_words': len(scores),
        'top_positive': sorted([w for w in sentiment_words if w[1] > 0], 
                               key=lambda x: x[1], reverse=True)[:3],
        'top_negative': sorted([w for w in sentiment_words if w[1] < 0], 
                               key=lambda x: x[1])[:3]
    }
    
    return avg_score, details

# Test with example texts
test_texts = [
    "Das ist eine hervorragende und fantastische Lösung!",
    "Die Situation ist katastrophal und inakzeptabel.",
    "Der Antrag wurde zur Kenntnis genommen.",
    "Der Winter ist zwar schön, aber auch ziemlich kalt."
]

if sentiws:
    print("Sentiment function tests:\n")
    for i, text in enumerate(test_texts, 1):
        score, details = analyze_sentiment_lexicon(text, sentiws)
        print(f"Text {i}: {text}")
        print(f"  → Score: {score:+.4f}")
        print(f"  → Positive words: {details['top_positive']}")
        print(f"  → Negative words: {details['top_negative']}")
        print()

### Step 3: Apply to all speeches

In [None]:
if sentiws:
    print("Analyzing sentiment with SentiWS lexicon...\n")
    
    results_lexicon = []
    
    
    for idx, row in tqdm(df.iterrows(), total=len(df), desc="Processing speeches"):
        score, details = analyze_sentiment_lexicon(
            row['fliesstext'], 
            sentiws
        )
        results_lexicon.append({
            'score': score,
            'details': details
        })
    
    df['sentiment_lexicon_score'] = [r['score'] for r in results_lexicon]
    
    print("\n✓ Analysis completed!")
    print(f"\nAverage sentiment score: {df['sentiment_lexicon_score'].mean():+.4f}")
    print(f"Score range: {df['sentiment_lexicon_score'].min():+.4f} to {df['sentiment_lexicon_score'].max():+.4f}")

In [None]:
# Sentiment by faction
if sentiws:
    sentiment_by_faction = df.groupby('fraktion')['sentiment_lexicon_score'].agg(['mean', 'count'])
    sentiment_by_faction = sentiment_by_faction.sort_values('mean')
    
    print("\nAverage sentiment by faction:")
    print(sentiment_by_faction)
    
    print("\nObservations:")
    print(f"- Most positive faction: {sentiment_by_faction.index[-1]} ({sentiment_by_faction['mean'].iloc[-1]:+.4f})")
    print(f"- Most negative faction: {sentiment_by_faction.index[0]} ({sentiment_by_faction['mean'].iloc[0]:+.4f})")
    
    # Most positive speech
    most_positive_idx = df['sentiment_lexicon_score'].idxmax()
    most_positive = df.loc[most_positive_idx]
    
    print("\n" + "="*80)
    print("MOST POSITIVE SPEECH")
    print("="*80)
    print(f"Speaker: {most_positive['redner_vorname']} {most_positive['redner_nachname']} ({most_positive['fraktion']})")
    print(f"Score: {most_positive['sentiment_lexicon_score']:+.4f}")
    print(f"\nText (first 500 characters):")
    print(most_positive['fliesstext'][:500] + "...")
    print("="*80)
    
    # Most negative speech
    most_negative_idx = df['sentiment_lexicon_score'].idxmin()
    most_negative = df.loc[most_negative_idx]
    
    print("\n" + "="*80)
    print("MOST NEGATIVE SPEECH")
    print("="*80)
    print(f"Speaker: {most_negative['redner_vorname']} {most_negative['redner_nachname']} ({most_negative['fraktion']})")
    print(f"Score: {most_negative['sentiment_lexicon_score']:+.4f}")
    print(f"\nText (first 500 characters):")
    print(most_negative['fliesstext'][:500] + "...")
    print("="*80)

### Reflection: Lexicon-based Sentiment Analysis

**Observations:**

**Advantages:**
- Very transparent: You can see exactly which words contribute
- Fast processing
- Easy to understand and explain
- Adjustable threshold

**Limitations:**
- No context understanding
- "nicht gut" (not good) is recognized as positive (because "gut" is positive)
- Irony is not recognized
- Negations are not handled
   

---

# Part 3: Model-based Approaches

In this section, we use **pre-trained machine learning models** to analyze texts. These approaches are:
- ✓ **Context-aware**: Understand negations, grammar, context
- ✓ **Generalizable**: Can handle unknown words
- ✓ **State-of-the-art**: Often better results
- ✗ **Intransparent**: "Black box" - why this decision?
- ✗ **Computationally intensive**: Slower, requires more resources

We will implement a model-based approach for **NER**:
- **NER**: Finding place names using spaCy

## 3.1 Model-based NER with spaCy

### How it works:

**spaCy** is a popular NLP framework trained on millions of German texts. The model learned to recognize patterns:
- Context (words around it)
- Grammar (e.g., "in Berlin" → Berlin is probably a place)
- Patterns learned from millions of examples

The model can recognize:
- **PER**: Persons
- **LOC**: Locations (what we're interested in!)
- **ORG**: Organizations
- **MISC**: Miscellaneous

### Step 1: Load spaCy model

In [None]:
import spacy

# Load the German language model
MODEL_NAME = 'de_core_news_sm'

print(f"Loading spaCy model '{MODEL_NAME}'...")
nlp = spacy.load(MODEL_NAME)
print("✓ Model loaded!")

# What can the model recognize?
print("\nThe model recognizes the following entity types:")
print("- PER: Persons")
print("- LOC: Locations (what we're interested in!)")
print("- ORG: Organizations")
print("- MISC: Miscellaneous")

### Step 2: Implement function

In [None]:
def find_locations_spacy(text, nlp_model):
    """
    Finds locations in a text with spaCy.
    
    How it works:
    1. The model analyzes the entire text
    2. It recognizes "entities" (named entities) of various types
    3. We filter out only the locations (LOC = Location)
    
    The model uses:
    - Context (surrounding words)
    - Grammar (e.g., "in Berlin" -> Berlin is probably a place)
    - Patterns learned from millions of examples
    
    Parameters:
        text: The text to analyze
        nlp_model: The loaded spaCy model
    
    Return:
        List of found locations
    """
    # Run text through the model
    doc = nlp_model(text)
    
    # Extract only entities of type 'LOC' (Location)
    locations = [ent.text for ent in doc.ents if ent.label_ == 'LOC']
    
    return locations

# Test with the same example texts as before
test_texts = [
    "In Berlin und München wird viel über die Politik in Bayern und Sachsen diskutiert.",
    "Hamburg, Bremen und Kiel sind wichtige Hafenstädte.",
    "Die Bundesregierung tagt in Berlin, die thüringische Landesregierung in Erfurt."
]

print("Function tests:\n")
for i, text in enumerate(test_texts, 1):
    result_spacy = find_locations_spacy(text, nlp)
    result_lexicon = find_locations_lexicon(text, locations_lower)
    
    print(f"Test {i}: {text}")
    print(f"  Found locations (spaCy):   {result_spacy}")
    print(f"  Found locations (Lexicon): {result_lexicon}")
    print(f"  Comparison: Does spaCy find the same locations as our lexicon?\n")

### Step 3: Apply to all speeches

**Note**: This can take a few minutes as the model analyzes each text individually.

In [None]:
print("Analyzing speeches with spaCy model...")
print("(This can take 2-5 minutes)\n")

from tqdm import tqdm
tqdm.pandas()

df['ner_spacy'] = df['fliesstext'].progress_apply(
    lambda text: find_locations_spacy(text, nlp)
)

df['ner_spacy_count'] = df['ner_spacy'].apply(len)

print(f"\n✓ Analysis completed!")
print(f"\nStatistics:")
print(f"- Speeches with at least one location: {(df['ner_spacy_count'] > 0).sum()} of {len(df)}")
print(f"- Average {df['ner_spacy_count'].mean():.2f} locations per speech")
print(f"- Maximum: {df['ner_spacy_count'].max()} locations in one speech")

In [None]:
# Which locations does spaCy find most frequently?
all_locations_spacy = [loc for locs in df['ner_spacy'] for loc in locs]
location_counts_spacy = Counter(all_locations_spacy)

print("Top 15 most mentioned locations (spaCy method):")
for location, count in location_counts_spacy.most_common(15):
    print(f"  {location}: {count}x")

### Step 4: Compare Lexicon vs. spaCy

In [None]:
# Which locations does spaCy find that the lexicon doesn't?
spacy_unique = set(all_locations_spacy) - set(all_locations_lexicon)
lexicon_unique = set(all_locations_lexicon) - set(all_locations_spacy)
both = set(all_locations_spacy) & set(all_locations_lexicon)

print("Differences in found locations:\n")
print(f"Only found by spaCy: {len(spacy_unique)} locations")
print(f"Examples: {list(spacy_unique)[:10]}\n")

print(f"Only found by Lexicon: {len(lexicon_unique)} locations")
print(f"Examples: {list(lexicon_unique)[:10]}\n")

print(f"Found by both: {len(both)} locations")
print(f"Examples: {list(both)[:10]}")

print("\nSummary:")
print(f"- Lexicon finds on average {df['ner_lexicon_count'].mean():.2f} locations per speech")
print(f"- spaCy finds on average {df['ner_spacy_count'].mean():.2f} locations per speech")
print(f"- Difference: {abs(df['ner_lexicon_count'].mean() - df['ner_spacy_count'].mean()):.2f} locations")

In [None]:
# Detailed analysis: Look at an example speech
# ADJUSTMENT: Change the number to analyze a different speech
EXAMPLE_SPEECH_INDEX = 0

example = df.iloc[EXAMPLE_SPEECH_INDEX]

print("=" * 80)
print(f"EXAMPLE SPEECH #{EXAMPLE_SPEECH_INDEX}")
print("=" * 80)
print(f"Speaker: {example['redner_vorname']} {example['redner_nachname']} ({example['fraktion']})")
print(f"\nText (first 500 characters):")
print(example['fliesstext'][:500] + "...")
print(f"\n{'─' * 80}")
print(f"Found locations (Lexicon): {example['ner_lexicon']}")
print(f"Found locations (spaCy):   {example['ner_spacy']}")
print("=" * 80)

### Reflection: Model-based vs. Lexicon-based NER

**Observations:**

1. **Quantity**: Which method finds more locations? Why?

2. **Quality**: 
   - Does spaCy also find locations not in our lexicon?
   - Does spaCy make errors (False Positives = recognizes something as a location that isn't)?
   
3. **Overlap**: How large is the intersection? Do both methods often agree?

**Typical differences:**
- Lexicon: Only finds known locations, no errors with correct list
- Model: Also finds unknown locations, but can also make errors

**Advantages of the model:**
- Can recognize unknown locations
- Understands context
- Can handle variations

**Disadvantages of the model:**
- Less transparent
- Can make mistakes
- Computationally intensive