# Word Extraction Strategy

This notebook outlines the comprehensive strategy for extracting words from the NOS Dutch news articles dataset to create a clean, categorized word list suitable for various applications.

## Strategy Overview

The word extraction process involves several key steps:
1. **Text Preprocessing**: Clean HTML content and prepare text for analysis
2. **Language Processing**: Use spaCy for tokenization, POS tagging, and lemmatization
3. **Word Filtering**: Remove unwanted tokens and apply quality filters
4. **Frequency Analysis**: Calculate word frequencies by year and overall
5. **Database Storage**: Store results in SQLite with proper categorization
6. **Quality Control**: Validate and clean the final word list

## Key Challenges and Solutions

### Challenge 1: HTML Content Cleaning
- **Problem**: The 'content' field contains HTML markup that needs to be stripped
- **Solution**: Use BeautifulSoup to parse HTML and extract clean text

### Challenge 2: Dutch Language Processing
- **Problem**: Need proper Dutch language model for accurate POS tagging
- **Solution**: Use spaCy's Dutch model (nl_core_news_sm) for linguistic analysis

### Challenge 3: Text Quality and Noise
- **Problem**: News articles may contain URLs, special characters, and formatting artifacts
- **Solution**: Implement comprehensive text cleaning pipeline

### Challenge 4: Memory Efficiency
- **Problem**: 295k articles (~1.36GB) require efficient processing
- **Solution**: Process articles in batches to manage memory usage

## Import Required Libraries

Import pandas for data manipulation and other necessary libraries for the word extraction pipeline.

In [1]:
import pandas as pd
import numpy as np
import os
from datetime import datetime

# Display settings for better data exploration
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

print("Basic libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Basic libraries imported successfully!
Pandas version: 2.3.1
NumPy version: 2.3.2


## Load the Dataset

Load the NOS_NL_articles_2015_mar_2025.feather file for word extraction processing.

In [2]:
# Load the feather dataset
file_path = "data/NOS_NL_articles_2015_mar_2025.feather"

print(f"Loading dataset from: {file_path}")
print(f"File exists: {os.path.exists(file_path)}")

if os.path.exists(file_path):
    # Get file size
    file_size = os.path.getsize(file_path)
    print(f"File size: {file_size / (1024**2):.2f} MB")
    
    # Load the dataset
    df = pd.read_feather(file_path)
    
    print(f"\nDataset loaded successfully!")
    print(f"Shape: {df.shape} (rows, columns)")
    print(f"Memory usage: {df.memory_usage(deep=True).sum() / (1024**2):.2f} MB")
else:
    print("File not found! Please check the file path.")

Loading dataset from: data/NOS_NL_articles_2015_mar_2025.feather
File exists: True
File size: 503.98 MB

Dataset loaded successfully!
Shape: (295259, 11) (rows, columns)
Memory usage: 1361.08 MB

Dataset loaded successfully!
Shape: (295259, 11) (rows, columns)
Memory usage: 1361.08 MB


## Step 1: Install and Import Text Processing Libraries

Install the required libraries for text processing, including spaCy for Dutch language processing and BeautifulSoup for HTML cleaning.

In [3]:
# Install required packages for text processing
import subprocess
import sys

def install_package(package):
    """Install a package using pip"""
    try:
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])
        print(f"✓ {package} installed successfully")
    except subprocess.CalledProcessError as e:
        print(f"✗ Failed to install {package}: {e}")

# Install required packages
packages = [
    "spacy",
    "beautifulsoup4", 
    "lxml",
    "html5lib",
    "tqdm",  # for progress bars
]

print("Installing required packages...")
for package in packages:
    install_package(package)

print("\nDownloading Dutch language model for spaCy...")
try:
    subprocess.check_call([sys.executable, "-m", "spacy", "download", "nl_core_news_sm"])
    print("✓ Dutch language model downloaded successfully")
except subprocess.CalledProcessError as e:
    print(f"✗ Failed to download Dutch model: {e}")
    print("You may need to run: python -m spacy download nl_core_news_sm")

Installing required packages...
✓ spacy installed successfully
✓ spacy installed successfully
✓ beautifulsoup4 installed successfully
✓ beautifulsoup4 installed successfully
✓ lxml installed successfully
✓ lxml installed successfully
✓ html5lib installed successfully
✓ html5lib installed successfully
✓ tqdm installed successfully

Downloading Dutch language model for spaCy...
✓ tqdm installed successfully

Downloading Dutch language model for spaCy...
✓ Dutch language model downloaded successfully
✓ Dutch language model downloaded successfully


In [4]:
# Import text processing libraries
import spacy
from bs4 import BeautifulSoup
import sqlite3
import re
from collections import Counter, defaultdict
from tqdm import tqdm
import string
from datetime import datetime

# Load Dutch language model
print("Loading Dutch language model...")
try:
    nlp = spacy.load("nl_core_news_sm")
    print("✓ Dutch language model loaded successfully")
    print(f"Model info: {nlp.meta['name']} v{nlp.meta['version']}")
except OSError as e:
    print(f"✗ Failed to load Dutch model: {e}")
    print("Please install the Dutch model: python -m spacy download nl_core_news_sm")
    nlp = None

# Test the model with a sample Dutch sentence
if nlp:
    test_sentence = "Dit is een test van de Nederlandse taalverwerking."
    doc = nlp(test_sentence)
    print(f"\nTest sentence: '{test_sentence}'")
    print("Tokens and POS tags:")
    for token in doc:
        print(f"  {token.text} -> {token.pos_} ({token.lemma_})")
else:
    print("Cannot test model - please install Dutch language model first")

Loading Dutch language model...
✓ Dutch language model loaded successfully
Model info: core_news_sm v3.8.0

Test sentence: 'Dit is een test van de Nederlandse taalverwerking.'
Tokens and POS tags:
  Dit -> PRON (dit)
  is -> AUX (zijn)
  een -> DET (een)
  test -> NOUN (test)
  van -> ADP (van)
  de -> DET (de)
  Nederlandse -> ADJ (Nederlands)
  taalverwerking -> NOUN (taalverwerking)
  . -> PUNCT (.)
✓ Dutch language model loaded successfully
Model info: core_news_sm v3.8.0

Test sentence: 'Dit is een test van de Nederlandse taalverwerking.'
Tokens and POS tags:
  Dit -> PRON (dit)
  is -> AUX (zijn)
  een -> DET (een)
  test -> NOUN (test)
  van -> ADP (van)
  de -> DET (de)
  Nederlandse -> ADJ (Nederlands)
  taalverwerking -> NOUN (taalverwerking)
  . -> PUNCT (.)


## Step 2: HTML Content Cleaning

Create functions to clean HTML content from the articles and prepare clean text for processing.

In [5]:
def clean_html_content(html_content):
    """
    Clean HTML content and extract readable text for spaCy processing.
    
    Args:
        html_content (str): Raw HTML content from articles
        
    Returns:
        str: Clean text ready for spaCy processing
    """
    if pd.isna(html_content) or not html_content:
        return ""
    
    try:
        # Parse HTML with BeautifulSoup
        soup = BeautifulSoup(html_content, 'html.parser')
        
        # Remove script and style elements
        for script in soup(["script", "style"]):
            script.decompose()
        
        # Get text content
        text = soup.get_text()
        
        # Clean up whitespace
        lines = (line.strip() for line in text.splitlines())
        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
        text = ' '.join(chunk for chunk in chunks if chunk)
        
        return text
    
    except Exception as e:
        print(f"Error cleaning HTML: {e}")
        return str(html_content)  # Return original if cleaning fails

def preprocess_text(text):
    """
    Additional text preprocessing before spaCy analysis.
    
    Args:
        text (str): Clean text from HTML cleaning
        
    Returns:
        str: Preprocessed text ready for spaCy
    """
    if not text:
        return ""
    
    # Remove URLs
    text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
    
    # Remove email addresses
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '', text)
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Remove very short texts (likely not meaningful)
    if len(text.strip()) < 10:
        return ""
    
    return text

# Test the cleaning functions
print("Testing HTML cleaning functions...")
test_html = """
<div class="article-content">
    <h1>Test Artikel Titel</h1>
    <p>Dit is een <strong>test</strong> artikel met <a href="https://example.com">links</a>.</p>
    <script>alert('test');</script>
    <p>Meer tekst hier.</p>
</div>
"""

cleaned = clean_html_content(test_html)
preprocessed = preprocess_text(cleaned)

print(f"Original HTML: {test_html}")
print(f"Cleaned text: {cleaned}")
print(f"Preprocessed text: {preprocessed}")

Testing HTML cleaning functions...
Original HTML: 
<div class="article-content">
    <h1>Test Artikel Titel</h1>
    <p>Dit is een <strong>test</strong> artikel met <a href="https://example.com">links</a>.</p>
    <script>alert('test');</script>
    <p>Meer tekst hier.</p>
</div>

Cleaned text: Test Artikel Titel Dit is een test artikel met links. Meer tekst hier.
Preprocessed text: Test Artikel Titel Dit is een test artikel met links. Meer tekst hier.


## Step 3: Word Extraction and Processing

Create functions to extract and process words using spaCy for POS tagging, lemmatization, and filtering.

In [6]:
def extract_words_from_text(text, nlp_model):
    """
    Extract and categorize words from cleaned text using spaCy.
    
    Args:
        text (str): Clean text ready for processing
        nlp_model: Loaded spaCy model
        
    Returns:
        list: List of word dictionaries with metadata
    """
    if not text or not nlp_model:
        return []
    
    try:
        # Process text with spaCy
        doc = nlp_model(text)
        
        words = []
        for token in doc:
            # Filter out unwanted tokens
            if should_include_token(token):
                word_info = {
                    'word': token.text.lower(),
                    'lemma': token.lemma_.lower(),
                    'pos': token.pos_,
                    'tag': token.tag_,
                    'is_alpha': token.is_alpha,
                    'is_stop': token.is_stop,
                    'length': len(token.text)
                }
                words.append(word_info)
        
        return words
    
    except Exception as e:
        print(f"Error processing text: {e}")
        return []

def should_include_token(token):
    """
    Determine if a token should be included in the word list.
    
    Args:
        token: spaCy token object
        
    Returns:
        bool: True if token should be included
    """
    # Basic filters
    if not token.text or len(token.text.strip()) == 0:
        return False
    
    # Must be alphabetic (no numbers, punctuation only)
    if not token.is_alpha:
        return False
    
    # Minimum length (avoid very short words like "a", "I")
    if len(token.text) < 2:
        return False
    
    # Maximum length (avoid very long words that might be errors)
    if len(token.text) > 25:
        return False
    
    # Skip certain POS tags
    excluded_pos = {'PUNCT', 'SPACE', 'X'}  # X = other (often errors)
    if token.pos_ in excluded_pos:
        return False
    
    # Skip if it's all uppercase (likely acronyms/abbreviations)
    if token.text.isupper() and len(token.text) > 3:
        return False
    
    return True

def get_pos_category(pos_tag):
    """
    Categorize POS tags into broader categories for easier analysis.
    
    Args:
        pos_tag (str): spaCy POS tag
        
    Returns:
        str: Broader category
    """
    pos_mapping = {
        'NOUN': 'noun',
        'PROPN': 'proper_noun',
        'VERB': 'verb',
        'ADJ': 'adjective',
        'ADV': 'adverb',
        'PRON': 'pronoun',
        'DET': 'determiner',
        'ADP': 'preposition',
        'CONJ': 'conjunction',
        'CCONJ': 'conjunction',
        'SCONJ': 'conjunction',
        'NUM': 'number',
        'PART': 'particle',
        'INTJ': 'interjection',
        'AUX': 'auxiliary'
    }
    return pos_mapping.get(pos_tag, 'other')

# Test the word extraction functions
print("Testing word extraction functions...")
if nlp:
    test_text = "Dit is een mooie Nederlandse zin met verschillende woorden en woordsoorten."
    words = extract_words_from_text(test_text, nlp)
    
    print(f"Test text: {test_text}")
    print("Extracted words:")
    for word in words:
        category = get_pos_category(word['pos'])
        print(f"  {word['word']} -> {word['lemma']} ({word['pos']}, {category})")
else:
    print("Cannot test - spaCy model not loaded")

Testing word extraction functions...
Test text: Dit is een mooie Nederlandse zin met verschillende woorden en woordsoorten.
Extracted words:
  dit -> dit (PRON, pronoun)
  is -> zijn (AUX, auxiliary)
  een -> een (DET, determiner)
  mooie -> mooi (ADJ, adjective)
  nederlandse -> nederlands (ADJ, adjective)
  zin -> zin (NOUN, noun)
  met -> met (ADP, preposition)
  verschillende -> verschillend (ADJ, adjective)
  woorden -> woord (NOUN, noun)
  en -> en (CCONJ, conjunction)
  woordsoorten -> woordsoort (NOUN, noun)


## Step 4: Database Setup

Create SQLite database structure to store words with their frequencies, POS tags, and yearly data.

In [7]:
def setup_database(db_path="words_database.sqlite"):
    """
    Create SQLite database with proper schema for storing word data.
    
    Args:
        db_path (str): Path to SQLite database file
        
    Returns:
        sqlite3.Connection: Database connection
    """
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()
    
    # Create words table
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS words (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            word TEXT NOT NULL,
            lemma TEXT NOT NULL,
            pos_tag TEXT NOT NULL,
            pos_category TEXT NOT NULL,
            total_frequency INTEGER DEFAULT 0,
            first_seen DATE,
            last_seen DATE,
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            UNIQUE(word, lemma, pos_tag)
        )
    ''')
    
    # Create word frequencies by year table
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS word_frequencies (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            word_id INTEGER,
            year INTEGER,
            frequency INTEGER DEFAULT 0,
            FOREIGN KEY (word_id) REFERENCES words (id),
            UNIQUE(word_id, year)
        )
    ''')
    
    # Create processing log table
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS processing_log (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            articles_processed INTEGER,
            words_extracted INTEGER,
            processing_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            notes TEXT
        )
    ''')
    
    # Create indexes for better performance
    cursor.execute('CREATE INDEX IF NOT EXISTS idx_word_lemma ON words (word, lemma)')
    cursor.execute('CREATE INDEX IF NOT EXISTS idx_pos_category ON words (pos_category)')
    cursor.execute('CREATE INDEX IF NOT EXISTS idx_frequency_year ON word_frequencies (year)')
    
    conn.commit()
    print(f"Database setup complete: {db_path}")
    
    return conn

def insert_word_data(conn, word_data, year):
    """
    Insert word data into the database with frequency tracking.
    
    Args:
        conn: SQLite connection
        word_data (list): List of word dictionaries
        year (int): Year of the article
    """
    cursor = conn.cursor()
    
    for word_info in word_data:
        pos_category = get_pos_category(word_info['pos'])
        
        # Insert or update word
        cursor.execute('''
            INSERT OR IGNORE INTO words (word, lemma, pos_tag, pos_category, first_seen, last_seen)
            VALUES (?, ?, ?, ?, ?, ?)
        ''', (
            word_info['word'],
            word_info['lemma'], 
            word_info['pos'],
            pos_category,
            f"{year}-01-01",
            f"{year}-12-31"
        ))
        
        # Update last_seen if word already exists
        cursor.execute('''
            UPDATE words 
            SET last_seen = ? 
            WHERE word = ? AND lemma = ? AND pos_tag = ? AND last_seen < ?
        ''', (
            f"{year}-12-31",
            word_info['word'],
            word_info['lemma'],
            word_info['pos'],
            f"{year}-12-31"
        ))
        
        # Get word ID
        cursor.execute('''
            SELECT id FROM words 
            WHERE word = ? AND lemma = ? AND pos_tag = ?
        ''', (word_info['word'], word_info['lemma'], word_info['pos']))
        
        word_id = cursor.fetchone()[0]
        
        # Insert or update frequency
        cursor.execute('''
            INSERT OR IGNORE INTO word_frequencies (word_id, year, frequency)
            VALUES (?, ?, 0)
        ''', (word_id, year))
        
        cursor.execute('''
            UPDATE word_frequencies 
            SET frequency = frequency + 1
            WHERE word_id = ? AND year = ?
        ''', (word_id, year))
        
        # Update total frequency
        cursor.execute('''
            UPDATE words 
            SET total_frequency = total_frequency + 1
            WHERE id = ?
        ''', (word_id,))
    
    conn.commit()

# Test database setup
print("Setting up test database...")
test_conn = setup_database("test_words.sqlite")

# Test with sample data
if nlp:
    sample_words = extract_words_from_text("Dit is een test van de database functionaliteit.", nlp)
    insert_word_data(test_conn, sample_words, 2023)
    
    # Query results
    cursor = test_conn.cursor()
    cursor.execute('SELECT * FROM words')
    results = cursor.fetchall()
    print(f"Sample words inserted: {len(results)}")
    for row in results[:3]:
        print(f"  {row}")

test_conn.close()

Setting up test database...
Database setup complete: test_words.sqlite
Sample words inserted: 8
  (1, 'dit', 'dit', 'PRON', 'pronoun', 1, '2023-01-01', '2023-12-31', '2025-07-31 19:18:59')
  (2, 'is', 'zijn', 'AUX', 'auxiliary', 1, '2023-01-01', '2023-12-31', '2025-07-31 19:18:59')
  (3, 'een', 'een', 'DET', 'determiner', 1, '2023-01-01', '2023-12-31', '2025-07-31 19:18:59')


## Step 5: Main Processing Pipeline

Create the main pipeline to process all articles in batches and extract words efficiently.

In [8]:
def process_articles_pipeline(df, nlp_model, db_path="words_database.sqlite", batch_size=1000):
    """
    Main pipeline to process all articles and extract words.
    
    Args:
        df: DataFrame with articles
        nlp_model: Loaded spaCy model
        db_path (str): Path to SQLite database
        batch_size (int): Number of articles to process in each batch
    """
    if not nlp_model:
        print("Error: spaCy model not loaded")
        return
    
    # Setup database
    conn = setup_database(db_path)
    
    # Prepare progress tracking
    total_articles = len(df)
    total_words_extracted = 0
    articles_processed = 0
    
    print(f"Starting processing of {total_articles:,} articles...")
    print(f"Batch size: {batch_size}")
    
    # Process in batches
    for start_idx in tqdm(range(0, total_articles, batch_size), desc="Processing batches"):
        end_idx = min(start_idx + batch_size, total_articles)
        batch_df = df.iloc[start_idx:end_idx]
        
        batch_words = 0
        
        for idx, row in batch_df.iterrows():
            try:
                # Extract year from published_time
                if pd.notna(row['published_time']):
                    if isinstance(row['published_time'], str):
                        year = pd.to_datetime(row['published_time']).year
                    else:
                        year = row['published_time'].year
                else:
                    year = 2020  # Default year if missing
                
                # Process different text fields
                text_fields = ['title', 'description', 'content']
                all_text = []
                
                for field in text_fields:
                    if field in row and pd.notna(row[field]):
                        if field == 'content':
                            # Clean HTML from content
                            clean_text = clean_html_content(row[field])
                        else:
                            clean_text = str(row[field])
                        
                        preprocessed = preprocess_text(clean_text)
                        if preprocessed:
                            all_text.append(preprocessed)
                
                # Combine all text
                combined_text = ' '.join(all_text)
                
                if combined_text:
                    # Extract words
                    words = extract_words_from_text(combined_text, nlp_model)
                    
                    if words:
                        # Insert into database
                        insert_word_data(conn, words, year)
                        batch_words += len(words)
                
                articles_processed += 1
                
            except Exception as e:
                print(f"Error processing article {idx}: {e}")
                continue
        
        total_words_extracted += batch_words
        
        # Log progress every 10 batches
        if (start_idx // batch_size) % 10 == 0:
            print(f"Processed {articles_processed:,}/{total_articles:,} articles, "
                  f"extracted {total_words_extracted:,} words")
    
    # Log final results
    cursor = conn.cursor()
    cursor.execute('INSERT INTO processing_log (articles_processed, words_extracted, notes) VALUES (?, ?, ?)',
                   (articles_processed, total_words_extracted, f"Batch processing complete - batch size {batch_size}"))
    conn.commit()
    
    print(f"\n=== PROCESSING COMPLETE ===")
    print(f"Total articles processed: {articles_processed:,}")
    print(f"Total words extracted: {total_words_extracted:,}")
    print(f"Database saved to: {db_path}")
    
    # Get final statistics
    cursor.execute('SELECT COUNT(*) FROM words')
    unique_words = cursor.fetchone()[0]
    
    cursor.execute('SELECT COUNT(DISTINCT pos_category) FROM words')
    pos_categories = cursor.fetchone()[0]
    
    cursor.execute('SELECT year, COUNT(*) FROM word_frequencies GROUP BY year ORDER BY year')
    yearly_stats = cursor.fetchall()
    
    print(f"Unique words in database: {unique_words:,}")
    print(f"POS categories found: {pos_categories}")
    print(f"Yearly distribution:")
    for year, count in yearly_stats:
        print(f"  {year}: {count:,} word instances")
    
    conn.close()
    
    return {
        'articles_processed': articles_processed,
        'words_extracted': total_words_extracted,
        'unique_words': unique_words,
        'database_path': db_path
    }

# Note: The actual processing will be run in the next step
print("Processing pipeline function defined. Ready to process articles.")

Processing pipeline function defined. Ready to process articles.


## Step 6: Run the Processing (Execute with Caution)

**WARNING**: This step will process all 295k+ articles and may take several hours. Only run when ready!

In [9]:
# SAFETY CHECK: Only run if you're ready to process all articles
RUN_FULL_PROCESSING = False  # Set to True when ready to run

if RUN_FULL_PROCESSING:
    print("🚀 Starting full processing of all articles...")
    print("This may take several hours depending on your system.")
    print("You can monitor progress and stop if needed.")
    
    # Load the full dataset if not already loaded
    if 'df' not in locals() or df is None:
        print("Loading dataset...")
        df = pd.read_feather("data/NOS_NL_articles_2015_mar_2025.feather")
        print(f"Dataset loaded: {df.shape}")
    
    # Check if spaCy model is loaded
    if nlp is None:
        print("Error: Dutch spaCy model not loaded. Please run the installation steps first.")
    else:
        # Run the processing pipeline
        results = process_articles_pipeline(
            df=df,
            nlp_model=nlp,
            db_path="dutch_words_database.sqlite",
            batch_size=500  # Smaller batches for better memory management
        )
        
        print("✅ Full processing complete!")
        print(f"Results: {results}")

else:
    print("⚠️  Full processing is disabled for safety.")
    print("Set RUN_FULL_PROCESSING = True to enable full processing.")
    print("\n📊 Alternative: Run a test with a small sample:")
    
    # Test with a small sample instead
    if 'df' in locals() and df is not None:
        sample_size = 100
        sample_df = df.head(sample_size)
        
        print(f"\n🧪 Running test with {sample_size} articles...")
        
        if nlp is not None:
            test_results = process_articles_pipeline(
                df=sample_df,
                nlp_model=nlp,
                db_path="test_dutch_words.sqlite",
                batch_size=50
            )
            print(f"Test results: {test_results}")
        else:
            print("Cannot run test - spaCy model not loaded")
    else:
        print("Dataset not loaded. Please run the data loading cells first.")

⚠️  Full processing is disabled for safety.
Set RUN_FULL_PROCESSING = True to enable full processing.

📊 Alternative: Run a test with a small sample:

🧪 Running test with 100 articles...
Database setup complete: test_dutch_words.sqlite
Starting processing of 100 articles...
Batch size: 50


Processing batches:  50%|█████     | 1/2 [00:03<00:03,  3.39s/it]

Processed 50/100 articles, extracted 13,013 words


Processing batches: 100%|██████████| 2/2 [00:07<00:00,  3.82s/it]


=== PROCESSING COMPLETE ===
Total articles processed: 100
Total words extracted: 29,696
Database saved to: test_dutch_words.sqlite
Unique words in database: 6,195
POS categories found: 13
Yearly distribution:
  2015: 6,195 word instances
Test results: {'articles_processed': 100, 'words_extracted': 29696, 'unique_words': 6195, 'database_path': 'test_dutch_words.sqlite'}





## Step 7: Analysis and Export

Analyze the extracted words and create various exports for different use cases.

In [10]:
def analyze_word_database(db_path="dutch_words_database.sqlite"):
    """
    Analyze the word database and generate statistics.
    
    Args:
        db_path (str): Path to the SQLite database
    """
    try:
        conn = sqlite3.connect(db_path)
        
        print(f"=== WORD DATABASE ANALYSIS ===")
        print(f"Database: {db_path}")
        
        # Basic statistics
        cursor = conn.cursor()
        
        cursor.execute('SELECT COUNT(*) FROM words')
        total_unique_words = cursor.fetchone()[0]
        
        cursor.execute('SELECT SUM(total_frequency) FROM words')
        total_word_instances = cursor.fetchone()[0]
        
        cursor.execute('SELECT COUNT(DISTINCT pos_category) FROM words')
        pos_categories = cursor.fetchone()[0]
        
        print(f"\nBasic Statistics:")
        print(f"  Unique words: {total_unique_words:,}")
        print(f"  Total word instances: {total_word_instances:,}")
        print(f"  POS categories: {pos_categories}")
        
        # Top words by frequency
        cursor.execute('''
            SELECT word, lemma, pos_category, total_frequency 
            FROM words 
            ORDER BY total_frequency DESC 
            LIMIT 20
        ''')
        top_words = cursor.fetchall()
        
        print(f"\nTop 20 Most Frequent Words:")
        for i, (word, lemma, pos, freq) in enumerate(top_words, 1):
            print(f"  {i:2d}. {word} ({lemma}) [{pos}] - {freq:,} times")
        
        # Words by POS category
        cursor.execute('''
            SELECT pos_category, COUNT(*) as count, AVG(total_frequency) as avg_freq
            FROM words 
            GROUP BY pos_category 
            ORDER BY count DESC
        ''')
        pos_stats = cursor.fetchall()
        
        print(f"\nWords by POS Category:")
        for pos, count, avg_freq in pos_stats:
            print(f"  {pos}: {count:,} words (avg freq: {avg_freq:.1f})")
        
        # Yearly trends
        cursor.execute('''
            SELECT year, COUNT(*) as word_count, SUM(frequency) as total_freq
            FROM word_frequencies 
            GROUP BY year 
            ORDER BY year
        ''')
        yearly_trends = cursor.fetchall()
        
        print(f"\nYearly Word Trends:")
        for year, word_count, total_freq in yearly_trends:
            print(f"  {year}: {word_count:,} unique words, {total_freq:,} total instances")
        
        conn.close()
        
    except sqlite3.Error as e:
        print(f"Database error: {e}")
    except FileNotFoundError:
        print(f"Database file not found: {db_path}")

def export_word_lists(db_path="dutch_words_database.sqlite", output_dir="exports"):
    """
    Export word lists in various formats for different use cases.
    
    Args:
        db_path (str): Path to the SQLite database
        output_dir (str): Directory to save exports
    """
    import os
    
    try:
        os.makedirs(output_dir, exist_ok=True)
        conn = sqlite3.connect(db_path)
        
        print(f"=== EXPORTING WORD LISTS ===")
        print(f"Output directory: {output_dir}")
        
        # 1. All words list (for general use)
        print("\n1. Exporting all words list...")
        cursor = conn.cursor()
        cursor.execute('SELECT DISTINCT word FROM words ORDER BY word')
        all_words = [row[0] for row in cursor.fetchall()]
        
        with open(f"{output_dir}/all_words.txt", 'w', encoding='utf-8') as f:
            for word in all_words:
                f.write(word + '\n')
        print(f"   Exported {len(all_words):,} words to all_words.txt")
        
        # 2. Common words (frequency >= 10)
        print("\n2. Exporting common words (frequency >= 10)...")
        cursor.execute('SELECT word, total_frequency FROM words WHERE total_frequency >= 10 ORDER BY total_frequency DESC')
        common_words = cursor.fetchall()
        
        with open(f"{output_dir}/common_words.txt", 'w', encoding='utf-8') as f:
            for word, freq in common_words:
                f.write(f"{word}\t{freq}\n")
        print(f"   Exported {len(common_words):,} words to common_words.txt")
        
        # 3. Words by POS category
        print("\n3. Exporting words by POS category...")
        pos_categories = ['noun', 'verb', 'adjective', 'adverb']
        
        for pos in pos_categories:
            cursor.execute('''
                SELECT word, total_frequency 
                FROM words 
                WHERE pos_category = ? 
                ORDER BY total_frequency DESC
            ''', (pos,))
            pos_words = cursor.fetchall()
            
            with open(f"{output_dir}/{pos}_words.txt", 'w', encoding='utf-8') as f:
                for word, freq in pos_words:
                    f.write(f"{word}\t{freq}\n")
            print(f"   Exported {len(pos_words):,} {pos} words to {pos}_words.txt")
        
        # 4. CSV export with full data
        print("\n4. Exporting full data to CSV...")
        cursor.execute('''
            SELECT word, lemma, pos_tag, pos_category, total_frequency, first_seen, last_seen
            FROM words 
            ORDER BY total_frequency DESC
        ''')
        
        import csv
        with open(f"{output_dir}/words_full_data.csv", 'w', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow(['word', 'lemma', 'pos_tag', 'pos_category', 'total_frequency', 'first_seen', 'last_seen'])
            writer.writerows(cursor.fetchall())
        print(f"   Exported full data to words_full_data.csv")
        
        # 5. Game-friendly word list (4-8 letters, common words)
        print("\n5. Exporting game-friendly word list...")
        cursor.execute('''
            SELECT word, total_frequency 
            FROM words 
            WHERE LENGTH(word) BETWEEN 4 AND 8 
            AND total_frequency >= 5
            AND pos_category IN ('noun', 'verb', 'adjective')
            ORDER BY total_frequency DESC
        ''')
        game_words = cursor.fetchall()
        
        with open(f"{output_dir}/game_words.txt", 'w', encoding='utf-8') as f:
            for word, freq in game_words:
                f.write(word + '\n')
        print(f"   Exported {len(game_words):,} words to game_words.txt")
        
        conn.close()
        print(f"\n✅ All exports completed in: {output_dir}")
        
    except Exception as e:
        print(f"Export error: {e}")

# Test analysis (will work with existing test database)
print("Testing analysis functions...")
if os.path.exists("test_dutch_words.sqlite"):
    analyze_word_database("test_dutch_words.sqlite")
    export_word_lists("test_dutch_words.sqlite", "test_exports")
else:
    print("No test database found. Run the processing steps first.")

Testing analysis functions...
=== WORD DATABASE ANALYSIS ===
Database: test_dutch_words.sqlite

Basic Statistics:
  Unique words: 6,195
  Total word instances: 29,696
  POS categories: 13

Top 20 Most Frequent Words:
   1. de (de) [determiner] - 1,862 times
   2. in (in) [preposition] - 882 times
   3. van (van) [preposition] - 852 times
   4. een (een) [determiner] - 780 times
   5. het (het) [determiner] - 692 times
   6. en (en) [conjunction] - 516 times
   7. is (zijn) [auxiliary] - 430 times
   8. op (op) [preposition] - 392 times
   9. met (met) [preposition] - 278 times
  10. voor (voor) [preposition] - 248 times
  11. er (er) [adverb] - 231 times
  12. het (het) [pronoun] - 198 times
  13. dat (dat) [conjunction] - 193 times
  14. te (te) [preposition] - 193 times
  15. niet (niet) [adverb] - 191 times
  16. zijn (zijn) [auxiliary] - 186 times
  17. hij (hij) [pronoun] - 184 times
  18. bij (bij) [preposition] - 180 times
  19. jaar (jaar) [noun] - 179 times
  20. die (die) [pr