# Unique Words Per Year Analysis

This notebook analyzes the pre-calculated SQLite database to extract words which are used **relatively more** compared to other words in specific years. It accounts for biases such as total word count per year and identifies words that have anomalous frequency patterns.

## Methodology

1. **Relative Frequency Analysis**: Calculate normalized frequencies accounting for total word counts per year
2. **Statistical Significance**: Identify words with statistically significant frequency spikes
3. **Bias Correction**: Account for varying article counts and total word volumes per year
4. **Temporal Uniqueness**: Find words that are distinctively associated with specific years

## Key Metrics Calculated

- **Relative Frequency**: Word frequency in year / Total words in year
- **Z-Score**: How many standard deviations above/below the word's average frequency
- **Lift**: Ratio of year frequency to expected frequency based on overall distribution
- **Temporal Specificity**: How concentrated a word's usage is in specific years

## Setup and Database Connection

Load required libraries and connect to the pre-calculated SQLite database.

In [None]:
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
from collections import defaultdict
import os

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = [12, 8]
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("✓ Libraries imported successfully")
print(f"✓ Pandas version: {pd.__version__}")
print(f"✓ NumPy version: {np.__version__}")

In [None]:
# Database connection and validation
def connect_to_database(db_path="output/words_database.sqlite"):
    """
    Connect to the pre-calculated words database and validate its structure.
    
    Args:
        db_path (str): Path to the SQLite database
        
    Returns:
        sqlite3.Connection: Database connection
    """
    # Try multiple possible database locations
    possible_paths = [
        db_path,
        "output/dutch_words_full.sqlite",
        "output/test_dutch_words.sqlite",
        "output/test_words.sqlite"
    ]
    
    for path in possible_paths:
        if os.path.exists(path):
            print(f"Found database: {path}")
            conn = sqlite3.connect(path)
            
            # Validate database structure
            cursor = conn.cursor()
            
            # Check required tables exist
            cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
            tables = [row[0] for row in cursor.fetchall()]
            
            required_tables = ['words', 'word_frequencies']
            if all(table in tables for table in required_tables):
                print(f"✓ Database validated with tables: {tables}")
                
                # Get basic statistics
                cursor.execute("SELECT COUNT(*) FROM words")
                word_count = cursor.fetchone()[0]
                
                cursor.execute("SELECT COUNT(*) FROM word_frequencies")
                freq_count = cursor.fetchone()[0]
                
                cursor.execute("SELECT MIN(year), MAX(year) FROM word_frequencies")
                year_range = cursor.fetchone()
                
                print(f"✓ Database contains:")
                print(f"  - {word_count:,} unique words")
                print(f"  - {freq_count:,} word-year frequency records")
                print(f"  - Years: {year_range[0]} to {year_range[1]}")
                
                return conn, path
            else:
                print(f"✗ Database {path} missing required tables")
                conn.close()
    
    raise FileNotFoundError("No valid word database found. Please run word_extraction_strategy.ipynb first.")

# Connect to database
try:
    conn, db_path = connect_to_database()
    print(f"\n✓ Successfully connected to database: {db_path}")
except FileNotFoundError as e:
    print(f"✗ Database connection failed: {e}")
    print("Please run the word_extraction_strategy.ipynb notebook first to generate the database.")
    conn = None

## Data Loading and Preparation

Load the word frequency data and prepare it for relative frequency analysis.

In [None]:
def load_word_frequency_data(conn):
    """
    Load and prepare word frequency data for analysis.
    
    Args:
        conn: SQLite database connection
        
    Returns:
        tuple: (word_freq_df, yearly_totals, word_info_df)
    """
    if conn is None:
        return None, None, None
    
    print("Loading word frequency data...")
    
    # Load word frequencies with word information
    query = """
    SELECT 
        w.word,
        w.lemma,
        w.pos_category,
        w.total_frequency,
        wf.year,
        wf.frequency
    FROM words w
    JOIN word_frequencies wf ON w.id = wf.word_id
    WHERE w.total_frequency >= 10  -- Only include words with reasonable frequency
    ORDER BY w.word, wf.year
    """
    
    word_freq_df = pd.read_sql_query(query, conn)
    print(f"✓ Loaded {len(word_freq_df):,} word-year frequency records")
    
    # Calculate yearly totals for normalization
    yearly_totals = word_freq_df.groupby('year')['frequency'].sum().reset_index()
    yearly_totals.columns = ['year', 'total_words']
    print(f"✓ Calculated yearly totals for {len(yearly_totals)} years")
    
    # Load word information
    word_info_query = """
    SELECT word, lemma, pos_category, total_frequency
    FROM words
    WHERE total_frequency >= 10
    """
    
    word_info_df = pd.read_sql_query(word_info_query, conn)
    print(f"✓ Loaded information for {len(word_info_df):,} unique words")
    
    return word_freq_df, yearly_totals, word_info_df

# Load the data
word_freq_df, yearly_totals, word_info_df = load_word_frequency_data(conn)

if word_freq_df is not None:
    print("\n📊 Data Overview:")
    print(f"Years covered: {word_freq_df['year'].min()} to {word_freq_df['year'].max()}")
    print(f"Unique words: {word_freq_df['word'].nunique():,}")
    print(f"POS categories: {word_freq_df['pos_category'].nunique()}")
    
    print("\n📈 Yearly word counts:")
    print(yearly_totals.to_string(index=False))
else:
    print("❌ Failed to load data. Cannot proceed with analysis.")

## Relative Frequency Analysis

Calculate relative frequencies and identify words with significant yearly variations.

In [None]:
def calculate_relative_frequencies(word_freq_df, yearly_totals):
    """
    Calculate relative frequencies and statistical measures for temporal uniqueness.
    
    Args:
        word_freq_df: DataFrame with word frequencies by year
        yearly_totals: DataFrame with total word counts per year
        
    Returns:
        DataFrame: Enhanced frequency data with relative measures
    """
    if word_freq_df is None or yearly_totals is None:
        return None
    
    print("Calculating relative frequencies and statistical measures...")
    
    # Merge with yearly totals
    df = word_freq_df.merge(yearly_totals, on='year')
    
    # Calculate relative frequency (normalized by year)
    df['relative_frequency'] = df['frequency'] / df['total_words']
    
    # Calculate expected frequency based on overall distribution
    word_stats = df.groupby('word').agg({
        'frequency': ['sum', 'mean', 'std'],
        'relative_frequency': ['mean', 'std'],
        'total_frequency': 'first',
        'pos_category': 'first',
        'lemma': 'first'
    }).reset_index()
    
    # Flatten column names
    word_stats.columns = ['word', 'total_freq', 'mean_freq', 'std_freq', 
                         'mean_rel_freq', 'std_rel_freq', 'total_frequency', 
                         'pos_category', 'lemma']
    
    # Merge back with main dataframe
    df = df.merge(word_stats[['word', 'mean_freq', 'std_freq', 'mean_rel_freq', 'std_rel_freq']], 
                  on='word')
    
    # Calculate Z-score for each word-year combination
    df['frequency_z_score'] = np.where(
        df['std_freq'] > 0,
        (df['frequency'] - df['mean_freq']) / df['std_freq'],
        0
    )
    
    # Calculate relative frequency Z-score
    df['rel_freq_z_score'] = np.where(
        df['std_rel_freq'] > 0,
        (df['relative_frequency'] - df['mean_rel_freq']) / df['std_rel_freq'],
        0
    )
    
    # Calculate lift (ratio to expected)
    df['frequency_lift'] = np.where(
        df['mean_freq'] > 0,
        df['frequency'] / df['mean_freq'],
        1
    )
    
    df['rel_freq_lift'] = np.where(
        df['mean_rel_freq'] > 0,
        df['relative_frequency'] / df['mean_rel_freq'],
        1
    )
    
    print(f"✓ Calculated relative frequencies for {len(df):,} word-year combinations")
    
    return df

# Calculate relative frequencies
enhanced_df = calculate_relative_frequencies(word_freq_df, yearly_totals)

if enhanced_df is not None:
    print("\n📊 Sample of enhanced frequency data:")
    sample_cols = ['word', 'year', 'frequency', 'relative_frequency', 'rel_freq_z_score', 'rel_freq_lift']
    print(enhanced_df[sample_cols].head(10).to_string(index=False))
    
    print("\n📈 Summary statistics:")
    print(f"Mean relative frequency Z-score: {enhanced_df['rel_freq_z_score'].mean():.3f}")
    print(f"Max relative frequency Z-score: {enhanced_df['rel_freq_z_score'].max():.3f}")
    print(f"Words with Z-score > 2: {(enhanced_df['rel_freq_z_score'] > 2).sum():,}")
    print(f"Words with Z-score > 3: {(enhanced_df['rel_freq_z_score'] > 3).sum():,}")

## Identify Unique Words Per Year

Find words that are significantly more frequent in specific years compared to their baseline usage.

In [None]:
def identify_unique_words_per_year(enhanced_df, z_threshold=2.0, min_frequency=20):
    """
    Identify words that are uniquely frequent in specific years.
    
    Args:
        enhanced_df: DataFrame with relative frequency calculations
        z_threshold: Minimum Z-score for considering a word "unique" to a year
        min_frequency: Minimum absolute frequency to avoid noise
        
    Returns:
        DataFrame: Words with their "unique" years and significance measures
    """
    if enhanced_df is None:
        return None
    
    print(f"Identifying unique words per year (Z-score >= {z_threshold}, frequency >= {min_frequency})...")
    
    # Filter for significant frequency spikes
    unique_words = enhanced_df[
        (enhanced_df['rel_freq_z_score'] >= z_threshold) & 
        (enhanced_df['frequency'] >= min_frequency)
    ].copy()
    
    # Sort by significance (Z-score)
    unique_words = unique_words.sort_values('rel_freq_z_score', ascending=False)
    
    print(f"✓ Found {len(unique_words):,} word-year combinations with significant frequency spikes")
    
    # Group by year to see top unique words per year
    top_per_year = unique_words.groupby('year').apply(
        lambda x: x.nlargest(20, 'rel_freq_z_score')
    ).reset_index(drop=True)
    
    # Calculate temporal specificity for each word
    word_temporal_stats = unique_words.groupby('word').agg({
        'year': ['count', 'nunique'],
        'rel_freq_z_score': ['max', 'mean'],
        'frequency': 'max',
        'pos_category': 'first',
        'lemma': 'first'
    }).reset_index()
    
    # Flatten column names
    word_temporal_stats.columns = ['word', 'spike_count', 'unique_years', 
                                  'max_z_score', 'mean_z_score', 'max_frequency',
                                  'pos_category', 'lemma']
    
    # Calculate temporal specificity (how concentrated usage is)
    word_temporal_stats['temporal_specificity'] = 1 / word_temporal_stats['unique_years']
    
    return unique_words, top_per_year, word_temporal_stats

# Identify unique words
unique_words, top_per_year, word_temporal_stats = identify_unique_words_per_year(enhanced_df)

if unique_words is not None:
    print("\n🎯 Top 10 most temporally unique words overall:")
    top_overall = unique_words.nlargest(10, 'rel_freq_z_score')
    display_cols = ['word', 'year', 'frequency', 'rel_freq_z_score', 'rel_freq_lift', 'pos_category']
    print(top_overall[display_cols].to_string(index=False))
    
    print("\n📅 Summary by year:")
    yearly_summary = unique_words.groupby('year').agg({
        'word': 'count',
        'rel_freq_z_score': 'mean'
    }).round(2)
    yearly_summary.columns = ['unique_words_count', 'avg_z_score']
    print(yearly_summary.to_string())

## Visualizations

Create visualizations to show temporal word usage patterns.

In [None]:
def create_temporal_visualizations(enhanced_df, unique_words, yearly_totals):
    """
    Create visualizations for temporal word usage patterns.
    
    Args:
        enhanced_df: DataFrame with all word frequency data
        unique_words: DataFrame with unique words per year
        yearly_totals: DataFrame with yearly totals
    """
    if enhanced_df is None or unique_words is None:
        print("Cannot create visualizations - data not available")
        return
    
    print("Creating temporal visualizations...")
    
    # Create output directory
    os.makedirs('output/visualizations', exist_ok=True)
    
    # 1. Yearly distribution of unique words
    plt.figure(figsize=(14, 8))
    
    plt.subplot(2, 2, 1)
    yearly_unique_counts = unique_words.groupby('year').size()
    plt.bar(yearly_unique_counts.index, yearly_unique_counts.values, alpha=0.7)
    plt.title('Number of Unique Words per Year\n(Z-score >= 2.0)')
    plt.xlabel('Year')
    plt.ylabel('Count of Unique Words')
    plt.xticks(rotation=45)
    
    # 2. Total word volume by year
    plt.subplot(2, 2, 2)
    plt.plot(yearly_totals['year'], yearly_totals['total_words'], marker='o', linewidth=2)
    plt.title('Total Word Volume by Year')
    plt.xlabel('Year')
    plt.ylabel('Total Words')
    plt.xticks(rotation=45)
    
    # 3. Z-score distribution
    plt.subplot(2, 2, 3)
    plt.hist(unique_words['rel_freq_z_score'], bins=30, alpha=0.7, edgecolor='black')
    plt.title('Distribution of Z-scores\n(Unique Words Only)')
    plt.xlabel('Relative Frequency Z-score')
    plt.ylabel('Count')
    
    # 4. Top words by temporal specificity
    plt.subplot(2, 2, 4)
    top_specific = word_temporal_stats.nlargest(15, 'temporal_specificity')
    plt.barh(range(len(top_specific)), top_specific['max_z_score'], alpha=0.7)
    plt.yticks(range(len(top_specific)), top_specific['word'])
    plt.title('Most Temporally Specific Words\n(Max Z-score)')
    plt.xlabel('Maximum Z-score')
    
    plt.tight_layout()
    plt.savefig('output/visualizations/temporal_overview.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    # 5. Detailed timeline for top words
    print("\nCreating detailed timeline for top unique words...")
    
    # Get top 8 most unique words
    top_words = unique_words.nlargest(8, 'rel_freq_z_score')['word'].unique()
    
    plt.figure(figsize=(16, 10))
    
    for i, word in enumerate(top_words):
        plt.subplot(2, 4, i+1)
        
        # Get all data for this word
        word_data = enhanced_df[enhanced_df['word'] == word].sort_values('year')
        
        # Plot relative frequency over time
        plt.plot(word_data['year'], word_data['relative_frequency'], 
                marker='o', linewidth=2, markersize=6)
        
        # Highlight years with high Z-scores
        unique_years = word_data[word_data['rel_freq_z_score'] >= 2.0]
        if len(unique_years) > 0:
            plt.scatter(unique_years['year'], unique_years['relative_frequency'], 
                       color='red', s=100, alpha=0.7, zorder=5)
        
        plt.title(f"'{word}'\n(POS: {word_data.iloc[0]['pos_category']})")
        plt.xlabel('Year')
        plt.ylabel('Relative Frequency')
        plt.xticks(rotation=45)
        
        # Format y-axis in scientific notation if very small
        if word_data['relative_frequency'].max() < 0.001:
            plt.ticklabel_format(axis='y', style='scientific', scilimits=(0,0))
    
    plt.tight_layout()
    plt.savefig('output/visualizations/top_words_timeline.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("✓ Visualizations saved to output/visualizations/")

# Create visualizations
create_temporal_visualizations(enhanced_df, unique_words, yearly_totals)

## Export Results

Export the unique words analysis results in various formats.

In [None]:
def export_unique_words_analysis(unique_words, top_per_year, word_temporal_stats, enhanced_df):
    """
    Export the unique words analysis results.
    
    Args:
        unique_words: DataFrame with all unique word-year combinations
        top_per_year: DataFrame with top unique words per year
        word_temporal_stats: DataFrame with temporal statistics per word
        enhanced_df: Complete enhanced frequency data
    """
    if unique_words is None:
        print("Cannot export - analysis data not available")
        return
    
    print("Exporting unique words analysis results...")
    
    # Create output directory
    output_dir = 'output/unique_words_analysis'
    os.makedirs(output_dir, exist_ok=True)
    
    # 1. Complete unique words dataset
    print("\n1. Exporting complete unique words dataset...")
    export_cols = ['word', 'lemma', 'pos_category', 'year', 'frequency', 
                   'relative_frequency', 'rel_freq_z_score', 'rel_freq_lift']
    unique_words[export_cols].to_csv(
        f'{output_dir}/unique_words_complete.csv', 
        index=False, encoding='utf-8'
    )
    print(f"   Exported {len(unique_words):,} unique word-year combinations")
    
    # 2. Top unique words per year
    print("\n2. Exporting top unique words per year...")
    top_per_year[export_cols].to_csv(
        f'{output_dir}/top_unique_words_per_year.csv', 
        index=False, encoding='utf-8'
    )
    print(f"   Exported top 20 unique words for each year")
    
    # 3. Word temporal statistics
    print("\n3. Exporting word temporal statistics...")
    word_temporal_stats.to_csv(
        f'{output_dir}/word_temporal_statistics.csv', 
        index=False, encoding='utf-8'
    )
    print(f"   Exported temporal statistics for {len(word_temporal_stats):,} words")
    
    # 4. Summary by year
    print("\n4. Creating yearly summary...")
    yearly_summary = unique_words.groupby('year').agg({
        'word': 'count',
        'rel_freq_z_score': ['mean', 'max'],
        'frequency': 'sum'
    }).round(3)
    
    yearly_summary.columns = ['unique_words_count', 'avg_z_score', 'max_z_score', 'total_frequency']
    yearly_summary.to_csv(f'{output_dir}/yearly_summary.csv', encoding='utf-8')
    print(f"   Exported yearly summary for {len(yearly_summary)} years")
    
    # 5. Simple text lists for each year
    print("\n5. Creating simple word lists per year...")
    year_lists_dir = f'{output_dir}/word_lists_by_year'
    os.makedirs(year_lists_dir, exist_ok=True)
    
    for year in sorted(unique_words['year'].unique()):
        year_words = unique_words[unique_words['year'] == year].nlargest(50, 'rel_freq_z_score')
        
        with open(f'{year_lists_dir}/unique_words_{year}.txt', 'w', encoding='utf-8') as f:
            f.write(f"# Unique words for {year} (Top 50 by Z-score)\n\n")
            for _, row in year_words.iterrows():
                f.write(f"{row['word']}\t{row['frequency']}\t{row['rel_freq_z_score']:.2f}\n")
    
    print(f"   Created word lists for {len(unique_words['year'].unique())} years")
    
    # 6. Analysis summary report
    print("\n6. Creating analysis summary report...")
    with open(f'{output_dir}/analysis_summary.txt', 'w', encoding='utf-8') as f:
        f.write("# Unique Words Per Year Analysis Summary\n\n")
        f.write(f"Analysis Date: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
        f.write(f"Database: {db_path}\n\n")
        
        f.write("## Methodology\n")
        f.write("- Relative frequency analysis accounting for yearly word count variations\n")
        f.write("- Z-score calculation to identify statistically significant frequency spikes\n")
        f.write("- Minimum Z-score threshold: 2.0 (2 standard deviations above mean)\n")
        f.write("- Minimum absolute frequency: 20 occurrences\n\n")
        
        f.write("## Key Statistics\n")
        f.write(f"- Total unique word-year combinations analyzed: {len(enhanced_df):,}\n")
        f.write(f"- Significant temporal spikes identified: {len(unique_words):,}\n")
        f.write(f"- Unique words with temporal spikes: {unique_words['word'].nunique():,}\n")
        f.write(f"- Years covered: {unique_words['year'].min()} to {unique_words['year'].max()}\n")
        f.write(f"- Average Z-score of significant spikes: {unique_words['rel_freq_z_score'].mean():.2f}\n")
        f.write(f"- Maximum Z-score observed: {unique_words['rel_freq_z_score'].max():.2f}\n\n")
        
        f.write("## Top 10 Most Temporally Unique Words\n")
        top_10 = unique_words.nlargest(10, 'rel_freq_z_score')
        for i, (_, row) in enumerate(top_10.iterrows(), 1):
            f.write(f"{i:2d}. {row['word']} ({row['year']}) - Z-score: {row['rel_freq_z_score']:.2f}\n")
    
    print(f"   Created analysis summary report")
    
    print(f"\n✅ All exports completed successfully in: {output_dir}")
    print(f"\n📁 Generated files:")
    print(f"   - unique_words_complete.csv: Complete dataset")
    print(f"   - top_unique_words_per_year.csv: Top words per year")
    print(f"   - word_temporal_statistics.csv: Temporal statistics")
    print(f"   - yearly_summary.csv: Summary by year")
    print(f"   - word_lists_by_year/: Simple text lists per year")
    print(f"   - analysis_summary.txt: Methodology and key findings")

# Export results
export_unique_words_analysis(unique_words, top_per_year, word_temporal_stats, enhanced_df)

## Key Findings Summary

Summarize the key findings from the unique words per year analysis.

In [None]:
def generate_key_findings(unique_words, word_temporal_stats, yearly_totals):
    """
    Generate and display key findings from the analysis.
    
    Args:
        unique_words: DataFrame with unique word-year combinations
        word_temporal_stats: DataFrame with temporal statistics
        yearly_totals: DataFrame with yearly totals
    """
    if unique_words is None:
        print("Cannot generate findings - analysis data not available")
        return
    
    print("🔍 KEY FINDINGS: Unique Words Per Year Analysis")
    print("=" * 60)
    
    # 1. Overall statistics
    print("\n📊 OVERALL STATISTICS:")
    print(f"   • Significant temporal word spikes identified: {len(unique_words):,}")
    print(f"   • Unique words with temporal patterns: {unique_words['word'].nunique():,}")
    print(f"   • Average Z-score of spikes: {unique_words['rel_freq_z_score'].mean():.2f}")
    print(f"   • Strongest temporal spike (Z-score): {unique_words['rel_freq_z_score'].max():.2f}")
    
    # 2. Year with most unique words
    yearly_counts = unique_words.groupby('year').size()
    peak_year = yearly_counts.idxmax()
    peak_count = yearly_counts.max()
    
    print(f"\n📅 TEMPORAL PATTERNS:")
    print(f"   • Year with most unique words: {peak_year} ({peak_count} words)")
    print(f"   • Years analyzed: {unique_words['year'].min()} to {unique_words['year'].max()}")
    
    # Word volume correlation
    if yearly_totals is not None:
        corr_data = yearly_counts.to_frame('unique_count').merge(
            yearly_totals.set_index('year'), left_index=True, right_index=True
        )
        correlation = corr_data['unique_count'].corr(corr_data['total_words'])
        print(f"   • Correlation between unique words and total volume: {correlation:.3f}")
    
    # 3. Top temporally specific words
    print(f"\n🎯 MOST TEMPORALLY SPECIFIC WORDS:")
    most_specific = word_temporal_stats.nlargest(5, 'temporal_specificity')
    for i, (_, row) in enumerate(most_specific.iterrows(), 1):
        print(f"   {i}. '{row['word']}' (POS: {row['pos_category']}) - "
              f"Max Z-score: {row['max_z_score']:.2f}")
    
    # 4. Most frequent unique words
    print(f"\n📈 HIGHEST FREQUENCY UNIQUE WORDS:")
    high_freq = unique_words.nlargest(5, 'frequency')
    for i, (_, row) in enumerate(high_freq.iterrows(), 1):
        print(f"   {i}. '{row['word']}' in {row['year']} - "
              f"Frequency: {row['frequency']:,}, Z-score: {row['rel_freq_z_score']:.2f}")
    
    # 5. POS category distribution
    print(f"\n📝 PART-OF-SPEECH DISTRIBUTION:")
    pos_dist = unique_words['pos_category'].value_counts().head(5)
    total_unique = len(unique_words)
    for pos, count in pos_dist.items():
        percentage = (count / total_unique) * 100
        print(f"   • {pos}: {count:,} ({percentage:.1f}%)")
    
    # 6. Sample unique words by year
    print(f"\n📋 SAMPLE UNIQUE WORDS BY YEAR:")
    for year in sorted(unique_words['year'].unique())[-3:]:  # Last 3 years
        year_top = unique_words[unique_words['year'] == year].nlargest(3, 'rel_freq_z_score')
        words_list = ", ".join([f"'{row['word']}'" for _, row in year_top.iterrows()])
        print(f"   • {year}: {words_list}")
    
    print(f"\n" + "=" * 60)
    print(f"📁 Detailed results exported to: output/unique_words_analysis/")
    print(f"📈 Visualizations saved to: output/visualizations/")

# Generate key findings
generate_key_findings(unique_words, word_temporal_stats, yearly_totals)

# Close database connection
if conn:
    conn.close()
    print("\n✓ Database connection closed")

## Conclusion

This analysis successfully identified words that are used relatively more in specific years compared to their baseline usage across the entire corpus. The methodology accounts for:

1. **Yearly volume bias**: Normalizes frequencies by total word count per year
2. **Statistical significance**: Uses Z-scores to identify truly anomalous usage patterns
3. **Temporal specificity**: Measures how concentrated a word's usage is in particular time periods

### Key Outputs Generated:

- **Complete dataset** of words with significant temporal spikes
- **Year-by-year analysis** showing unique words for each time period
- **Statistical measures** including Z-scores and lift ratios
- **Visualizations** showing temporal patterns and trends
- **Export files** in CSV and text formats for further analysis

### Applications:

- **Historical analysis**: Understanding language evolution and current events
- **Content categorization**: Identifying time-specific terminology
- **Trend detection**: Spotting emerging or declining word usage
- **Research**: Supporting linguistic and sociological studies

This analysis provides a robust foundation for understanding how language usage varies over time and which words serve as temporal markers in the Dutch news corpus.