# **Notebook 3: Text Analysis for Historians**

Welcome to text analysis! In this notebook, you'll learn to analyze large text collections using modern computational tools. We'll explore how religious discourse has evolved in US Presidential inaugural addresses from 1789 to 2021.

**What you'll learn:**
- Modern text processing with SpaCy
- Document comparison and analysis
- N-gram analysis for phrase patterns
- Temporal visualization of text trends
- Professional text analysis workflows

**Why this matters for historians:**
These skills let you analyze thousands of documents, track language changes over time, and discover patterns that would be impossible to see manually. You'll be able to ask questions like: "How has presidential religious language changed since Washington?"

**Our research question:**
How has religious discourse in US Presidential inaugural addresses evolved from 1789 to 2021?

In [None]:
# Install and import our modern text analysis libraries
!pip install spacy scikit-learn matplotlib seaborn pandas --quiet
!python -m spacy download en_core_web_sm --quiet

print("üìö Installing modern text analysis libraries...")
print("‚úÖ Installation complete!")

# Import the libraries we'll use
import spacy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from collections import Counter
import re
import requests
from io import StringIO

# Load SpaCy's English model
nlp = spacy.load("en_core_web_sm")

print("‚úÖ Libraries imported successfully!")
print("üî§ SpaCy: Modern text processing")
print("üìä Scikit-learn: Document analysis and comparison") 
print("üìà Matplotlib/Seaborn: Data visualization")
print("üêº Pandas: Data manipulation")

## Step 1: Loading the Presidential Inaugural Corpus

We'll use the US Presidential Inaugural Address corpus, which contains 59 speeches from Washington (1789) to Biden (2021). This is perfect for analyzing how language has changed over more than two centuries.

**Step 1a: Download the corpus data**

First, let's get the inaugural address corpus. We'll load it from a reliable source that includes both the text and metadata.

In [None]:
# Load the inaugural address corpus
# We'll create a sample dataset that matches the quanteda corpus structure

def load_inaugural_corpus():
    """Load US Presidential Inaugural Address corpus with metadata"""
    
    # Sample data structure - in a real implementation, this would load from CSV or API
    # This represents the key speeches across different eras for demonstration
    corpus_data = [
        {
            'Year': 1789, 'President': 'Washington', 'FirstName': 'George', 'Party': 'Nonpartisan',
            'text': 'Almighty Being who rules over the universe divine Providence has honored the American people divine Author of every good and perfect gift divine blessing divine guidance religious obligations'
        },
        {
            'Year': 1861, 'President': 'Lincoln', 'FirstName': 'Abraham', 'Party': 'Republican',
            'text': 'Almighty has His own purposes divine attributes justice of the Almighty that God gives to both North and South this terrible war religious duty under God'
        },
        {
            'Year': 1933, 'President': 'Roosevelt', 'FirstName': 'Franklin D.', 'Party': 'Democratic',
            'text': 'with the help of God nation asks for action under the guidance of Almighty God social justice divine providence blessed with natural resources'
        },
        {
            'Year': 1961, 'President': 'Kennedy', 'FirstName': 'John F.', 'Party': 'Democratic',
            'text': 'for God and country God willing responsibility to God and man divine power which has lighted the world God bless America almighty God'
        },
        {
            'Year': 2021, 'President': 'Biden', 'FirstName': 'Joseph R.', 'Party': 'Democratic',
            'text': 'may God bless America and may God protect our troops prayer for our country God willing we will overcome God bless you all'
        }
    ]
    
    # In practice, you would load the full corpus like this:
    # corpus_url = "https://raw.githubusercontent.com/quanteda/quanteda.corpora/master/data-raw/data_corpus_inaugural.csv"
    # df = pd.read_csv(corpus_url)
    
    # For this demo, we'll use our sample data
    df = pd.DataFrame(corpus_data)
    print(f"üìö Loaded {len(df)} inaugural addresses")
    print(f"üìÖ Date range: {df['Year'].min()} to {df['Year'].max()}")
    print(f"üèõÔ∏è Presidents included: {', '.join(df['President'].tolist())}")
    
    return df

# Load the corpus
inaugural_df = load_inaugural_corpus()

# Display basic information
print(f"\nüìä Corpus Overview:")
print(inaugural_df[['Year', 'President', 'Party']].to_string(index=False))

print(f"\nüí° Note: This is a sample for demonstration. The full corpus contains 59 speeches!")
print(f"   In practice, you'd load the complete dataset with all presidents.")

## Step 2: Modern Text Processing with SpaCy

SpaCy is the leading library for natural language processing in 2025. It provides industrial-strength text processing that's much more sophisticated than simple word splitting.

**Step 2a: Understanding SpaCy's capabilities**

Let's see what SpaCy can tell us about a presidential text:

In [None]:
# Demonstrate SpaCy's text processing capabilities
sample_text = "Almighty God has blessed America with divine providence and religious freedom"

# Process the text with SpaCy
doc = nlp(sample_text)

print("üî§ SpaCy Text Analysis Demonstration:")
print("=" * 50)
print(f"Original text: {sample_text}")
print()

# Show what SpaCy extracts
print("üìù Token Analysis:")
for token in doc:
    print(f"  '{token.text}' -> Lemma: '{token.lemma_}', POS: {token.pos_}, Stop: {token.is_stop}")

print(f"\nüè∑Ô∏è Named Entities Found:")
for ent in doc.ents:
    print(f"  '{ent.text}' -> {ent.label_} ({spacy.explain(ent.label_)})")

print(f"\nüí° Key SpaCy Features:")
print("  - Lemmatization: Converts words to root form (blessed -> bless)")
print("  - POS tagging: Identifies parts of speech")
print("  - Stop word detection: Identifies common words to filter")
print("  - Named entity recognition: Finds people, places, organizations")
print("  - Much more accurate than simple .split() and .lower() approaches!")

**Step 2b: Create a text processing function**

Now let's create a professional text processing function using SpaCy:

## Summary: Your Text Analysis Toolkit

üéâ **Congratulations!** You've mastered modern computational text analysis for historical research. You now have professional-level skills that can handle thousands of documents and reveal patterns invisible to traditional methods.

**Technical Skills Mastered:**
- ‚úÖ **SpaCy processing**: Industrial-strength text preprocessing
- ‚úÖ **Scikit-learn analysis**: Document comparison and n-gram extraction  
- ‚úÖ **Data visualization**: Temporal trend analysis with matplotlib/seaborn
- ‚úÖ **Corpus analysis**: Large-scale text collection processing
- ‚úÖ **Statistical analysis**: Quantitative historical research methods

**Historical Research Skills:**
- ‚úÖ **Targeted vocabulary analysis**: Tracking specific themes over time
- ‚úÖ **N-gram analysis**: Finding phrase patterns and evolution
- ‚úÖ **Temporal analysis**: Understanding how language changes across periods
- ‚úÖ **Comparative methods**: Analyzing differences between groups/eras
- ‚úÖ **Quantitative interpretation**: Drawing historical conclusions from data

**Research Applications:**
- ‚úÖ **Religious discourse analysis**: Ready methodology for similar studies
- ‚úÖ **Political rhetoric evolution**: Framework for analyzing political language
- ‚úÖ **Cross-temporal comparison**: Skills for studying long-term changes
- ‚úÖ **Cross-national analysis**: Foundation for comparative historical studies

**Next Steps for Advanced Research:**
1. **Scale up**: Apply these methods to larger corpora (thousands of documents)
2. **Specialize**: Focus on specific historical themes (nationalism, democracy, etc.)
3. **Compare**: Build comparative studies across countries or institutions
4. **Innovate**: Develop new metrics and visualizations for historical questions
5. **Publish**: Share your findings with digital humanities communities

**Key Takeaway:**
You've learned to ask questions that would be impossible without computational methods: "How has religious language evolved across 250+ years of American political rhetoric?" These skills let you discover patterns, test hypotheses, and generate insights that transform historical understanding.

**The Digital Historian's Advantage:**
- Process vast amounts of text systematically
- Identify subtle patterns across long time periods  
- Quantify changes that seem intuitive but need proof
- Compare multiple dimensions simultaneously
- Generate reproducible, evidence-based conclusions

You're now equipped to tackle original digital history research projects!

In [None]:
# Bonus Challenge: Planning a Canadian Throne Speech Corpus

print("üçÅ Bonus Project: Canadian Throne Speech Corpus")
print("=" * 60)

print("üéØ Project Goal:")
print("Create a corpus of Canadian Throne Speeches to compare with US Presidential inaugurals")

print(f"\nüìö Data Sources to Explore:")
sources = [
    {
        'name': 'Poltext Canadian Throne Speeches',
        'url': 'https://www.poltext.org/en/part-1-electronic-political-texts/canadian-throne-speeches',
        'description': 'Academic corpus with structured data',
        'advantages': ['Professional curation', 'Standardized format', 'Metadata included'],
        'approach': 'Download CSV/XML files, parse with pandas'
    },
    {
        'name': 'Parliament of Canada Archives',
        'url': 'https://www.parl.ca/DocumentViewer/en/house/sitting-hansard',
        'description': 'Official government transcripts',
        'advantages': ['Authoritative source', 'Complete coverage', 'Multiple formats'],
        'approach': 'Web scraping with Beautiful Soup + requests'
    },
    {
        'name': 'Library and Archives Canada',
        'url': 'https://www.bac-lac.gc.ca/',
        'description': 'National archives with digitized documents',
        'advantages': ['Historical depth', 'Original documents', 'Rich metadata'],
        'approach': 'API access or Internet Archive integration'
    }
]

for i, source in enumerate(sources, 1):
    print(f"\n{i}. {source['name']}")
    print(f"   URL: {source['url']}")
    print(f"   Description: {source['description']}")
    print(f"   Advantages: {', '.join(source['advantages'])}")
    print(f"   Technical approach: {source['approach']}")

print(f"\nüîß Technical Implementation Plan:")
implementation_steps = [
    "1. Data Collection: Use web scraping or API to gather throne speeches",
    "2. Data Cleaning: Extract text, dates, and metadata using SpaCy",
    "3. Corpus Structure: Create pandas DataFrame similar to inaugural corpus", 
    "4. Analysis Pipeline: Apply same religious discourse analysis techniques",
    "5. Comparative Study: Compare Canadian vs. US religious political rhetoric",
    "6. Visualization: Create charts showing differences and similarities",
    "7. Historical Context: Interpret findings in light of different political systems"
]

for step in implementation_steps:
    print(f"  {step}")

print(f"\nüîç Research Questions for Canadian Analysis:")
canadian_questions = [
    "How does religious language in Throne Speeches compare to US inaugurals?",
    "Do Canadian speeches show different temporal patterns?",
    "How do different Prime Ministers vary in religious rhetoric?",
    "Does the Westminster system influence religious language differently?",
    "How do Quebec/French Canadian influences affect religious discourse?"
]

for i, question in enumerate(canadian_questions, 1):
    print(f"  {i}. {question}")

print(f"\nüíª Code Template for Canadian Corpus:")
print("=" * 40)

# Template code structure
template_code = '''
# Step 1: Data collection function
def collect_throne_speeches():
    """Collect Canadian throne speeches from online sources"""
    # Your web scraping or API code here
    pass

# Step 2: Process Canadian texts  
def process_canadian_text(text):
    """Process Canadian political text with SpaCy"""
    # Apply same processing as US inaugurals
    # Consider bilingual content (English/French)
    pass

# Step 3: Comparative analysis
def compare_us_canada_discourse(us_data, canadian_data):
    """Compare religious discourse between countries"""
    # Statistical comparison
    # Visualization of differences
    # Historical interpretation
    pass

# Step 4: Bilingual analysis (advanced)
def analyze_french_english_differences():
    """Analyze differences between French and English throne speeches"""
    # Requires French SpaCy model: python -m spacy download fr_core_news_sm
    pass
'''

print(template_code)

print(f"\nüöÄ Next Steps for Ambitious Students:")
next_steps = [
    "1. Choose a data source and examine its structure",
    "2. Write a simple web scraper or data downloader", 
    "3. Adapt the US inaugural analysis code for Canadian data",
    "4. Create comparative visualizations",
    "5. Write up findings as a research paper or blog post",
    "6. Share your corpus with other digital historians!"
]

for step in next_steps:
    print(f"  {step}")

print(f"\nüí° This project combines:")
print("  ‚úÖ All the web scraping skills from Notebook 2")
print("  ‚úÖ All the text analysis skills from Notebook 3") 
print("  ‚úÖ Original historical research")
print("  ‚úÖ Cross-national comparative analysis")
print("  üá®üá¶ Contributing to Canadian digital humanities!")

print(f"\nüìß If you build this corpus, consider sharing it with:")
print("  - Canadian political science researchers")
print("  - Digital humanities communities") 
print("  - The Programming Historian")
print("  - Government of Canada open data initiatives")

## üçÅ Bonus Challenge: Building a Canadian Corpus

Ready for a advanced project? Let's plan how to create your own corpus of Canadian political texts using the skills you've learned.

**Goal**: Build a corpus of Canadian Throne Speeches for comparative analysis with US inaugurals.

In [None]:
# Final Challenge: Your Historical Research Project

print("üî¨ Final Challenge: Comparative Historical Analysis")
print("=" * 60)

# Research question suggestions
research_questions = [
    "How does religious language differ between Republican and Democratic presidents?",
    "Do crisis periods (wars, depressions) correlate with increased religious rhetoric?",
    "Which religious concepts (divine, God, blessing) are most common across eras?",
    "How has the formality of religious language changed over time?",
    "Do longer inaugurals contain proportionally more religious content?"
]

print("üéØ Suggested Research Questions:")
for i, question in enumerate(research_questions, 1):
    print(f"  {i}. {question}")

print(f"\nüìã Your Task:")
print("1. Choose a research question (or create your own)")
print("2. Use the analysis techniques from this notebook")
print("3. Create visualizations to support your findings")
print("4. Write a brief historical interpretation")

# Example analysis: Party comparison
print(f"\nüìä Example Analysis: Religious Language by Political Party")
print("=" * 50)

# Filter for speeches with party data (excluding Washington who was nonpartisan)
party_data = inaugural_df[inaugural_df['Party'] != 'Nonpartisan'].copy()

if len(party_data) > 0:
    party_comparison = party_data.groupby('Party').agg({
        'religious_density': ['mean', 'std', 'count'],
        'religious_count': 'mean'
    }).round(2)
    
    print("Religious density by party:")
    print(party_comparison)
    
    # Simple party comparison visualization
    plt.figure(figsize=(10, 6))
    
    # Box plot comparing parties
    plt.subplot(1, 2, 1)
    party_groups = [group['religious_density'].values for name, group in party_data.groupby('Party')]
    party_names = list(party_data.groupby('Party').groups.keys())
    
    plt.boxplot(party_groups, labels=party_names)
    plt.title('Religious Density Distribution by Party')
    plt.ylabel('Religious Density (%)')
    plt.xticks(rotation=45)
    
    # Time series by party
    plt.subplot(1, 2, 2)
    for party in party_names:
        party_subset = party_data[party_data['Party'] == party]
        plt.plot(party_subset['Year'], party_subset['religious_density'], 
                'o-', label=party, alpha=0.7)
    
    plt.title('Religious Density Over Time by Party')
    plt.xlabel('Year')
    plt.ylabel('Religious Density (%)')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

# Your turn - add your own analysis here!
print(f"\nüöÄ Your Analysis Space:")
print("=" * 30)
print("# Customize this section for your research question")
print("# Use the functions and techniques from earlier cells")
print("# Examples:")
print("# - Compare different time periods")
print("# - Analyze specific religious themes")
print("# - Track evolution of particular phrases")
print("# - Examine correlation with historical events")

# Template for student analysis
your_research_question = "Your research question here"
print(f"\nüìù Research Question: {your_research_question}")

# Add your analysis code here:
# your_data = inaugural_df[some_filter]
# your_results = some_analysis(your_data)  
# create_your_visualization(your_results)

print(f"\nüìö Research Findings:")
print("1. [Your first finding]")
print("2. [Your second finding]") 
print("3. [Your interpretation of the historical significance]")

## Step 6: Final Challenge - Comparative Historical Analysis

Now combine all your skills to conduct a comprehensive analysis comparing different historical periods or presidents.

**Your research project:**
Use the techniques you've learned to answer a historical question about religious discourse in presidential inaugurals.

In [None]:
# Create visualizations of religious discourse trends
plt.style.use('default')  # Clean, professional style
plt.figure(figsize=(12, 8))

# Plot 1: Religious density over time
plt.subplot(2, 2, 1)
plt.plot(inaugural_df['Year'], inaugural_df['religious_density'], 'o-', linewidth=2, markersize=6)
plt.title('Religious Discourse Density Over Time', fontsize=12, fontweight='bold')
plt.xlabel('Year')
plt.ylabel('Religious Density (%)')
plt.grid(True, alpha=0.3)

# Annotate some key points
for idx, row in inaugural_df.iterrows():
    if row['religious_density'] > inaugural_df['religious_density'].mean() + 5:  # High points
        plt.annotate(f"{row['President']}\n{row['religious_density']:.1f}%", 
                    (row['Year'], row['religious_density']),
                    xytext=(10, 10), textcoords='offset points',
                    fontsize=8, ha='left')

# Plot 2: Raw religious word counts
plt.subplot(2, 2, 2)
plt.bar(inaugural_df['Year'], inaugural_df['religious_count'], alpha=0.7)
plt.title('Number of Religious Words per Speech', fontsize=12, fontweight='bold')
plt.xlabel('Year')
plt.ylabel('Religious Word Count')
plt.xticks(rotation=45)

# Plot 3: Religious words vs. speech length
plt.subplot(2, 2, 3)
plt.scatter(inaugural_df['token_count'], inaugural_df['religious_count'], 
           s=60, alpha=0.7, c=inaugural_df['Year'], cmap='viridis')
plt.title('Religious Words vs. Speech Length', fontsize=12, fontweight='bold')
plt.xlabel('Total Words in Speech')
plt.ylabel('Religious Words')
plt.colorbar(label='Year')

# Add trend line
z = np.polyfit(inaugural_df['token_count'], inaugural_df['religious_count'], 1)
p = np.poly1d(z)
plt.plot(inaugural_df['token_count'], p(inaugural_df['token_count']), "r--", alpha=0.8)

# Plot 4: Historical periods comparison
plt.subplot(2, 2, 4)
# Create era categories
inaugural_df['Era'] = pd.cut(inaugural_df['Year'], 
                            bins=[1780, 1850, 1900, 1950, 2030],
                            labels=['Early Republic\n(1789-1850)', 'Civil War Era\n(1851-1900)', 
                                   'Modern Era\n(1901-1950)', 'Contemporary\n(1951-2021)'])

era_avg = inaugural_df.groupby('Era')['religious_density'].mean()
plt.bar(range(len(era_avg)), era_avg.values, alpha=0.7)
plt.title('Average Religious Density by Era', fontsize=12, fontweight='bold')
plt.xlabel('Historical Era')
plt.ylabel('Average Religious Density (%)')
plt.xticks(range(len(era_avg)), era_avg.index, rotation=45)

plt.tight_layout()
plt.show()

print("üìä Visualization Insights:")
print("=" * 40)
print("üìà Top plot: Shows religious density trends over time")
print("üìä Bar chart: Raw counts of religious words per speech")
print("üîç Scatter plot: Relationship between speech length and religious content")
print("üìÖ Bottom plot: Comparison across historical eras")
print(f"\nüí° Key patterns to notice:")
print(f"  - Do religious references increase or decrease over time?")
print(f"  - Are longer speeches more religious?")
print(f"  - Which historical eras had the most religious rhetoric?")

## Step 5: Visualizing Temporal Trends

Data visualization helps us see patterns that are hard to spot in tables. Let's create charts showing how religious discourse has changed over time.

**Step 5a: Religious density over time**

In [None]:
# Your exercise: Track specific religious phrases over time
def track_phrase_over_time(df, phrase):
    """Track occurrences of a specific phrase across speeches"""
    results = []
    
    for idx, row in df.iterrows():
        text_lower = row['text'].lower()
        phrase_count = text_lower.count(phrase.lower())
        
        results.append({
            'Year': row['Year'],
            'President': row['President'],
            'Phrase': phrase,
            'Count': phrase_count,
            'Present': phrase_count > 0
        })
    
    return pd.DataFrame(results)

# Track some key religious phrases
phrases_to_track = ['god bless', 'divine providence', 'almighty god']

print("üéØ Tracking Specific Religious Phrases")
print("=" * 50)

for phrase in phrases_to_track:
    phrase_data = track_phrase_over_time(inaugural_df, phrase)
    total_uses = phrase_data['Count'].sum()
    presidents_using = phrase_data[phrase_data['Present']]['President'].tolist()
    
    print(f"\nüìù Phrase: '{phrase}'")
    print(f"   Total uses: {total_uses}")
    print(f"   Presidents using it: {', '.join(presidents_using) if presidents_using else 'None'}")
    
    # Show specific occurrences
    for idx, row in phrase_data.iterrows():
        if row['Count'] > 0:
            print(f"     {row['Year']} {row['President']}: {row['Count']} time(s)")

# Try tracking your own phrase:
# my_phrase = "under god"  # Example
# my_results = track_phrase_over_time(inaugural_df, my_phrase)
# print(f"\nYour phrase '{my_phrase}' analysis:")
# print(my_results[my_results['Count'] > 0])

### üîÑ **Your Turn: Track Specific Religious Phrases**

Choose a religious phrase and track its usage across different time periods:

In [None]:
# N-gram analysis for religious phrases
from sklearn.feature_extraction.text import CountVectorizer

def extract_ngrams(text, n=2):
    """Extract n-grams from text using scikit-learn"""
    # Use CountVectorizer to extract n-grams
    vectorizer = CountVectorizer(
        ngram_range=(n, n),  # Only n-grams of length n
        stop_words='english',
        lowercase=True,
        token_pattern=r'[a-zA-Z]+',  # Only alphabetic tokens
        min_df=1  # Minimum document frequency
    )
    
    # Fit and transform the text
    ngram_matrix = vectorizer.fit_transform([text])
    
    # Get the n-grams and their counts
    feature_names = vectorizer.get_feature_names_out()
    counts = ngram_matrix.toarray()[0]
    
    # Create list of (ngram, count) tuples
    ngrams_with_counts = list(zip(feature_names, counts))
    
    # Sort by count (descending)
    ngrams_with_counts.sort(key=lambda x: x[1], reverse=True)
    
    return ngrams_with_counts

def find_religious_ngrams(ngrams_list, religious_vocab=all_religious_words):
    """Filter n-grams that contain religious vocabulary"""
    religious_ngrams = []
    
    for ngram, count in ngrams_list:
        # Check if any word in the n-gram is religious
        words_in_ngram = ngram.split()
        if any(word in religious_vocab for word in words_in_ngram):
            religious_ngrams.append((ngram, count))
    
    return religious_ngrams

print("üìù N-gram Analysis: Finding Religious Phrases")
print("=" * 60)

# Analyze bigrams (2-word phrases) across all speeches
print("üîç Analyzing 2-grams (bigrams)...")

# Combine all speech texts for corpus-wide analysis
all_texts = ' '.join(inaugural_df['text'].tolist())

# Extract bigrams
bigrams = extract_ngrams(all_texts, n=2)
religious_bigrams = find_religious_ngrams(bigrams)

print(f"\nTop religious bigrams:")
for bigram, count in religious_bigrams[:10]:
    print(f"  '{bigram}': {count} occurrences")

# Extract trigrams (3-word phrases)
print(f"\nüîç Analyzing 3-grams (trigrams)...")
trigrams = extract_ngrams(all_texts, n=3)
religious_trigrams = find_religious_ngrams(trigrams)

print(f"\nTop religious trigrams:")
for trigram, count in religious_trigrams[:8]:
    print(f"  '{trigram}': {count} occurrences")

print(f"\nüí° N-gram insights:")
print(f"  - Bigrams reveal common religious phrases")
print(f"  - Trigrams show complete religious expressions")
print(f"  - Frequency indicates which phrases are most traditional")
print(f"  - Perfect for tracking phrase evolution over time!")

## Step 4: N-gram Analysis - Finding Religious Phrases

Individual words tell part of the story, but phrases reveal deeper patterns. Let's analyze 2-grams (bigrams) and 3-grams (trigrams) to find religious phrases like "God bless America" or "divine providence."

**Step 4a: Extract n-grams from speeches**

In [None]:
# Analyze religious discourse across all speeches
print("üîç Analyzing religious discourse across all inaugurals...")

# Add religious analysis columns
inaugural_df['religious_count'] = inaugural_df['processed_tokens'].apply(
    lambda tokens: count_religious_words(tokens)[0]
)

inaugural_df['religious_words'] = inaugural_df['processed_tokens'].apply(
    lambda tokens: count_religious_words(tokens)[1]
)

inaugural_df['religious_density'] = (
    inaugural_df['religious_count'] / inaugural_df['token_count'] * 100
).round(1)

# Display results
print(f"\nüìä Religious Discourse Analysis Results:")
print("=" * 70)
for idx, row in inaugural_df.iterrows():
    print(f"{row['Year']} {row['President']:<12}: {row['religious_count']:2d} religious words "
          f"({row['religious_density']:4.1f}% density)")
    
    # Show specific religious words found
    unique_religious = list(set(row['religious_words']))
    if unique_religious:
        print(f"{'':26} Words: {', '.join(unique_religious[:6])}")
        if len(unique_religious) > 6:
            print(f"{'':26} + {len(unique_religious) - 6} more...")
    print()

# Calculate summary statistics
avg_density = inaugural_df['religious_density'].mean()
max_religious = inaugural_df.loc[inaugural_df['religious_density'].idxmax()]
min_religious = inaugural_df.loc[inaugural_df['religious_density'].idxmin()]

print(f"üìà Summary Statistics:")
print(f"  Average religious density: {avg_density:.1f}%")
print(f"  Highest religious content: {max_religious['President']} ({max_religious['Year']}) - {max_religious['religious_density']:.1f}%")
print(f"  Lowest religious content: {min_religious['President']} ({min_religious['Year']}) - {min_religious['religious_density']:.1f}%")

**Step 3b: Analyze religious discourse across all speeches**

Now let's apply this analysis to all inaugural addresses to see temporal patterns:

In [None]:
# Define religious vocabulary categories
religious_vocabulary = {
    'Divine References': ['god', 'almighty', 'divine', 'lord', 'providence', 'creator', 'heaven', 'holy'],
    'Religious Actions': ['pray', 'prayer', 'bless', 'blessing', 'worship', 'faith', 'believe'],
    'Religious Concepts': ['religious', 'sacred', 'holy', 'spiritual', 'righteous', 'moral', 'virtue'],
    'Biblical/Christian': ['jesus', 'christ', 'christian', 'bible', 'scripture', 'gospel', 'salvation']
}

# Flatten the vocabulary for easy searching
all_religious_words = []
for category, words in religious_vocabulary.items():
    all_religious_words.extend(words)

print("üôè Religious Vocabulary Analysis")
print("=" * 50)
print("üìñ Religious word categories:")
for category, words in religious_vocabulary.items():
    print(f"  {category}: {', '.join(words)}")

print(f"\nüìä Total religious terms tracked: {len(all_religious_words)}")

# Function to count religious words in a text
def count_religious_words(tokens, religious_vocab=all_religious_words):
    """Count religious words in processed tokens"""
    religious_count = 0
    found_words = []
    
    for token in tokens:
        if token in religious_vocab:
            religious_count += 1
            found_words.append(token)
    
    return religious_count, found_words

# Test with Washington's speech
washington_tokens = inaugural_df[inaugural_df['President'] == 'Washington']['processed_tokens'].iloc[0]
rel_count, rel_words = count_religious_words(washington_tokens)

print(f"\nüîç Test with Washington (1789):")
print(f"  Religious words found: {rel_count}")
print(f"  Specific words: {', '.join(set(rel_words))}")
print(f"  Religious density: {rel_count/len(washington_tokens)*100:.1f}% of all words")

## Step 3: Analyzing Religious Discourse

Now let's focus on our research question: How has religious language evolved in presidential inaugurals? We'll create a targeted analysis of religious vocabulary.

**Step 3a: Define religious vocabulary**

First, we need to identify what constitutes "religious" language:

In [None]:
# Your exercise: Process all inaugural addresses
print("üîÑ Processing all inaugural addresses with SpaCy...")

# Add a column for processed tokens
inaugural_df['processed_tokens'] = inaugural_df['text'].apply(
    lambda text: process_text_with_spacy(text, remove_stop_words=True, lemmatize=True)
)

# Add a column for token count
inaugural_df['token_count'] = inaugural_df['processed_tokens'].apply(len)

# Display results
print(f"\nüìä Processing Results:")
print("=" * 60)
for idx, row in inaugural_df.iterrows():
    print(f"{row['Year']} {row['President']}: {row['token_count']} processed tokens")
    print(f"  Sample tokens: {row['processed_tokens'][:8]}...")
    print()

print(f"‚úÖ Successfully processed {len(inaugural_df)} inaugural addresses!")
print(f"üìà Total unique vocabulary across all speeches: {len(set([token for tokens in inaugural_df['processed_tokens'] for token in tokens]))}")

# Try with your own processing settings:
# inaugural_df['tokens_no_lemma'] = inaugural_df['text'].apply(
#     lambda text: process_text_with_spacy(text, remove_stop_words=True, lemmatize=False)
# )

### üîÑ **Your Turn: Process the Entire Corpus**

Now apply SpaCy processing to all inaugural addresses in our corpus:

In [None]:
def process_text_with_spacy(text, remove_stop_words=True, lemmatize=True):
    """
    Process text using SpaCy for professional text analysis
    
    Parameters:
    - text: Input text string
    - remove_stop_words: Whether to filter out common words
    - lemmatize: Whether to convert words to root forms
    
    Returns:
    - List of processed tokens
    """
    # Process with SpaCy
    doc = nlp(text)
    
    processed_tokens = []
    
    for token in doc:
        # Skip punctuation and whitespace
        if token.is_punct or token.is_space:
            continue
            
        # Skip stop words if requested
        if remove_stop_words and token.is_stop:
            continue
            
        # Skip very short tokens
        if len(token.text) < 2:
            continue
            
        # Use lemma (root form) if requested, otherwise use original text
        if lemmatize:
            word = token.lemma_.lower()
        else:
            word = token.text.lower()
            
        # Only keep alphabetic tokens
        if word.isalpha():
            processed_tokens.append(word)
    
    return processed_tokens

# Test our function with Washington's sample text
washington_text = inaugural_df[inaugural_df['President'] == 'Washington']['text'].iloc[0]

print("üîç Testing our SpaCy processing function:")
print("=" * 60)
print(f"Original text: {washington_text}")
print()

# Process with different settings
tokens_basic = process_text_with_spacy(washington_text, remove_stop_words=False, lemmatize=False)
tokens_full = process_text_with_spacy(washington_text, remove_stop_words=True, lemmatize=True)

print(f"Basic processing (no filtering): {len(tokens_basic)} tokens")
print(f"  {tokens_basic[:10]}...")

print(f"Full processing (stop words removed, lemmatized): {len(tokens_full)} tokens")
print(f"  {tokens_full[:10]}...")

print(f"\n‚úÖ Notice how SpaCy gives us much cleaner, more meaningful tokens!")
print(f"   - Removes common words like 'the', 'who', 'has'")
print(f"   - Converts words to root forms (e.g., 'honored' -> 'honor')")
print(f"   - Filters out punctuation automatically")