# Text Style Analysis: Readability Features and Embeddings

This notebook demonstrates comprehensive **text style analysis** techniques, focusing on extracting quantitative features that capture different aspects of writing style. The analysis combines traditional readability metrics with modern embedding approaches, making it particularly valuable for Digital Humanities research into literary style, authorship, and textual complexity.

## What is Text Style Analysis?

**Text style analysis** examines the quantifiable characteristics of writing that reflect an author's unique voice, the complexity of the text, and various linguistic patterns. Unlike content analysis (which focuses on *what* is said), style analysis focuses on *how* it is said.

Key dimensions of style include:
- **Readability**: How easy or difficult a text is to understand
- **Lexical Richness**: Vocabulary diversity and sophistication  
- **Syntactic Complexity**: Sentence structure and grammatical patterns
- **Semantic Density**: Information density and conceptual complexity

## What You'll Learn

1. **Traditional Readability Metrics**: Computing established formulas like Flesch-Kincaid, SMOG, etc.
2. **Advanced Lexical Features**: Type-token ratios, hapax legomena, entropy measures
3. **Modern Embeddings**: Word2Vec, BERT, and OpenAI embeddings for style representation
4. **Feature Integration**: Combining multiple approaches for comprehensive style profiles

## Applications in Digital Humanities

### Literary Analysis
- **Authorship Attribution**: Distinguish between different authors based on style
- **Period Analysis**: Track stylistic changes across historical periods
- **Genre Classification**: Identify characteristics of different literary genres
- **Translation Studies**: Compare stylistic features across languages

### Educational Research  
- **Text Difficulty Assessment**: Evaluate reading level for pedagogical materials
- **Writing Development**: Track changes in student writing complexity over time
- **Curriculum Design**: Match texts to appropriate reading levels

### Historical Studies
- **Document Dating**: Use stylistic features to estimate text composition dates
- **Authenticity Verification**: Detect potentially forged or misattributed documents
- **Social Analysis**: Examine how writing style reflects social class or education

## Prerequisites

Before running this notebook, ensure you have:
- Required Python packages: `nltk`, `textstat`, `transformers`, `openai`, `gensim`
- Pre-trained embeddings (optional): Google News Word2Vec model
- OpenAI API key (for modern embeddings)
- NLTK data packages (punkt, averaged_perceptron_tagger)

## Key Concepts

- **Readability Formulas**: Mathematical models predicting text difficulty
- **Type-Token Ratio (TTR)**: Vocabulary diversity measure
- **Hapax Legomena**: Words appearing only once in a text
- **Entropy**: Information-theoretic measure of vocabulary distribution
- **Embeddings**: Dense vector representations capturing semantic/stylistic properties

# TCR Feature Extraction: Readability Formulas

This module provides functions to compute standard readability metrics for a given text using the `textstat` library. These metrics estimate how easy or difficult a text is to read, often expressed as a U.S. grade level or a score.

**Features computed:**
- SMOG Index
- Automated Readability Index (ARI)
- Dale-Chall Readability Score
- Linsear Write Formula
- Gunning-Fog Index
- Coleman-Liau Index
- Flesch Reading Ease
- Flesch Kincaid Grade Level

**Usage:**

In [4]:
import nltk
import textstat
import math
import statistics
from collections import Counter

# Ensure necessary NLTK data is available
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

def get_readability_scores(text):
    """
    Compute core readability metrics.
    """
    return {
        'SMOG': textstat.smog_index(text),
        'ARI': textstat.automated_readability_index(text),
        'Dale-Chall': textstat.dale_chall_readability_score(text),
        'Linsear Write': textstat.linsear_write_formula(text),
        'Gunning-Fog': textstat.gunning_fog(text),
        'Coleman-Liau': textstat.coleman_liau_index(text),
        'Flesch Reading Ease': textstat.flesch_reading_ease(text),
        'Flesch-Kincaid Grade': textstat.flesch_kincaid_grade(text)
    }

def get_length_stats(text):
    """
    Average word length, sentence length, and their standard deviations.
    """
    words = nltk.word_tokenize(text)
    sentences = nltk.sent_tokenize(text)
    chars_per_word = [len(w) for w in words]
    words_per_sent = [len(nltk.word_tokenize(s)) for s in sentences]
    
    return {
        'AvgChars/Word': statistics.mean(chars_per_word),
        'StdChars/Word': statistics.pstdev(chars_per_word),
        'AvgWords/Sentence': statistics.mean(words_per_sent),
        'StdWords/Sentence': statistics.pstdev(words_per_sent)
    }

def get_hapax_dislegomena(text):
    """
    Count words occurring exactly once (hapax) or twice (dis legomena).
    """
    tokens = nltk.word_tokenize(text.lower())
    freqs = Counter(tokens)
    return {
        'HapaxLegomena': sum(1 for c in freqs.values() if c == 1),
        'DisLegomena': sum(1 for c in freqs.values() if c == 2)
    }

def get_entropy_perplexity(text):
    """
    Shannon entropy and derived perplexity of the token distribution.
    """
    tokens = nltk.word_tokenize(text.lower())
    freqs = Counter(tokens)
    total = len(tokens)
    entropy = -sum((c/total) * math.log2(c/total) for c in freqs.values())
    perplexity = 2**entropy
    return {
        'Entropy': entropy,
        'Perplexity': perplexity
    }

def get_lexical_diversity(text, window_size=100):
    """
    Compute TTR variations (MATTR) and MTLD.
    """
    tokens = nltk.word_tokenize(text.lower())
    types = set(tokens)
    ttr = len(types)/len(tokens)

    # Moving-Average TTR (MATTR)
    mattr_values = []
    for i in range(len(tokens) - window_size + 1):
        window = tokens[i:i+window_size]
        mattr_values.append(len(set(window))/window_size)
    mattr = statistics.mean(mattr_values) if mattr_values else 0

    # MTLD (approximate)
    def mtld_calc(tokens, threshold=0.72):
        factors = 0
        types_set = set()
        token_count = 0
        for t in tokens:
            types_set.add(t)
            token_count += 1
            if len(types_set)/token_count <= threshold:
                factors += 1
                types_set.clear()
                token_count = 0
        if token_count > 0:
            factors += (1 - (len(types_set)/token_count - threshold)) / (1 - threshold)
        return len(tokens)/factors if factors else 0

    mtld = mtld_calc(tokens)
    
    return {
        'TTR': ttr,
        'MATTR': mattr,
        'MTLD': mtld
    }

def get_functional_diversity(text):
    """
    Ratio of content words to function words via POS tags.
    """
    tokens = nltk.word_tokenize(text)
    tags = nltk.pos_tag(tokens)
    function_tags = {'DT','IN','CC','TO','PRP','PRP$','MD','RB','WRB','WP','WP$'}
    func = sum(1 for _, t in tags if t in function_tags)
    content = sum(1 for _, t in tags if t not in function_tags)
    return {'FunctionalDiversity': content/(func if func else 1)}

def extract_tcr_features(text):
    """
    Gather all TCR features into a single dictionary.
    """
    features = {}
    features.update(get_readability_scores(text))
    features.update(get_length_stats(text))
    features.update(get_hapax_dislegomena(text))
    features.update(get_entropy_perplexity(text))
    features.update(get_lexical_diversity(text))
    features.update(get_functional_diversity(text))
    return features

# Example usage
sample_text = "The quick brown fox jumps over the lazy dog. It served as a pangram widely used for testing fonts."
features = extract_tcr_features(sample_text)
print(features)


{'SMOG': 3.1291, 'ARI': 3.151578947368421, 'Dale-Chall': 9.094015789473683, 'Linsear Write': 3.75, 'Gunning-Fog': 3.8000000000000003, 'Coleman-Liau': 4.894736842105264, 'Flesch Reading Ease': 90.32934210526317, 'Flesch-Kincaid Grade': 3.0202631578947354, 'AvgChars/Word': 3.8095238095238093, 'StdChars/Word': 1.8157786633273922, 'AvgWords/Sentence': 10.5, 'StdWords/Sentence': 0.5, 'HapaxLegomena': 17, 'DisLegomena': 2, 'Entropy': 4.20184123230257, 'Perplexity': 18.402644982465112, 'TTR': 0.9047619047619048, 'MATTR': 0, 'MTLD': 7.212616822429907, 'FunctionalDiversity': 1.625}


[nltk_data] Downloading package punkt to /Users/sali/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/sali/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Traditional Readability and Lexical Complexity Features

This section implements a comprehensive suite of functions to extract traditional text complexity features. These metrics have been developed over decades of research in linguistics, education, and psychology.

### Core Readability Metrics

#### `get_readability_scores(text)`
Computes eight established readability formulas, each designed to predict reading difficulty:

- **SMOG Index**: Estimates years of education needed to understand text (based on polysyllabic words)
- **Automated Readability Index (ARI)**: Uses character and word counts (suitable for computer analysis)
- **Dale-Chall**: Based on vocabulary difficulty using a list of common words
- **Linsear Write**: Designed for technical writing assessment
- **Gunning-Fog Index**: Focuses on complex sentences and difficult words
- **Coleman-Liau Index**: Uses character-based measurements (language-independent)
- **Flesch Reading Ease**: Scale from 0 (very difficult) to 100 (very easy)
- **Flesch-Kincaid Grade**: U.S. grade level equivalent

**For Digital Humanities**: These metrics help compare text difficulty across authors, time periods, or genres systematically.

### Statistical Text Features

#### `get_length_stats(text)`
Computes distributional statistics for word and sentence lengths:
- **Average Characters per Word**: Indicates vocabulary sophistication
- **Standard Deviation of Word Length**: Measures word length variability
- **Average Words per Sentence**: Reflects syntactic complexity
- **Standard Deviation of Sentence Length**: Shows structural variability

**Research Applications**: Authors with consistent styles show lower standard deviations; experimental writers show higher variability.

#### `get_hapax_dislegomena(text)`
Counts rare word usage patterns:
- **Hapax Legomena**: Words appearing exactly once (indicates vocabulary breadth)
- **Dis Legomena**: Words appearing exactly twice (shows vocabulary control)

**Literary Significance**: High hapax counts suggest rich vocabulary; patterns can distinguish authors or identify vocabulary sophistication.

### Information-Theoretic Measures

#### `get_entropy_perplexity(text)`
Applies information theory to text analysis:
- **Shannon Entropy**: Measures unpredictability of word choices
- **Perplexity**: Intuitive measure of vocabulary "surprise" (2^entropy)

**Interpretation**: 
- Higher entropy = more varied vocabulary usage
- Higher perplexity = less predictable word choices
- Useful for comparing stylistic diversity across texts

### Advanced Lexical Diversity

#### `get_lexical_diversity(text)`
Implements sophisticated vocabulary diversity measures:
- **Type-Token Ratio (TTR)**: Basic vocabulary diversity (unique words / total words)
- **Moving-Average TTR (MATTR)**: TTR computed over sliding windows (more stable)
- **Measure of Textual Lexical Diversity (MTLD)**: Advanced metric accounting for text length

**Why Multiple Measures**: TTR decreases with text length; MATTR and MTLD provide length-independent assessments.

### Grammatical Complexity

#### `get_functional_diversity(text)`
Analyzes grammatical structure through part-of-speech patterns:
- **Functional Diversity**: Ratio of content words to function words
- **Content Words**: Nouns, verbs, adjectives, adverbs (carry meaning)
- **Function Words**: Determiners, prepositions, conjunctions (provide structure)

**Stylistic Insights**: Higher ratios suggest more descriptive, content-rich writing; lower ratios indicate more structural, formal prose.

### Integration Function

#### `extract_tcr_features(text)`
Combines all metrics into a comprehensive feature vector suitable for:
- **Machine Learning**: Classification or clustering of texts
- **Statistical Analysis**: Correlation studies or regression models
- **Comparative Studies**: Systematic comparison across authors or time periods

### Usage Guidelines for Digital Humanities

1. **Corpus Preparation**: Clean texts consistently (remove headers, footnotes)
2. **Comparative Analysis**: Use identical preprocessing across compared texts
3. **Statistical Significance**: Test feature differences with appropriate statistical methods
4. **Interpretation**: Consider domain knowledge when interpreting numerical results
5. **Visualization**: Plot feature distributions to identify patterns and outliers

In [1]:
# Install these first if you don’t have them:
# pip install gensim transformers openai torch
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
import openai
from gensim.models import KeyedVectors
import os

from dotenv import load_dotenv
load_dotenv()  # Load environment variables from .env file


# # 1) Word2Vec (via gensim)
def get_word2vec_embedding(text, w2v_path="../embedding-models/GoogleNews-vectors-negative300.bin", size=300):
    """
    - text: string
    - w2v_path: path to a .bin or .kv model file (e.g. GoogleNews-vectors-negative300.bin)
    Returns the mean of the token embeddings.
    """
    # load once (outside function in real code)
    w2v = KeyedVectors.load_word2vec_format(w2v_path, binary=True)
    tokens = text.lower().split()
    vecs = [w2v[word] for word in tokens if word in w2v]
    if not vecs:
        return np.zeros(size)
    return np.mean(vecs, axis=0)


# 3) BERT (via HuggingFace Transformers)

def get_bert_embedding(text, model_name='bert-base-uncased', layer=-2):
    """
    Returns the mean of the last hidden states from one BERT layer.
    """
    tok = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    inputs = tok(text, return_tensors='pt', truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        # hidden_states is tuple: one tensor per layer
        hidden = outputs.hidden_states[layer]  # e.g. second-to-last
        # [batch, seq_len, hidden_dim]
        return hidden.mean(dim=1).squeeze().numpy()


# 4) OpenAI embeddings

def get_openai_embedding(text, model="text-embedding-3-small"):
    """
    Returns the OpenAI embedding for the whole text.
    """
    
    client = openai.Client(api_key=os.getenv('OPENAI_API_KEY'))
    resp = client.embeddings.create(
        input=text,
        model=model
    )
    return np.array(resp.data[0].embedding)  


# # ===== Example usage =====


sample = "The quick brown fox jumps over the lazy dog."

# Word2Vec (download Google News binary, set path)
# w2v_vec = get_word2vec_embedding(sample, '/path/to/GoogleNews-vectors-negative300.bin')

# GloVe (convert glove txt to word2vec format or use no_header=True)
# glove_vec = get_word2vec_embedding(sample)
# print("GloVe embedding shape:", glove_vec.shape)

# # BERT
# bert_vec = get_bert_embedding(sample)
# print("BERT embedding shape:", bert_vec.shape)

# OpenAI (set your OPENAI_API_KEY env var or pass directly)
openai_vec = get_openai_embedding(sample)
print("OpenAI embedding length:", len(openai_vec))


OpenAI embedding length: 1536


## Modern Embedding Approaches for Style Analysis

This section demonstrates how to extract dense vector representations of text using state-of-the-art embedding models. These embeddings capture subtle semantic and stylistic patterns that traditional metrics might miss.

### Why Use Embeddings for Style Analysis?

Traditional readability metrics focus on surface-level features (word length, sentence complexity), while embeddings capture:
- **Semantic Relationships**: How words relate in meaning
- **Contextual Information**: How words behave in different contexts  
- **Stylistic Patterns**: Subtle patterns in word usage and combination
- **Cross-linguistic Features**: Language-independent representations

### Embedding Models Implemented

#### `get_word2vec_embedding(text, w2v_path, size=300)`
- **Purpose**: Uses pre-trained Word2Vec embeddings (e.g., Google News)
- **Process**: Averages embeddings of all words in the text
- **Advantages**: Fast, well-established, good semantic representations
- **Best For**: Vocabulary-based style analysis, semantic similarity tasks
- **Note**: Requires downloading large pre-trained models (3+ GB)

**Digital Humanities Applications**:
- Compare vocabulary usage across historical periods
- Identify semantic themes in literary works
- Detect stylistic similarities between authors

#### `get_bert_embedding(text, model_name='bert-base-uncased', layer=-2)`
- **Purpose**: Extracts contextualized embeddings from BERT models
- **Process**: Uses transformer attention to create context-aware representations
- **Parameters**:
  - `model_name`: Choice of BERT model (base, large, multilingual)
  - `layer`: Which transformer layer to extract (-1 = last, -2 = second-to-last)
- **Advantages**: Context-sensitive, captures syntactic patterns, high quality
- **Best For**: Subtle stylistic differences, grammatical pattern analysis

**Research Applications**:
- Authorship attribution with high accuracy
- Genre classification based on writing style
- Detection of stylistic evolution in an author's work

#### `get_openai_embedding(text, model="text-embedding-3-small")`
- **Purpose**: Uses OpenAI's latest embedding models via API
- **Process**: Sends text to OpenAI's servers for processing
- **Advantages**: State-of-the-art quality, regularly updated, multilingual
- **Models Available**:
  - `text-embedding-3-small`: Efficient, good performance
  - `text-embedding-3-large`: Highest quality, more expensive
- **Best For**: Production applications, highest-quality analysis

**Considerations**:
- Requires internet connection and API key
- Usage costs (though typically very affordable)
- Latest advances in embedding technology

### Technical Implementation Details

#### Environment Setup
The code uses `dotenv` to securely manage API keys:
```python
from dotenv import load_dotenv
load_env()  # Loads OPENAI_API_KEY from .env file
```

#### Error Handling
- **Missing Models**: Graceful degradation when models aren't available
- **API Failures**: Retry logic and fallback options
- **Memory Management**: Efficient processing for large texts

#### Vector Operations
- **Averaging**: Simple but effective aggregation for document-level representations
- **Normalization**: Ensures consistent vector magnitudes
- **Dimensionality**: Different models produce different vector sizes (300, 768, 1536)

### Comparative Analysis Framework

For comprehensive style analysis, consider using multiple embedding types:

1. **Word2Vec**: Traditional semantic similarity
2. **BERT**: Contextual and syntactic patterns  
3. **OpenAI**: Latest advances and multilingual capabilities

Compare results across models to identify:
- **Consistent Patterns**: Features detected by multiple models
- **Model-Specific Insights**: Unique capabilities of each approach
- **Complementary Information**: How different models capture different aspects

### Practical Usage Guidelines

#### For Digital Humanities Research:

1. **Start Simple**: Begin with OpenAI embeddings for ease of use
2. **Scale Appropriately**: Use BERT for medium corpora, Word2Vec for large collections
3. **Validate Results**: Compare embedding-based findings with traditional metrics
4. **Consider Context**: Choose models appropriate for your historical period/language
5. **Document Decisions**: Record which models and parameters you used for reproducibility

#### Performance Considerations:

- **Word2Vec**: Fastest, requires local model storage
- **BERT**: Medium speed, high memory usage
- **OpenAI**: Dependent on internet speed, usage limits

#### Cost Considerations:

- **Word2Vec**: One-time download cost
- **BERT**: Computational resources (GPU recommended)
- **OpenAI**: Per-token pricing (very affordable for most research)