# Part 1b: Data Integration and Predictor Engineering

## From Raw Corpus to Powerful Predictors

**Learning Objectives:**
- **Process Raw Text**: Take a large, unstructured text corpus and turn it into a clean, tokenized list of words.
- **Calculate Frequency**: Compute raw word frequency counts from the tokenized text.
- **Integrate External Data**: Merge the LLM-derived frequencies with established psycholinguistic datasets (ECP, SUBTLEX) to create a rich, comparative dataset.
- **Apply Transformations**: Convert raw frequencies into meaningful, psycholinguistically-validated scales (Schepens, Zipf).
- **Export for Analysis**: Save the final, merged data into a single file, ready for statistical analysis in Notebook 2.

---

💡 **Research Context:** A raw text file isn't useful for statistical modeling. We need to "engineer" predictors from it. This notebook automates the critical pipeline from text to data. We will calculate our own `llm_frequency` and then place it alongside well-known, human-validated measures. This allows us to directly compare our LLM-based predictor with the "gold standards" in the field.

# Prepare Predictors from Corpus

This notebook processes the large corpus text file to extract word frequencies and prepare predictor data for analysis. The output will be used by `notebook2_corpus_analysis.ipynb` to compare different frequency measures.

## 1. Setup and Configuration

First, we'll import the necessary libraries and define the file paths for our input (the raw text corpus) and our output (the final CSV file with all predictors).

In [44]:
import pandas as pd
import re
from collections import Counter
import os
import numpy as np

# File paths
corpus_file_path = '../output/large_corpus.txt'
output_csv_path = '../output/generated_corpus_with_predictors.csv'

print("Starting corpus processing...")
print(f"Corpus file: {corpus_file_path}")
print(f"Output file: {output_csv_path}")

Starting corpus processing...
Corpus file: ../output/large_corpus.txt
Output file: ../output/generated_corpus_with_predictors.csv


## 2. Load and Validate the Corpus

Before we can process the text, we need to load it into memory. We'll also perform a quick check to ensure the file exists and display the first few characters to confirm it has loaded correctly.

In [45]:
# Check if corpus file exists
if not os.path.exists(corpus_file_path):
    print(f"ERROR: Corpus file not found at {corpus_file_path}")
    print("Please ensure the large_corpus.txt file exists in the output directory.")
else:
    print("✓ Corpus file found")
    
    # Read the corpus
    print("Reading corpus...")
    with open(corpus_file_path, 'r', encoding='utf-8') as f:
        corpus_text = f.read()
    
    print(f"Corpus loaded: {len(corpus_text):,} characters")
    print(f"First 200 characters: {corpus_text[:200]}...")

✓ Corpus file found
Reading corpus...
Corpus loaded: 16,404,292 characters
First 200 characters: **How to Create a Reliable Backup System for Your Digital Life**

In today’s digital world, our lives are increasingly stored on computers, smartphones, and cloud services. From family photos and impo...
Corpus loaded: 16,404,292 characters
First 200 characters: **How to Create a Reliable Backup System for Your Digital Life**

In today’s digital world, our lives are increasingly stored on computers, smartphones, and cloud services. From family photos and impo...


## 3. Text Preprocessing and Tokenization

This is a critical step in turning unstructured text into data. We will:
1.  **Clean the Text**: Remove any metadata comments that were added during the generation process.
2.  **Lowercase**: Convert all text to lowercase to ensure that words like "The" and "the" are treated as the same token.
3.  **Tokenize**: Use a regular expression (`regex`) to find all sequences of letters, effectively splitting the text into a list of words (tokens).

In [46]:
# Text preprocessing and tokenization
print("Cleaning and tokenizing text...")

# Remove metadata comments before processing
# The DOTALL flag is crucial for multiline JSON
cleaned_text = re.sub(r'<!-- Story Metadata:.*?-->', '', corpus_text, flags=re.DOTALL)

# Convert to lowercase and extract words using regex
# This pattern matches sequences of letters and some common contractions
words = re.findall(r"\b[a-z]+(?:'[a-z]+)?\b", cleaned_text.lower())

print(f"Total tokens extracted: {len(words):,}")
print(f"Sample tokens: {words[:20]}")

Cleaning and tokenizing text...
Total tokens extracted: 2,050,703
Sample tokens: ['how', 'to', 'create', 'a', 'reliable', 'backup', 'system', 'for', 'your', 'digital', 'life', 'in', 'today', 's', 'digital', 'world', 'our', 'lives', 'are', 'increasingly']
Total tokens extracted: 2,050,703
Sample tokens: ['how', 'to', 'create', 'a', 'reliable', 'backup', 'system', 'for', 'your', 'digital', 'life', 'in', 'today', 's', 'digital', 'world', 'our', 'lives', 'are', 'increasingly']


## 4. Calculate Word Frequencies

Now that we have a clean list of tokens, we can calculate the frequency of each word. We'll use Python's `Counter` to create a dictionary where keys are words and values are their raw counts.

This raw count is the simplest form of a frequency predictor.

In [47]:
# Count word frequencies
print("Counting word frequencies...")
word_counts = Counter(words)
total_words = len(words)
unique_words = len(word_counts)

print(f"Unique words: {unique_words:,}")
print(f"Most common words: {word_counts.most_common(10)}")

Counting word frequencies...
Unique words: 46,483
Most common words: [('the', 113290), ('a', 85845), ('and', 44435), ('of', 43403), ('to', 34483), ('in', 33797), ('it', 32853), ('s', 28349), ('like', 17852), ('that', 16510)]
Unique words: 46,483
Most common words: [('the', 113290), ('a', 85845), ('and', 44435), ('of', 43403), ('to', 34483), ('in', 33797), ('it', 32853), ('s', 28349), ('like', 17852), ('that', 16510)]


## 5. Create a Structured DataFrame

To make the data easier to work with, we'll convert our word counts into a `pandas` DataFrame. This structure allows for powerful data manipulation, filtering, and merging. We'll also add a `word_length` column, which is another simple but powerful predictor of reading time.

In [48]:
# Create DataFrame with word frequency data
print("Creating DataFrame...")

# Convert word counts to DataFrame
df_words = pd.DataFrame(word_counts.items(), columns=['word', 'llm_frequency_raw'])
df_words['word_length'] = df_words['word'].apply(len)

print(f"DataFrame created with {len(df_words)} words.")

Creating DataFrame...
DataFrame created with 46483 words.


## 6. Data Integration: Merging with External Datasets

**This is where we create a rich dataset for comparison.** Our LLM-generated frequency is interesting, but its true value is only revealed when compared against established, human-derived measures.

We will merge our `llm_frequency_raw` data with two key external datasets:
1.  **The English Crowdsourcing Project (ECP)**: This dataset provides human reading time data and contains several pre-computed frequency measures (`SUBTLEX`, `Multilex`) and even a familiarity rating derived from GPT (`GPT`).
2.  **SUBTLEX-US**: While ECP gives us the final *Zipf-scaled* SUBTLEX value, it doesn't give us the *raw frequency count*. We load the original SUBTLEX-US corpus data to get this raw count, which is essential for applying our own transformations consistently.

By merging these, we can place our new predictor (`llm_frequency`) in the same rows as the established ones, setting the stage for a direct comparison in Notebook 2.

In [49]:
# --- Data Integration ---
print("\nIntegrating with ECP reference data...")
try:
    ecp_df = pd.read_csv('../data/lexicaldecision/ecp/English Crowdsourcing Project All Native Speakers.csv')
    print(f"✅ Loaded {len(ecp_df)} records from ECP dataset")
    
    # Load SUBTLEX-US data to get raw frequency counts
    print("Loading SUBTLEX-US data for raw frequency counts...")
    subtlex_us_path = '../data/frequency/subtlex-us/SUBTLEXus74286wordstextversion.txt'
    subtlex_df = pd.read_csv(subtlex_us_path, sep='	')
    print(f"✅ Loaded {len(subtlex_df)} records from SUBTLEX-US dataset")
    
    # Rename columns for consistency
    subtlex_df = subtlex_df.rename(columns={
        'Word': 'word',
        'FREQcount': 'subtlex_freq_raw'
    })
    
    # Merge ECP data with SUBTLEX-US data to get raw frequency counts
    ecp_with_subtlex = pd.merge(ecp_df, subtlex_df[['word', 'subtlex_freq_raw']], 
                               left_on='spelling' if 'spelling' in ecp_df.columns else 'Word', 
                               right_on='word', how='left')
    
    # Define word column and predictors to merge
    word_col = 'spelling' if 'spelling' in ecp_with_subtlex.columns else 'Word'
    # We need the raw SUBTLEX frequency count (subtlex_freq_raw) and the total corpus size for the Schepens transform.
    # The 'SUBTLEX' column is the Zipf scale, which we'll also use.
    ref_cols = ['SUBTLEX', 'subtlex_freq_raw', 'Multilex', 'GPT']
    cols_to_merge = [word_col] + [col for col in ref_cols if col in ecp_with_subtlex.columns]
    
    # Merge generated frequencies with reference data
    merged_df = pd.merge(df_words, ecp_with_subtlex[cols_to_merge], left_on='word', right_on=word_col, how='left')
    
    # Rename columns for clarity
    merged_df = merged_df.rename(columns={
        'SUBTLEX': 'subtlex_zipf', # This is the pre-computed Zipf scale
        'subtlex_freq_raw': 'subtlex_freq_raw', # This is the raw frequency count from SUBTLEX-US
        'Multilex': 'multilex_zipf',
        'GPT': 'gpt_familiarity'
    })
    if word_col != 'word':
        merged_df = merged_df.drop(columns=[word_col])
        
    print("✅ Merged generated data with ECP reference measures and SUBTLEX-US raw frequencies.")
    
except FileNotFoundError as e:
    print(f"⚠️ Data file not found: {e}. Proceeding without reference measures.")
    merged_df = df_words.copy()
except Exception as e:
    print(f"⚠️ Error loading reference data: {e}. Proceeding without reference measures.")
    merged_df = df_words.copy()


Integrating with ECP reference data...
✅ Loaded 61851 records from ECP dataset
Loading SUBTLEX-US data for raw frequency counts...
✅ Loaded 74286 records from SUBTLEX-US dataset
✅ Merged generated data with ECP reference measures and SUBTLEX-US raw frequencies.
✅ Loaded 61851 records from ECP dataset
Loading SUBTLEX-US data for raw frequency counts...
✅ Loaded 74286 records from SUBTLEX-US dataset
✅ Merged generated data with ECP reference measures and SUBTLEX-US raw frequencies.


## 7. Logarithmic Transformations: Creating Comparable Predictors

**Why transform raw frequencies?** Raw frequency counts are heavily skewed (a few words like "the" appear millions of times, while most words appear very rarely). This skew violates the assumptions of many statistical models (like linear regression). Logarithmic transformations compress the scale, making the distribution more normal and better behaved for statistical analysis.

We will apply two important transformations to both our LLM frequencies and the SUBTLEX frequencies:

1.  **Schepens et al. Transformation**: `log( (1 + frequency_raw) * 1e6 / corpus_size )`
    - This is the formula used in the foundational paper for this project. It scales the frequency relative to the corpus size.

2.  **Van Heuven et al. (Zipf) Transformation**: `log10((raw_frequency + 1) / (corpus_M + types_M)) + 3`
    - This is a widely-used standard in psycholinguistics (used by the ECP). It accounts for both the total number of words (corpus size) and the number of unique words (types).

By applying these formulas to both our LLM corpus and the SUBTLEX corpus, we create four key predictors that can be directly compared:
- `llm_freq_schepens`
- `llm_freq_zipf`
- `subtlex_schepens` (calculated by us from raw SUBTLEX counts)
- `subtlex_zipf` (the original value from ECP, which we confirmed uses the Van Heuven formula)

In [50]:
# --- Logarithmic Transformations ---
print("\nApplying logarithmic transformations for direct comparison...")

# Define corpus sizes for Schepens calculation
# SUBTLEX-US corpus size is approximately 51 million words (using the same as UK for consistency)
SUBTLEX_US_SIZE = 51_000_000
llm_corpus_size = total_words
print(f"LLM Corpus Size: {llm_corpus_size:,} words")
print(f"SUBTLEX-US Corpus Size: {SUBTLEX_US_SIZE:,} words")

# Define transformation functions
def schepens_log(freq_series_raw, corpus_size):
    """log( (1 + frequency_raw) * 1e6 / corpus_size )"""
    # Using log1p is more numerically stable for log(1 + x)
    return np.log1p(freq_series_raw) + np.log(1_000_000 / corpus_size)

def van_heuven_zipf_wrong(freq_series_raw, corpus_size):
    """log10(frequency_per_million + 1)"""
    freq_per_million = (freq_series_raw / corpus_size) * 1_000_000
    return np.log10(freq_per_million + 1)

def van_heuven_zipf(freq_series_raw, corpus_size, word_types):
    """log10((raw_frequency + 1) / (corpus_size_in_millions + word_types_in_millions)) + 3"""
    corpus_size_millions = corpus_size / 1_000_000
    word_types_millions = word_types / 1_000_000
    
    # Add 1 to frequency to handle 0 values
    numerator = freq_series_raw + 1
    denominator = corpus_size_millions + word_types_millions
    
    # Avoid division by zero if denominator is 0
    if denominator == 0:
        return np.nan
        
    return np.log10(numerator / denominator) + 3

# --- Apply transformations to create the four target measures ---

# 1. LLM-derived frequencies
merged_df['llm_freq_schepens'] = schepens_log(merged_df['llm_frequency_raw'], llm_corpus_size)
merged_df['llm_freq_zipf'] = van_heuven_zipf(merged_df['llm_frequency_raw'], llm_corpus_size, unique_words)
print("   ✓ Calculated Schepens and Zipf scales for LLM frequency")

# 2. SUBTLEX frequencies
if 'subtlex_freq_raw' in merged_df.columns:
    # We calculate the Schepens scale from the raw SUBTLEX frequency.
    merged_df['subtlex_schepens'] = schepens_log(merged_df['subtlex_freq_raw'], SUBTLEX_US_SIZE)
    
    # Also calculate Van Heuven Zipf scale for SUBTLEX using proper SUBTLEX parameters
    SUBTLEX_WORD_TYPES = 74286  # Number of word types in SUBTLEX-US dataset
    merged_df['subtlex_zipf_vanheuven'] = van_heuven_zipf(merged_df['subtlex_freq_raw'], SUBTLEX_US_SIZE, SUBTLEX_WORD_TYPES)
    
    print("   ✓ Calculated Schepens scale for SUBTLEX frequency")
    print("   ✓ Calculated Van Heuven Zipf scale for SUBTLEX frequency")
else:
    print("   ⚠️ 'subtlex_freq_raw' column not found. Cannot calculate subtlex_schepens.")
    merged_df['subtlex_schepens'] = np.nan
    merged_df['subtlex_zipf_vanheuven'] = np.nan

# Sort by raw frequency
merged_df = merged_df.sort_values('llm_frequency_raw', ascending=False).reset_index(drop=True)

print("\n✅ Transformations complete.")
print("\nFirst 10 rows of the processed data:")
display_cols = ['word', 'llm_freq_schepens', 'llm_freq_zipf', 'subtlex_zipf', 'subtlex_schepens', 'subtlex_zipf_vanheuven']
available_cols = [col for col in display_cols if col in merged_df.columns]
print(merged_df[available_cols].head(10))


Applying logarithmic transformations for direct comparison...
LLM Corpus Size: 2,050,703 words
SUBTLEX-US Corpus Size: 51,000,000 words
   ✓ Calculated Schepens and Zipf scales for LLM frequency
   ✓ Calculated Schepens scale for SUBTLEX frequency
   ✓ Calculated Van Heuven Zipf scale for SUBTLEX frequency

✅ Transformations complete.

First 10 rows of the processed data:
   word  llm_freq_schepens  llm_freq_zipf  subtlex_zipf  subtlex_schepens  \
0   the          10.919532       7.732558      7.468478         10.290422   
1     a          10.642128       7.612083      7.309360          9.924040   
2   and           9.983623       7.326098      7.126116          9.502104   
3    of           9.960124       7.315893      7.063010          9.356798   
4    to           9.730068       7.215981      7.355006         10.029145   
5    in           9.709974       7.207254      6.989451          9.187423   
6    it           9.681646       7.194951      7.275782          9.846723   
7     s 

# Export

In [51]:
# --- Save Processed Data ---
output_path = '../output/merged_predictors.csv'
merged_df.to_csv(output_path, index=False)
print(f"\n✅ Processed data saved to {output_path}")


✅ Processed data saved to ../output/merged_predictors.csv
