# Test Suite for Refactored `prepare_transcripts.py`

This notebook tests the refactored preprocessing module with CHILDES sample data and verifies compatibility with alignment analysis scripts.

## Prerequisites

Before running this notebook, ensure you've:
1. Installed the package in editable mode: `pip install -e .`
2. Installed spaCy: `pip install spacy` [NOTE: is this necessary?]
3. Downloaded spaCy model: `python -m spacy download en_core_web_sm` [NOTE: is this necessary?]

## Data Location

Test files: `/Users/ndd697/Desktop/Github-Projects/llm-linguistic-alignment/src/align_test/data/CHILDES/`

---
## Setup: Import Libraries and Configure Paths

In [44]:
import os 
import pandas as pd
import ast

# Import the refactored preprocessing module
from align_test.prepare_transcripts_refactored import prepare_transcripts

print("‚úì Imports successful")


‚úì Imports successful


In [2]:
# ============================================================
# CONFIGURATION: Set your data directories
# ============================================================

# Input: CHILDES data directory
CHILDES_DATA_DIR = '/Users/ndd697/Desktop/Github-Projects/llm-linguistic-alignment/src/align_test/data/CHILDES/'

# Output: Test output directories
OUTPUT_DIR_BASIC = './test_output_basic'
OUTPUT_DIR_SPACY = './test_output_spacy'
OUTPUT_DIR_ALIGNMENT = './test_alignment_results'

# Create output directories
for dir_path in [OUTPUT_DIR_BASIC, OUTPUT_DIR_SPACY, OUTPUT_DIR_ALIGNMENT]:
    os.makedirs(dir_path, exist_ok=True)
    print(f"‚úì Created directory: {dir_path}")

‚úì Created directory: ./test_output_basic
‚úì Created directory: ./test_output_spacy
‚úì Created directory: ./test_alignment_results


---
## Verify Input Data

Check that the CHILDES directory exists and contains our test files.

In [3]:
# Check if CHILDES directory exists
print(f"CHILDES Directory: {CHILDES_DATA_DIR}")
print(f"Exists: {os.path.exists(CHILDES_DATA_DIR)}")

if not os.path.exists(CHILDES_DATA_DIR):
    print("\n‚úó Directory not found! Please update CHILDES_DATA_DIR above.")
else:
    # List files in directory
    files = [f for f in os.listdir(CHILDES_DATA_DIR) if f.endswith('.txt')]
    print(f"\n‚úì Found {len(files)} .txt files:")
    for f in files:
        file_path = os.path.join(CHILDES_DATA_DIR, f)
        size_kb = os.path.getsize(file_path) / 1024
        print(f"  - {f} ({size_kb:.1f} KB)")

CHILDES Directory: /Users/ndd697/Desktop/Github-Projects/llm-linguistic-alignment/src/align_test/data/CHILDES/
Exists: True

‚úì Found 20 .txt files:
  - time197-cond1.txt (4.4 KB)
  - time202-cond1.txt (6.2 KB)
  - time191-cond1.txt (5.5 KB)
  - time209-cond1.txt (6.4 KB)
  - time210-cond1.txt (5.5 KB)
  - time204-cond1.txt (8.8 KB)
  - time196-cond1.txt (4.3 KB)
  - time203-cond1.txt (5.3 KB)
  - time208-cond1.txt (6.5 KB)
  - time205-cond1.txt (7.5 KB)
  - time195-cond1.txt (5.6 KB)
  - time198-cond1.txt (5.3 KB)
  - time200-cond1.txt (5.5 KB)
  - time193-cond1.txt (5.7 KB)
  - time206-cond1.txt (6.2 KB)
  - time194-cond1.txt (5.3 KB)
  - time199-cond1.txt (5.9 KB)
  - time201-cond1.txt (5.6 KB)
  - time192-cond1.txt (4.1 KB)
  - time207-cond1.txt (6.5 KB)


---
## Inspect Raw Input Files

Let's look at the structure of the raw input files before preprocessing.

In [4]:
# Load and display a sample input file
sample_file = os.path.join(CHILDES_DATA_DIR, 'time200-cond1.txt')

print(f"Reading: {os.path.basename(sample_file)}\n")

raw_df = pd.read_csv(sample_file, sep='\t', encoding='utf-8')

print(f"Columns: {raw_df.columns.tolist()}")
print(f"Rows: {len(raw_df)}")
print(f"\nFirst 5 rows:")
raw_df.head()

Reading: time200-cond1.txt

Columns: ['participant', 'content']
Rows: 134

First 5 rows:


Unnamed: 0,participant,content
0,cgv,well hurry Abe it's time to eat are you ready.
1,kid,is it time.
2,cgv,yeah.
3,kid,I'm almost done okay Mom now you can come over...
4,cgv,okay.


In [5]:
# Show some sample content
print("Sample utterances:\n")
for i in range(min(5, len(raw_df))):
    print(f"{raw_df['participant'].iloc[i]}: {raw_df['content'].iloc[i]}")

Sample utterances:

cgv: well hurry Abe it's time to eat are you ready.
kid: is it time.
cgv: yeah.
kid: I'm almost done okay Mom now you can come over here and look.
cgv: okay.


---
## TEST 1: Basic Preprocessing (NLTK Only)

Test preprocessing with default NLTK POS tagger (fastest option).

In [6]:
print("="*60)
print("TEST 1: Basic Preprocessing (NLTK only)")
print("="*60)

# Run preprocessing with minimal options
results_basic = prepare_transcripts(
    input_files=CHILDES_DATA_DIR,
    output_file_directory=OUTPUT_DIR_BASIC,
    run_spell_check=False,  # Disable for faster testing
    minwords=2,
    add_stanford_tags=False,  # NLTK only
    input_as_directory=True
)

print(f"\n‚úì Preprocessing complete!")
print(f"Total utterances processed: {len(results_basic)}")

TEST 1: Basic Preprocessing (NLTK only)
Downloading required NLTK resources...
  - Downloading wordnet...
    ‚úì wordnet downloaded successfully
  - Downloading omw-1.4...
    ‚úì omw-1.4 downloaded successfully
NLTK resources ready!


Found 20 files to process
Output directory: ./test_output_basic

Processing: time197-cond1.txt
  1. Cleaning text...
  2. Merging adjacent turns...
  3. Tokenizing...
  4. Lemmatizing...
  5. Applying POS tagging...
Converting tokens and lemmas to string representations...
Applying NLTK POS tagging...
NLTK POS tagging complete
  6. Saved: time197-cond1.txt
     Rows: 76
Processing: time202-cond1.txt
  1. Cleaning text...
  2. Merging adjacent turns...
  3. Tokenizing...
  4. Lemmatizing...
  5. Applying POS tagging...
Converting tokens and lemmas to string representations...
Applying NLTK POS tagging...
NLTK POS tagging complete
  6. Saved: time202-cond1.txt
     Rows: 92
Processing: time191-cond1.txt
  1. Cleaning text...
  2. Merging adjacent turns...

In [7]:
# Examine the output
print("Output columns:")
for col in results_basic.columns:
    print(f"  - {col}")

print(f"\nDataFrame shape: {results_basic.shape}")
results_basic.head(3)

Output columns:
  - participant
  - content
  - token
  - lemma
  - tagged_token
  - tagged_lemma
  - file

DataFrame shape: (1832, 7)


Unnamed: 0,participant,content,token,lemma,tagged_token,tagged_lemma,file
0,cgv,that was fun,"['that', 'was', 'fun']","['that', 'be', 'fun']","[('that', 'DT'), ('was', 'VBD'), ('fun', 'NN')]","[('that', 'DT'), ('be', 'VB'), ('fun', 'NN')]",time197-cond1.txt
1,kid,dad you should have climbed the cliffs with us,"['dad', 'you', 'should', 'have', 'climbed', 't...","['dad', 'you', 'should', 'have', 'climb', 'the...","[('dad', 'NN'), ('you', 'PRP'), ('should', 'MD...","[('dad', 'NN'), ('you', 'PRP'), ('should', 'MD...",time197-cond1.txt
2,cgv,next time i will,"['next', 'time', 'i', 'will']","['next', 'time', 'i', 'will']","[('next', 'JJ'), ('time', 'NN'), ('i', 'NN'), ...","[('next', 'JJ'), ('time', 'NN'), ('i', 'NN'), ...",time197-cond1.txt


In [8]:
# Examine a single row in detail
print("Sample processed utterance:\n")
sample_row = results_basic.iloc[0]

print(f"Participant: {sample_row['participant']}")
print(f"Content: {sample_row['content']}")
print(f"\nToken (string): {sample_row['token'][:100]}...")
print(f"Type: {type(sample_row['token'])}")

# Parse and display
tokens = ast.literal_eval(sample_row['token'])
print(f"\nToken (parsed): {tokens}")
print(f"Type after parsing: {type(tokens)}")

Sample processed utterance:

Participant: cgv
Content: that was fun

Token (string): ['that', 'was', 'fun']...
Type: <class 'str'>

Token (parsed): ['that', 'was', 'fun']
Type after parsing: <class 'list'>


### Validate Output Format

Check that the output format is compatible with alignment analysis scripts.

In [9]:
# Load one of the saved output files
output_files = [f for f in os.listdir(OUTPUT_DIR_BASIC) 
                if f.endswith('.txt') and 'concatenated' not in f]

print(f"Output files created: {output_files}")

# Load the first file
test_file_path = os.path.join(OUTPUT_DIR_BASIC, output_files[0])
print(f"\nLoading: {output_files[0]}")

test_df = pd.read_csv(test_file_path, sep='\t', encoding='utf-8')
print(f"Rows loaded: {len(test_df)}")
print(f"Columns: {test_df.columns.tolist()}")

Output files created: ['time197-cond1.txt', 'time202-cond1.txt', 'time191-cond1.txt', 'time209-cond1.txt', 'time210-cond1.txt', 'time204-cond1.txt', 'time192-cond1-bs.txt', 'time196-cond1.txt', 'time203-cond1.txt', 'time208-cond1.txt', 'time205-cond1.txt', 'time195-cond1.txt', 'time198-cond1.txt', 'time200-cond1.txt', 'time193-cond1.txt', 'time206-cond1.txt', 'time194-cond1.txt', 'time199-cond1.txt', 'time201-cond1.txt', 'time192-cond1.txt', 'time207-cond1.txt']

Loading: time197-cond1.txt
Rows loaded: 76
Columns: ['participant', 'content', 'token', 'lemma', 'tagged_token', 'tagged_lemma', 'file']


In [10]:
# Test 1: Verify all columns are present
required_cols = ['participant', 'content', 'token', 'lemma', 'tagged_token', 'tagged_lemma', 'file']

print("Test 1: Required Columns")
print("-" * 40)
for col in required_cols:
    present = col in test_df.columns
    status = "‚úì" if present else "‚úó"
    print(f"{status} {col}")

all_present = all(col in test_df.columns for col in required_cols)
print(f"\nResult: {'‚úì PASSED' if all_present else '‚úó FAILED'}")

Test 1: Required Columns
----------------------------------------
‚úì participant
‚úì content
‚úì token
‚úì lemma
‚úì tagged_token
‚úì tagged_lemma
‚úì file

Result: ‚úì PASSED


In [11]:
# Test 2: Verify data types (should be strings)
list_columns = ['token', 'lemma', 'tagged_token', 'tagged_lemma']

print("Test 2: Data Types (should be strings)")
print("-" * 40)

all_strings = True
for col in list_columns:
    if col in test_df.columns:
        first_val = test_df[col].iloc[0]
        is_string = isinstance(first_val, str)
        status = "‚úì" if is_string else "‚úó"
        print(f"{status} {col}: {type(first_val).__name__}")
        if not is_string:
            all_strings = False

print(f"\nResult: {'‚úì PASSED' if all_strings else '‚úó FAILED'}")

Test 2: Data Types (should be strings)
----------------------------------------
‚úì token: str
‚úì lemma: str
‚úì tagged_token: str
‚úì tagged_lemma: str

Result: ‚úì PASSED


In [12]:
# Test 3: Verify ast.literal_eval compatibility
print("Test 3: ast.literal_eval Compatibility")
print("-" * 40)

all_parseable = True
for col in list_columns:
    if col in test_df.columns:
        try:
            parsed = ast.literal_eval(test_df[col].iloc[0])
            print(f"‚úì {col}: Parses to {type(parsed).__name__}")
            print(f"  Sample: {str(parsed)[:60]}...")
        except Exception as e:
            print(f"‚úó {col}: Parse failed - {e}")
            all_parseable = False

print(f"\nResult: {'‚úì PASSED' if all_parseable else '‚úó FAILED'}")

Test 3: ast.literal_eval Compatibility
----------------------------------------
‚úì token: Parses to list
  Sample: ['that', 'was', 'fun']...
‚úì lemma: Parses to list
  Sample: ['that', 'be', 'fun']...
‚úì tagged_token: Parses to list
  Sample: [('that', 'DT'), ('was', 'VBD'), ('fun', 'NN')]...
‚úì tagged_lemma: Parses to list
  Sample: [('that', 'DT'), ('be', 'VB'), ('fun', 'NN')]...

Result: ‚úì PASSED


In [13]:
# Test 4: Verify POS tag tuple format
print("Test 4: POS Tag Format")
print("-" * 40)

correct_format = True
for col in ['tagged_token', 'tagged_lemma']:
    if col in test_df.columns:
        try:
            parsed = ast.literal_eval(test_df[col].iloc[0])
            if parsed:
                is_tuple = isinstance(parsed[0], tuple)
                correct_length = len(parsed[0]) == 2 if is_tuple else False
                
                if is_tuple and correct_length:
                    print(f"‚úì {col}: Correct format")
                    print(f"  Sample: {parsed[0]} (word, POS)")
                else:
                    print(f"‚úó {col}: Incorrect format")
                    correct_format = False
        except Exception as e:
            print(f"‚úó {col}: Format check failed - {e}")
            correct_format = False

print(f"\nResult: {'‚úì PASSED' if correct_format else '‚úó FAILED'}")

Test 4: POS Tag Format
----------------------------------------
‚úì tagged_token: Correct format
  Sample: ('that', 'DT') (word, POS)
‚úì tagged_lemma: Correct format
  Sample: ('that', 'DT') (word, POS)

Result: ‚úì PASSED


### TEST 1 Summary

In [38]:
print("="*60)
print("TEST 1 SUMMARY: Basic Preprocessing")
print("="*60)

test1_passed = all_present and all_strings and all_parseable and correct_format

if test1_passed:
    print("\n‚úì TEST 1 PASSED: Basic preprocessing works correctly!")
    print("\nOutput format is compatible with alignment analysis.")
else:
    print("\n‚úó TEST 1 FAILED: Some checks did not pass.")
    print("Please review the test results above.")

TEST 1 SUMMARY: Basic Preprocessing

‚úì TEST 1 PASSED: Basic preprocessing works correctly!

Output format is compatible with alignment analysis.


---
## TEST 2: Preprocessing with spaCy

Test preprocessing with spaCy POS tagger (100x faster than Stanford).

In [15]:
# Check if spaCy is available
try:
    import spacy
    print("‚úì spaCy is installed")
    print("Note: Model will be auto-downloaded by prepare_transcripts() if needed")
    spacy_available = True
except ImportError:
    print("‚úó spaCy not installed")
    print("Install with: pip install spacy")
    print("Will skip spaCy tests")
    spacy_available = False

‚úì spaCy is installed
Note: Model will be auto-downloaded by prepare_transcripts() if needed


In [None]:
# Only run if spaCy is available
if spacy_available:
    print("="*60)
    print("TEST 2: Preprocessing with spaCy")
    print("="*60)
    
    # Run preprocessing with spaCy
    results_spacy = prepare_transcripts(
        input_files=CHILDES_DATA_DIR,
        output_file_directory=OUTPUT_DIR_SPACY,
        run_spell_check=False,
        minwords=2,
        add_stanford_tags=True,
        stanford_tagger_type='spacy',  # Use spaCy
        input_as_directory=True
    )
    
    print(f"\n‚úì Preprocessing with spaCy complete!")
    print(f"Total utterances processed: {len(results_spacy)}")
else:
    print("\nSkipping spaCy test (spaCy not available)")
    results_spacy = None

In [17]:
# Examine spaCy output (if available)
if results_spacy is not None:
    print("Output columns:")
    for col in results_spacy.columns:
        print(f"  - {col}")
    
    # Check for spaCy-specific columns
    has_spacy_cols = 'tagged_stan_token' in results_spacy.columns and 'tagged_stan_lemma' in results_spacy.columns
    
    if has_spacy_cols:
        print("\n‚úì spaCy tagging columns present (tagged_stan_token, tagged_stan_lemma)")
        
        # Show sample spaCy tags
        sample_spacy_tag = ast.literal_eval(results_spacy['tagged_stan_token'].iloc[0])
        print(f"\nSample spaCy tags:")
        for i, (word, tag) in enumerate(sample_spacy_tag[:5]):
            print(f"  {i+1}. ('{word}', '{tag}')")
    else:
        print("\n‚úó spaCy tagging columns missing!")
    
    results_spacy.head(3)

Output columns:
  - participant
  - content
  - token
  - lemma
  - tagged_token
  - tagged_lemma
  - tagged_stan_token
  - tagged_stan_lemma
  - file

‚úì spaCy tagging columns present (tagged_stan_token, tagged_stan_lemma)

Sample spaCy tags:
  1. ('that', 'DT')
  2. ('was', 'VBD')
  3. ('fun', 'JJ')


In [18]:
# Compare NLTK tags vs spaCy tags for same utterance
if results_spacy is not None:
    print("Comparison: NLTK vs spaCy POS Tags")
    print("="*60)
    
    sample_row = results_spacy.iloc[0]
    
    nltk_tags = ast.literal_eval(sample_row['tagged_token'])
    spacy_tags = ast.literal_eval(sample_row['tagged_stan_token'])
    
    print(f"Utterance: {sample_row['content']}\n")
    print(f"{'Word':<15} {'NLTK Tag':<10} {'spaCy Tag':<10} {'Same?':<10}")
    print("-" * 50)
    
    for (word_n, tag_n), (word_s, tag_s) in zip(nltk_tags, spacy_tags):
        same = "‚úì" if tag_n == tag_s else "‚úó"
        print(f"{word_n:<15} {tag_n:<10} {tag_s:<10} {same:<10}")
    
    # Calculate agreement
    agreements = sum(1 for (_, t1), (_, t2) in zip(nltk_tags, spacy_tags) if t1 == t2)
    total = len(nltk_tags)
    agreement_pct = (agreements / total * 100) if total > 0 else 0
    
    print(f"\nAgreement: {agreements}/{total} ({agreement_pct:.1f}%)")



Comparison: NLTK vs spaCy POS Tags
Utterance: that was fun

Word            NLTK Tag   spaCy Tag  Same?     
--------------------------------------------------
that            DT         DT         ‚úì         
was             VBD        VBD        ‚úì         
fun             NN         JJ         ‚úó         

Agreement: 2/3 (66.7%)


### TEST 2 Summary

In [19]:
# Compare NLTK tags vs spaCy tags across ALL utterances
if results_spacy is not None:
    print("="*60)
    print("COMPREHENSIVE COMPARISON: NLTK vs spaCy POS Tags")
    print("="*60)
    
    total_agreements = 0
    total_tokens = 0
    per_utterance_agreements = []
    
    # Calculate agreement across all utterances
    for idx in range(len(results_spacy)):
        sample_row = results_spacy.iloc[idx]
        
        nltk_tags = ast.literal_eval(sample_row['tagged_token'])
        spacy_tags = ast.literal_eval(sample_row['tagged_stan_token'])
        
        if nltk_tags and spacy_tags and len(nltk_tags) == len(spacy_tags):
            agreements = sum(1 for (_, t1), (_, t2) in zip(nltk_tags, spacy_tags) if t1 == t2)
            total_agreements += agreements
            total_tokens += len(nltk_tags)
            
            # Track per-utterance agreement
            utterance_pct = (agreements / len(nltk_tags)) * 100
            per_utterance_agreements.append(utterance_pct)
    
    # Overall statistics
    overall_agreement_pct = (total_agreements / total_tokens * 100) if total_tokens > 0 else 0
    
    print(f"\nüìä OVERALL STATISTICS:")
    print(f"   Total tokens compared: {total_tokens}")
    print(f"   Agreements: {total_agreements}")
    print(f"   Disagreements: {total_tokens - total_agreements}")
    print(f"   Overall Agreement: {overall_agreement_pct:.1f}%")
    
    if per_utterance_agreements:
        import numpy as np
        print(f"\n   Per-utterance agreement:")
        print(f"      Mean: {np.mean(per_utterance_agreements):.1f}%")
        print(f"      Median: {np.median(per_utterance_agreements):.1f}%")
        print(f"      Min: {np.min(per_utterance_agreements):.1f}%")
        print(f"      Max: {np.max(per_utterance_agreements):.1f}%")
    
    # Show detailed examples
    print("\n" + "="*60)
    print("DETAILED EXAMPLES (First 3 utterances with >5 words)")
    print("="*60)
    
    examples_shown = 0
    for idx in range(len(results_spacy)):
        if examples_shown >= 3:
            break
            
        sample_row = results_spacy.iloc[idx]
        nltk_tags = ast.literal_eval(sample_row['tagged_token'])
        spacy_tags = ast.literal_eval(sample_row['tagged_stan_token'])
        
        if len(nltk_tags) > 5 and len(spacy_tags) > 5:
            examples_shown += 1
            
            print(f"\n--- Example {examples_shown} ---")
            print(f"Source: {sample_row.get('file', 'unknown')}")
            print(f"Participant: {sample_row.get('participant', 'unknown')}")
            print(f"Utterance: {sample_row['content']}\n")
            print(f"{'Word':<15} {'NLTK':<10} {'spaCy':<10} {'Match':<8}")
            print("-" * 48)
            
            agreements = 0
            disagreements = []
            
            for (word_n, tag_n), (word_s, tag_s) in zip(nltk_tags, spacy_tags):
                match = "‚úì" if tag_n == tag_s else "‚úó"
                print(f"{word_n:<15} {tag_n:<10} {tag_s:<10} {match:<8}")
                
                if tag_n == tag_s:
                    agreements += 1
                else:
                    disagreements.append((word_n, tag_n, tag_s))
            
            total = len(nltk_tags)
            print(f"\nAgreement: {agreements}/{total} ({100*agreements/total:.1f}%)")
            
            if disagreements:
                print(f"Disagreements: {len(disagreements)}")
                for word, nltk_tag, spacy_tag in disagreements[:3]:  # Show first 3
                    print(f"  ‚Ä¢ '{word}': NLTK={nltk_tag}, spaCy={spacy_tag}")
    
    # Identify most common disagreements
    print("\n" + "="*60)
    print("MOST COMMON TAG DISAGREEMENTS")
    print("="*60)
    
    disagreement_counts = {}
    for idx in range(len(results_spacy)):
        sample_row = results_spacy.iloc[idx]
        nltk_tags = ast.literal_eval(sample_row['tagged_token'])
        spacy_tags = ast.literal_eval(sample_row['tagged_stan_token'])
        
        if nltk_tags and spacy_tags:
            for (word, t1), (_, t2) in zip(nltk_tags, spacy_tags):
                if t1 != t2:
                    key = f"NLTK:{t1} vs spaCy:{t2}"
                    if key not in disagreement_counts:
                        disagreement_counts[key] = []
                    disagreement_counts[key].append(word)
    
    # Show top 10 disagreements
    if disagreement_counts:
        sorted_disagreements = sorted(disagreement_counts.items(), 
                                     key=lambda x: len(x[1]), 
                                     reverse=True)
        
        print("\nTop 10 tag disagreement patterns:")
        for i, (pattern, words) in enumerate(sorted_disagreements[:10], 1):
            example_words = ', '.join(list(set(words))[:3])  # Show up to 3 unique examples
            print(f"{i:2}. {pattern:<30} (n={len(words):3}) Examples: {example_words}")
    else:
        print("\n‚úì Perfect agreement! No disagreements found.")
    
    print("\n" + "="*60)

COMPREHENSIVE COMPARISON: NLTK vs spaCy POS Tags

üìä OVERALL STATISTICS:
   Total tokens compared: 21927
   Agreements: 17950
   Disagreements: 3977
   Overall Agreement: 81.9%

   Per-utterance agreement:
      Mean: 81.5%
      Median: 83.3%
      Min: 0.0%
      Max: 100.0%

DETAILED EXAMPLES (First 3 utterances with >5 words)

--- Example 1 ---
Source: time197-cond1.txt
Participant: kid
Utterance: dad you should have climbed the cliffs with us

Word            NLTK       spaCy      Match   
------------------------------------------------
dad             NN         NN         ‚úì       
you             PRP        PRP        ‚úì       
should          MD         MD         ‚úì       
have            VB         VB         ‚úì       
climbed         VBD        VBN        ‚úó       
the             DT         DT         ‚úì       
cliffs          NNS        NNS        ‚úì       
with            IN         IN         ‚úì       
us              PRP        PRP        ‚úì       

Agreeme

In [20]:
print("="*60)
print("TEST 2 SUMMARY: spaCy Preprocessing")
print("="*60)

if results_spacy is not None:
    test2_passed = has_spacy_cols and len(results_spacy) > 0
    
    if test2_passed:
        print("\n‚úì TEST 2 PASSED: spaCy preprocessing works correctly!")
        print("\nspaCy tags are being generated and stored properly.")
    else:
        print("\n‚úó TEST 2 FAILED: spaCy preprocessing had issues.")
else:
    print("\n‚äò TEST 2 SKIPPED: spaCy not available")
    test2_passed = None

TEST 2 SUMMARY: spaCy Preprocessing

‚úì TEST 2 PASSED: spaCy preprocessing works correctly!

spaCy tags are being generated and stored properly.


---
## TEST 3: Preprocessing with Stanford

In [41]:
import subprocess

# Step 1: Check if Java is installed
print("\n1. Checking Java installation...")
print("-" * 40)

java_available = False
try:
    result = subprocess.run(['java', '-version'], 
                          capture_output=True, 
                          text=True, 
                          timeout=5)
    
    # Check both stderr and stdout for Java version info
    output = result.stderr + result.stdout
    
    # Java typically outputs to stderr, and should contain "version"
    # Check return code AND output content
    if result.returncode == 0 and ('version' in output.lower() or 'openjdk' in output.lower()):
        # Extract version line (usually first line)
        version_lines = [line for line in output.split('\n') if line.strip()]
        if version_lines:
            java_version = version_lines[0]
            # Double-check it's not an error message
            if 'unable to locate' not in java_version.lower() and 'not found' not in java_version.lower():
                print(f"‚úì Java is installed: {java_version}")
                java_available = True
            else:
                print("‚úó Java not found")
                print(f"  Error: {java_version}")
    else:
        print("‚úó Java not found or not working properly")
        if output.strip():
            print(f"  Output: {output.strip()[:100]}")
        
except FileNotFoundError:
    print("‚úó Java command not found")
    print("  Java is not installed or not in PATH")
except subprocess.TimeoutExpired:
    print("‚úó Java check timed out")
except Exception as e:
    print(f"‚úó Error checking Java: {e}")

if not java_available:
    print("\n  Stanford POS Tagger requires Java to run")
    print("  Install Java from:")
    print("    - macOS: https://www.java.com/en/download/")
    print("    - macOS (alternative): brew install openjdk")
    print("    - Linux: sudo apt-get install default-jdk")
    print("    - Windows: https://www.java.com/en/download/")


1. Checking Java installation...
----------------------------------------
‚úì Java is installed: java version "1.8.0_471"


In [42]:
# Step 2: Check for Stanford POS Tagger files
print("\n2. Checking Stanford POS Tagger files...")
print("-" * 40)

stanford_available = False
stanford_pos_path = None
stanford_language_path = None

if java_available:
    # Common locations where users might have Stanford tagger
    common_locations = [
        os.path.expanduser("~/stanford-postagger"),
        os.path.expanduser("~/Downloads/stanford-postagger-full-2020-11-17"),
        "/usr/local/stanford-postagger",
        "./stanford-postagger",
    ]
    
    # Check if any common location exists
    for location in common_locations:
        if os.path.exists(location):
            # Check for required files
            jar_path = os.path.join(location, "stanford-postagger.jar")
            model_path = os.path.join(location, "models/english-left3words-distsim.tagger")
            
            if os.path.exists(jar_path) and os.path.exists(model_path):
                stanford_pos_path = location
                stanford_language_path = "models/english-left3words-distsim.tagger"
                stanford_available = True
                print(f"‚úì Found Stanford tagger at: {location}")
                print(f"  JAR: {jar_path}")
                print(f"  Model: {model_path}")
                break
    
    if not stanford_available:
        print("‚úó Stanford POS Tagger not found in common locations")
        print("\nTo use Stanford tagger:")
        print("  1. Download from: https://nlp.stanford.edu/software/tagger.shtml#Download")
        print("  2. Extract to a known location")
        print("  3. Update the paths below")
        print("\nCommon locations checked:")
        for loc in common_locations:
            print(f"  - {loc}")
else:
    print("‚äò Skipping (Java not available)")


2. Checking Stanford POS Tagger files...
----------------------------------------
‚úó Stanford POS Tagger not found in common locations

To use Stanford tagger:
  1. Download from: https://nlp.stanford.edu/software/tagger.shtml#Download
  2. Extract to a known location
  3. Update the paths below

Common locations checked:
  - /Users/ndd697/stanford-postagger
  - /Users/ndd697/Downloads/stanford-postagger-full-2020-11-17
  - /usr/local/stanford-postagger
  - ./stanford-postagger


In [43]:
# Step 3: Manual path configuration (if not auto-detected)
print("\n3. Path Configuration")
print("-" * 40)

if java_available and not stanford_available:
    print("\n‚ö†Ô∏è  Stanford tagger not auto-detected.")
    print("If you have Stanford tagger installed, specify paths below:")
    print("\nExample paths:")
    print("  stanford_pos_path = '/Users/yourname/stanford-postagger-full-2020-11-17'")
    print("  stanford_language_path = 'models/english-left3words-distsim.tagger'")
    
    # Uncomment and update these lines if you have Stanford tagger installed:
    # stanford_pos_path = "/path/to/your/stanford-postagger"
    # stanford_language_path = "models/english-left3words-distsim.tagger"
    # stanford_available = True
    
    if stanford_pos_path and stanford_language_path:
        # Validate the paths
        jar_path = os.path.join(stanford_pos_path, "stanford-postagger.jar")
        model_path = os.path.join(stanford_pos_path, stanford_language_path)
        
        if os.path.exists(jar_path) and os.path.exists(model_path):
            stanford_available = True
            print(f"‚úì Manual configuration successful")
        else:
            print(f"‚úó Invalid paths:")
            if not os.path.exists(jar_path):
                print(f"  JAR not found: {jar_path}")
            if not os.path.exists(model_path):
                print(f"  Model not found: {model_path}")
            stanford_available = False
elif stanford_available:
    print(f"‚úì Using auto-detected paths:")
    print(f"  Base: {stanford_pos_path}")
    print(f"  Model: {stanford_language_path}")


3. Path Configuration
----------------------------------------

‚ö†Ô∏è  Stanford tagger not auto-detected.
If you have Stanford tagger installed, specify paths below:

Example paths:
  stanford_pos_path = '/Users/yourname/stanford-postagger-full-2020-11-17'
  stanford_language_path = 'models/english-left3words-distsim.tagger'


In [48]:
stanford_pos_path

'stanford-postagger-full-2020-11-17'

In [None]:
## MANUALLY DOING THIS
stanford_pos_path = '/Users/ndd697/Desktop/Github-Projects/llm-linguistic-alignment/sandbox-prepare/stanford-postagger-full-2020-11-17'
stanford_language_path = '/Users/ndd697/Desktop/Github-Projects/llm-linguistic-alignment/sandbox-prepare/stanford-postagger-full-2020-11-17/models/english-left3words-distsim.tagger'

# Step 4: Test Stanford tagging (if available)
print("\n4. Test Results")
print("-" * 40)

# RE-VALIDATE: Check if paths were manually set after Step 3
if not stanford_available and 'stanford_pos_path' in locals() and 'stanford_language_path' in locals():
    if stanford_pos_path is not None and stanford_language_path is not None:
        print("\nüîÑ Detected manually configured paths. Validating...")
        
        # Normalize paths
        stanford_pos_path = os.path.normpath(os.path.expanduser(stanford_pos_path))
        stanford_language_path = os.path.normpath(stanford_language_path)
        
        # Check if stanford_language_path is absolute or relative
        if os.path.isabs(stanford_language_path):
            # It's an absolute path, use it directly
            model_path = stanford_language_path
        else:
            # It's relative to stanford_pos_path
            model_path = os.path.join(stanford_pos_path, stanford_language_path)
        
        jar_path = os.path.join(stanford_pos_path, "stanford-postagger.jar")
        
        print(f"  Checking JAR: {jar_path}")
        print(f"  Checking Model: {model_path}")
        
        if os.path.exists(jar_path) and os.path.exists(model_path):
            stanford_available = True
            print(f"  ‚úì Manual configuration validated!")
            print(f"  ‚úì Found JAR: {os.path.basename(jar_path)}")
            print(f"  ‚úì Found Model: {os.path.basename(model_path)}")
            
            # Update stanford_language_path to be relative if it was given as absolute
            if os.path.isabs(stanford_language_path):
                # Convert to relative path from stanford_pos_path
                try:
                    stanford_language_path = os.path.relpath(model_path, stanford_pos_path)
                    print(f"  ‚ÑπÔ∏è  Converted to relative path: {stanford_language_path}")
                except ValueError:
                    # Can't make relative (e.g., different drives on Windows)
                    print(f"  ‚ÑπÔ∏è  Using absolute model path")
        else:
            print(f"  ‚úó Validation failed:")
            if not os.path.exists(jar_path):
                print(f"    - JAR not found: {jar_path}")
            if not os.path.exists(model_path):
                print(f"    - Model not found: {model_path}")
            print(f"\n  üí° Tips:")
            print(f"    - stanford_pos_path should point to the Stanford tagger directory")
            print(f"    - stanford_language_path can be either:")
            print(f"      ‚Ä¢ Relative: 'models/english-left3words-distsim.tagger'")
            print(f"      ‚Ä¢ Absolute: '/full/path/to/models/english-left3words-distsim.tagger'")

# Now proceed with the test
if not java_available:
    print("\n‚äò TEST 3 SKIPPED: Java not installed")
    print("Stanford tagger requires Java to run.")
    stanford_test_passed = None
    
elif not stanford_available:
    print("\n‚äò TEST 3 SKIPPED: Stanford tagger not configured")
    print("Stanford tagger files not found or paths not specified.")
    print("\nüí° To configure manually:")
    print("   1. Make sure Java is installed")
    print("   2. Download Stanford POS Tagger from:")
    print("      https://nlp.stanford.edu/software/tagger.shtml#Download")
    print("   3. In Step 3 above, set:")
    print("      stanford_pos_path = '/path/to/stanford-postagger-full-2020-11-17'")
    print("      stanford_language_path = 'models/english-left3words-distsim.tagger'")
    print("   4. Re-run this cell (Step 4)")
    stanford_test_passed = None
    
else:
    print("\n‚úì All prerequisites met. Running Stanford tagger test...")
    print(f"\nThis may take several minutes (Stanford tagger is ~100x slower than spaCy)")
    print(f"Processing {len([f for f in os.listdir(CHILDES_DATA_DIR) if f.endswith('.txt')])} files...")
    
    # Create output directory for Stanford test
    OUTPUT_DIR_STANFORD = './test_output_stanford'
    os.makedirs(OUTPUT_DIR_STANFORD, exist_ok=True)
    
    try:
        import time
        start_time = time.time()
        
        results_stanford = prepare_transcripts(
            input_files=CHILDES_DATA_DIR,
            output_file_directory=OUTPUT_DIR_STANFORD,
            run_spell_check=False,
            minwords=2,
            add_stanford_tags=True,
            stanford_tagger_type='stanford',  # Use Stanford
            stanford_pos_path=stanford_pos_path,
            stanford_language_path=stanford_language_path,
            stanford_batch_size=50,  # Process in batches for better performance
            input_as_directory=True
        )
        
        end_time = time.time()
        processing_time = end_time - start_time
        
        print(f"\n‚úì Stanford preprocessing complete!")
        print(f"  Time taken: {processing_time:.1f} seconds ({processing_time/60:.1f} minutes)")
        print(f"  Total utterances processed: {len(results_stanford)}")
        
        # Check if Stanford tags were actually created
        sample_stanford_tags = ast.literal_eval(results_stanford['tagged_stan_token'].iloc[0])
        if sample_stanford_tags:
            print(f"  ‚úì Stanford tags successfully generated")
            stanford_test_passed = True
        else:
            print(f"  ‚úó Stanford tags are empty")
            stanford_test_passed = False
            
    except Exception as e:
        print(f"\n‚úó Stanford preprocessing failed: {e}")
        import traceback
        traceback.print_exc()
        stanford_test_passed = False
        results_stanford = None


4. Test Results
----------------------------------------

üîÑ Detected manually configured paths. Validating...
  Checking JAR: stanford-postagger-full-2020-11-17/stanford-postagger.jar
  Checking Model: stanford-postagger-full-2020-11-17/stanford-postagger-full-2020-11-17/models/english-left3words-distsim.tagger
  ‚úó Validation failed:
    - Model not found: stanford-postagger-full-2020-11-17/stanford-postagger-full-2020-11-17/models/english-left3words-distsim.tagger

  üí° Tips:
    - stanford_pos_path should point to the Stanford tagger directory
    - stanford_language_path can be either:
      ‚Ä¢ Relative: 'models/english-left3words-distsim.tagger'
      ‚Ä¢ Absolute: '/full/path/to/models/english-left3words-distsim.tagger'

‚äò TEST 3 SKIPPED: Stanford tagger not configured
Stanford tagger files not found or paths not specified.

üí° To configure manually:
   1. Make sure Java is installed
   2. Download Stanford POS Tagger from:
      https://nlp.stanford.edu/software/tagg

---
## TEST 3: Integration with Alignment Analysis

Test that preprocessed files work correctly with the alignment analysis scripts.

In [29]:
print("="*60)
print("TEST 3: Integration with Alignment Analysis")
print("="*60)

# Import alignment analyzer
from align_test.alignment import LinguisticAlignment

print("\n‚úì Successfully imported LinguisticAlignment")

TEST 3: Integration with Alignment Analysis

‚úì Successfully imported LinguisticAlignment


In [30]:
# Initialize alignment analyzer
print("Initializing alignment analyzer...")

analyzer = LinguisticAlignment(
    alignment_type="lexsyn",
    cache_dir=os.path.join(OUTPUT_DIR_ALIGNMENT, "cache")
)

print("‚úì Analyzer initialized")

Initializing alignment analyzer...
‚úì Analyzer initialized


In [31]:
# Run alignment analysis on preprocessed data
print("\nRunning alignment analysis...")
print(f"Input folder: {OUTPUT_DIR_BASIC}")
print(f"Output folder: {OUTPUT_DIR_ALIGNMENT}")

alignment_results = analyzer.analyze_folder(
    folder_path=OUTPUT_DIR_BASIC,
    output_directory=OUTPUT_DIR_ALIGNMENT,
    lag=1,
    max_ngram=2,
    ignore_duplicates=True,
    add_stanford_tags=False  # Using NLTK-only preprocessed data
)

print(f"\n‚úì Alignment analysis complete!")
print(f"Utterance pairs analyzed: {len(alignment_results)}")


Running alignment analysis...
Input folder: ./test_output_basic
Output folder: ./test_alignment_results
ANALYZE_FOLDER: Processing data from folder: ./test_output_basic with lag=1
Found 22 files to process with lag 1


Processing time197-cond1.txt: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 76/76 [00:00<00:00, 5002.62it/s]
Processing time202-cond1.txt: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 92/92 [00:00<00:00, 5438.09it/s]
Processing time191-cond1.txt: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 99/99 [00:00<00:00, 6079.68it/s]
Processing time209-cond1.txt: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 98/98 [00:00<00:00, 6470.45it/s]
Processing time210-cond1.txt: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100/100 [00:00<00:00, 6922.21it/s]
Processing time204-cond1.txt: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 143/143 [00:00<00:00, 6600.48it/s]
Processing time192-cond1-bs.txt: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 67/67 [00:00<00:00, 6590.02it/s]
Processing time196-cond1.txt: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 66/66 [00:00<00:00, 6578.36it/s]
Processing time203-cond1.txt: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 90/90 [00:00<00:00, 6541.68it/s]
Processing time208-cond1.txt: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 86/86 [00:00<00:00, 6091.5

Successfully processed 22 out of 22 files
Results saved to ./test_alignment_results/lexsyn/lexsyn_alignment_ngram2_lag1_noDups_noStan.csv

‚úì Alignment analysis complete!
Utterance pairs analyzed: 3731


In [32]:
# Examine alignment results
print("Alignment Results:")
print(f"Shape: {alignment_results.shape}")
print(f"\nColumns: {alignment_results.columns.tolist()}")

alignment_results.head()

Alignment Results:
Shape: (3731, 24)

Columns: ['time', 'source_file', 'participant', 'content', 'token', 'lemma', 'tagged_token', 'tagged_lemma', 'lag', 'utter_order', 'content1', 'content2', 'utterance_length1', 'utterance_length2', 'lexical_tok1_cosine', 'lexical_lem1_cosine', 'pos_tok1_cosine', 'pos_lem1_cosine', 'lexical_tok2_cosine', 'lexical_lem2_cosine', 'pos_tok2_cosine', 'pos_lem2_cosine', 'lexical_master_cosine', 'syntactic_master_cosine']


Unnamed: 0,time,source_file,participant,content,token,lemma,tagged_token,tagged_lemma,lag,utter_order,...,lexical_tok1_cosine,lexical_lem1_cosine,pos_tok1_cosine,pos_lem1_cosine,lexical_tok2_cosine,lexical_lem2_cosine,pos_tok2_cosine,pos_lem2_cosine,lexical_master_cosine,syntactic_master_cosine
0,1,time197-cond1.txt,cgv,that was fun,"[that, was, fun]","[that, be, fun]","[(that, DT), (was, VBD), (fun, NN)]","[(that, DT), (be, VB), (fun, NN)]",1,cgv kid,...,0.0,0.0,0.522233,0.696311,0.0,0.0,0.0,0.0,0.0,0.0
1,2,time197-cond1.txt,kid,dad you should have climbed the cliffs with us,"[dad, you, should, have, climbed, the, cliffs,...","[dad, you, should, have, climb, the, cliff, wi...","[(dad, NN), (you, PRP), (should, MD), (have, V...","[(dad, NN), (you, PRP), (should, MD), (have, V...",1,kid cgv,...,0.0,0.0,0.369274,0.738549,0.0,0.0,0.0,0.0,0.0,0.0
2,3,time197-cond1.txt,cgv,next time i will,"[next, time, i, will]","[next, time, i, will]","[(next, JJ), (time, NN), (i, NN), (will, MD)]","[(next, JJ), (time, NN), (i, NN), (will, MD)]",1,cgv kid,...,0.27735,0.27735,0.0,0.0,0.0,0.0,0.154303,0.0,0.138675,0.077152
3,4,time197-cond1.txt,kid,did you have fun fishing i hope that we go the...,"[did, you, have, fun, fishing, i, hope, that, ...","[do, you, have, fun, fishing, i, hope, that, w...","[(did, VBD), (you, PRP), (have, VBP), (fun, VB...","[(do, VBP), (you, PRP), (have, VB), (fun, VBN)...",1,kid cgv,...,0.27735,0.27735,0.353553,0.258199,0.0,0.0,0.0,0.0,0.138675,0.0
4,5,time197-cond1.txt,cgv,i bet we will,"[i, bet, we, will]","[i, bet, we, will]","[(i, JJ), (bet, NN), (we, PRP), (will, MD)]","[(i, JJ), (bet, NN), (we, PRP), (will, MD)]",1,cgv kid,...,0.176777,0.166667,0.371391,0.533745,0.0,0.0,0.111111,0.09245,0.085861,0.101781


In [33]:
# Check for expected alignment metrics
expected_metrics = [
    'lexical_tok1_cosine',
    'lexical_lem1_cosine', 
    'pos_tok1_cosine',
    'pos_lem1_cosine',
    'lexical_master_cosine',
    'syntactic_master_cosine'
]

print("Expected Alignment Metrics:")
print("-" * 40)

found_metrics = []
for metric in expected_metrics:
    present = metric in alignment_results.columns
    status = "‚úì" if present else "‚úó"
    print(f"{status} {metric}")
    if present:
        found_metrics.append(metric)

print(f"\nFound {len(found_metrics)}/{len(expected_metrics)} expected metrics")

Expected Alignment Metrics:
----------------------------------------
‚úì lexical_tok1_cosine
‚úì lexical_lem1_cosine
‚úì pos_tok1_cosine
‚úì pos_lem1_cosine
‚úì lexical_master_cosine
‚úì syntactic_master_cosine

Found 6/6 expected metrics


In [34]:
# Show sample alignment scores
if found_metrics:
    print("\nSample Alignment Scores:")
    print("="*60)
    
    sample = alignment_results.iloc[0]
    
    print(f"Source: {sample['source_file']}")
    print(f"Participant: {sample['participant']}")
    print(f"Content: {sample['content']}")
    print(f"\nAlignment Scores:")
    
    for metric in found_metrics:
        if metric in sample:
            value = sample[metric]
            print(f"  {metric}: {value:.4f}" if pd.notna(value) else f"  {metric}: NaN")


Sample Alignment Scores:
Source: time197-cond1.txt
Participant: cgv
Content: that was fun

Alignment Scores:
  lexical_tok1_cosine: 0.0000
  lexical_lem1_cosine: 0.0000
  pos_tok1_cosine: 0.5222
  pos_lem1_cosine: 0.6963
  lexical_master_cosine: 0.0000
  syntactic_master_cosine: 0.0000


In [35]:
# Visualize alignment scores distribution
import matplotlib.pyplot as plt

if 'lexical_master_cosine' in alignment_results.columns:
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    # Lexical alignment distribution
    alignment_results['lexical_master_cosine'].hist(ax=axes[0], bins=20)
    axes[0].set_title('Lexical Alignment Distribution')
    axes[0].set_xlabel('Lexical Master Cosine')
    axes[0].set_ylabel('Frequency')
    
    # Syntactic alignment distribution
    if 'syntactic_master_cosine' in alignment_results.columns:
        alignment_results['syntactic_master_cosine'].hist(ax=axes[1], bins=20)
        axes[1].set_title('Syntactic Alignment Distribution')
        axes[1].set_xlabel('Syntactic Master Cosine')
        axes[1].set_ylabel('Frequency')
    
    plt.tight_layout()
    plt.show()
    
    # Print summary statistics
    print("\nAlignment Score Statistics:")
    print("="*60)
    print(alignment_results[['lexical_master_cosine', 'syntactic_master_cosine']].describe())

ModuleNotFoundError: No module named 'matplotlib'

### TEST 3 Summary

In [None]:
print("="*60)
print("TEST 3 SUMMARY: Alignment Integration")
print("="*60)

test3_passed = len(alignment_results) > 0 and len(found_metrics) >= 4

if test3_passed:
    print("\n‚úì TEST 3 PASSED: Integration with alignment analysis works!")
    print("\nPreprocessed files are fully compatible with alignment analysis.")
    print(f"Successfully analyzed {len(alignment_results)} utterance pairs.")
else:
    print("\n‚úó TEST 3 FAILED: Integration issues detected.")
    print("Please review the test results above.")

---
## TEST 4: Check Output Files

Verify that the saved output files on disk are correct.

In [None]:
print("="*60)
print("TEST 4: Output Files on Disk")
print("="*60)

# Check basic output directory
print(f"\nBasic output directory: {OUTPUT_DIR_BASIC}")
basic_files = [f for f in os.listdir(OUTPUT_DIR_BASIC) if f.endswith('.txt')]
print(f"Files created: {len(basic_files)}")
for f in basic_files:
    size_kb = os.path.getsize(os.path.join(OUTPUT_DIR_BASIC, f)) / 1024
    print(f"  - {f} ({size_kb:.1f} KB)")

In [None]:
# Load and verify a saved file
if basic_files:
    test_file = os.path.join(OUTPUT_DIR_BASIC, basic_files[0])
    print(f"\nVerifying saved file: {basic_files[0]}")
    
    # Load from disk
    saved_df = pd.read_csv(test_file, sep='\t', encoding='utf-8')
    print(f"‚úì Loaded {len(saved_df)} rows from disk")
    
    # Quick format check
    token_str = saved_df['token'].iloc[0]
    print(f"\nToken column type: {type(token_str)}")
    print(f"Token value: {token_str[:80]}...")
    
    # Parse check
    try:
        token_list = ast.literal_eval(token_str)
        print(f"‚úì Successfully parsed to: {type(token_list).__name__}")
        print(f"  Contents: {token_list}")
    except Exception as e:
        print(f"‚úó Parse failed: {e}")

---
## Final Summary

Overall test results and next steps.

In [None]:
print("\n" + "="*60)
print("FINAL TEST SUMMARY")
print("="*60)

# Collect results
test_results = {
    "TEST 1: Basic Preprocessing (NLTK)": test1_passed,
    "TEST 2: spaCy Integration": test2_passed if test2_passed is not None else "SKIPPED",
    "TEST 3: Alignment Integration": test3_passed
}

# Print results
for test_name, result in test_results.items():
    if result == "SKIPPED":
        print(f"‚äò {test_name}: SKIPPED")
    elif result:
        print(f"‚úì {test_name}: PASSED")
    else:
        print(f"‚úó {test_name}: FAILED")

# Overall assessment
passed_tests = [r for r in test_results.values() if r is True]
failed_tests = [r for r in test_results.values() if r is False]

print(f"\nResults: {len(passed_tests)} passed, {len(failed_tests)} failed")

if len(failed_tests) == 0:
    print("\n" + "="*60)
    print("üéâ ALL TESTS PASSED!")
    print("="*60)
    print("\nThe refactored prepare_transcripts.py is working correctly!")
    print("\nYou can now:")
    print("  1. Use prepare_transcripts with your own data")
    print("  2. Run alignment analysis on preprocessed output")
    print("  3. Generate baseline comparisons with surrogates")
else:
    print("\n" + "="*60)
    print("‚ö†Ô∏è  SOME TESTS FAILED")
    print("="*60)
    print("\nPlease review the failed tests above.")

---
## Bonus: Quick Preprocessing Example

Once tests pass, here's how to preprocess your own data.

In [None]:
# Example: Preprocess with spaCy (recommended)
# Uncomment and modify paths for your own data

# from align_test.prepare_transcripts_refactored import prepare_transcripts

# my_results = prepare_transcripts(
#     input_files="/path/to/my/raw/transcripts",
#     output_file_directory="/path/to/my/preprocessed/output",
#     run_spell_check=True,
#     minwords=2,
#     add_stanford_tags=True,
#     stanford_tagger_type='spacy',  # Recommended: fast and accurate
#     save_concatenated_dataframe=True
# )

# print(f"Preprocessed {len(my_results)} utterances!")

---
## Bonus: Full Pipeline Example

Complete workflow from raw data to alignment results.

In [None]:
# Example: Complete pipeline
# Uncomment to run on your own data

# # Step 1: Preprocess
# preprocessed = prepare_transcripts(
#     input_files="./my_raw_data",
#     output_file_directory="./my_preprocessed",
#     add_stanford_tags=True,
#     stanford_tagger_type='spacy'
# )

# # Step 2: Analyze alignment
# from align_test.alignment import LinguisticAlignment

# analyzer = LinguisticAlignment(alignment_types=["lexsyn", "fasttext"])
# results = analyzer.analyze_folder(
#     folder_path="./my_preprocessed",
#     output_directory="./my_results",
#     lag=1,
#     max_ngram=2,
#     add_stanford_tags=True  # Use spaCy tags from preprocessing
# )

# # Step 3: Generate baseline
# baseline = analyzer.analyze_baseline(
#     input_files="./my_preprocessed",
#     output_directory="./my_results",
#     lag=1,
#     max_ngram=2,
#     add_stanford_tags=True,
#     id_separator="-",
#     condition_label="cond",
#     dyad_label="dyad"
# )

# print("Complete pipeline finished!")