# Tutorial 3: Baseline Alignment with Surrogate Conversations

This tutorial demonstrates how to establish **baseline alignment levels** using surrogate conversation pairs.

## What You'll Learn

- What surrogate conversations are and why they matter
- How to generate surrogate pairs from real conversations
- Computing baseline alignment in surrogate data
- Comparing real vs. baseline alignment
- Interpreting statistical significance of alignment

## Why Baseline Analysis Matters

When you find alignment in real conversations, you need to ask: **"Is this alignment meaningful, or could it occur by chance?"**

### The Problem:
Some alignment occurs naturally just from:
- Speaking the same language
- Discussing the same topic
- Using common grammatical structures

### The Solution: Surrogate Pairs
Create "fake" conversations by pairing speakers who **never actually talked to each other**:
- Same experimental condition
- Same number of turns
- But different dyads (pairs of people)

**Example**:
- Real conversation: Person A talks with Person B
- Surrogate: Person A's turns paired with Person C's turns (who never met)

### The Test:
If **real alignment > baseline alignment**, the alignment is likely due to genuine interaction, not chance!

## Prerequisites

You should have already:
1. Completed Tutorial 1 (Preprocessing)
2. Completed Tutorial 2 (Alignment Analysis)
3. Have real alignment results saved

---
## Step 1: Import and Configure

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Import the alignment analyzer
from align_test.alignment import LinguisticAlignment

print("‚úì Imports successful")

In [None]:
# ============================================================
# Configure Paths
# ============================================================

# INPUT: Choose which preprocessed output to use
# This should match what you used in Tutorial 2 for real alignment!
#
# Options:
#   './tutorial_output/preprocessed_nltk'     - NLTK tags only (fastest)
#   './tutorial_output/preprocessed_spacy'    - NLTK + spaCy tags
#   './tutorial_output/preprocessed_stanford' - NLTK + Stanford tags
#
# ‚ö†Ô∏è IMPORTANT: Use the SAME preprocessing output you used in Tutorial 2
# so that baseline results are directly comparable to real results!

INPUT_DIR = './tutorial_output/preprocessed_nltk'  # ‚Üê Change this if needed

# OUTPUT: Where to save baseline results
OUTPUT_DIR = './tutorial_output/baseline_results'

# Create output directory
os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"Using preprocessed files from: {INPUT_DIR}")
print(f"Baseline results will be saved to: {OUTPUT_DIR}")

# Verify input data exists
if os.path.exists(INPUT_DIR):
    files = [f for f in os.listdir(INPUT_DIR) if f.endswith('.txt') and 'concatenated' not in f]
    print(f"\n‚úì Found {len(files)} preprocessed conversation files")
    print(f"\nSample filenames:")
    for f in files[:3]:
        print(f"  - {f}")
else:
    print("\n‚úó Preprocessed data not found!")
    print("Please run Tutorial 1 (Preprocessing) first.")


---
## Step 2: Identify Filename Patterns

The surrogate algorithm needs to identify **which dyad** and **which condition** each file belongs to. It does this by looking at the **filename patterns** from your preprocessing output.

**Important**: You're **NOT** changing your filenames! The filenames come from Tutorial 1 preprocessing and are already set. Your job is to **tell the algorithm how to read them**.

### What the Algorithm Needs to Know:

Looking at the CHILDES sample filenames (`time197-cond1.txt`, `time202-cond1.txt`, etc.):
- **Dyad identifier**: `time197`, `time202` (the part that identifies which pair of people)
- **Condition identifier**: `cond1` (the experimental condition)
- **Separator**: `-` (the character between components)

### Your Task: Configure These Parameters

In the code below, you'll specify:
- `DYAD_LABEL`: The text that **precedes** the dyad ID (e.g., `'time'` in `time197`)
- `CONDITION_LABEL`: The text that **precedes** the condition ID (e.g., `'cond'` in `cond1`)
- `ID_SEPARATOR`: The character that separates parts (e.g., `'-'` in `time197-cond1`)

### More Examples:

| Filename | DYAD_LABEL | CONDITION_LABEL | ID_SEPARATOR |
|----------|------------|-----------------|---------------|
| `time197-cond1.txt` | `'time'` | `'cond'` | `'-'` |
| `dyad05-condition2.txt` | `'dyad'` | `'condition'` | `'-'` |
| `pair_A_exp_1.txt` | `'pair'` | `'exp'` | `'_'` |

**The algorithm will then:**
1. Extract dyad and condition IDs from each filename
2. Group files by condition (e.g., all `cond1` files together)
3. Within each condition, pair different dyads to create surrogates

In [None]:
# ============================================================
# Identify the Filename Pattern in YOUR Data
# ============================================================
# Look at your actual filenames and configure these parameters
# to match YOUR naming convention:

ID_SEPARATOR = '-'        # ‚Üê Character separating parts (e.g., '-' in 'time197-cond1')
DYAD_LABEL = 'time'       # ‚Üê Text before dyad ID (e.g., 'time' in 'time197-cond1')
CONDITION_LABEL = 'cond'  # ‚Üê Text before condition ID (e.g., 'cond' in 'time197-cond1')

print(f"Your Filename Pattern Configuration:")
print("="*60)
print(f"  Dyad label: '{DYAD_LABEL}'")
print(f"  Condition label: '{CONDITION_LABEL}'")
print(f"  Separator: '{ID_SEPARATOR}'")
print(f"\nüí° These parameters tell the algorithm how to READ your existing filenames.")
print(f"   You are NOT renaming files - just identifying the pattern!")

# Validate that your filenames match this pattern
print(f"\nValidating filenames against your configuration:\n")

for filename in files[:5]:
    has_dyad = DYAD_LABEL in filename
    has_cond = CONDITION_LABEL in filename
    has_sep = ID_SEPARATOR in filename
    
    status = "‚úì" if (has_dyad and has_cond and has_sep) else "‚úó"
    print(f"{status} {filename}")
    
    if not (has_dyad and has_cond and has_sep):
        missing = []
        if not has_dyad: missing.append(f"'{DYAD_LABEL}'")
        if not has_cond: missing.append(f"'{CONDITION_LABEL}'")
        if not has_sep: missing.append(f"'{ID_SEPARATOR}'")
        print(f"   ‚ö†Ô∏è  Missing: {', '.join(missing)}")
        print(f"   ‚Üí Update the configuration parameters above to match your filenames")

print("\n" + "="*60)
print("If all files show ‚úì, you're ready to proceed!")
print("If any show ‚úó, update DYAD_LABEL, CONDITION_LABEL, or ID_SEPARATOR above.")
print("="*60)

---
## Step 3: Generate Surrogate Conversation Pairs

The algorithm will:
1. Group files by condition (e.g., all `cond1` files together)
2. Create all possible pairings of different dyads within each condition
3. For each pairing, create 2 surrogate conversations by interleaving turns

### Example:
**Original conversations:**
- File 1: Dyad A (Person 1 + Person 2)
- File 2: Dyad B (Person 3 + Person 4)

**Surrogates created:**
- Surrogate 1: Person 1's turns + Person 3's turns
- Surrogate 2: Person 2's turns + Person 4's turns

**Result**: You'll typically generate many surrogate pairs (e.g., with 20 files ‚Üí ~190 surrogate pairs)

In [None]:
# Initialize analyzer for baseline analysis
print("Initializing analyzer for baseline analysis...\n")

analyzer_baseline = LinguisticAlignment(
    alignment_type="lexsyn"
)

print("‚úì Analyzer ready for baseline computation")

In [None]:
# Generate surrogates and analyze baseline alignment
print("Generating surrogate pairs and computing baseline...\n")
print("‚ö†Ô∏è  This may take several minutes depending on the number of files...\n")

baseline_results = analyzer_baseline.analyze_baseline(
    input_files=INPUT_DIR,
    output_directory=OUTPUT_DIR,
    lag=1,
    max_ngram=2,
    ignore_duplicates=True,
    all_surrogates=True,              # Generate all possible pairings
    keep_original_turn_order=True,    # Maintain temporal order
    id_separator=ID_SEPARATOR,
    dyad_label=DYAD_LABEL,
    condition_label=CONDITION_LABEL
)

print(f"\n‚úì Baseline analysis complete!")
print(f"Surrogate pairs analyzed: {len(baseline_results)}")

In [None]:
# Examine what was created
import glob

surrogate_dir = os.path.join(OUTPUT_DIR, 'surrogates')
surrogate_runs = [d for d in os.listdir(surrogate_dir) if d.startswith('surrogate_run-')]

if surrogate_runs:
    latest_run = sorted(surrogate_runs)[-1]
    surrogate_files = glob.glob(os.path.join(surrogate_dir, latest_run, '*.txt'))
    
    print(f"Surrogate Generation Summary:")
    print("="*60)
    print(f"  Original conversations: {len(files)}")
    print(f"  Surrogate pairs created: {len(surrogate_files)}")
    print(f"  Location: {surrogate_dir}/{latest_run}/")
    
    print(f"\n  Sample surrogate filenames:")
    for f in surrogate_files[:3]:
        print(f"    - {os.path.basename(f)}")

---
## Step 4: Load Real Alignment Results

Load the real conversation alignment computed in Tutorial 2 for comparison.

In [None]:
# Load real alignment results from Tutorial 2
real_results_path = './tutorial_output/alignment_results/lexsyn/lexsyn_alignment_ngram2_lag1_noDups_noAdd.csv'

if os.path.exists(real_results_path):
    real_results = pd.read_csv(real_results_path)
    print(f"‚úì Loaded real alignment results")
    print(f"  Utterance pairs: {len(real_results)}")
else:
    print("‚úó Real alignment results not found!")
    print("Please run Tutorial 2 first to generate real alignment results.")
    print(f"Expected file: {real_results_path}")

---
## Step 5: Compare Real vs. Baseline Alignment

Now we can see if real conversations show more alignment than surrogate pairs.

In [None]:
# Compare statistics
print("Alignment Comparison: Real vs. Baseline")
print("="*60)

metrics = ['lexical_master_cosine', 'syntactic_master_cosine']

for metric in metrics:
    real_mean = real_results[metric].mean()
    baseline_mean = baseline_results[metric].mean()
    difference = real_mean - baseline_mean
    percent_increase = (difference / baseline_mean * 100) if baseline_mean > 0 else 0
    
    print(f"\n{metric}:")
    print(f"  Real conversations:  {real_mean:.4f}")
    print(f"  Baseline (surrogates): {baseline_mean:.4f}")
    print(f"  Difference: {difference:.4f} ({percent_increase:+.1f}%)")
    
    if difference > 0:
        print(f"  ‚Üí Real conversations show MORE alignment ‚úì")
    else:
        print(f"  ‚Üí No additional alignment in real conversations")

### Visualize Real vs. Baseline Distributions

In [None]:
# Create side-by-side boxplots for clearer comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Lexical alignment comparison
data_lexical = [
    real_results['lexical_master_cosine'].dropna(),
    baseline_results['lexical_master_cosine'].dropna()
]
bp1 = axes[0].boxplot(data_lexical, labels=['Real', 'Baseline'], patch_artist=True)
bp1['boxes'][0].set_facecolor('steelblue')
bp1['boxes'][1].set_facecolor('lightgray')
axes[0].set_title('Lexical Alignment: Real vs. Baseline', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Alignment Score', fontsize=12)
axes[0].grid(alpha=0.3, axis='y')

# Add mean markers
means_lex = [d.mean() for d in data_lexical]
axes[0].scatter([1, 2], means_lex, color='red', s=100, zorder=3, label='Mean', marker='D')
axes[0].legend()

# Syntactic alignment comparison
data_syntactic = [
    real_results['syntactic_master_cosine'].dropna(),
    baseline_results['syntactic_master_cosine'].dropna()
]
bp2 = axes[1].boxplot(data_syntactic, labels=['Real', 'Baseline'], patch_artist=True)
bp2['boxes'][0].set_facecolor('coral')
bp2['boxes'][1].set_facecolor('lightgray')
axes[1].set_title('Syntactic Alignment: Real vs. Baseline', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Alignment Score', fontsize=12)
axes[1].grid(alpha=0.3, axis='y')

# Add mean markers
means_syn = [d.mean() for d in data_syntactic]
axes[1].scatter([1, 2], means_syn, color='red', s=100, zorder=3, label='Mean', marker='D')
axes[1].legend()

plt.tight_layout()
plt.show()

print("\nüìä Interpretation Guide:")
print("  - Box: Middle 50% of data (25th-75th percentile)")
print("  - Line in box: Median")
print("  - Red diamond: Mean")
print("  - Whiskers: Range of data (excluding outliers)")
print("  - Circles: Outliers\n")
print("If Real box/mean is higher than Baseline ‚Üí alignment is above chance!")

---
## Step 6: Statistical Testing

Perform statistical tests to determine if the difference between real and baseline is significant.

In [None]:
print("Statistical Significance Testing")
print("="*60)

for metric in metrics:
    # Independent samples t-test
    real_values = real_results[metric].dropna()
    baseline_values = baseline_results[metric].dropna()
    
    t_stat, p_value = stats.ttest_ind(real_values, baseline_values)
    
    print(f"\n{metric}:")
    print(f"  t-statistic: {t_stat:.4f}")
    print(f"  p-value: {p_value:.6f}")
    
    if p_value < 0.001:
        print(f"  ‚Üí Highly significant (p < 0.001) ***")
    elif p_value < 0.01:
        print(f"  ‚Üí Very significant (p < 0.01) **")
    elif p_value < 0.05:
        print(f"  ‚Üí Significant (p < 0.05) *")
    else:
        print(f"  ‚Üí Not significant (p >= 0.05)")
    
    # Effect size (Cohen's d)
    pooled_std = np.sqrt((real_values.std()**2 + baseline_values.std()**2) / 2)
    cohens_d = (real_values.mean() - baseline_values.mean()) / pooled_std
    print(f"  Cohen's d: {cohens_d:.4f}", end="")
    
    if abs(cohens_d) > 0.8:
        print(" (large effect)")
    elif abs(cohens_d) > 0.5:
        print(" (medium effect)")
    elif abs(cohens_d) > 0.2:
        print(" (small effect)")
    else:
        print(" (negligible effect)")

---
## Step 7: Interpretation Guide

### Understanding the Results:

#### If Real > Baseline (Statistically Significant):
- ‚úÖ **Alignment is meaningful**: Speakers genuinely adapt to each other
- ‚úÖ **Not just chance**: The alignment exceeds what random pairing would produce

#### If Real ‚âà Baseline (Not Significant):
- ‚ö†Ô∏è **Alignment may be spurious**: Could be due to topic/language constraints
- ‚ö†Ô∏è **Need more data**: Or the effect is too subtle to detect
- ‚ö†Ô∏è **Reconsider analysis**: Try different parameters or alignment types

#### If Real < Baseline (Rare):
- ü§î **Anti-alignment?**: Speakers may be deliberately differentiating
- ü§î **Check your data**: Ensure preprocessing was correct
- ü§î **Unusual pattern**: Worth investigating further


---
## Step 8: Save Comparison Data

Create a summary dataframe for further analysis.

In [None]:
# Create comparison summary
comparison_data = []

for metric in metrics:
    real_values = real_results[metric].dropna()
    baseline_values = baseline_results[metric].dropna()
    
    t_stat, p_value = stats.ttest_ind(real_values, baseline_values)
    pooled_std = np.sqrt((real_values.std()**2 + baseline_values.std()**2) / 2)
    cohens_d = (real_values.mean() - baseline_values.mean()) / pooled_std
    
    comparison_data.append({
        'metric': metric,
        'real_mean': real_values.mean(),
        'real_std': real_values.std(),
        'baseline_mean': baseline_values.mean(),
        'baseline_std': baseline_values.std(),
        'difference': real_values.mean() - baseline_values.mean(),
        't_statistic': t_stat,
        'p_value': p_value,
        'cohens_d': cohens_d,
        'real_n': len(real_values),
        'baseline_n': len(baseline_values)
    })

comparison_df = pd.DataFrame(comparison_data)

# Save comparison
comparison_dir = os.path.join(OUTPUT_DIR, 'comparison')
os.makedirs(comparison_dir, exist_ok=True)
comparison_path = os.path.join(comparison_dir, 'alignment_comparison_lexsyn.csv')
comparison_df.to_csv(comparison_path, index=False)

print("Comparison Summary:")
print("="*60)
print(comparison_df.to_string(index=False))

print(f"\n‚úì Comparison saved to: {comparison_path}")

---
## Step 9: Review All Output Files

In [None]:
print("üìÅ Baseline Analysis Output Files:\n")
print("="*60)

# Show directory structure
for root, dirs, files_list in os.walk(OUTPUT_DIR):
    level = root.replace(OUTPUT_DIR, '').count(os.sep)
    indent = ' ' * 2 * level
    print(f"{indent}{os.path.basename(root)}/")
    
    subindent = ' ' * 2 * (level + 1)
    for file in files_list[:5]:  # Show first 5 files per directory
        size_kb = os.path.getsize(os.path.join(root, file)) / 1024
        print(f"{subindent}{file} ({size_kb:.1f} KB)")
    
    if len(files_list) > 5:
        print(f"{subindent}... and {len(files_list) - 5} more files")

print("\n" + "="*60)

---
## Summary

Congratulations! You've completed the baseline analysis tutorial.

### What You've Learned:

1. ‚úì **Surrogate Concept**: Why baseline comparison matters for research
2. ‚úì **Surrogate Generation**: Creating fake conversation pairs from real data
3. ‚úì **Baseline Computation**: Analyzing alignment in surrogate data
4. ‚úì **Statistical Testing**: Testing significance of real vs. baseline
5. ‚úì **Interpretation**: Understanding what your results mean

### Key Findings to Report:

For each alignment metric, you now have:
- **Real alignment mean and SD**
- **Baseline alignment mean and SD**
- **Statistical test results** (t-test, p-value)
- **Effect size** (Cohen's d)
- **Interpretation** (is the alignment meaningful?)

### Next Steps:

- **Export results**: Load comparison CSV into R, SPSS, or Excel for publication
- **Visualize in papers**: Use the plots generated here
- **Try different analyzers**: Run baseline with FastText or BERT
- **Adjust parameters**: Test different lag values, n-gram sizes
- **Use your own data**: Apply to your research conversations

### Advanced Options:

#### Use Existing Surrogates:
If you've already generated surrogates, reuse them:
```python
baseline_results = analyzer.analyze_baseline(
    input_files=INPUT_DIR,
    use_existing_surrogates='./path/to/surrogates/surrogate_run-123456/'
)
```

#### Sample Fewer Surrogates:
For faster testing, generate fewer pairs:
```python
baseline_results = analyzer.analyze_baseline(
    input_files=INPUT_DIR,
    all_surrogates=False  # Generate ~50% of possible pairs
)
```

#### Randomize Turn Order:
Break temporal structure in surrogates:
```python
baseline_results = analyzer.analyze_baseline(
    input_files=INPUT_DIR,
    keep_original_turn_order=False  # Shuffle turns
)
```

---
## ‚úÖ Tutorial 3 Complete!

You now have the complete workflow for rigorous alignment analysis:

1. **Tutorial 1**: Preprocess raw conversations
2. **Tutorial 2**: Compute alignment metrics
3. **Tutorial 3**: Establish baseline and test significance ‚Üê You are here!

---

## üéì Congratulations!

You've mastered the complete ALIGN package workflow and are ready to conduct publication-quality linguistic alignment research.

For questions or support, please visit the GitHub repository.