# Tutorial 2: Analyzing Linguistic Alignment

This tutorial demonstrates how to analyze linguistic alignment in conversational data using the ALIGN package.

## What You'll Learn

- Computing **lexical-syntactic alignment** (word and grammar similarity)
- Computing **semantic alignment** using FastText embeddings
- Computing **semantic alignment** using BERT embeddings
- Comparing different POS taggers (NLTK, spaCy, Stanford)
- Using multiple analyzers together for comprehensive analysis

## Prerequisites

You should have already:
1. Completed Tutorial 1 (Preprocessing)
2. Have preprocessed files in `./tutorial_output/preprocessed_nltk/`

---
## Step 1: Import and Configure

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the alignment analyzer
from align_test.alignment import LinguisticAlignment

print("‚úì Imports successful")

In [None]:
# Configure paths
# Input: Preprocessed data from Tutorial 1
INPUT_DIR_NLTK = './tutorial_output/preprocessed_nltk'
INPUT_DIR_SPACY = './tutorial_output/preprocessed_spacy'
INPUT_DIR_STANFORD = './tutorial_output/preprocessed_stanford'

# Output: Where to save alignment results
OUTPUT_DIR = './tutorial_output/alignment_results'

# Create output directory
os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"Output directory: {OUTPUT_DIR}")

# Verify input data exists
if os.path.exists(INPUT_DIR_NLTK):
    files = [f for f in os.listdir(INPUT_DIR_NLTK) if f.endswith('.txt')]
    print(f"‚úì Found {len(files)} preprocessed files")
else:
    print("‚úó Preprocessed data not found!")
    print("Please run Tutorial 1 (Preprocessing) first.")

---
## Step 2: Lexical-Syntactic Alignment

This analyzes how speakers align in their **word choices** (lexical) and **grammar patterns** (syntactic).

### Key Parameters:
- `lag=1`: Compare each utterance with the next one (turn-by-turn)
- `max_ngram=2`: Analyze both unigrams (single words) and bigrams (word pairs)
- `ignore_duplicates=True`: Ignore repeated n-grams when computing syntactic alignment

### Metrics Computed:
- **Lexical alignment**: Word overlap between speakers
- **Syntactic alignment**: Grammar pattern (POS tag) similarity
- **Master scores**: Averaged alignment across n-gram sizes

In [None]:
# Initialize the lexical-syntactic analyzer
print("Initializing analyzer...\n")

analyzer_lexsyn = LinguisticAlignment(
    alignment_type="lexsyn"
)

print("‚úì Analyzer ready")

In [None]:
# Run alignment analysis
print("Analyzing lexical-syntactic alignment...\n")

results_lexsyn = analyzer_lexsyn.analyze_folder(
    folder_path=INPUT_DIR_NLTK,
    output_directory=OUTPUT_DIR,
    lag=1,
    max_ngram=2,
    ignore_duplicates=True,
    add_additional_tags=False  # Using NLTK tags only
)

print(f"\n‚úì Analysis complete!")
print(f"Analyzed {len(results_lexsyn)} utterance pairs")

### Examine Results

Let's look at what alignment metrics were computed:

In [None]:
# Show all alignment metrics
alignment_metrics = [col for col in results_lexsyn.columns if 'cosine' in col]

print("Alignment Metrics Computed:\n")
for metric in alignment_metrics:
    print(f"  - {metric}")

print(f"\nSample alignment scores (first utterance pair):")
sample = results_lexsyn.iloc[10]
print(f"\nParticipants: {sample['utter_order']}")
print(f"Content 1: {sample['content1']}")
print(f"Content 2: {sample['content2']}")
print(f"\nLexical alignment: {sample['lexical_master_cosine']:.4f}")
print(f"Syntactic alignment: {sample['syntactic_master_cosine']:.4f}")

### Visualize Alignment Distributions

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Lexical alignment
results_lexsyn['lexical_master_cosine'].hist(
    ax=axes[0], bins=30, edgecolor='black', alpha=0.7, color='steelblue'
)
axes[0].set_title('Lexical Alignment Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Lexical Alignment Score', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].axvline(results_lexsyn['lexical_master_cosine'].mean(), 
                color='red', linestyle='--', linewidth=2, label='Mean')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Syntactic alignment
results_lexsyn['syntactic_master_cosine'].hist(
    ax=axes[1], bins=30, edgecolor='black', alpha=0.7, color='coral'
)
axes[1].set_title('Syntactic Alignment Distribution', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Syntactic Alignment Score', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].axvline(results_lexsyn['syntactic_master_cosine'].mean(), 
                color='red', linestyle='--', linewidth=2, label='Mean')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Print statistics
print("\nAlignment Statistics:")
print("="*60)
print(results_lexsyn[['lexical_master_cosine', 'syntactic_master_cosine']].describe())

---
## Step 3: Comparing Different POS Taggers (Optional)

If you preprocessed with spaCy or Stanford in Tutorial 1, you can compare how different taggers affect alignment scores.

In [None]:
# Check if spaCy preprocessing is available
if os.path.exists(INPUT_DIR_SPACY):
    print("Analyzing with spaCy tags...\n")
    
    results_spacy = analyzer_lexsyn.analyze_folder(
        folder_path=INPUT_DIR_SPACY,
        output_directory=OUTPUT_DIR,
        lag=1,
        max_ngram=2,
        ignore_duplicates=True,
        add_additional_tags=True,
        additional_tagger_type='spacy'
    )
    
    print(f"‚úì spaCy analysis complete!")
    
    # Compare syntactic master scores
    print(f"\nSyntactic Alignment Comparison:")
    print(f"  NLTK only:  {results_lexsyn['syntactic_master_cosine'].mean():.4f}")
    print(f"  With spaCy: {results_spacy['syntactic_master_cosine'].mean():.4f}")
    print(f"\nNote: spaCy scores include both NLTK and spaCy POS tags (averaged)")
else:
    print("‚äò spaCy preprocessing not available")
    print("Run Tutorial 1 with spaCy option to enable this comparison")

---
## Step 4: Semantic Alignment with FastText

FastText analyzes **semantic similarity** - whether speakers use words with similar meanings, even if the exact words differ.

### First Run:
- Downloads FastText model (~1-2 GB)
- May take several minutes
- Model is cached for future use

### What It Does:
- Converts words to 300-dimensional vectors
- Compares vector similarity between utterances
- Filters vocabulary to focus on content words

In [None]:
# Initialize FastText analyzer
print("Initializing FastText analyzer...\n")
print("\n‚ö†Ô∏è  Note: First run will download FastText model (~1-2 GB). This may take several minutes...\n")

analyzer_fasttext = LinguisticAlignment(
    alignment_type="fasttext",
    cache_dir=os.path.join(OUTPUT_DIR, "cache")
)

print("\n‚úì Analyzer ready. Next run will be much faster since the model is cached.\n")

In [None]:
# Run FastText semantic alignment
print("\nAnalyzing semantic alignment with FastText...\n")

results_fasttext = analyzer_fasttext.analyze_folder(
    folder_path=INPUT_DIR_NLTK,
    output_directory=OUTPUT_DIR,
    lag=1,
    high_sd_cutoff=3,  # Exclude very common words
    low_n_cutoff=1     # Exclude very rare words
)

print(f"\n‚úì FastText analysis complete!")
print(f"Analyzed {len(results_fasttext)} utterance pairs")

In [None]:
# Examine FastText metrics
fasttext_metrics = [col for col in results_fasttext.columns if 'fasttext' in col and 'cosine' in col]

print("FastText Metrics Computed:\n")
for metric in fasttext_metrics:
    print(f"  - {metric}")

# Visualize semantic alignment
master_metric = [m for m in fasttext_metrics if 'master' in m.lower()][0]

plt.figure(figsize=(10, 6))
results_fasttext[master_metric].hist(bins=30, edgecolor='black', alpha=0.7, color='forestgreen')
plt.title('FastText Semantic Alignment Distribution', fontsize=14, fontweight='bold')
plt.xlabel('Semantic Similarity Score', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.axvline(results_fasttext[master_metric].mean(), 
            color='red', linestyle='--', linewidth=2, label='Mean')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nMean semantic similarity: {results_fasttext[master_metric].mean():.4f}")

---
## Step 5: Semantic Alignment with BERT (Optional)

BERT provides **contextual semantic analysis** - understanding meaning based on surrounding words.

### Prerequisites:

#### Get a Hugging Face Token:
1. Go to: https://huggingface.co/settings/tokens
2. Click 'New token' ‚Üí Copy token
3. Set environment variable:

**macOS/Linux:**
```bash
# Add to ~/.zshrc or ~/.bash_profile
export HUGGINGFACE_TOKEN='your_token_here'
```

**Windows:**
```bash
setx HUGGINGFACE_TOKEN "your_token_here"
```

4. Restart Jupyter for changes to take effect

In [None]:
# Check for Hugging Face token
token_available = 'HUGGINGFACE_TOKEN' in os.environ

if token_available:
    print("‚úì Hugging Face token found")
    print("Ready to use BERT!")
else:
    print("‚úó Hugging Face token not found")
    print("\nPlease see the setup instructions above.")
    print("After setting the token, restart Jupyter and re-run this cell.")

In [None]:
if token_available:
    print("Initializing BERT analyzer...\n")
    
    analyzer_bert = LinguisticAlignment(
        alignment_type="bert",
        model_name="bert-base-uncased",
        token=os.environ.get('HUGGINGFACE_TOKEN')
    )
    
    print("Analyzing semantic alignment with BERT...\n")
    
    results_bert = analyzer_bert.analyze_folder(
        folder_path=INPUT_DIR_NLTK,
        output_directory=OUTPUT_DIR,
        lag=1
    )
    
    print(f"\n‚úì BERT analysis complete!")
    print(f"Mean semantic similarity: {results_bert['bert-base-uncased_cosine_similarity'].mean():.4f}")
else:
    print("‚äò Skipping BERT analysis (token not available)")

---
## Step 6: Multi-Analyzer Comprehensive Analysis

You can run **multiple analyzers simultaneously** and get merged results with all metrics in one dataframe.

This is useful for:
- Comparing lexical vs. semantic alignment
- Getting a comprehensive view of all alignment types
- Analyzing correlations between different alignment measures

In [None]:
# Initialize combined analyzer with multiple types
print("Initializing multi-analyzer (LexSyn + FastText)...\n")

analyzer_combined = LinguisticAlignment(
    alignment_types=["lexsyn", "fasttext"],  # List of types
    cache_dir=os.path.join(OUTPUT_DIR, "cache")
)

print("‚úì Multi-analyzer ready")
print(f"Will compute both lexical-syntactic AND semantic alignment")

In [None]:
# Run combined analysis
print("\nRunning comprehensive multi-analyzer analysis...\n")

results_combined = analyzer_combined.analyze_folder(
    folder_path=INPUT_DIR_NLTK,
    output_directory=OUTPUT_DIR,
    lag=1,
    max_ngram=2,              # For LexSyn
    ignore_duplicates=True,   # For LexSyn
    high_sd_cutoff=3,         # For FastText
    low_n_cutoff=1            # For FastText
)

print(f"\n‚úì Multi-analyzer analysis complete!")
print(f"Combined results: {results_combined.shape[0]} rows √ó {results_combined.shape[1]} columns")

In [None]:
# Show all metrics in combined results
print("\nAll Metrics in Combined Results:\n")

# Lexical-syntactic metrics
lexsyn_metrics = [col for col in results_combined.columns 
                  if any(x in col for x in ['lexical_', 'syntactic_', 'pos_'])]
print("Lexical-Syntactic Metrics:")
for m in lexsyn_metrics:
    print(f"  - {m}")

# Semantic metrics
semantic_metrics = [col for col in results_combined.columns if 'fasttext' in col]
print(f"\nSemantic (FastText) Metrics:")
for m in semantic_metrics:
    print(f"  - {m}")

# Show sample scores
print("\n" + "="*60)
print("Sample Comprehensive Alignment Scores (First Utterance Pair)")
print("="*60)
sample = results_combined.iloc[10]
print(f"\nLexical alignment:   {sample['lexical_master_cosine']:.4f}")
print(f"Syntactic alignment: {sample['syntactic_master_cosine']:.4f}")
if 'master_fasttext-wiki-news-300_cosine_similarity' in sample:
    print(f"Semantic alignment:  {sample['master_fasttext-wiki-news-300_cosine_similarity']:.4f}")

### Compare All Alignment Types

In [None]:
# Create comparison visualization
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Lexical
results_combined['lexical_master_cosine'].hist(
    ax=axes[0], bins=30, edgecolor='black', alpha=0.7, color='steelblue'
)
axes[0].set_title('Lexical Alignment', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Score', fontsize=12)
axes[0].axvline(results_combined['lexical_master_cosine'].mean(), 
                color='red', linestyle='--', linewidth=2)
axes[0].grid(alpha=0.3)

# Syntactic
results_combined['syntactic_master_cosine'].hist(
    ax=axes[1], bins=30, edgecolor='black', alpha=0.7, color='coral'
)
axes[1].set_title('Syntactic Alignment', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Score', fontsize=12)
axes[1].axvline(results_combined['syntactic_master_cosine'].mean(), 
                color='red', linestyle='--', linewidth=2)
axes[1].grid(alpha=0.3)

# Semantic (FastText)
semantic_col = [c for c in results_combined.columns if 'master' in c and 'fasttext' in c][0]
results_combined[semantic_col].hist(
    ax=axes[2], bins=30, edgecolor='black', alpha=0.7, color='forestgreen'
)
axes[2].set_title('Semantic Alignment (FastText)', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Score', fontsize=12)
axes[2].axvline(results_combined[semantic_col].mean(), 
                color='red', linestyle='--', linewidth=2)
axes[2].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Print summary statistics
print("\nComparative Statistics:")
print("="*60)
summary_cols = ['lexical_master_cosine', 'syntactic_master_cosine', semantic_col]
print(results_combined[summary_cols].describe())

### Analyze Correlations Between Alignment Types

In [None]:
# Compute correlations
correlation_cols = ['lexical_master_cosine', 'syntactic_master_cosine', semantic_col]
correlations = results_combined[correlation_cols].corr()

print("Correlations Between Alignment Types:")
print("="*60)
print(correlations)

# Interpretation
print("\nInterpretation:")
print(f"Lexical-Syntactic correlation: {correlations.iloc[0,1]:.3f}")
if abs(correlations.iloc[0,1]) > 0.5:
    print("  ‚Üí Strong relationship: Word similarity often accompanies grammar similarity")
elif abs(correlations.iloc[0,1]) > 0.3:
    print("  ‚Üí Moderate relationship: Some connection between word and grammar patterns")
else:
    print("  ‚Üí Weak relationship: Lexical and syntactic alignment are relatively independent")

---
## Step 7: Review Output Files

All alignment results are saved as CSV files for further analysis.

In [None]:
import glob

print("üìÅ Output Files Created:\n")
print("="*60)

# Find all output files
output_files = glob.glob(os.path.join(OUTPUT_DIR, '**/*.csv'), recursive=True)

for file_path in sorted(output_files):
    # Get relative path and file size
    rel_path = os.path.relpath(file_path, OUTPUT_DIR)
    size_kb = os.path.getsize(file_path) / 1024
    
    # Determine analyzer type from path
    if 'lexsyn' in rel_path:
        analyzer = "LexSyn"
    elif 'fasttext' in rel_path:
        analyzer = "FastText"
    elif 'bert' in rel_path:
        analyzer = "BERT"
    elif 'merged' in rel_path:
        analyzer = "Combined"
    else:
        analyzer = "Other"
    
    print(f"{analyzer:10} {rel_path:60} ({size_kb:7.1f} KB)")

print("\n" + "="*60)

---
## Summary

Congratulations! You've completed the alignment analysis tutorial.

### What You've Learned:

1. ‚úì **Lexical-Syntactic Alignment**: Measuring word and grammar similarity
2. ‚úì **Semantic Alignment (FastText)**: Analyzing meaning similarity
3. ‚úì **Semantic Alignment (BERT)**: Contextual semantic analysis
4. ‚úì **Multi-Analyzer Analysis**: Running multiple analyzers together
5. ‚úì **Comparative Analysis**: Understanding relationships between alignment types

### Next Steps:

- **Use your own data**: Replace input paths with your preprocessed conversations
- **Adjust parameters**: Experiment with different `lag`, `max_ngram`, and filtering settings
- **Generate baselines**: Compare real conversations to surrogate pairs (see documentation)
- **Statistical analysis**: Load the CSV files into R, Python, or your preferred tool

### Key Metrics Reference:

**Lexical-Syntactic:**
- `lexical_master_cosine`: Overall word similarity (0-1, higher = more similar)
- `syntactic_master_cosine`: Overall grammar similarity (0-1, higher = more similar)

**Semantic:**
- `master_fasttext-wiki-news-300_cosine_similarity`: Overall meaning similarity
- `bert-base-uncased_cosine_similarity`: Contextual semantic similarity

### Understanding the Scores:

- **0.0**: No alignment (completely different)
- **0.3-0.5**: Moderate alignment (some similarity)
- **0.7-0.9**: High alignment (strong similarity)
- **1.0**: Perfect alignment (identical)

---
## ‚úÖ Tutorial Complete!

You now have all the tools to analyze linguistic alignment in conversational data.

For more information:
- See the README.md for detailed documentation
- Check example scripts in the `examples/` folder
- Visit the GitHub repository for updates and support