# Tutorial 1: Preprocessing Conversational Transcripts

This tutorial demonstrates how to use the ALIGN package to preprocess conversational data for linguistic alignment analysis.

## What You'll Learn

- How to prepare raw conversational transcripts for alignment analysis
- Using different POS taggers (NLTK, spaCy, Stanford)
- Understanding the preprocessing output format
- Validating that outputs are ready for Phase 2 (alignment analysis)

## Prerequisites

You should have already:
1. Cloned this repository
2. Installed the package: `pip install -e .`
3. Sample data is available in `src/align_test/data/CHILDES/`

---
## Step 1: Import and Configure

In [None]:
import os
import pandas as pd
import ast

# Import the preprocessing function
from align_test.prepare_transcripts import prepare_transcripts

print("‚úì Imports successful")

In [None]:
# Configure paths
# Input: Sample CHILDES data (included in the package)
INPUT_DIR = '../src/align_test/data/CHILDES/'

# Output: Where to save preprocessed files
OUTPUT_DIR_BASIC = './tutorial_output/preprocessed_nltk'
OUTPUT_DIR_SPACY = './tutorial_output/preprocessed_spacy'
OUTPUT_DIR_STANFORD = './tutorial_output/preprocessed_stanford'

# Create output directories
for dir_path in [OUTPUT_DIR_BASIC, OUTPUT_DIR_SPACY, OUTPUT_DIR_STANFORD]:
    os.makedirs(dir_path, exist_ok=True)
    
print(f"Input directory: {INPUT_DIR}")
print(f"Output directories created ‚úì")

---
## Step 2: Inspect Sample Input Data

Let's look at what raw conversation data looks like before preprocessing.

In [None]:
# Check available files
files = [f for f in os.listdir(INPUT_DIR) if f.endswith('.txt')]
print(f"Found {len(files)} conversation files\n")

# Load and display a sample file
sample_file = os.path.join(INPUT_DIR, files[0])
df_sample = pd.read_csv(sample_file, sep='\t', encoding='utf-8')

print(f"Sample file: {files[0]}")
print(f"Columns: {df_sample.columns.tolist()}")
print(f"Rows: {len(df_sample)}\n")
print("First 5 utterances:")
df_sample.head()

---
## Step 3: Basic Preprocessing (NLTK Only)

This is the **fastest option** and requires no additional setup. NLTK tags are automatically downloaded if needed.

### What This Does:
1. Cleans text (removes non-letters, fillers like "um", "uh")
2. Merges adjacent turns by the same speaker
3. Tokenizes and lemmatizes words
4. Adds POS (Part-of-Speech) tags using NLTK

In [None]:
print("Starting NLTK-only preprocessing...\n")

results_nltk = prepare_transcripts(
    input_files=INPUT_DIR,
    output_file_directory=OUTPUT_DIR_BASIC,
    run_spell_check=False,          # Disable for speed (optional)
    minwords=2,                      # Minimum words per utterance
    add_additional_tags=False        # NLTK only
)

print(f"\n‚úì Preprocessing complete!")
print(f"Total utterances processed: {len(results_nltk)}")

### Examine Preprocessed Output

Let's see what the preprocessed data looks like:

In [None]:
print("Output columns:")
for i, col in enumerate(results_nltk.columns, 1):
    print(f"  {i}. {col}")

print(f"\nSample row:")
sample = results_nltk.iloc[0]
print(f"Participant: {sample['participant']}")
print(f"Content: {sample['content']}")
print(f"Tokens: {sample['token']}")
print(f"Lemmas: {sample['lemma']}")
print(f"Tagged: {sample['tagged_token'][:100]}...")

### Validate Output Format

The preprocessing creates strings that can be parsed back to Python objects. This is important for Phase 2 (alignment analysis).

In [None]:
# Test parsing
sample_row = results_nltk.iloc[0]

# Parse token string back to list
tokens = ast.literal_eval(sample_row['token'])
print(f"Token type after parsing: {type(tokens)}")
print(f"Tokens: {tokens}")

# Parse tagged tokens
tagged = ast.literal_eval(sample_row['tagged_token'])
print(f"\nTagged token type: {type(tagged)}")
print(f"Tagged tokens (first 3): {tagged[:3]}")
print(f"\n‚úì Format validation successful!")

---
## Step 4: Preprocessing with spaCy (Optional)

spaCy provides **100-200x faster** POS tagging than Stanford with minimal accuracy differences.


In [None]:
# Check if spaCy is available
try:
    import spacy
    print("‚úì spaCy is installed")
    spacy_available = True
except ImportError:
    print("‚úó spaCy not installed")
    print("\nTo install: pip install spacy")
    print("Then run: python -m spacy download en_core_web_sm")
    spacy_available = False

In [None]:
if spacy_available:
    print("Starting preprocessing with spaCy...\n")
    
    results_spacy = prepare_transcripts(
        input_files=INPUT_DIR,
        output_file_directory=OUTPUT_DIR_SPACY,
        run_spell_check=False,
        minwords=2,
        add_additional_tags=True,       # Add spaCy tags
        tagger_type='spacy'             # Specify spaCy
    )
    
    print(f"\n‚úì spaCy preprocessing complete!")
    
    # Show additional columns
    spacy_cols = [c for c in results_spacy.columns if c not in results_nltk.columns]
    print(f"\nAdditional columns with spaCy:")
    for col in spacy_cols:
        print(f"  - {col}")
else:
    print("‚äò Skipping spaCy preprocessing (not installed)")

---
## Step 5: Preprocessing with Stanford (Optional)

Stanford CoreNLP provides the **highest accuracy** but is **~100x slower** than spaCy.

### Prerequisites:

#### 1. Install Java
```bash
# macOS
brew install openjdk

# Linux
sudo apt-get install default-jdk

# Windows
# Download from: https://www.java.com/en/download/
```

#### 2. Download Stanford POS Tagger
- Download from: https://nlp.stanford.edu/software/tagger.shtml#Download
- Extract to a known location (e.g., `~/stanford-postagger/`)
- Update the paths in the cell below

In [None]:
import subprocess

# Check Java
try:
    result = subprocess.run(['java', '-version'], capture_output=True, text=True, timeout=5)
    if result.returncode == 0:
        version = result.stderr.split('\n')[0]
        print(f"‚úì Java installed: {version}")
        java_available = True
    else:
        print("‚úó Java not working properly")
        java_available = False
except:
    print("‚úó Java not found")
    print("\nPlease install Java first (see instructions above)")
    java_available = False

# Configure STANFORD_PATH (UPDATE THESE FOR YOUR SYSTEM)
if java_available:
    STANFORD_PATH = os.path.expanduser('~/stanford-postagger-full-2020-11-17') # Update this path
    STANFORD_MODEL = 'models/english-left3words-distsim.tagger'
    
    # Check if Stanford tagger exists
    jar_path = os.path.join(STANFORD_PATH, 'stanford-postagger.jar')
    model_path = os.path.join(STANFORD_PATH, STANFORD_MODEL)
    
    if os.path.exists(jar_path) and os.path.exists(model_path):
        print(f"‚úì Stanford tagger found at: {STANFORD_PATH}")
        stanford_available = True
    else:
        print(f"‚úó Stanford tagger not found at: {STANFORD_PATH}")
        print("\nPlease:")
        print("  1. Download from: https://nlp.stanford.edu/software/tagger.shtml#Download")
        print("  2. Extract and update STANFORD_PATH above")
        stanford_available = False
else:
    stanford_available = False

In [None]:
if stanford_available:
    print("Starting preprocessing with Stanford...")
    print("‚ö†Ô∏è  This may take several minutes...\n")
    
    results_stanford = prepare_transcripts(
        input_files=INPUT_DIR,
        output_file_directory=OUTPUT_DIR_STANFORD,
        run_spell_check=False,
        minwords=2,
        add_additional_tags=True,
        tagger_type='stanford',
        stanford_pos_path=STANFORD_PATH,
        stanford_language_path=STANFORD_MODEL,
        stanford_batch_size=50  # Process in batches for speed
    )
    
    print(f"\n‚úì Stanford preprocessing complete!")
    
    # Show additional columns
    stanford_cols = [c for c in results_stanford.columns if c not in results_nltk.columns]
    print(f"\nAdditional columns with Stanford:")
    for col in stanford_cols:
        print(f"  - {col}")
else:
    print("‚äò Skipping Stanford preprocessing (not available)")

---
## Summary: Output Files

Your preprocessed files are now ready for alignment analysis!

In [None]:
print("üìÅ Preprocessed Files Created:\n")

for label, path in [("NLTK-only", OUTPUT_DIR_BASIC),
                    ("NLTK + spaCy", OUTPUT_DIR_SPACY),
                    ("NLTK + Stanford", OUTPUT_DIR_STANFORD)]:
    if os.path.exists(path):
        files = [f for f in os.listdir(path) if f.endswith('.txt')]
        if files:
            print(f"‚úì {label}: {len(files)} files in {path}")

print("\n" + "="*60)
print("‚úÖ TUTORIAL 1 COMPLETE")
print("="*60)
print("\nNext Step: Run Tutorial 2 (Alignment Analysis)")
print("This will analyze the preprocessed files to compute alignment metrics.")

---
## Understanding the Output Format

Each preprocessed file contains:

### Core Columns (always present):
- `participant`: Speaker identifier
- `content`: Original utterance text (cleaned)
- `token`: List of tokenized words (as string)
- `lemma`: List of lemmatized words (as string)
- `tagged_token`: List of (word, POS) tuples from NLTK (as string)
- `tagged_lemma`: List of (lemma, POS) tuples from NLTK (as string)
- `file`: Source filename

### Additional Columns (when using extra taggers):
- `tagged_spacy_token`: POS tags from spaCy
- `tagged_spacy_lemma`: POS tags for lemmas from spaCy
- `tagged_stan_token`: POS tags from Stanford
- `tagged_stan_lemma`: POS tags for lemmas from Stanford

### Important Notes:
- All list/tuple columns are stored as **strings**
- Use `ast.literal_eval()` to convert strings back to Python objects
- This format ensures compatibility with the alignment analysis phase
- A concatenated file with all conversations is also saved for batch processing