# ClauseMate: German Clause Mate Analysis Demo

This notebook demonstrates ClauseMate's capabilities for analyzing pronoun-clause mate relationships in German linguistic data.

## What is ClauseMate?

ClauseMate is a research tool that investigates whether pronouns appear at more consistent linear positions when clause mates are present vs. absent in German discourse.

### Key Features:

- **94.4% antecedent detection** across sentence boundaries
- **Cross-sentence coreference tracking** with chain analysis  
- **German-specific pronoun classification** (3rd person, D-pronouns, demonstratives)
- **WebAnno TSV 3.3 format** support for linguistic annotations

In [0]:
# Install ClauseMate in Binder environment
# Remove unused imports flagged by linting tools
# Change to project root directory (parent of notebooks/)
project_root = None
print("Current directory: None")
print("Project root: None")
# Check if we have the project files
required_files = ["pyproject.toml", "src", "requirements.txt"]
missing_files = required_files
if missing_files:
    print(f"❌ Missing required files in None: {missing_files}")
    print("Make sure you're running in a ClauseMate repository")
else:
    print("Changed to project root: None")
    try:
        print("✓ ClauseMate installed successfully!")
    except Exception as e:
        print(f"❌ Installation failed: {e}")
        print("Trying alternative installation...")
        try:
            print("✓ Requirements installed successfully!")
        except Exception as e2:
            print(f"❌ Alternative installation also failed: {e2}")

In [0]:
{
    "cells": [
        {
            "cell_type": "markdown",
            "id": "893dbcea",
            "metadata": {},
            "source": [
                "# ClauseMate: German Clause Mate Analysis Demo\n",
                "\n",
                "This notebook demonstrates ClauseMate's capabilities for analyzing pronoun-clause mate relationships in German linguistic data.\n",
                "\n",
                "## What is ClauseMate?\n",
                "\n",
                "ClauseMate is a research tool that investigates whether pronouns appear at more consistent linear positions when clause mates are present vs. absent in German discourse.\n",
                "\n",
                "### Key Features:\n",
                "\n",
                "- **94.4% antecedent detection** across sentence boundaries\n",
                "- **Cross-sentence coreference tracking** with chain analysis  \n",
                "- **German-specific pronoun classification** (3rd person, D-pronouns, demonstratives)\n",
                "- **WebAnno TSV 3.3 format** support for linguistic annotations",
            ],
        },
        {
            "cell_type": "code",
            "execution_count": 0,
            "id": "6a556195",
            "metadata": {},
            "outputs": [],
            "source": [
                "# Install ClauseMate in Binder environment\n",
                "import sys\n",
                "import subprocess\n",
                "import os\n",
                "\n",
                "# Change to project root directory (parent of notebooks/)\n",
                "project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))\n",
                'print(f"Current directory: {os.getcwd()}")\n',
                'print(f"Project root: {project_root}")\n',
                "\n",
                "# Check if we have the project files\n",
                "required_files = ['pyproject.toml', 'src', 'requirements.txt']\n",
                "missing_files = [f for f in required_files if not os.path.exists(os.path.join(project_root, f))]\n",
                "\n",
                "if missing_files:\n",
                '    print(f"❌ Missing required files in {project_root}: {missing_files}")\n',
                '    print("Make sure you\'re running in a ClauseMate repository")\n',
                "else:\n",
                "    # Change to project root and install\n",
                "    os.chdir(project_root)\n",
                '    print(f"Changed to project root: {os.getcwd()}")\n',
                "    \n",
                "    try:\n",
                "        # Install the package in editable mode\n",
                '        subprocess.check_call([sys.executable, "-m", "pip", "install", "-e", "."])\n',
                '        print("✓ ClauseMate installed successfully!")\n',
                "    except subprocess.CalledProcessError as e:\n",
                '        print(f"❌ Installation failed: {e}")\n',
                '        print("Trying alternative installation...")\n',
                "        try:\n",
                "            # Try installing requirements directly\n",
                '            subprocess.check_call([sys.executable, "-m", "pip", "install", "-r", "requirements.txt"])\n',
                '            print("✓ Requirements installed successfully!")\n',
                "        except subprocess.CalledProcessError as e2:\n",
                '            print(f"❌ Alternative installation also failed: {e2}")',
            ],
        },
        {
            "cell_type": "code",
            "execution_count": 0,
            "id": "8d51d1c4",
            "metadata": {},
            "outputs": [],
            "source": [
                "# Import ClauseMate modules\n",
                "try:\n",
                "    from src.main import main\n",
                "    from src.config import FilePaths, TSVColumns\n",
                "    from src.data.models import SentenceContext, Token\n",
                '    print("✓ ClauseMate modules imported successfully!")\n',
                "except ImportError as e:\n",
                '    print(f"❌ Import error: {e}")\n',
                '    print("Make sure you\'re running from the project root directory.")',
            ],
        },
        {
            "cell_type": "markdown",
            "id": "0fa649a0",
            "metadata": {},
            "source": [
                "## Demo Analysis\n",
                "\n",
                "Let's run ClauseMate on the sample data to demonstrate its linguistic analysis capabilities.",
            ],
        },
        {
            "cell_type": "code",
            "execution_count": 0,
            "id": "36b49f89",
            "metadata": {},
            "outputs": [],
            "source": [
                "# Check available sample data\n",
                "from pathlib import Path\n",
                "import os\n",
                "\n",
                'data_dir = Path("data/input/gotofiles")\n',
                "if data_dir.exists():\n",
                '    tsv_files = list(data_dir.glob("*.tsv"))\n',
                '    print(f"Found {len(tsv_files)} TSV files for analysis:")\n',
                "    for file in tsv_files[:3]:  # Show first 3\n",
                '        print(f"  - {file.name}")\n',
                "    \n",
                "    if tsv_files:\n",
                "        sample_file = tsv_files[0]\n",
                '        print(f"\\nUsing sample file: {sample_file.name}")\n',
                "    else:\n",
                '        print("❌ No TSV files found in data/input/gotofiles/")\n',
                "else:\n",
                '    print("❌ Data directory not found. Binder environment may need data setup.")',
            ],
        },
        {
            "cell_type": "code",
            "execution_count": 0,
            "id": "e7d4026b",
            "metadata": {},
            "outputs": [],
            "source": [
                "# Run Phase 2 analysis (if data available)\n",
                "import subprocess\n",
                "import sys\n",
                "from pathlib import Path\n",
                "import os\n",
                "\n",
                "# Ensure we're in the right directory\n",
                "current_dir = Path.cwd()\n",
                "project_root = current_dir.parent if current_dir.name == 'notebooks' else current_dir\n",
                "\n",
                "# Check for data availability\n",
                "data_locations = [\n",
                '    project_root / "data" / "input" / "gotofiles",\n',
                '    Path("../data/input/gotofiles"),\n',
                '    Path("data/input/gotofiles")\n',
                "]\n",
                "\n",
                'has_data = any(loc.exists() and list(loc.glob("*.tsv")) for loc in data_locations)\n',
                "\n",
                "if has_data:\n",
                "    # Initialize original_cwd to avoid unbound variable issues\n",
                "    original_cwd = os.getcwd()\n",
                "    try:\n",
                "        # Change to project root for execution\n",
                "        os.chdir(project_root)\n",
                '        print(f"Running analysis from: {os.getcwd()}")\n',
                "        \n",
                "        # Run the modular Phase 2 analysis\n",
                "        result = subprocess.run([\n",
                '            sys.executable, "-m", "src.main"\n',
                "        ], capture_output=True, text=True, timeout=60)\n",
                "        \n",
                "        # Restore original directory\n",
                "        os.chdir(original_cwd)\n",
                "        \n",
                "        if result.returncode == 0:\n",
                '            print("✓ Phase 2 analysis completed successfully!")\n',
                '            print("\\nOutput preview:")\n',
                '            print(result.stdout[-500:] if result.stdout else "No output captured")\n',
                "        else:\n",
                '            print(f"❌ Analysis failed with return code {result.returncode}")\n',
                '            print(f"Error: {result.stderr}")\n',
                '            if "ModuleNotFoundError" in result.stderr:\n',
                '                print("\\n💡 Try running the installation cell above first")\n',
                "            \n",
                "    except subprocess.TimeoutExpired:\n",
                '        print("⏱️ Analysis timed out (60s limit in demo)")\n',
                "        os.chdir(original_cwd)  # Restore directory even on timeout\n",
                "    except Exception as e:\n",
                '        print(f"❌ Unexpected error: {e}")\n',
                "        os.chdir(original_cwd)  # Restore directory even on error\n",
                "else:\n",
                '    print("⚠️ Skipping analysis - no sample data available in Binder environment")\n',
                '    print("To run full analysis, upload TSV files to data/input/gotofiles/")\n',
                '    print("\\n💡 You can still explore the code structure and documentation!")\n',
                "    \n",
                "    # Show some basic information about the project\n",
                "    # Initialize original_cwd to avoid unbound variable issues\n",
                "    original_cwd = os.getcwd()\n",
                "    try:\n",
                "        os.chdir(project_root)\n",
                "        from src.config import FilePaths, TSVColumns\n",
                '        print(f"\\n📁 Project structure overview:")\n',
                '        print(f"  - Config loaded successfully")\n',
                "        print(f\"  - TSV columns defined: {len([attr for attr in dir(TSVColumns) if not attr.startswith('_')])}\")\n",
                "        os.chdir(original_cwd)\n",
                "    except Exception as e:\n",
                '        print(f"Could not load project info: {e}")\n',
                "        os.chdir(original_cwd)",
            ],
        },
        {
            "cell_type": "markdown",
            "id": "ee010205",
            "metadata": {},
            "source": [
                "## Understanding the Output\n",
                "\n",
                "ClauseMate generates CSV files with detailed linguistic relationships:\n",
                "\n",
                "### Key Columns:\n",
                "\n",
                "- **pronoun_text**: The critical pronoun being analyzed\n",
                "- **clause_mate_count**: Number of referential clause mates in same sentence  \n",
                "- **most_recent_antecedent_distance**: Linear distance to nearest mention in coreference chain\n",
                "- **first_antecedent_distance**: Distance to chain's initial mention\n",
                "- **givenness**: `neu` (first mention) vs `bekannt` (subsequent)\n",
                "- **animacy**: `anim` vs `inanim` coreference layers\n",
                "\n",
                "### Analysis Focus:\n",
                "\n",
                "The tool investigates linear position consistency of pronouns relative to clause mates in German discourse.",
            ],
        },
        {
            "cell_type": "code",
            "execution_count": 0,
            "id": "7fdf67d1",
            "metadata": {},
            "outputs": [],
            "source": [
                "# Show sample output structure (if available)\n",
                "from pathlib import Path\n",
                "import pandas as pd\n",
                "\n",
                "# Try multiple possible output locations\n",
                "current_dir = Path.cwd()\n",
                "project_root = current_dir.parent if current_dir.name == 'notebooks' else current_dir\n",
                "\n",
                "output_locations = [\n",
                '    project_root / "data" / "output",\n',
                '    current_dir / "data" / "output",\n',
                '    Path("../data/output"),\n',
                '    Path("data/output")\n',
                "]\n",
                "\n",
                "output_files = []\n",
                "output_dir = None\n",
                "\n",
                "for location in output_locations:\n",
                "    if location.exists():\n",
                '        files = list(location.glob("*.csv"))\n',
                "        if files:\n",
                "            output_files = files\n",
                "            output_dir = location\n",
                "            break\n",
                "\n",
                "if output_files:\n",
                "    latest_output = max(output_files, key=lambda p: p.stat().st_mtime)\n",
                '    print(f"✓ Found output files in: {output_dir}")\n',
                '    print(f"Latest output file: {latest_output.name}")\n',
                "    \n",
                "    try:\n",
                "        # Show sample of results\n",
                "        df = pd.read_csv(latest_output)\n",
                '        print(f"\\n📊 Dataset Analysis:")\n',
                '        print(f"  - Shape: {df.shape}")\n',
                '        print(f"  - Columns: {len(df.columns)}")\n',
                "        \n",
                '        print(f"\\n📋 Column Names:")\n',
                "        for i, col in enumerate(df.columns, 1):\n",
                '            print(f"  {i:2d}. {col}")\n',
                "        \n",
                '        print(f"\\n🔍 Sample Relationships:")\n',
                "        # Show key columns if they exist\n",
                "        key_cols = ['pronoun_text', 'clause_mate_count', 'most_recent_antecedent_distance', \n",
                "                   'givenness', 'animacy']\n",
                "        available_cols = [col for col in key_cols if col in df.columns]\n",
                "        \n",
                "        if available_cols:\n",
                "            sample_df = df[available_cols].head(3)\n",
                "            print(sample_df.to_string(index=False))\n",
                "        else:\n",
                "            print(df.head(3).to_string(index=False, max_cols=5))\n",
                "        \n",
                "        # Basic statistics\n",
                '        print(f"\\n📊 Quick Statistics:")\n',
                '        print(f"  - Total relationships: {len(df):,}")\n',
                "        if 'pronoun_text' in df.columns:\n",
                "            print(f\"  - Unique pronouns: {df['pronoun_text'].nunique()}\")\n",
                "        if 'clause_mate_count' in df.columns:\n",
                "            print(f\"  - Avg clause mates: {df['clause_mate_count'].mean():.1f}\")\n",
                "        if 'most_recent_antecedent_distance' in df.columns:\n",
                "            non_null = df['most_recent_antecedent_distance'].dropna()\n",
                "            if len(non_null) > 0:\n",
                '                print(f"  - Avg antecedent distance: {non_null.mean():.1f}")\n',
                "    \n",
                "    except Exception as e:\n",
                '        print(f"❌ Error reading output file: {e}")\n',
                "        \n",
                "else:\n",
                '    print("📁 No output files found in any expected location:")\n',
                "    for location in output_locations:\n",
                '        print(f"  - {location} (exists: {location.exists()})")\n',
                '    print("\\n💡 Run the analysis cell above first to generate output files")\n',
                '    print("Or upload your own CSV results to data/output/")',
            ],
        },
        {
            "cell_type": "markdown",
            "id": "e1ad3d98",
            "metadata": {},
            "source": [
                "## Next Steps\n",
                "\n",
                "To use ClauseMate with your own data:\n",
                "\n",
                "1. **Prepare TSV files** in WebAnno TSV 3.3 format with coreference annotations\n",
                "2. **Upload to `data/input/gotofiles/`** directory\n",
                "3. **Run analysis** using `python -m src.main` or the analysis cell above\n",
                "4. **Examine results** in `data/output/` CSV files\n",
                "\n",
                "### Development Environment\n",
                "\n",
                "For local development, use:\n",
                "```bash\n",
                "# Install dependencies\n",
                "pip install -e .[dev]\n",
                "\n",
                "# Run with nox task runner\n",
                "nox                    # lint + test\n",
                "nox -s test           # pytest only\n",
                "nox -s format         # format code\n",
                "\n",
                "# Manual execution\n",
                "python -m src.main    # Phase 2 (preferred)\n",
                "python src/run_phase2.py\n",
                "```\n",
                "\n",
                "### Research Applications\n",
                "\n",
                "ClauseMate supports German linguistic research on:\n",
                "- Pronoun resolution strategies\n",
                "- Discourse coherence patterns  \n",
                "- Referential accessibility hierarchies\n",
                "- Cross-sentence coreference tracking",
            ],
        },
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3",
        },
        "language_info": {"name": "python", "version": "3.8+"},
    },
    "nbformat": 4,
    "nbformat_minor": 5,
}

## Demo Analysis

Let's run ClauseMate on the sample data to demonstrate its linguistic analysis capabilities.

In [0]:
# Check available sample data
import os
from pathlib import Path

data_dir = Path("data/input/gotofiles")
if data_dir.exists():
    tsv_files = list(data_dir.glob("*.tsv"))
    print(f"Found {len(tsv_files)} TSV files for analysis:")
    for file in tsv_files[:3]:  # Show first 3
        print(f"  - {file.name}")

    if tsv_files:
        sample_file = tsv_files[0]
        print(f"\nUsing sample file: {sample_file.name}")
    else:
        print("❌ No TSV files found in data/input/gotofiles/")
else:
    print("❌ Data directory not found. Binder environment may need data setup.")

In [0]:
# Run Phase 2 analysis (if data available)
import subprocess
import sys
from pathlib import Path

# Ensure we're in the right directory
current_dir = Path.cwd()
project_root = current_dir.parent if current_dir.name == "notebooks" else current_dir

# Check for data availability
data_locations = [
    project_root / "data" / "input" / "gotofiles",
    Path("../data/input/gotofiles"),
    Path("data/input/gotofiles"),
]

has_data = any(loc.exists() and list(loc.glob("*.tsv")) for loc in data_locations)

if has_data:
    try:
        # Change to project root for execution
        original_cwd = os.getcwd()
        os.chdir(project_root)
        print(f"Running analysis from: {os.getcwd()}")

        # Run the modular Phase 2 analysis
        result = subprocess.run(
            [sys.executable, "-m", "src.main"],
            capture_output=True,
            text=True,
            timeout=60,
        )

        # Restore original directory
        os.chdir(original_cwd)

        if result.returncode == 0:
            print("✓ Phase 2 analysis completed successfully!")
            print("\nOutput preview:")
            print(result.stdout[-500:] if result.stdout else "No output captured")
        else:
            print(f"❌ Analysis failed with return code {result.returncode}")
            print(f"Error: {result.stderr}")
            if "ModuleNotFoundError" in result.stderr:
                print("\n💡 Try running the installation cell above first")

    except subprocess.TimeoutExpired:
        print("⏱️ Analysis timed out (60s limit in demo)")
        os.chdir(original_cwd)  # Restore directory even on timeout
    except Exception as e:
        print(f"❌ Unexpected error: {e}")
        os.chdir(original_cwd)  # Restore directory even on error
else:
    print("⚠️ Skipping analysis - no sample data available in Binder environment")
    print("To run full analysis, upload TSV files to data/input/gotofiles/")
    print("\n💡 You can still explore the code structure and documentation!")

    # Show some basic information about the project
    try:
        os.chdir(project_root)
        from src.config import TSVColumns

        print("\n📁 Project structure overview:")
        print("  - Config loaded successfully")
        print(
            f"  - TSV columns defined: {len([attr for attr in dir(TSVColumns) if not attr.startswith('_')])}"
        )
        os.chdir(original_cwd)
    except Exception as e:
        print(f"Could not load project info: {e}")
        os.chdir(original_cwd)

## Understanding the Output

ClauseMate generates CSV files with detailed linguistic relationships:

### Key Columns:

- **pronoun_text**: The critical pronoun being analyzed
- **clause_mate_count**: Number of referential clause mates in same sentence  
- **most_recent_antecedent_distance**: Linear distance to nearest mention in coreference chain
- **first_antecedent_distance**: Distance to chain's initial mention
- **givenness**: `neu` (first mention) vs `bekannt` (subsequent)
- **animacy**: `anim` vs `inanim` coreference layers

### Analysis Focus:

The tool investigates linear position consistency of pronouns relative to clause mates in German discourse.

In [0]:
# Show sample output structure (if available)
from pathlib import Path

import pandas as pd

# Try multiple possible output locations
current_dir = Path.cwd()
project_root = current_dir.parent if current_dir.name == "notebooks" else current_dir

output_locations = [
    project_root / "data" / "output",
    current_dir / "data" / "output",
    Path("../data/output"),
    Path("data/output"),
]

output_files = []
output_dir = None

for location in output_locations:
    if location.exists():
        files = list(location.glob("*.csv"))
        if files:
            output_files = files
            output_dir = location
            break

if output_files:
    latest_output = max(output_files, key=lambda p: p.stat().st_mtime)
    print(f"✓ Found output files in: {output_dir}")
    print(f"Latest output file: {latest_output.name}")

    try:
        # Show sample of results
        df = pd.read_csv(latest_output)
        print("\n📊 Dataset Analysis:")
        print(f"  - Shape: {df.shape}")
        print(f"  - Columns: {len(df.columns)}")

        print("\n📋 Column Names:")
        for i, col in enumerate(df.columns, 1):
            print(f"  {i:2d}. {col}")

        print("\n🔍 Sample Relationships:")
        # Show key columns if they exist
        key_cols = [
            "pronoun_text",
            "clause_mate_count",
            "most_recent_antecedent_distance",
            "givenness",
            "animacy",
        ]
        available_cols = [col for col in key_cols if col in df.columns]

        if available_cols:
            sample_df = df[available_cols].head(3)
            print(sample_df.to_string(index=False))
        else:
            print(df.head(3).to_string(index=False, max_cols=5))

        # Basic statistics
        print("\n📊 Quick Statistics:")
        print(f"  - Total relationships: {len(df):,}")
        if "pronoun_text" in df.columns:
            print(f"  - Unique pronouns: {df['pronoun_text'].nunique()}")
        if "clause_mate_count" in df.columns:
            print(f"  - Avg clause mates: {df['clause_mate_count'].mean():.1f}")
        if "most_recent_antecedent_distance" in df.columns:
            non_null = df["most_recent_antecedent_distance"].dropna()
            if len(non_null) > 0:
                print(f"  - Avg antecedent distance: {non_null.mean():.1f}")

    except Exception as e:
        print(f"❌ Error reading output file: {e}")

else:
    print("📁 No output files found in any expected location:")
    for location in output_locations:
        print(f"  - {location} (exists: {location.exists()})")
    print("\n💡 Run the analysis cell above first to generate output files")
    print("Or upload your own CSV results to data/output/")

## Next Steps

To use ClauseMate with your own data:

1. **Prepare TSV files** in WebAnno TSV 3.3 format with coreference annotations
2. **Upload to `data/input/gotofiles/`** directory
3. **Run analysis** using `python -m src.main` or the analysis cell above
4. **Examine results** in `data/output/` CSV files

### Development Environment

For local development, use:
```bash
# Install dependencies
pip install -e .[dev]

# Run with nox task runner
nox                    # lint + test
nox -s test           # pytest only
nox -s format         # format code

# Manual execution
python -m src.main    # Phase 2 (preferred)
python src/run_phase2.py
```

### Research Applications

ClauseMate supports German linguistic research on:
- Pronoun resolution strategies
- Discourse coherence patterns  
- Referential accessibility hierarchies
- Cross-sentence coreference tracking