# Quote Extraction Pipeline Visualization

This notebook loads the JSONL outputs from each stage of the quote extraction pipeline, combines them into a single DataFrame, and provides visualizations to analyze the process.

In [None]:
import os
import sys
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

# Add project root to sys.path to allow for local module imports
project_root = os.path.abspath(os.path.join(os.getcwd(), '..', '..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

print(f"pandas version: {pd.__version__}")
print(f"matplotlib version: {matplotlib.__version__}")

sns.set_theme(style="whitegrid")

## 1. Configuration & Parameters

Define paths to the data files and set visualization parameters.

In [None]:
from pathlib import Path

# The `run_extraction.py` script saves the data in a `data` directory at the project root.
DATA_DIR = Path("../../data")
FILES = {
    0: DATA_DIR / "stage0_raw.jsonl",
    1: DATA_DIR / "stage1_candidates.jsonl",
    2: DATA_DIR / "stage2_attributed.jsonl",
    3: DATA_DIR / "stage3_final.jsonl"
}

# Visualization settings
SCORE_NULL_PLACEHOLDER = -1.0

## 2. Utility Functions

Helper functions to load and prepare the data for analysis.

In [None]:
def load_stage(stage: int) -> pd.DataFrame:
    """Loads a single stage's JSONL file into a DataFrame."""
    file_path = FILES[stage]
    if not file_path.exists():
        print(f"Warning: File not found for stage {stage} at {file_path}. Please run `make run` first.")
        return pd.DataFrame()
    return pd.read_json(file_path, lines=True)

def prepare_dataframe(dfs: list[pd.DataFrame]) -> pd.DataFrame:
    """Concatenates, cleans, and sorts the DataFrames from all stages."""
    if not any(not df.empty for df in dfs):
        return pd.DataFrame()
    full = pd.concat(dfs, ignore_index=True)
    full["speaker"] = full["speaker"].fillna("<none>")
    full["score"] = full["score"].fillna(SCORE_NULL_PLACEHOLDER)
    return full.sort_values(["stage", "speaker", "score"], ascending=[True, True, False])

## 3. Data Loading & Preparation

Load data from all four stages into a single pandas DataFrame.

In [None]:
dfs = [load_stage(s) for s in FILES]
full_df = prepare_dataframe(dfs)

if not full_df.empty:
    print("Combined DataFrame shape:", full_df.shape)
    display(full_df.head(10))
else:
    print("DataFrame is empty. Please run `make run` first to generate the data.")

## 4. Stage-By-Stage Inspection

Let's examine the output of each pipeline stage. We'll print a formatted summary of the quotes found at each step.

In [None]:
def print_stage_overview(df):
    """Iterates through the dataframe and prints a formatted summary."""
    if df.empty:
        print("No data to display.")
        return
    
    for stage, stage_df in df.groupby("stage"):
        print(f"\n\n{'='*10} Stage {stage} {'='*10}")
        for speaker, grp in stage_df.groupby("speaker"):
            print(f"\n-- Speaker: {speaker!r} ({len(grp)} rows) --")
            # Sort by score for display purposes and show top 5
            for _, row in grp.sort_values('score', ascending=False).head(5).iterrows():
                score_str = f"[{row.score:.2f}]" if row.score >= 0 else "[---]"
                # For raw text, show a snippet
                display_text = str(row.text).replace('\n', ' ')
                if len(display_text) > 120:
                    display_text = display_text[:117] + '...'
                print(f"  • {score_str} {display_text}")

print_stage_overview(full_df)

## 5. Cross-Stage Comparison & Visualizations

Now let's visualize the filtering process across stages.

### 5.1 Quote Survival Rate by Stage

In [None]:
if not full_df.empty:
    stage_counts = full_df.groupby('stage')['doc_id'].nunique()
    stage_counts.index = ['0: Raw Docs', '1: Candidates', '2: Attributed', '3: Final']
    
    plt.figure(figsize=(10, 6))
    ax = sns.barplot(x=stage_counts.index, y=stage_counts.values)
    ax.set_title('Number of Items Passing Each Stage')
    ax.set_ylabel('Count (Log Scale)')
    ax.set_yscale('log')
    ax.bar_label(ax.containers[0])
    plt.show()
else:
    print("No data to plot.")

### 5.2 Speaker Distribution in Final Stage

In [None]:
if not full_df.empty:
    final_df = full_df[full_df.stage == 3]
    if not final_df.empty:
        plt.figure(figsize=(12, 8))
        top_speakers = final_df['speaker'].value_counts().nlargest(20).index
        speaker_df = final_df[final_df['speaker'].isin(top_speakers)]
        ax = sns.countplot(y=speaker_df['speaker'], order=top_speakers, hue=speaker_df['speaker'], legend=False)
        ax.set_title('Stage 3: Final Quote Count per Speaker (Top 20)')
        ax.set_xlabel('Number of Quotes')
        ax.set_ylabel('Speaker')
        plt.tight_layout()
        plt.show()
    else:
        print("No quotes survived to the final stage.")
else:
    print("No data to plot.")

### 5.3 Score Distribution in Final Stage

In [None]:
if not full_df.empty:
    final_df_with_scores = full_df[(full_df.stage == 3) & (full_df.score >= 0)]
    if not final_df_with_scores.empty:
        plt.figure(figsize=(10, 6))
        ax = sns.histplot(final_df_with_scores['score'], bins=20, kde=True)
        ax.set_title('Stage 3: Distribution of Semantic Scores for Final Quotes')
        ax.set_xlabel('Similarity Score')
        ax.set_ylabel('Frequency')
        plt.show()
    else:
        print("No scored quotes in the final stage to plot.")
else:
    print("No data to plot.")

## 6. Conclusion

This notebook provides a detailed, stage-by-stage analysis of the quote extraction pipeline. The visualizations highlight the filtering effectiveness at each step, from raw text to final, semantically-scored quotes.