# t-SNE Visualization Analysis

This notebook displays results from the automated t-SNE analysis pipeline. All computations are handled by the pipeline - this notebook focuses on narrative and visualization.

In [None]:
from core.config import initialize_notebook
import pickle
import matplotlib.pyplot as plt
from pathlib import Path

# Initialize environment
env = initialize_notebook(regenerate_run_id=False)

# Display active research question
research_question = env.configs.run['run_name']
research_group_col = env.configs.data["columns"]["mapping"]["research_group"]

print(f"\n{'='*50}")
print(f"Research Question: {research_question.upper()}")
print(f"Group Column: {research_group_col}")
print(f"{'='*50}\n")

run_cfg = env.configs.run
data_dir = env.repo_root / "outputs" / run_cfg["run_name"] / run_cfg["run_id"] / f"seed_{run_cfg['seed']}"
embeddings_dir = data_dir / "tsne_embeddings"
plots_dir = embeddings_dir / "plots"

print(f"Loading t-SNE results from: {embeddings_dir}")

In [None]:
# Run t-SNE analysis if not already computed
if not embeddings_dir.exists() or not (plots_dir / f"qc_comparison_complexity{env.configs.tsne['complexity']}.png").exists():
    print("t-SNE results not found. Running t-SNE analysis pipeline...")
    from core.tsne.pipeline import run_tsne_analysis
    results = run_tsne_analysis(env)
    print("t-SNE analysis complete!")
else:
    print("t-SNE results found, loading existing analysis")

## Quality Control Impact

The first analysis shows how quality control filtering affects the overall data distribution in t-SNE space. We expect to see outliers with high surface topology defects removed after QC.

In [None]:
# Display QC comparison plot
complexity = env.configs.tsne["complexity"]
qc_plot = plots_dir / f"qc_comparison_complexity{complexity}.png"

if qc_plot.exists():
    from IPython.display import Image, display
    display(Image(str(qc_plot)))
else:
    print("QC comparison plot not found. Run the full pipeline first.")

## Harmonization Effectiveness

These plots demonstrate how harmonization reduces site effects while preserving biological signal of interest (research groups).

In [None]:
# Display harmonization impact on research groups
group_plot = plots_dir / f"harmonization_{research_question}_complexity{complexity}.png"

if group_plot.exists():
    display(Image(str(group_plot)))
else:
    print(f"{research_question.title()} harmonization plot not found. Run the full pipeline first.")

In [None]:
# Display harmonization impact on scanner effects
scanner_plot = plots_dir / f"harmonization_scanner_complexity{complexity}.png"

if scanner_plot.exists():
    display(Image(str(scanner_plot)))
else:
    print("Scanner harmonization plot not found. Run the full pipeline first.")

## Demographics Distribution

These plots show how demographic variables (age, sex) are distributed in the t-SNE space before and after harmonization.

In [None]:
# Display age distribution
age_plot = plots_dir / f"demographics_age_complexity{complexity}.png"

if age_plot.exists():
    display(Image(str(age_plot)))
else:
    print("Age demographics plot not found. Run the full pipeline first.")

In [None]:
# Display sex distribution
sex_plot = plots_dir / f"demographics_sex_complexity{complexity}.png"

if sex_plot.exists():
    display(Image(str(sex_plot)))
else:
    print("Sex demographics plot not found. Run the full pipeline first.")

## Summary Statistics

Load and display key statistics from the analysis.

In [None]:
# Load metadata to show summary stats
metadata_path = embeddings_dir / "metadata.pkl"

if metadata_path.exists():
    with open(metadata_path, 'rb') as f:
        metadata = pickle.load(f)
    
    print("=== t-SNE Analysis Summary ===")
    print(f"Pre-QC samples: {len(metadata['preqc'][research_question]):,}")
    print(f"Post-QC samples: {len(metadata['postqc'][research_question]):,}")
    print(f"QC removal: {len(metadata['preqc'][research_question]) - len(metadata['postqc'][research_question]):,} samples")
    print(f"t-SNE complexity: {complexity}")
    
    # Show research group distribution
    import pandas as pd
    group_counts = pd.Series(metadata['postqc'][research_question]).value_counts()
    print(f"\n{research_question.title()} group distribution: {group_counts.to_dict()}")
    
else:
    print("Metadata not found. Run the full pipeline first.")

## Conclusion

This analysis demonstrates:

1. **Quality Control**: Effective removal of poor-quality samples with high surface topology defects
2. **Harmonization**: Reduction of scanner-related clustering while preserving research group patterns
3. **Demographics**: Consistent distribution of age and sex across the harmonized data

The t-SNE embeddings and plots are saved for further analysis and publication.