# Notebook 06: Therapeutic Target Prioritization

**Kairos Therapeutics ML Prototype V0.2**

---

## ⚠️ Public Repository Notice

**This notebook demonstrates Kairos Therapeutics' target prioritization methodology using synthetic data.**

All gene names, scores, and rankings shown are **illustrative placeholders** designed to demonstrate our computational approach. Actual therapeutic targets, priority scores, and strategic decisions are maintained in our internal pipeline.

---

## Strategic Framework

### MSC Delivery Constraints

MSCs can **secrete proteins** into the joint space. They cannot:
- Edit host genes
- Knock out genes in cartilage
- Deliver intracellular factors

### Therapeutic Direction Categories

| Category | Biological Meaning | MSC Strategy |
|----------|-------------------|---------------|
| `overexpress` | Beneficial factor (protective, regenerative, anti-inflammatory) | Secrete this factor |
| `inhibit_direct` | Pathogenic factor with known protein inhibitor | Secrete the inhibitor |
| `inhibit_indirect` | Pathogenic factor without direct inhibitor | Secrete pathway modulator |
| `caution` | Context-dependent biology | Requires validation |
| `biomarker_only` | Disease marker but not MSC-targetable | Exclude from combinations |

### Inhibitor Strategy

| Inhibitor Type | Mechanism | Example |
|----------------|-----------|--------|
| **Direct** | Protein binds and blocks target | Inhibitor binds protease active site |
| **Indirect (pathway)** | Modulates upstream signaling | Anti-inflammatory cytokine reduces expression |

---

## Inputs
- Candidate genes from age-disease intersection analysis

## Outputs
- `prioritized_targets.csv` (full therapeutic framework)
- `msc_deliverable_factors.csv` (overexpress candidates only)
- Visualization figures

---

## Author
Kairos Therapeutics | Computational Biology Team

## Version
v0.2 (Public Methodology Demo) | December 2025

---
## Cell 1: Setup and Configuration

In [None]:
import os
import warnings
from pathlib import Path
from datetime import datetime

warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (14, 10)
plt.rcParams['figure.dpi'] = 100
plt.rcParams['font.size'] = 11

NOTEBOOK_DIR = Path(os.getcwd())
if NOTEBOOK_DIR.name == 'notebooks':
    PROJECT_ROOT = NOTEBOOK_DIR.parent
else:
    PROJECT_ROOT = NOTEBOOK_DIR

DATA_PROCESSED = PROJECT_ROOT / 'data' / 'processed'
FIGURES_DIR = PROJECT_ROOT / 'reports' / 'figures'
REPORTS_DIR = PROJECT_ROOT / 'reports'

for d in [DATA_PROCESSED, FIGURES_DIR, REPORTS_DIR]:
    d.mkdir(parents=True, exist_ok=True)

print("="*70)
print("NOTEBOOK 06: THERAPEUTIC TARGET PRIORITIZATION")
print("PUBLIC METHODOLOGY DEMONSTRATION")
print("="*70)
print()
print("⚠️  This notebook uses SYNTHETIC DATA for demonstration purposes.")
print("    Actual targets and scores are maintained in internal pipeline.")
print(f"\nStarted: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Project root: {PROJECT_ROOT}")

---
## Cell 2: Generate Synthetic Demonstration Data

This synthetic dataset illustrates our methodology without revealing actual therapeutic targets.

In [None]:
# =============================================================================
# SYNTHETIC DEMONSTRATION DATA
# =============================================================================
# These are placeholder gene names to demonstrate the methodology.
# Actual gene targets are maintained in our internal pipeline.

np.random.seed(42)  # For reproducibility

synthetic_genes = [
    # Format: (gene, log2FC, therapeutic_direction, direct_inhibitor, indirect_inhibitor, hallmark, secreted)
    
    # === PATHOGENIC PROTEASES (inhibit_direct) ===
    ('PROTEASE_A', 3.2, 'inhibit_direct', 'INHIBITOR_X', 'CYTOKINE_1', 'ECM Degradation', True),
    ('PROTEASE_B', 2.1, 'inhibit_direct', 'INHIBITOR_X', 'CYTOKINE_1', 'ECM Degradation', True),
    ('PROTEASE_C', 1.8, 'inhibit_direct', 'INHIBITOR_Y', 'CYTOKINE_2', 'ECM Degradation', True),
    
    # === PROTECTIVE FACTORS (overexpress) ===
    ('INHIBITOR_X', 1.0, 'overexpress', None, None, 'ECM Protection', True),
    ('GROWTH_FACTOR_1', -1.5, 'overexpress', None, None, 'Anti-Senescence', True),
    ('GROWTH_FACTOR_2', 1.2, 'overexpress', None, None, 'Growth Factor Modulation', True),
    ('CYTOKINE_3', -1.8, 'overexpress', None, None, 'Stem Cell Maintenance', True),
    ('ANTAGONIST_1', -1.2, 'overexpress', None, None, 'Anti-Fibrosis', True),
    ('SERPIN_1', 2.0, 'overexpress', None, None, 'Proteostasis', True),
    ('MATRIX_PROTEIN_1', 2.5, 'overexpress', None, None, 'Anti-Angiogenesis', True),
    
    # === INFLAMMATORY MEDIATORS (inhibit_indirect) ===
    ('INFLAM_MARKER_1', 2.5, 'inhibit_indirect', None, 'CYTOKINE_1', 'Chronic Inflammation', True),
    ('INFLAM_MARKER_2', 1.6, 'inhibit_indirect', None, 'CYTOKINE_1', 'Chronic Inflammation', False),
    ('ADHESION_1', 1.5, 'inhibit_indirect', None, 'CYTOKINE_1', 'Chronic Inflammation', False),
    ('ADHESION_2', -1.3, 'inhibit_indirect', None, 'CYTOKINE_1', 'Chronic Inflammation', False),
    ('SENESCENCE_1', 2.1, 'inhibit_indirect', None, 'GROWTH_FACTOR_1', 'Cellular Senescence', True),
    ('COX_ENZYME', -1.8, 'inhibit_indirect', None, 'CYTOKINE_1', 'Inflammation', False),
    ('NF_KB_SUBUNIT', -1.3, 'inhibit_indirect', None, 'CYTOKINE_1', 'Inflammation Master Regulator', False),
    
    # === CONTEXT-DEPENDENT (caution) ===
    ('ANGIO_FACTOR_1', -2.6, 'caution', 'MATRIX_PROTEIN_1', None, 'Angiogenesis', True),
    ('ANGIO_FACTOR_2', 1.0, 'caution', None, None, 'Lymphangiogenesis', True),
    ('SIGNALING_1', 1.3, 'caution', 'DECOY_RECEPTOR_1', None, 'Signaling', True),
    ('FIBROSIS_FACTOR', 1.3, 'caution', None, 'ANTAGONIST_1', 'Fibrosis Risk', True),
    ('METABOLIC_1', -2.7, 'caution', None, None, 'Metabolism', True),
    ('CHEMOKINE_1', 2.0, 'caution', None, None, 'Inflammation / Immune', True),
    
    # === BIOMARKER ONLY (intracellular, not MSC-targetable) ===
    ('CELL_CYCLE_1', -3.0, 'biomarker_only', None, None, 'Cellular Senescence', False),
    ('CELL_CYCLE_2', 2.0, 'biomarker_only', None, None, 'Cell Cycle', False),
    ('CELL_CYCLE_3', 1.7, 'biomarker_only', None, None, 'Cell Cycle', False),
    ('CELL_CYCLE_4', 1.4, 'biomarker_only', None, None, 'Cellular Senescence', False),
    ('RECEPTOR_1', 2.4, 'biomarker_only', None, None, 'Signaling', False),
    ('SIGNALING_2', -2.0, 'biomarker_only', None, None, 'Metabolism', False),
    ('ANTIOXIDANT_1', 1.1, 'biomarker_only', None, None, 'Oxidative Stress', False),
    ('ANTIOXIDANT_2', 1.0, 'biomarker_only', None, None, 'Oxidative Stress', False),
    ('CHROMATIN_1', -1.8, 'biomarker_only', None, None, 'Chromatin / Alarmin', False),
    ('TF_LONGEVITY', -1.1, 'biomarker_only', None, None, 'Longevity', False),
    ('SENESCENCE_MARKER', 1.1, 'biomarker_only', None, None, 'Cellular Senescence', False),
]

# Create DataFrame
candidates_df = pd.DataFrame(synthetic_genes, 
    columns=['gene', 'log2FC', 'therapeutic_direction', 'direct_inhibitor', 
             'indirect_inhibitor', 'hallmark', 'secreted'])

# Add expression direction
candidates_df['expression_in_oa'] = candidates_df['log2FC'].apply(
    lambda x: 'Upregulated' if x > 0 else 'Downregulated'
)

print(f"Synthetic demonstration dataset: {len(candidates_df)} genes")
print(f"\nTherapeutic Direction Distribution:")
print(candidates_df['therapeutic_direction'].value_counts())

---
## Cell 3: Display Therapeutic Classification Framework

In [None]:
print("THERAPEUTIC CLASSIFICATION FRAMEWORK")
print("="*70)
print()

# Overexpress
overexpress = candidates_df[candidates_df['therapeutic_direction'] == 'overexpress']
print(f"1. OVEREXPRESS ({len(overexpress)} genes)")
print("   MSC Strategy: Engineer MSCs to secrete these protective factors")
print("-"*50)
for _, row in overexpress.iterrows():
    print(f"   + {row['gene']}: {row['hallmark']}")
print()

# Inhibit Direct
inhibit_direct = candidates_df[candidates_df['therapeutic_direction'] == 'inhibit_direct']
print(f"2. INHIBIT DIRECT ({len(inhibit_direct)} genes)")
print("   MSC Strategy: Deliver protein-level inhibitors")
print("-"*50)
for _, row in inhibit_direct.iterrows():
    print(f"   - {row['gene']} -> Deliver: {row['direct_inhibitor']}")
print()

# Inhibit Indirect
inhibit_indirect = candidates_df[candidates_df['therapeutic_direction'] == 'inhibit_indirect']
print(f"3. INHIBIT INDIRECT ({len(inhibit_indirect)} genes)")
print("   MSC Strategy: Deliver pathway modulators")
print("-"*50)
for _, row in inhibit_indirect.head(5).iterrows():
    if pd.notna(row['indirect_inhibitor']):
        print(f"   - {row['gene']} -> Modulate via: {row['indirect_inhibitor']}")
print(f"   ... and {len(inhibit_indirect) - 5} more")
print()

# Caution
caution = candidates_df[candidates_df['therapeutic_direction'] == 'caution']
print(f"4. CAUTION ({len(caution)} genes)")
print("   Status: Requires additional validation before inclusion")
print("-"*50)
for _, row in caution.iterrows():
    print(f"   ? {row['gene']}: {row['hallmark']}")
print()

# Biomarker Only
biomarker = candidates_df[candidates_df['therapeutic_direction'] == 'biomarker_only']
print(f"5. BIOMARKER ONLY ({len(biomarker)} genes)")
print("   Status: Intracellular/membrane - not MSC-targetable")
print("-"*50)
print(f"   {len(biomarker)} genes excluded from therapeutic combinations")

print()
print("="*70)
print("KEY: Only 'overexpress' + secreted genes proceed to combination generation")
print("="*70)

---
## Cell 4: Identify MSC-Deliverable Factors

In [None]:
print("MSC-DELIVERABLE THERAPEUTIC FACTORS")
print("="*70)
print()

# Filter to overexpress + secreted
msc_deliverable = candidates_df[
    (candidates_df['therapeutic_direction'] == 'overexpress') &
    (candidates_df['secreted'] == True)
].copy()

print(f"MSC-Deliverable Factors: {len(msc_deliverable)}")
print("-"*50)
for _, row in msc_deliverable.iterrows():
    print(f"  + {row['gene']}: {row['hallmark']}")

print()
print(f"These {len(msc_deliverable)} factors will proceed to combination generation.")
print(f"All other genes are excluded from MSC therapeutic combinations.")

---
## Cell 5: Apply Priority Scoring

In [None]:
# Priority scoring
priority_map = {
    'ECM Protection': 100,
    'ECM Degradation': 90,
    'Anti-Senescence': 85,
    'Stem Cell Maintenance': 80,
    'Anti-Fibrosis': 75,
    'Proteostasis': 70,
    'Anti-Angiogenesis': 65,
    'Chronic Inflammation': 60,
    'Cellular Senescence': 55,
}

candidates_df['hallmark_score'] = candidates_df['hallmark'].map(priority_map).fillna(50)

# Effect size score
abs_log2fc = candidates_df['log2FC'].abs()
candidates_df['effect_size_score'] = 100 * (abs_log2fc - abs_log2fc.min()) / (abs_log2fc.max() - abs_log2fc.min() + 0.01)

# Secreted bonus
candidates_df['secreted_bonus'] = candidates_df['secreted'].apply(lambda x: 20 if x else 0)

# Direction bonus (overexpress is most actionable)
direction_bonus = {
    'overexpress': 30,
    'inhibit_direct': 20,
    'inhibit_indirect': 10,
    'caution': 0,
    'biomarker_only': -10,
}
candidates_df['direction_bonus'] = candidates_df['therapeutic_direction'].map(direction_bonus)

# Composite score
candidates_df['composite_score'] = (
    candidates_df['hallmark_score'] * 0.35 +
    candidates_df['effect_size_score'] * 0.25 +
    candidates_df['secreted_bonus'] +
    candidates_df['direction_bonus']
)

# Assign tiers
def assign_tier(row):
    if row['therapeutic_direction'] == 'overexpress' and row['secreted'] == True:
        return 'Tier 1: Lead'
    elif row['therapeutic_direction'] in ['inhibit_direct', 'inhibit_indirect']:
        return 'Tier 2: Development'
    elif row['therapeutic_direction'] == 'caution':
        return 'Tier 3: Validation Required'
    else:
        return 'Tier 4: Biomarker Only'

candidates_df['tier'] = candidates_df.apply(assign_tier, axis=1)

# Sort and rank
candidates_df = candidates_df.sort_values('composite_score', ascending=False).reset_index(drop=True)
candidates_df['rank'] = range(1, len(candidates_df) + 1)

print("PRIORITY SCORING COMPLETE")
print("="*70)
print()
print("Tier Distribution:")
print(candidates_df['tier'].value_counts())
print()
print("Top 10 Candidates:")
print("-"*70)
display_cols = ['rank', 'gene', 'therapeutic_direction', 'composite_score', 'tier']
print(candidates_df[display_cols].head(10).to_string(index=False))

---
## Cell 6: Visualization

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Color scheme
direction_colors = {
    'overexpress': '#27AE60',
    'inhibit_direct': '#E74C3C',
    'inhibit_indirect': '#E67E22',
    'caution': '#F1C40F',
    'biomarker_only': '#95A5A6'
}

# Panel A: Therapeutic Direction by Expression
ax1 = axes[0, 0]
for direction in direction_colors.keys():
    subset = candidates_df[candidates_df['therapeutic_direction'] == direction]
    ax1.scatter(subset['log2FC'], subset['composite_score'], 
                c=direction_colors[direction], label=direction, s=100, alpha=0.7,
                edgecolors='black', linewidth=0.5)
ax1.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
ax1.set_xlabel('log2FC (Positive = Upregulated in Disease)', fontsize=11)
ax1.set_ylabel('Composite Score', fontsize=11)
ax1.set_title('Panel A: Therapeutic Direction by Expression Pattern', fontweight='bold')
ax1.legend(loc='lower right', fontsize=8)

# Panel B: Therapeutic Direction Distribution
ax2 = axes[0, 1]
direction_counts = candidates_df['therapeutic_direction'].value_counts()
colors = [direction_colors[d] for d in direction_counts.index]
ax2.pie(direction_counts, labels=direction_counts.index, autopct='%1.0f%%', colors=colors)
ax2.set_title('Panel B: Therapeutic Direction Distribution', fontweight='bold')

# Panel C: MSC Delivery Strategy
ax3 = axes[1, 0]
strategy_data = {
    'Overexpress\n(Protective)': len(candidates_df[(candidates_df['therapeutic_direction'] == 'overexpress') & (candidates_df['secreted'] == True)]),
    'Inhibit Direct\n(Deliver Inhibitor)': len(candidates_df[candidates_df['therapeutic_direction'] == 'inhibit_direct']),
    'Inhibit Indirect\n(Pathway Mod)': len(candidates_df[candidates_df['therapeutic_direction'] == 'inhibit_indirect']),
    'Caution\n(Validate)': len(candidates_df[candidates_df['therapeutic_direction'] == 'caution']),
    'Not Targetable\n(Intracellular)': len(candidates_df[candidates_df['therapeutic_direction'] == 'biomarker_only']),
}
bars = ax3.bar(strategy_data.keys(), strategy_data.values(), 
               color=['#27AE60', '#E74C3C', '#E67E22', '#F1C40F', '#95A5A6'])
ax3.set_ylabel('Number of Genes', fontsize=11)
ax3.set_title('Panel C: MSC Delivery Strategy Summary', fontweight='bold')
ax3.tick_params(axis='x', rotation=15)

for bar in bars:
    height = bar.get_height()
    ax3.text(bar.get_x() + bar.get_width()/2., height,
             f'{int(height)}', ha='center', va='bottom', fontweight='bold')

# Panel D: Top MSC-Deliverable Factors
ax4 = axes[1, 1]
msc_top = candidates_df[
    (candidates_df['therapeutic_direction'] == 'overexpress') & 
    (candidates_df['secreted'] == True)
].head(7)

if len(msc_top) > 0:
    colors = plt.cm.Greens(np.linspace(0.3, 0.9, len(msc_top)))
    ax4.barh(msc_top['gene'], msc_top['composite_score'], color=colors)
    ax4.set_xlabel('Composite Score', fontsize=11)
    ax4.set_title('Panel D: Top MSC-Deliverable Factors', fontweight='bold')
    ax4.invert_yaxis()

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'target_prioritization_methodology.png', dpi=150, 
            bbox_inches='tight', facecolor='white')
plt.show()

print(f"\nSaved: {FIGURES_DIR / 'target_prioritization_methodology.png'}")

---
## Cell 7: Export Demonstration Results

In [None]:
# Export columns
export_cols = [
    'rank', 'gene', 'log2FC', 'expression_in_oa',
    'therapeutic_direction', 'direct_inhibitor', 'indirect_inhibitor',
    'hallmark', 'secreted', 'composite_score', 'tier'
]

# Save full dataset
candidates_df[export_cols].to_csv(DATA_PROCESSED / 'prioritized_targets_demo.csv', index=False)
print(f"Saved: {DATA_PROCESSED / 'prioritized_targets_demo.csv'}")

# Save MSC-deliverable only
msc_deliverable = candidates_df[
    (candidates_df['therapeutic_direction'] == 'overexpress') & 
    (candidates_df['secreted'] == True)
]
msc_deliverable[export_cols].to_csv(DATA_PROCESSED / 'msc_deliverable_factors_demo.csv', index=False)
print(f"Saved: {DATA_PROCESSED / 'msc_deliverable_factors_demo.csv'} ({len(msc_deliverable)} factors)")

print("\nVERIFICATION:")
print(f"  'therapeutic_direction' values: {candidates_df['therapeutic_direction'].unique().tolist()}")
print(f"  'direct_inhibitor' populated: {candidates_df['direct_inhibitor'].notna().sum()} genes")
print(f"  'indirect_inhibitor' populated: {candidates_df['indirect_inhibitor'].notna().sum()} genes")

---
## Cell 8: Summary

In [None]:
print("="*70)
print("NOTEBOOK 06 COMPLETE | V0.2 (Public Demo)")
print("="*70)
print()

print("METHODOLOGY DEMONSTRATION SUMMARY:")
print("-"*50)
print(f"  Total synthetic genes: {len(candidates_df)}")
print()

for direction in ['overexpress', 'inhibit_direct', 'inhibit_indirect', 'caution', 'biomarker_only']:
    count = len(candidates_df[candidates_df['therapeutic_direction'] == direction])
    print(f"  {direction}: {count}")

print()
print("MSC ENGINEERING STRATEGY:")
print("-"*50)
print(f"  Factors for MSC overexpression: {len(msc_deliverable)}")
print(f"  These proceed to combination generation in downstream notebooks")

print()
print("KEY OUTPUTS:")
print("-"*50)
print("  prioritized_targets_demo.csv (full synthetic dataset)")
print("  msc_deliverable_factors_demo.csv (overexpress candidates)")
print("  target_prioritization_methodology.png (visualization)")

print()
print("⚠️  REMINDER: This notebook uses synthetic data for demonstration.")
print("    Actual therapeutic targets are in our internal pipeline.")

print(f"\nCompleted: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("="*70)

---
## End of Notebook 06 (Public)

### Methodology Demonstrated

1. **Five Therapeutic Direction Categories**:
   - `overexpress`: Protective factors for MSC secretion
   - `inhibit_direct`: Pathogenic factors with protein inhibitors
   - `inhibit_indirect`: Pathogenic factors with pathway modulators
   - `caution`: Context-dependent, requires validation
   - `biomarker_only`: Not MSC-targetable

2. **Dual Inhibitor Mapping**:
   - Direct inhibitors: Protein-level blockers
   - Indirect inhibitors: Pathway modulators

3. **MSC Delivery Filter**:
   - Only `overexpress` + `secreted` genes proceed to combinations
   - Pathogenic factors excluded from MSC payload

---

*Kairos Therapeutics | Computational Biology Platform | V0.2 (Public)*