Create a Jupyter notebook demo that illustrates a minimal, reproducible workflow
for integrating NGS variant data and structured clinical data to predict therapy response.

Constraints:
- Use synthetic/mock CSV data
- Focus on logic and feasibility, not performance
- Use simple, interpretable models only
- Keep code readable and well-commented

Required Sections (in this order):
1. Overview (Markdown)
2. Load Libraries and Data
3. Patient-Level Data Alignment
4. Feature Engineering (gene-level encoding)
5. Simple Interpretable Model (Logistic Regression or Random Forest)
6. Basic Interpretation (feature importance or coefficients)
7. Summary and Limitations (Markdown)

Use Python with pandas and scikit-learn only.
Avoid deep learning and external APIs.

# Demo Presentation Structure (10 min talk + 20 min Q&A)

## 10-Minute Demo Presentation Outline
1. **Introduction (1 min)**
   - Briefly introduce yourself and the project’s goal.
2. **Problem Statement & Motivation (1 min)**
   - What problem are you solving? Why is it important?
3. **Data Overview (1 min)**
   - Describe the synthetic NGS and clinical data used.
4. **Approach & Methods (2 min)**
   - Summarize feature engineering and modeling approach.
   - Mention key scripts and notebook sections.
5. **Demo Walkthrough (4 min)**
   - Show notebook sections:
     - Data loading
     - Feature engineering
     - Model training
     - Results/outputs
   - Highlight visualizations or key findings.
6. **Conclusion (1 min)**
   - Recap main achievements and next steps.

## 20-Minute Q&A Session
- Invite questions from the audience.
- Be ready to discuss:
  - Technical details (code, algorithms, data handling)
  - Challenges faced and solutions
  - Potential improvements or future work
  - Broader impact or applications
- Have backup slides or code cells ready for deeper dives if needed.

---
**Tip:** Time yourself during practice to stay within 10 minutes. Keep slides/cells concise and focused.

# Slide 1: Introduction
**Content:**
- Project: Integrating NGS variant data and clinical data to predict therapy response
- Presenter: [Your Name]

**Script:**
"Hello, my name is [Your Name]. Today, I’ll demonstrate a minimal workflow for integrating next-generation sequencing (NGS) variant data with structured clinical data to predict therapy response. This demo uses synthetic data and focuses on interpretability and reproducibility."

---

# Slide 2: Problem Statement & Motivation
**Content:**
- Challenge: Predicting therapy response using genomics and clinical data
- Importance: Personalized medicine, better outcomes

**Script:**
"The challenge we address is predicting how patients respond to therapy by combining genomics and clinical data. This is crucial for personalized medicine, helping clinicians make better treatment decisions and improving patient outcomes."

---

# Slide 3: Data Overview
**Content:**
- Synthetic NGS variant data (gene-level)
- Structured clinical data (age, sex, response)

**Script:**
"We use two synthetic datasets: one with NGS variant data at the gene level, and another with clinical features like age, sex, and therapy response. All data are mock, designed for demonstration and reproducibility."

---

# Slide 4: Approach & Methods
**Content:**
- Data alignment and merging
- Feature engineering (gene encoding, clinical features)
- Simple, interpretable models (Logistic Regression, Random Forest)

**Script:**
"Our approach involves aligning patient data, engineering features from both genomics and clinical sources, and applying simple, interpretable models such as logistic regression and random forest. The focus is on clarity and feasibility, not performance."

---

# Slide 5: Demo Walkthrough
**Content:**
- Data loading
- Feature engineering
- Model training
- Results/outputs

**Script:**
"Let’s walk through the notebook. First, we load and display the data. Next, we engineer features, including gene-level encodings and clinical variables. We then train two models and evaluate their performance. Finally, we interpret the results using model coefficients and feature importances."

---

# Slide 6: Conclusion
**Content:**
- Achievements: Demonstrated workflow, interpretable results
- Next steps: Apply to real data, refine features

**Script:**
"In summary, we’ve shown a minimal, reproducible workflow for integrating NGS and clinical data to predict therapy response. The results are interpretable and the process is transparent. Next steps include applying this workflow to real data and refining the feature engineering. Thank you!"

# Overview

This notebook demonstrates a minimal, reproducible workflow for integrating synthetic next-generation sequencing (NGS) variant data and structured clinical data to predict therapy response. The workflow uses interpretable machine learning models and emphasizes logic, feasibility, and clarity over performance. All data are synthetic and for demonstration purposes only.

In [1]:
# Load Libraries and Data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import numpy as np

# Load synthetic NGS variant data, skipping comment lines
ngs_variant_data = pd.read_csv('data/ngs_mock.csv', comment='#')

# Create synthetic clinical data
clinical_data = pd.DataFrame({
    'patient_id': ['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8', 'P9', 'P10'],
    'age': [65, 50, 70, 45, 60, 55, 62, 48, 67, 53],
    'sex': ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'F'],
    'therapy_response': [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]  # 1: response, 0: no response
})

# Display synthetic data
print('NGS Variant Data:')
print(ngs_variant_data)
print('\nClinical Data:')
print(clinical_data)

NGS Variant Data:
   patient_id    gene   variant
0          P1    TP53     R175H
1          P1    EGFR     L858R
2          P2   BRCA1  185delAG
3          P2    KRAS      G12D
4          P3     ALK    F1174L
5          P3  PIK3CA     E545K
6          P4    EGFR     T790M
7          P5    TP53     R273C
8          P5   BRCA2  6174delT
9          P6    KRAS      G13D
10         P7     ALK    R1275Q
11         P8  PIK3CA    H1047R
12         P9   BRCA1  5382insC
13        P10    TP53     R248Q
14        P10    EGFR     G719S

Clinical Data:
  patient_id  age sex  therapy_response
0         P1   65   F                 1
1         P2   50   M                 0
2         P3   70   F                 1
3         P4   45   M                 0
4         P5   60   F                 1
5         P6   55   M                 0
6         P7   62   F                 1
7         P8   48   M                 0
8         P9   67   F                 1
9        P10   53   F                 0


In [2]:
# Patient-Level Data Alignment
# Pivot NGS variant data to gene-level binary encoding
pivot = ngs_variant_data.pivot_table(index='patient_id', columns='gene', values='variant', aggfunc='count', fill_value=0)
pivot = (pivot > 0).astype(int)

# Merge with clinical data
merged_data = pd.merge(clinical_data, pivot, left_on='patient_id', right_index=True)

print('Merged Data:')
print(merged_data)

Merged Data:
  patient_id  age sex  therapy_response  ALK  BRCA1  BRCA2  EGFR  KRAS  \
0         P1   65   F                 1    0      0      0     1     0   
1         P2   50   M                 0    0      1      0     0     1   
2         P3   70   F                 1    1      0      0     0     0   
3         P4   45   M                 0    0      0      0     1     0   
4         P5   60   F                 1    0      0      1     0     0   
5         P6   55   M                 0    0      0      0     0     1   
6         P7   62   F                 1    1      0      0     0     0   
7         P8   48   M                 0    0      0      0     0     0   
8         P9   67   F                 1    0      1      0     0     0   
9        P10   53   F                 0    0      0      0     1     0   

   PIK3CA  TP53  
0       0     1  
1       0     0  
2       1     0  
3       0     0  
4       0     1  
5       0     0  
6       0     0  
7       1     0  
8       0 

In [3]:
# Feature Engineering (gene-level encoding)
# Encode categorical clinical features (e.g., sex)
encoded_data = merged_data.copy()
encoded_data['sex_F'] = (encoded_data['sex'] == 'F').astype(int)
encoded_data['sex_M'] = (encoded_data['sex'] == 'M').astype(int)

# Select features for modeling
feature_cols = list(pivot.columns) + ['age', 'sex_F', 'sex_M']
X = encoded_data[feature_cols]
y = encoded_data['therapy_response']

print('Feature Matrix (X):')
print(X)
print('\nTarget (y):')
print(y)

Feature Matrix (X):
   ALK  BRCA1  BRCA2  EGFR  KRAS  PIK3CA  TP53  age  sex_F  sex_M
0    0      0      0     1     0       0     1   65      1      0
1    0      1      0     0     1       0     0   50      0      1
2    1      0      0     0     0       1     0   70      1      0
3    0      0      0     1     0       0     0   45      0      1
4    0      0      1     0     0       0     1   60      1      0
5    0      0      0     0     1       0     0   55      0      1
6    1      0      0     0     0       0     0   62      1      0
7    0      0      0     0     0       1     0   48      0      1
8    0      1      0     0     0       0     0   67      1      0
9    0      0      0     1     0       0     1   53      1      0

Target (y):
0    1
1    0
2    1
3    0
4    1
5    0
6    1
7    0
8    1
9    0
Name: therapy_response, dtype: int64


In [4]:
# Simple Interpretable Model (Logistic Regression and Random Forest)
# Split data for demonstration (note: small synthetic dataset)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred_logreg = logreg.predict(X_test)
print('Logistic Regression Classification Report:')
print(classification_report(y_test, y_pred_logreg))

# Random Forest
rf = RandomForestClassifier(n_estimators=10, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print('Random Forest Classification Report:')
print(classification_report(y_test, y_pred_rf))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         2
           1       1.00      1.00      1.00         2

    accuracy                           1.00         4
   macro avg       1.00      1.00      1.00         4
weighted avg       1.00      1.00      1.00         4

Random Forest Classification Report:
              precision    recall  f1-score   support

           0       0.67      1.00      0.80         2
           1       1.00      0.50      0.67         2

    accuracy                           0.75         4
   macro avg       0.83      0.75      0.73         4
weighted avg       0.83      0.75      0.73         4



In [5]:
# Basic Interpretation (feature importance or coefficients)
# Logistic Regression coefficients
print('Logistic Regression Coefficients:')
for feature, coef in zip(feature_cols, logreg.coef_[0]):
    print(f'{feature}: {coef:.3f}')

# Random Forest feature importances
print('\nRandom Forest Feature Importances:')
for feature, importance in zip(feature_cols, rf.feature_importances_):
    print(f'{feature}: {importance:.3f}')

Logistic Regression Coefficients:
ALK: 0.019
BRCA1: 0.000
BRCA2: 0.070
EGFR: -0.086
KRAS: 0.000
PIK3CA: -0.003
TP53: -0.016
age: 0.685
sex_F: 0.004
sex_M: -0.004

Random Forest Feature Importances:
ALK: 0.200
BRCA1: 0.000
BRCA2: 0.000
EGFR: 0.225
KRAS: 0.000
PIK3CA: 0.025
TP53: 0.000
age: 0.500
sex_F: 0.050
sex_M: 0.000


# Summary and Limitations

This notebook presented a minimal, reproducible workflow for integrating synthetic NGS variant data and structured clinical data to predict therapy response using interpretable models. Key steps included data alignment, gene-level encoding, and model interpretation. All data were synthetic and results are illustrative only. Limitations include the use of mock data, small sample size, and simplified feature engineering. This workflow is intended to demonstrate feasibility and logic, not clinical validity or performance.