# Project 3: Data Exploration

**Objective:** Explore and prepare network databases and patient data for heterogeneous GNN construction.

## Outline
1. Load and explore STRING PPI network
2. Load and explore Reactome pathways
3. Load and explore ENCODE TF-target relationships
4. Load and explore GTEx co-expression
5. Examine ClinVar patient-variant data
6. Basic network statistics

In [None]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns

from pathlib import Path

# Project paths
PROJECT_ROOT = Path("..")
DATA_DIR = PROJECT_ROOT / "data"
NETWORK_DIR = DATA_DIR / "networks"
PATIENT_DIR = DATA_DIR / "patients"

## 1. STRING PPI Network

Download from: https://string-db.org/
Filter to: Human (9606), combined_score >= 700

In [None]:
# TODO: Download and process STRING
# string_url = "https://stringdb-downloads.org/download/protein.links.v12.0/9606.protein.links.v12.0.txt.gz"

# Expected columns: protein1, protein2, combined_score
# Filter: combined_score >= 700
# Map: ENSP to gene symbol via biomart

## 2. Reactome Pathways

Download from: https://reactome.org/download-data

In [None]:
# TODO: Download and process Reactome
# Expected output: gene_symbol, pathway_id, pathway_name

## 3. ENCODE TF-Target Relationships

Source: ENCODE ChIP-seq peaks (TF binding sites near gene promoters)

In [None]:
# TODO: Query ENCODE API or use pre-processed TF-target DB
# Expected output: tf_gene, target_gene, cell_type, score

## 4. GTEx Co-expression

Source: GTEx v8 expression data, compute Pearson correlations

In [None]:
# TODO: Download GTEx, compute correlations
# Filter: |correlation| >= 0.7
# Expected output: gene1, gene2, tissue, correlation

## 5. ClinVar Patient Data

Source: ClinVar VCF with pathogenic/likely pathogenic variants

In [None]:
# TODO: Parse ClinVar VCF
# Extract: variant_id, gene, disease, significance
# Filter: Pathogenic or Likely pathogenic

## 6. Network Statistics

In [None]:
# TODO: Compute and visualize
# - Node count per network
# - Edge count per network
# - Degree distribution
# - Network overlap (genes in multiple networks)

## Next Steps

After data exploration:
1. Run `02_graph_construction.ipynb` to build patient graphs
2. See `../src/graph/builder.py` for the PatientGraphBuilder class