# Copyright and License

© 2026, Isabel Bejerano Blazquez  

This Jupyter Notebook is licensed under the **MIT License**.  

## Disclaimer

- This notebook is provided *“as is”*, without warranty of any kind, express or implied.  
- The author assumes no responsibility or liability for any errors, omissions, or outcomes resulting from the use of this notebook or its contents.  
- All analyses and interpretations are for **educational and research purposes only** and do not constitute medical or clinical advice.  

## Dataset Note

- The analyses presented in this notebook are based on the **GSE2034 microarray dataset** from the **NCBI Gene Expression Omnibus (GEO)**, focusing on breast cancer patients.  
- The dataset is publicly available and contains **gene expression measurements** along with **clinical outcome metadata**, including relapse-free survival information.  
- The dataset is fully **de-identified** and intended for research and educational use.  

---

# Predictive Pipeline: Survival-Based Stratification of Breast Cancer Patients Using Microarray Data (GSE2034)

**Dataset:** [GSE2034: Gene Expression Profiles of Breast Cancer Patients](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE2034)  

**Local path:** `../datasets`  

**License:** Publicly available from NCBI GEO; fully **de-identified** and intended for **research and educational purposes**.  

## Abstract

Molecular profiling of breast cancer patients is a powerful approach for understanding tumor biology and predicting clinical outcomes. This notebook presents a reproducible, end-to-end predictive modeling pipeline applied to the **GSE2034 microarray dataset** for the unsupervised clustering and survival-based stratification of breast cancer patients.

The analysis focuses on unsupervised clustering techniques to identify distinct patient subgroups based on gene expression data, followed by survival analysis to assess relapse-free survival outcomes. The pipeline includes data preprocessing, normalization, clustering algorithms (e.g., hierarchical clustering, k-means), and integration of clinical metadata (e.g., survival status). Model performance is evaluated using survival analysis metrics, including Kaplan-Meier curves and Cox proportional hazards models, to assess the prognostic value of the identified clusters.

All findings are interpreted within a biological and clinical framework, considering the molecular heterogeneity of breast cancer and the potential for personalized treatment strategies. The notebook emphasizes transparency, reproducibility, and methodological rigor, making it suitable for educational, research, and clinical risk stratification applications.


## Dataset Relevance

The **GSE2034 microarray dataset** is a widely used resource for breast cancer research, containing gene expression data from 286 breast cancer samples measured on the Affymetrix HG-U133A platform. The dataset is accompanied by clinical metadata, including relapse-free survival status, making it highly relevant for **predictive modeling, unsupervised clustering, and survival analysis** in cancer research.

The dataset's large sample size and comprehensive gene expression measurements provide a rich resource for studying the molecular heterogeneity of breast cancer and identifying distinct subgroups of patients. While GSE2034 does not include detailed treatment information or long-term follow-up data, its integration of **clinical outcomes** (such as relapse-free survival) allows for **prognostic modeling** and risk stratification.

Because the dataset is publicly available and **fully de-identified**, it supports **ethical use, reproducibility**, and **transparency** in cancer research workflows. Its accessibility makes it an ideal resource for educational purposes, exploratory analysis, and the development of tools aimed at improving clinical decision-making and personalized treatment strategies.


## Relevance of Survival-Based Stratification as an Analytical Outcome

Survival outcomes, particularly relapse-free survival, are central endpoints in breast cancer research due to their direct clinical relevance and strong association with molecular tumor heterogeneity. Stratifying patients using gene expression–derived subgroups enables the identification of distinct risk strata that are not captured by conventional clinicopathological variables alone.

In benchmark datasets such as GSE2034, relapse-free survival is systematically annotated and integrated with genome-wide expression profiles, making it well suited for unsupervised survival-based stratification. Although treatment information is limited, the dataset is widely used for methodological evaluation and exploratory prognostic analyses in cancer genomics.

From a methodological standpoint, this framework supports risk group discovery rather than individual-level prediction, allowing survival distributions across molecular clusters to be compared using Kaplan–Meier estimators and Cox proportional hazards models. This stratification-oriented approach reflects real-world prognostic heterogeneity while remaining appropriate for educational and research-focused analyses.


## Analytical Framework and Research Design

This study adopts a **quantitative, observational, exploratory research design** based on publicly available molecular and clinical outcome data from the **GSE2034 breast cancer microarray dataset** hosted by the NCBI Gene Expression Omnibus (GEO). The primary objective is to develop a **reproducible end-to-end analytical pipeline** for **unsupervised patient stratification** using gene expression profiles, followed by survival-based evaluation of the resulting molecular subgroups.

The analytical framework is explicitly **exploratory and prognostic rather than causal or predictive at the individual level**. No attempt is made to infer causal mechanisms, treatment effects, or clinical decision rules. Model outputs, cluster assignments, and survival estimates are interpreted descriptively, with the goal of assessing whether unsupervised molecular stratification yields clinically meaningful separation in relapse-free survival outcomes.

The analysis proceeds through the following structured stages:

1. **Data ingestion and structural inspection**  
   Importation of gene expression and clinical metadata, verification of sample alignment, dimensionality, and outcome availability.

2. **Preprocessing and normalization**  
   Quality control, filtering, and normalization of high-dimensional microarray expression data to ensure comparability across samples.

3. **Feature selection and dimensionality reduction**  
   Selection of informative genes or transformation of the expression space to mitigate noise and high-dimensional effects.

4. **Unsupervised clustering**  
   Application of unsupervised learning algorithms (e.g., hierarchical clustering, k-means) to identify molecularly defined patient subgroups.

5. **Cluster characterization**  
   Examination of cluster composition, stability, and expression patterns to assess biological plausibility.

6. **Survival data integration**  
   Linking cluster assignments with relapse-free survival outcomes and censoring information.

7. **Survival-based stratification analysis**  
   Comparison of survival distributions across clusters using Kaplan–Meier estimators and Cox proportional hazards models.

8. **Interpretation and limitations**  
   Contextual interpretation of stratification results, with explicit discussion of data limitations, lack of treatment information, and constraints on clinical generalizability.

Although survival analysis techniques are employed, the primary objective of this study is **risk group discovery and methodological transparency**, not individual-level outcome prediction or clinical inference. The notebook is intended to demonstrate how unsupervised molecular stratification can be evaluated using survival endpoints in an ethical, reproducible, and research-focused analytical workflow.


### Outcome-Proximal Variables, Predictor Eligibility, and Leakage Control

To prevent semantic target leakage and ensure meaningful unsupervised stratification, predictor eligibility was determined based on **conceptual proximity to the outcome** rather than purely statistical association. In the GSE2034 breast cancer dataset, special care was taken to exclude variables that directly encode **relapse occurrence** or reflect **post-relapse information**.

The following variables were identified as **outcome-proximal** and **excluded from stratification features**:

| Variable | Description | Reason for Exclusion |
|----------|-------------|--------------------|
| `relapse` | Binary indicator of distant metastasis occurrence | Directly encodes the endpoint; including it would leak the outcome |
| `months.to.relapse.or.last.followup` | Time to relapse or last follow-up | Directly tied to the endpoint; post-outcome information |
| `Brain.relapses` | Indicator of brain-specific relapse | Post-relapse outcome information; would introduce circularity |

All remaining variables were retained for clustering and stratification, for instance:

- **Gene expression profiles (Affymetrix U133A microarray)** – baseline molecular features suitable for unsupervised analysis.  
- **`ER.Status`** – estrogen receptor status at baseline, considered a pre-treatment clinical attribute.  
- **Sample identifiers (`PID`, GEO accession)** – retained for data linkage but not used as predictive features.

By filtering variables in this way, stratification clusters are based solely on **baseline molecular and clinical characteristics**, avoiding circular definitions that would artificially inflate apparent differences in relapse outcomes. Outcome variables are reserved for **post-hoc evaluation** of cluster prognostic separation.


## Data Source and Ethical Considerations

This analysis uses the **GSE2034 breast cancer dataset** from **NCBI GEO**, containing de-identified gene expression profiles and clinical annotations for 286 primary breast tumor samples. The dataset is publicly available under standard GEO data usage policies.  

Ethical considerations:

- All samples and annotations are fully de-identified; no personal identifiers are included.  
- No attempts are made to re-identify patients.  
- Analyses focus exclusively on baseline tumor molecular and clinical features.  
- Outcome-proximal variables (e.g., relapse indicators, time to relapse, post-treatment response) are excluded to prevent leakage and ensure clusters reflect **pre-treatment characteristics** rather than post-outcome information.  


## Dataset Loading and Verification

In [1]:
import pandas as pd

# Load the GEO Series Matrix, skipping metadata lines starting with '!'
# For tab-delimited GEO matrix files
filename = '../datasets/GSE2034_series_matrix.txt'

df = pd.read_csv(
    filename,
    sep='\t',
    comment='!',  # skip metadata lines
    header=0,
    engine='python',
    skipinitialspace=True
)

# Inspect dataset
print(f"Shape: {df.shape}")
print("Columns:", df.columns.tolist())
df.head()


Shape: (22283, 287)
Columns: ['ID_REF', 'GSM36777', 'GSM36778', 'GSM36779', 'GSM36780', 'GSM36781', 'GSM36782', 'GSM36783', 'GSM36784', 'GSM36785', 'GSM36786', 'GSM36787', 'GSM36788', 'GSM36789', 'GSM36790', 'GSM36791', 'GSM36792', 'GSM36793', 'GSM36794', 'GSM36795', 'GSM36796', 'GSM36797', 'GSM36798', 'GSM36799', 'GSM36800', 'GSM36801', 'GSM36802', 'GSM36803', 'GSM36804', 'GSM36805', 'GSM36806', 'GSM36807', 'GSM36808', 'GSM36809', 'GSM36810', 'GSM36811', 'GSM36812', 'GSM36813', 'GSM36814', 'GSM36815', 'GSM36816', 'GSM36817', 'GSM36818', 'GSM36819', 'GSM36820', 'GSM36821', 'GSM36822', 'GSM36823', 'GSM36824', 'GSM36825', 'GSM36826', 'GSM36827', 'GSM36828', 'GSM36829', 'GSM36830', 'GSM36831', 'GSM36832', 'GSM36833', 'GSM36834', 'GSM36835', 'GSM36836', 'GSM36837', 'GSM36838', 'GSM36839', 'GSM36840', 'GSM36841', 'GSM36842', 'GSM36843', 'GSM36844', 'GSM36845', 'GSM36846', 'GSM36847', 'GSM36848', 'GSM36849', 'GSM36850', 'GSM36851', 'GSM36852', 'GSM36853', 'GSM36854', 'GSM36855', 'GSM36856', 

Unnamed: 0,ID_REF,GSM36777,GSM36778,GSM36779,GSM36780,GSM36781,GSM36782,GSM36783,GSM36784,GSM36785,...,GSM37053,GSM37054,GSM37055,GSM37056,GSM37057,GSM37058,GSM37059,GSM37060,GSM37061,GSM37062
0,1007_s_at,3848.1,6520.9,5285.7,4043.7,4263.6,2949.8,5498.9,3863.1,3370.4,...,4058.2,4017.6,2841.0,2914.2,3681.0,3066.9,2773.0,2984.3,3540.0,2620.0
1,1053_at,228.9,112.5,178.4,398.7,417.7,221.2,280.4,198.2,304.7,...,183.4,356.1,234.6,169.4,94.5,265.5,209.8,160.0,285.7,180.5
2,117_at,213.1,189.8,269.7,312.4,327.1,225.0,243.5,244.4,348.5,...,326.6,234.9,369.6,149.5,236.4,347.9,226.7,252.9,135.1,191.8
3,121_at,1009.4,2083.3,1203.4,1104.4,1043.3,1117.6,1085.4,1423.1,1196.4,...,1041.3,1195.6,751.5,1117.8,1022.4,1127.4,1071.8,1178.5,1256.7,1284.6
4,1255_g_at,31.8,145.8,42.5,108.2,69.2,47.4,84.3,102.0,22.8,...,143.5,32.7,62.6,43.0,100.5,47.0,45.1,146.3,75.9,87.4


In [2]:
# Show 5 first rows
df.head()

Unnamed: 0,ID_REF,GSM36777,GSM36778,GSM36779,GSM36780,GSM36781,GSM36782,GSM36783,GSM36784,GSM36785,...,GSM37053,GSM37054,GSM37055,GSM37056,GSM37057,GSM37058,GSM37059,GSM37060,GSM37061,GSM37062
0,1007_s_at,3848.1,6520.9,5285.7,4043.7,4263.6,2949.8,5498.9,3863.1,3370.4,...,4058.2,4017.6,2841.0,2914.2,3681.0,3066.9,2773.0,2984.3,3540.0,2620.0
1,1053_at,228.9,112.5,178.4,398.7,417.7,221.2,280.4,198.2,304.7,...,183.4,356.1,234.6,169.4,94.5,265.5,209.8,160.0,285.7,180.5
2,117_at,213.1,189.8,269.7,312.4,327.1,225.0,243.5,244.4,348.5,...,326.6,234.9,369.6,149.5,236.4,347.9,226.7,252.9,135.1,191.8
3,121_at,1009.4,2083.3,1203.4,1104.4,1043.3,1117.6,1085.4,1423.1,1196.4,...,1041.3,1195.6,751.5,1117.8,1022.4,1127.4,1071.8,1178.5,1256.7,1284.6
4,1255_g_at,31.8,145.8,42.5,108.2,69.2,47.4,84.3,102.0,22.8,...,143.5,32.7,62.6,43.0,100.5,47.0,45.1,146.3,75.9,87.4


In [None]:
# Access metadata

import GEOparse

# Download the series matrix file and parse it
gse = GEOparse.get_GEO("GSE2034", destdir="./")

# Extract metadata for all samples
metadata = {gsm: gse.gsms[gsm].metadata for gsm in gse.gsms}

# Example: look at the metadata for the first sample
sample_id = list(metadata.keys())[0]
print(metadata[sample_id])

import pandas as pd

# Flatten metadata into a DataFrame
clinical_data = pd.DataFrame.from_dict(metadata, orient='index')

# Show first 5 rows
print(clinical_data.head())