# Copyright and License

© 2026, Isabel Bejerano Blazquez  

This Jupyter Notebook is licensed under the **MIT License**.  

## Disclaimer

- This notebook is provided *“as is”*, without warranty of any kind, express or implied.  
- The author assumes no responsibility or liability for any errors, omissions, or outcomes resulting from the use of this notebook or its contents.  
- All analyses and interpretations are for **educational and research purposes only** and do not constitute medical or clinical advice.  

## Dataset Note

- The analyses presented in this notebook are based on the **GSE2034 microarray dataset** from the **NCBI Gene Expression Omnibus (GEO)**, focusing on breast cancer patients.  
- The dataset is publicly available and contains **gene expression measurements** along with **clinical outcome metadata**, including relapse-free survival information.  
- The dataset is fully **de-identified** and intended for research and educational use.  

---

# Predictive Pipeline: Survival-Based Stratification of Breast Cancer Patients Using Microarray Data (GSE2034)

**Dataset:** [GSE2034: Gene Expression Profiles of Breast Cancer Patients](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE2034)  

**Local path:** `../datasets`  

**License:** Publicly available from NCBI GEO; fully **de-identified** and intended for **research and educational purposes**.  

## Abstract

Molecular profiling of breast cancer patients is a powerful approach for understanding tumor biology and predicting clinical outcomes. This notebook presents a reproducible, end-to-end predictive modeling pipeline applied to the **GSE2034 microarray dataset** for the unsupervised clustering and survival-based stratification of breast cancer patients.

The analysis focuses on unsupervised clustering techniques to identify distinct patient subgroups based on gene expression data, followed by survival analysis to assess relapse-free survival outcomes. The pipeline includes data preprocessing, normalization, clustering algorithms (e.g., hierarchical clustering, k-means), and integration of clinical metadata (e.g., survival status). Model performance is evaluated using survival analysis metrics, including Kaplan-Meier curves and Cox proportional hazards models, to assess the prognostic value of the identified clusters.

All findings are interpreted within a biological and clinical framework, considering the molecular heterogeneity of breast cancer and the potential for personalized treatment strategies. The notebook emphasizes transparency, reproducibility, and methodological rigor, making it suitable for educational, research, and clinical risk stratification applications.


## Dataset Relevance

The **GSE2034 microarray dataset** is a widely used resource for breast cancer research, containing gene expression data from 286 breast cancer samples measured on the Affymetrix HG-U133A platform. The dataset is accompanied by clinical metadata, including relapse-free survival status, making it highly relevant for **predictive modeling, unsupervised clustering, and survival analysis** in cancer research.

The dataset's large sample size and comprehensive gene expression measurements provide a rich resource for studying the molecular heterogeneity of breast cancer and identifying distinct subgroups of patients. While GSE2034 does not include detailed treatment information or long-term follow-up data, its integration of **clinical outcomes** (such as relapse-free survival) allows for **prognostic modeling** and risk stratification.

Because the dataset is publicly available and **fully de-identified**, it supports **ethical use, reproducibility**, and **transparency** in cancer research workflows. Its accessibility makes it an ideal resource for educational purposes, exploratory analysis, and the development of tools aimed at improving clinical decision-making and personalized treatment strategies.
