<a href="https://colab.research.google.com/github/joelvsam/AI_Health_Assistant/blob/main/Cancr_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Introduction and Problem Statement

Cancer is not a single disease but a collection of biologically distinct subtypes.
In breast cancer, molecular subtypes such as Luminal A, Luminal B, HER2-enriched, and Basal-like show different prognoses and treatment responses.

RNA sequencing measures gene expression at genome scale and captures underlying tumor biology. Machine learning models can learn patterns across thousands of genes to distinguish cancer subtypes.

Goal:
Predict breast cancer molecular subtypes from TCGA RNA-seq data and extract interpretable biological insights.

2. Dataset Description

We use TCGA Breast Cancer (BRCA) RNA-seq data from the UCSC Xena public hub.

Samples: ~1,000 tumor samples

Features: ~20,000 genes

Labels: PAM50 molecular subtypes

Challenges:

Extreme dimensionality

Noise and batch effects

Class imbalance

Limited sample size relative to genes


In [8]:
import numpy as np
import pandas as pd
import os
import zipfile

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

import torch
import torch.nn as nn
import torch.optim as optim

import kagglehub

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
torch.manual_seed(RANDOM_STATE)


<torch._C.Generator at 0x78aecb6df550>

In [11]:
# -------------------------------
# 2. Download and Extract Dataset
# -------------------------------
dataset_path = kagglehub.dataset_download("orvile/gene-expression-profiles-of-breast-cancer")
print("Dataset downloaded to:", dataset_path)

# Extract all ZIP files in the dataset folder
for f in os.listdir(dataset_path):
    if f.endswith(".zip"):
        with zipfile.ZipFile(os.path.join(dataset_path, f), 'r') as zip_ref:
            zip_ref.extractall(dataset_path)

print("Files after extraction:", os.listdir(dataset_path))


Using Colab cache for faster access to the 'gene-expression-profiles-of-breast-cancer' dataset.
Dataset downloaded to: /kaggle/input/gene-expression-profiles-of-breast-cancer
Files after extraction: ['GSE2034', 'BC-TCGA', 'Simulation-Data', 'GSE25066']


In [13]:
# -------------------------------
# 3. Select Dataset
# -------------------------------
# Options: "BC-TCGA", "GSE2034", "GSE25066", "Simulation"
DATASET = "BC-TCGA"

# Find CSV files
csv_files = [f for f in os.listdir(dataset_path) if f.endswith(".csv")]

# Map dataset selection to CSV files automatically
expression_file = [f for f in csv_files if f"{DATASET}_expression" in f][0]
labels_file = [f for f in csv_files if f"{DATASET}_labels" in f][0]

expression_file = os.path.join(dataset_path, expression_file)
labels_file = os.path.join(dataset_path, labels_file)

# Load CSVs
expression_df = pd.read_csv(expression_file, index_col=0)
labels_df = pd.read_csv(labels_file)
print(f"Expression shape: {expression_df.shape}, Labels shape: {labels_df.shape}")


IndexError: list index out of range