Skip to content

msancheza/biolattice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

microCube (Bio-Lattice)

Converts raw breast MRI volumes (DICOM) into highly compact 32×32×32 tensors with 3 channels: post-contrast structure (adaptive max pool), local heterogeneity (pooled variance E[X²]−E[X]² per micro-cell—related to texture but without explicit GLCM or LBP), and kinetics (pooled post − pre). This codebase trains a custom 3D-ResNet on those tensors for a binary target described in clinical shorthand as benign vs. malignant—but operationally defined from the Duke spreadsheet (see Training labels below).

The micro-cube itself is a powerful input representation: with different clinical labels and a modified classification head (e.g., multi-class molecular subtypes), it could be adapted to other diagnostic tasks, subject to cohort size and data availability.

Training labels (ground truth)

Training and evaluation use datasets/Clinical_and_Other_Features.xlsx (loaded with header=1, as in the Duke TCIA companion table). Only rows with a non-null Mol Subtype value are kept; patients without that field never enter the training set.

Code label Rule in train.py Meaning
0 Mol Subtype ≤ 0 “Negative” class in experiments
1 Mol Subtype > 0 “Positive” class in experiments

This is a pragmatic proxy: it maps a numeric molecular subtype column to a binary target. It is not wired to a free-text pathology or imaging report inside this repo. Metrics (AUC, accuracy, etc.) therefore measure separability under this rule, not an abstract gold standard for every possible definition of malignancy. For other institutions or papers, swap in labels from your own approved reference (e.g., biopsy-confirmed BIRADS, pT stage) and adjust BioLatticeDataset accordingly.

🌱 Green AI & Computational Efficiency

Instead of training massive, energy-hungry 3D Convolutional Networks directly on gigabyte-scale DICOMs, Bio-Lattice mathematically condenses clinical data into microscopic 4D tensors prior to deep learning. This allows the core 3D-ResNet to train natively on consumer-grade hardware (e.g., Apple Silicon) in minutes rather than days on cloud GPUs. This architecture drastically reduces the operational carbon footprint and cloud computing costs, democratizing high-tier medical research without sacrificing diagnostic sensitivity.

Requirements

  • Python 3.10+ (The project locally uses Python 3.13)
  • Duke Cohort type data: datasets/raw_data/<PatientID>/..., datasets/Annotation_Boxes.xlsx, datasets/Clinical_and_Other_Features.xlsx

Configuration: Tunable paths, Duke series keywords, training hyperparameters, inference threshold, and model widths are centralized in config.py (repository root). Edit that file to adjust behavior without hunting through main.py / train.py / predict.py.

Installation

python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r requirements.txt

PyTorch: If you need a specific hardware variant (CPU/CUDA), follow the official installation guide.

Dataset & Reproducibility (Duke Breast Cancer MRI)

For testing was used the public Duke Breast Cancer MRI dataset from The Cancer Imaging Archive (TCIA).

Since raw DICOM MRI sequences and the generated 4D tensors weigh hundreds of gigabytes, they are not included in this code repository to ensure lightning-fast cloning.

  • To test the pipeline out-of-the-box: You can download the pre-compiled tensors directly from Hugging Face Datasets.
  • Alternative Quick Test: You can also use the few .pt files provided in the datasets/examples_microcubos/ folder. Just copy them into datasets/micro_cubos/ to instantly run the Predict (predict.py) or test the Streamlit dashboard on patients Breast_MRI_001 through 007.
  • To reproduce the full research: You must download the native Duke cohort from TCIA, extract the DICOMs into datasets/raw_data/<PatientID>/, and run the Data Extraction Module (main.py) to generate your own tensors.

Usage (Core Pipeline)

  1. python main.py — Builds datasets/micro_cubos/<PatientID>_lattice.pt (see below for exact behavior and limits).

What main.py actually does

This pipeline is optimized for a Duke-style TCIA layout and raw pydicom loading. It is not a full clinical preprocessing stack.

Step Reality
Series choice Walks each patient folder, reads one DICOM per subfolder for SeriesDescription, then picks pre vs post with substring rules tuned to Duke naming (pre, dyn without “phase”, vs 1st / ph1 / etc.). If several folders match, the last match in the walk wins—there is no tie-break or UI. Other sites will usually need new rules or manual series mapping.
3D stack Slices are sorted by ImagePositionPatient[2] and stacked as pixel_array (float32). RescaleSlope / RescaleIntercept are not applied here; downstream training applies per-cube Z-score, not absolute intensity calibration.
ROI Box comes from Annotation_Boxes.xlsx: Start Slice/Row/Column are treated as 1-based and shifted by −1 for NumPy; End Slice/Row/Column feed Python slice ends as in the current code (verify against your TCIA column definitions if you change cohorts). The same integer box is applied to both phases before any resize.
Pre vs post geometry There is no rigid or deformable registration. After crop, if pre and post tensors differ in shape, weave_4d_micro_cube only trilinearly resizes the pre ROI to match the post ROI volume (see main.py). That aligns grid size, not guaranteed anatomy. Channel 3 (post − pre) assumes phases are already comparable within the ROI (as in many research datasets with fixed protocols).
Output A single [3, 32, 32, 32] tensor per patient: pooled post-contrast structure, pooled local variance on post, and pooled post − pre kinetics.
graph TD
    %% Global Styles
    classDef raw fill:#eef2f5,stroke:#93a1a1,stroke-width:2px,color:#2c3e50;
    classDef process fill:#fef9e7,stroke:#d4ac0d,stroke-width:2px,color:#7d6608;
    classDef tensor fill:#e8f8f5,stroke:#28b463,stroke-width:2px,color:#145a32,font-weight:bold;
    classDef net fill:#fdf2e9,stroke:#e74c3c,stroke-width:2px,color:#78281f;

    %% 1. Raw DICOM Sequences
    subgraph phase1 ["Phase 1 - Duke MRI dataset"]
        PRE["Pre-Contrast Phase <br/> V_pre"]:::raw
        POST["Post-Contrast Phase <br/> V_post"]:::raw
    end

    %% 2. Pairing + ROI (no rigid registration in code)
    subgraph phase2 ["Phase 2 - Phase pairing and ROI"]
        SEL["PRE/POST folders <br/> SeriesDescription heuristics (Duke)"]:::process
        ROI["Same 3D box + ~20% halo <br/> from Annotation_Boxes.xlsx"]:::process
    end

    PRE --> SEL
    POST --> SEL
    SEL --> ROI

    %% 3. The Tensor Weaver
    subgraph phase3 ["Phase 3 - Weave to 32 x 32 x 32"]
        C1["Channel 1: Structure <br/> Adaptive Max Pooling"]:::process
        C2["Channel 2: Local heterogeneity <br/> pooled var E[X^2] - E[X]^2"]:::process
        C3["Channel 3: Kinetics <br/> pooled post - pre (pre resized if shape mismatch)"]:::process
    end

    ROI --> C1
    ROI --> C2
    ROI --> C3

    %% 4. Final Output
    CUB{"(Bio-Lattice Tensor) <br/> 3 ch x 32 x 32 x 32"}:::tensor
    C1 --> CUB
    C2 --> CUB
    C3 --> CUB

    %% 5. Downstream Inference
    RES(["3D-ResNet Architecture"]):::net
    LOSS("Supervision: Duke Mol Subtype <br/> binary rule + BCEWithLogitsLoss")

    CUB ==> RES
    RES -.-> LOSS
Loading
  1. python train.py — Trains the BioLattice3DResNet residual classifier natively and saves the optimal model weights to datasets/modelo/biolattice_3dresnet_binary.pth.
  2. python predict.py — Performs Virtual Biopsy inference for a specific Patient ID.
  3. streamlit run dashboard/app.py — Launches the interactive UI orchestrator to handle the full pipeline and dataset evaluations visually.

Why this direction matters (potential & iteration)

The core bet of Bio-Lattice is to separate two problems: (1) turning large, multi-phase breast MRI into a small, task-aware tensor that still carries structure, heterogeneity, and enhancement dynamics, and (2) training a light 3D model that can iterate quickly on consumer hardware. That split keeps the research loop cheap: you can revisit labels, augmentations, or heads without always paying for full-volume training.

The repo today is a deliberately minimal slice of that idea (Duke heuristics, simple ROI logic, no full registration). Improving each phase over time is the main lever to strengthen the project—not rewriting everything at once, but tightening the weakest link:

Area How further work could help
Phase pairing & geometry Rigid/deformable registration, confidence when multiple series match, or explicit series UID config for new cohorts would make the kinetics channel more trustworthy across sites.
Intensities & physics Applying DICOM rescale where appropriate, or normalization tied to acquisition parameters, could improve cross-scanner robustness before or after the weave.
Weave design Different resolutions (e.g. 48³), extra channels (true radiomics blocks, T1w context), or learnable downsampling could raise the ceiling without abandoning the “micro-cube” philosophy.
Labels & evaluation Pathology-aligned targets, external validation, and patient-level splits documented in the repo would align claims with clinical meaning.
Training Architecture search, calibration of thresholds, or uncertainty estimates could sit on top of the same tensors without changing the green-AI story.

None of that is required to run or extend this prototype; it is a roadmap-shaped note so contributors know where effort pays off next.

Medical Disclaimer

This is strictly a Research Prototype, not a certified medical device. Do not use for final clinical decisions or standalone patient diagnosis.

About

Converts breast MRI volumes (DICOM) into compact **32×32×32 micro-cubes** with **3 channels** (structure, texture/variance, pre–post contrast).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages