Converts raw breast MRI volumes (DICOM) into highly compact 32×32×32 tensors with 3 channels: post-contrast structure (adaptive max pool), local heterogeneity (pooled variance E[X²]−E[X]² per micro-cell—related to texture but without explicit GLCM or LBP), and kinetics (pooled post − pre). This codebase trains a custom 3D-ResNet on those tensors for a binary target described in clinical shorthand as benign vs. malignant—but operationally defined from the Duke spreadsheet (see Training labels below).
The micro-cube itself is a powerful input representation: with different clinical labels and a modified classification head (e.g., multi-class molecular subtypes), it could be adapted to other diagnostic tasks, subject to cohort size and data availability.
Training and evaluation use datasets/Clinical_and_Other_Features.xlsx (loaded with header=1, as in the Duke TCIA companion table). Only rows with a non-null Mol Subtype value are kept; patients without that field never enter the training set.
| Code label | Rule in train.py |
Meaning |
|---|---|---|
| 0 | Mol Subtype ≤ 0 |
“Negative” class in experiments |
| 1 | Mol Subtype > 0 |
“Positive” class in experiments |
This is a pragmatic proxy: it maps a numeric molecular subtype column to a binary target. It is not wired to a free-text pathology or imaging report inside this repo. Metrics (AUC, accuracy, etc.) therefore measure separability under this rule, not an abstract gold standard for every possible definition of malignancy. For other institutions or papers, swap in labels from your own approved reference (e.g., biopsy-confirmed BIRADS, pT stage) and adjust BioLatticeDataset accordingly.
Instead of training massive, energy-hungry 3D Convolutional Networks directly on gigabyte-scale DICOMs, Bio-Lattice mathematically condenses clinical data into microscopic 4D tensors prior to deep learning. This allows the core 3D-ResNet to train natively on consumer-grade hardware (e.g., Apple Silicon) in minutes rather than days on cloud GPUs. This architecture drastically reduces the operational carbon footprint and cloud computing costs, democratizing high-tier medical research without sacrificing diagnostic sensitivity.
- Python 3.10+ (The project locally uses Python 3.13)
- Duke Cohort type data:
datasets/raw_data/<PatientID>/...,datasets/Annotation_Boxes.xlsx,datasets/Clinical_and_Other_Features.xlsx
Configuration: Tunable paths, Duke series keywords, training hyperparameters, inference threshold, and model widths are centralized in config.py (repository root). Edit that file to adjust behavior without hunting through main.py / train.py / predict.py.
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtPyTorch: If you need a specific hardware variant (CPU/CUDA), follow the official installation guide.
For testing was used the public Duke Breast Cancer MRI dataset from The Cancer Imaging Archive (TCIA).
Since raw DICOM MRI sequences and the generated 4D tensors weigh hundreds of gigabytes, they are not included in this code repository to ensure lightning-fast cloning.
- To test the pipeline out-of-the-box: You can download the pre-compiled tensors directly from Hugging Face Datasets.
- Alternative Quick Test: You can also use the few
.ptfiles provided in thedatasets/examples_microcubos/folder. Just copy them intodatasets/micro_cubos/to instantly run the Predict (predict.py) or test the Streamlit dashboard on patientsBreast_MRI_001through007. - To reproduce the full research: You must download the native Duke cohort from TCIA, extract the DICOMs into
datasets/raw_data/<PatientID>/, and run the Data Extraction Module (main.py) to generate your own tensors.
python main.py— Buildsdatasets/micro_cubos/<PatientID>_lattice.pt(see below for exact behavior and limits).
This pipeline is optimized for a Duke-style TCIA layout and raw pydicom loading. It is not a full clinical preprocessing stack.
| Step | Reality |
|---|---|
| Series choice | Walks each patient folder, reads one DICOM per subfolder for SeriesDescription, then picks pre vs post with substring rules tuned to Duke naming (pre, dyn without “phase”, vs 1st / ph1 / etc.). If several folders match, the last match in the walk wins—there is no tie-break or UI. Other sites will usually need new rules or manual series mapping. |
| 3D stack | Slices are sorted by ImagePositionPatient[2] and stacked as pixel_array (float32). RescaleSlope / RescaleIntercept are not applied here; downstream training applies per-cube Z-score, not absolute intensity calibration. |
| ROI | Box comes from Annotation_Boxes.xlsx: Start Slice/Row/Column are treated as 1-based and shifted by −1 for NumPy; End Slice/Row/Column feed Python slice ends as in the current code (verify against your TCIA column definitions if you change cohorts). The same integer box is applied to both phases before any resize. |
| Pre vs post geometry | There is no rigid or deformable registration. After crop, if pre and post tensors differ in shape, weave_4d_micro_cube only trilinearly resizes the pre ROI to match the post ROI volume (see main.py). That aligns grid size, not guaranteed anatomy. Channel 3 (post − pre) assumes phases are already comparable within the ROI (as in many research datasets with fixed protocols). |
| Output | A single [3, 32, 32, 32] tensor per patient: pooled post-contrast structure, pooled local variance on post, and pooled post − pre kinetics. |
graph TD
%% Global Styles
classDef raw fill:#eef2f5,stroke:#93a1a1,stroke-width:2px,color:#2c3e50;
classDef process fill:#fef9e7,stroke:#d4ac0d,stroke-width:2px,color:#7d6608;
classDef tensor fill:#e8f8f5,stroke:#28b463,stroke-width:2px,color:#145a32,font-weight:bold;
classDef net fill:#fdf2e9,stroke:#e74c3c,stroke-width:2px,color:#78281f;
%% 1. Raw DICOM Sequences
subgraph phase1 ["Phase 1 - Duke MRI dataset"]
PRE["Pre-Contrast Phase <br/> V_pre"]:::raw
POST["Post-Contrast Phase <br/> V_post"]:::raw
end
%% 2. Pairing + ROI (no rigid registration in code)
subgraph phase2 ["Phase 2 - Phase pairing and ROI"]
SEL["PRE/POST folders <br/> SeriesDescription heuristics (Duke)"]:::process
ROI["Same 3D box + ~20% halo <br/> from Annotation_Boxes.xlsx"]:::process
end
PRE --> SEL
POST --> SEL
SEL --> ROI
%% 3. The Tensor Weaver
subgraph phase3 ["Phase 3 - Weave to 32 x 32 x 32"]
C1["Channel 1: Structure <br/> Adaptive Max Pooling"]:::process
C2["Channel 2: Local heterogeneity <br/> pooled var E[X^2] - E[X]^2"]:::process
C3["Channel 3: Kinetics <br/> pooled post - pre (pre resized if shape mismatch)"]:::process
end
ROI --> C1
ROI --> C2
ROI --> C3
%% 4. Final Output
CUB{"(Bio-Lattice Tensor) <br/> 3 ch x 32 x 32 x 32"}:::tensor
C1 --> CUB
C2 --> CUB
C3 --> CUB
%% 5. Downstream Inference
RES(["3D-ResNet Architecture"]):::net
LOSS("Supervision: Duke Mol Subtype <br/> binary rule + BCEWithLogitsLoss")
CUB ==> RES
RES -.-> LOSS
python train.py— Trains theBioLattice3DResNetresidual classifier natively and saves the optimal model weights todatasets/modelo/biolattice_3dresnet_binary.pth.python predict.py— Performs Virtual Biopsy inference for a specificPatient ID.streamlit run dashboard/app.py— Launches the interactive UI orchestrator to handle the full pipeline and dataset evaluations visually.
The core bet of Bio-Lattice is to separate two problems: (1) turning large, multi-phase breast MRI into a small, task-aware tensor that still carries structure, heterogeneity, and enhancement dynamics, and (2) training a light 3D model that can iterate quickly on consumer hardware. That split keeps the research loop cheap: you can revisit labels, augmentations, or heads without always paying for full-volume training.
The repo today is a deliberately minimal slice of that idea (Duke heuristics, simple ROI logic, no full registration). Improving each phase over time is the main lever to strengthen the project—not rewriting everything at once, but tightening the weakest link:
| Area | How further work could help |
|---|---|
| Phase pairing & geometry | Rigid/deformable registration, confidence when multiple series match, or explicit series UID config for new cohorts would make the kinetics channel more trustworthy across sites. |
| Intensities & physics | Applying DICOM rescale where appropriate, or normalization tied to acquisition parameters, could improve cross-scanner robustness before or after the weave. |
| Weave design | Different resolutions (e.g. 48³), extra channels (true radiomics blocks, T1w context), or learnable downsampling could raise the ceiling without abandoning the “micro-cube” philosophy. |
| Labels & evaluation | Pathology-aligned targets, external validation, and patient-level splits documented in the repo would align claims with clinical meaning. |
| Training | Architecture search, calibration of thresholds, or uncertainty estimates could sit on top of the same tensors without changing the green-AI story. |
None of that is required to run or extend this prototype; it is a roadmap-shaped note so contributors know where effort pays off next.
This is strictly a Research Prototype, not a certified medical device. Do not use for final clinical decisions or standalone patient diagnosis.