# 00_raw_data_qc.ipynb — Raw Data Quality Check

- Inspect raw TCGA-BRCA expression and phenotype files' existence and compute an MD5 "fingerprint" for each to lock provenance before preprocessing.

- Load just enough to inspect the dataset's structure. Xena "HiSeqV2" typically has rows = genes and columns = samples; the first column is gene symbol/ID. Whereas, the phenotype (clinical) file is samples × features (sample identifiers, molecular subtypes (e.g., PAM50), ...). 

- Verify data structure, duplicates, and missing data summary.  
- Detects sample ID (`TCGA-XX-XXXX) and subtype columns to ensure labels exist.
- Verify matching TCGA sample IDs between expression and phenotype tables and examine sample types (tumor 01, normal 11, etc.).

- Examine available molecular subtype columns (PAM50) showing their label distribution to confirm valid categories.

- Save small preview snippets of the raw data to inspect structure before preprocessing. 

**Inputs**
- data/raw/TCGA-BRCA.HiSeqV2.gz
- data/raw/TCGA-BRCA.GDC_phenotype.tsv

**Outputs**
- data/interim/expr_preview.tsv
- data/interim/pheno_preview.tsv
- reports/data_audit.txt


> This notebook is **read-only**: without performing any cleaning/transforming.

## 0. Notebook Setup & Paths

In [None]:
# Standard imports
import json, os
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re, hashlib, textwrap, gzip
import textwrap
from IPython.display import display


In [2]:
# Project paths
PROJECT_ROOT = Path.cwd().parents[0] if Path.cwd().name == "notebooks" else Path.cwd()
DATA_RAW = PROJECT_ROOT / "data" / "raw"
DATA_INTERIM = PROJECT_ROOT / "data" / "interim"
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
REPORTS = PROJECT_ROOT / "reports"
FIGS = PROJECT_ROOT / "reports" / "figures"
MODELS = PROJECT_ROOT / "models"


for d in (DATA_INTERIM, DATA_PROCESSED, REPORTS, FIGS, MODELS):
    d.mkdir(parents=True, exist_ok=True)

print("✓ Paths ready")
print("ROOT =", PROJECT_ROOT)

✓ Paths ready
ROOT = c:\Projects\BRCATranstypia


In [3]:
# Pretty display (these set options; they don't print)
pd.set_option("display.max_columns", 120)
sns.set(context="notebook", style="whitegrid", rc={"figure.figsize": (6,4)})

## 1. File Validation and Integrity

In [4]:
# File paths
expr_path  = DATA_RAW / "TCGA-BRCA.HiSeqV2.gz"
pheno_path = DATA_RAW / "TCGA-BRCA.GDC_phenotype.tsv"

assert expr_path.exists(), f"Missing file: {expr_path}"
assert pheno_path.exists(), f"Missing file: {pheno_path}"

def md5(path: Path, chunk=1<<20) -> str:
    h = hashlib.md5()
    with open(path, "rb") as f:
        while True:
            b = f.read(chunk)
            if not b: break
            h.update(b)
    return h.hexdigest()

print("Expression file:", expr_path.name, "MD5:", md5(expr_path))
print("Phenotype file: ", pheno_path.name, "MD5:", md5(pheno_path))


Expression file: TCGA-BRCA.HiSeqV2.gz MD5: 3ea92ab7e8fcb6102ddca52a44d45002
Phenotype file:  TCGA-BRCA.GDC_phenotype.tsv MD5: 2192a1772da7df9e9599d005d3a0a483


## 2. Lightweight Raw Data Load (without transforms)

In [5]:
# Expression: handle .gz or .tsv seamlessly
if expr_path.suffix == ".gz":
    with gzip.open(expr_path, "rt") as fh:
        expr = pd.read_csv(fh, sep="\t", low_memory=False)
else:
    expr = pd.read_csv(expr_path, sep="\t", low_memory=False)

pheno = pd.read_csv(pheno_path, sep="\t", dtype=str, low_memory=False)


In [6]:
display(expr.iloc[:6, :6])
display(pheno.head(3))

Unnamed: 0,sample,TCGA-AR-A5QQ-01,TCGA-D8-A1JA-01,TCGA-BH-A0BQ-01,TCGA-BH-A0BT-01,TCGA-A8-A06X-01
0,ARHGEF10L,9.5074,7.4346,9.3216,9.0198,9.6417
1,HIF3A,1.5787,3.6607,2.7224,1.3414,0.5819
2,RNF17,0.0,0.6245,0.5526,0.0,0.0
3,RNF10,11.3676,11.9181,11.9665,13.1881,12.0036
4,RNF11,11.1292,13.5273,11.4105,11.0911,11.2545
5,RNF13,9.9722,10.8702,10.4406,10.4244,10.148


Unnamed: 0,sampleID,AJCC_Stage_nature2012,Age_at_Initial_Pathologic_Diagnosis_nature2012,CN_Clusters_nature2012,Converted_Stage_nature2012,Days_to_Date_of_Last_Contact_nature2012,Days_to_date_of_Death_nature2012,ER_Status_nature2012,Gender_nature2012,HER2_Final_Status_nature2012,Integrated_Clusters_no_exp__nature2012,Integrated_Clusters_unsup_exp__nature2012,Integrated_Clusters_with_PAM50__nature2012,Metastasis_Coded_nature2012,Metastasis_nature2012,Node_Coded_nature2012,Node_nature2012,OS_Time_nature2012,OS_event_nature2012,PAM50Call_RNAseq,PAM50_mRNA_nature2012,PR_Status_nature2012,RPPA_Clusters_nature2012,SigClust_Intrinsic_mRNA_nature2012,SigClust_Unsupervised_mRNA_nature2012,Survival_Data_Form_nature2012,Tumor_T1_Coded_nature2012,Tumor_nature2012,Vital_Status_nature2012,_INTEGRATION,_PANCAN_CNA_PANCAN_K8,_PANCAN_Cluster_Cluster_PANCAN,_PANCAN_DNAMethyl_BRCA,_PANCAN_DNAMethyl_PANCAN,_PANCAN_RPPA_PANCAN_K8,_PANCAN_UNC_RNAseq_PANCAN_K16,_PANCAN_miRNA_PANCAN,_PANCAN_mirna_BRCA,_PANCAN_mutation_PANCAN,_PATIENT,_cohort,_primary_disease,_primary_site,additional_pharmaceutical_therapy,additional_radiation_therapy,additional_surgery_locoregional_procedure,additional_surgery_metastatic_procedure,age_at_initial_pathologic_diagnosis,anatomic_neoplasm_subdivision,axillary_lymph_node_stage_method_type,axillary_lymph_node_stage_other_method_descriptive_text,bcr_followup_barcode,bcr_patient_barcode,bcr_sample_barcode,breast_cancer_surgery_margin_status,breast_carcinoma_estrogen_receptor_status,breast_carcinoma_immunohistochemistry_er_pos_finding_scale,breast_carcinoma_immunohistochemistry_pos_cell_score,breast_carcinoma_immunohistochemistry_prgstrn_rcptr_ps_fndng_scl,breast_carcinoma_primary_surgical_procedure_name,...,new_neoplasm_event_occurrence_anatomic_site,new_neoplasm_event_type,new_neoplasm_occurrence_anatomic_site_text,new_tumor_event_additional_surgery_procedure,new_tumor_event_after_initial_treatment,number_of_lymphnodes_positive_by_he,number_of_lymphnodes_positive_by_ihc,oct_embedded,other_dx,pathologic_M,pathologic_N,pathologic_T,pathologic_stage,pathology_report_file_name,patient_id,person_neoplasm_cancer_status,pgr_detection_method_text,pos_finding_her2_erbb2_other_measurement_scale_text,pos_finding_metastatic_brst_crcnm_strgn_rcptr_thr_msrmnt_scl_txt,pos_finding_progesterone_receptor_other_measurement_scale_text,positive_finding_estrogen_receptor_other_measurement_scale_text,postoperative_rx_tx,primary_lymph_node_presentation_assessment,progesterone_receptor_level_cell_percent_category,project_code,radiation_therapy,sample_type,sample_type_id,surgical_procedure_purpose_other_text,system_version,targeted_molecular_therapy,tissue_prospective_collection_indicator,tissue_retrospective_collection_indicator,tissue_source_site,tumor_tissue_site,vial_number,vital_status,year_of_initial_pathologic_diagnosis,_GENOMIC_ID_TCGA_BRCA_exp_HiSeqV2_exon,_GENOMIC_ID_TCGA_BRCA_exp_HiSeqV2_PANCAN,_GENOMIC_ID_TCGA_BRCA_RPPA_RBN,_GENOMIC_ID_TCGA_BRCA_mutation,_GENOMIC_ID_TCGA_BRCA_PDMRNAseq,_GENOMIC_ID_TCGA_BRCA_hMethyl450,_GENOMIC_ID_TCGA_BRCA_RPPA,_GENOMIC_ID_TCGA_BRCA_PDMRNAseqCNV,_GENOMIC_ID_TCGA_BRCA_mutation_curated_wustl_gene,_GENOMIC_ID_TCGA_BRCA_hMethyl27,_GENOMIC_ID_TCGA_BRCA_PDMarrayCNV,_GENOMIC_ID_TCGA_BRCA_miRNA_HiSeq,_GENOMIC_ID_TCGA_BRCA_mutation_wustl_gene,_GENOMIC_ID_TCGA_BRCA_miRNA_GA,_GENOMIC_ID_TCGA_BRCA_exp_HiSeqV2_percentile,_GENOMIC_ID_data/public/TCGA/BRCA/miRNA_GA_gene,_GENOMIC_ID_TCGA_BRCA_gistic2thd,_GENOMIC_ID_data/public/TCGA/BRCA/miRNA_HiSeq_gene,_GENOMIC_ID_TCGA_BRCA_G4502A_07_3,_GENOMIC_ID_TCGA_BRCA_exp_HiSeqV2,_GENOMIC_ID_TCGA_BRCA_gistic2,_GENOMIC_ID_TCGA_BRCA_PDMarray
0,TCGA-3C-AAAU-01,,,,,,,,,,,,,,,,,,,,,,,,,,,,,TCGA-3C-AAAU-01,,,,,,,,,,TCGA-3C-AAAU,TCGA Breast Cancer (BRCA),breast invasive carcinoma,Breast,,,,,55,Left Lower Outer Quadrant,Sentinel lymph node biopsy plus axillary disse...,,TCGA-3C-AAAU-F68069,TCGA-3C-AAAU,TCGA-3C-AAAU-01A,,Positive,,,,,...,Lung,Distant Metastasis,,,NO,4,,True,No,MX,NX,TX,Stage X,TCGA-3C-AAAU.0CD23E1B-3FA3-4A43-AE6E-C8E7B5125...,AAAU,WITH TUMOR,,,,,,NO,YES,50-59%,,NO,Primary Tumor,1,,6th,,NO,YES,3C,Breast,A,LIVING,2004,6ef883fc-81f3-4089-95e0-86904ffc0d38,6ef883fc-81f3-4089-95e0-86904ffc0d38,,,TCGA-3C-AAAU-01,TCGA-3C-AAAU-01A-11D-A41Q-05,,TCGA-3C-AAAU-01,,,,TCGA-3C-AAAU-01,,,6ef883fc-81f3-4089-95e0-86904ffc0d38,,TCGA-3C-AAAU-01A-11D-A41E-01,TCGA-3C-AAAU-01,,6ef883fc-81f3-4089-95e0-86904ffc0d38,TCGA-3C-AAAU-01A-11D-A41E-01,
1,TCGA-3C-AALI-01,,,,,,,,,,,,,,,,,,,,,,,,,,,,,TCGA-3C-AALI-01,,,,,,,,,,TCGA-3C-AALI,TCGA Breast Cancer (BRCA),breast invasive carcinoma,Breast,,,,,50,Right Upper Outer Quadrant,Sentinel lymph node biopsy plus axillary disse...,,TCGA-3C-AALI-F68057,TCGA-3C-AALI,TCGA-3C-AALI-01A,,Positive,,,,,...,,,,,NO,1,,True,No,M0,N1a,T2,Stage IIB,TCGA-3C-AALI.84E6A935-1A49-4BC1-9669-3DEA161CF...,AALI,TUMOR FREE,,,,,,YES,YES,<10%,,YES,Primary Tumor,1,,6th,,NO,YES,3C,Breast,A,LIVING,2003,dd8d3665-ec9d-45be-b7b9-a85dac3585e2,dd8d3665-ec9d-45be-b7b9-a85dac3585e2,,,TCGA-3C-AALI-01,TCGA-3C-AALI-01A-11D-A41Q-05,1B37DE89-DEB6-4049-A2D4-450A4A1DE5D3,TCGA-3C-AALI-01,,,,TCGA-3C-AALI-01,,,dd8d3665-ec9d-45be-b7b9-a85dac3585e2,,TCGA-3C-AALI-01A-11D-A41E-01,TCGA-3C-AALI-01,,dd8d3665-ec9d-45be-b7b9-a85dac3585e2,TCGA-3C-AALI-01A-11D-A41E-01,
2,TCGA-3C-AALJ-01,,,,,,,,,,,,,,,,,,,,,,,,,,,,,TCGA-3C-AALJ-01,,,,,,,,,,TCGA-3C-AALJ,TCGA Breast Cancer (BRCA),breast invasive carcinoma,Breast,,,,,62,Right,,,TCGA-3C-AALJ-F71675,TCGA-3C-AALJ,,,Positive,,,,,...,,,,,NO,1,,,No,M0,N1a,T2,Stage IIB,,AALJ,TUMOR FREE,,,,,,NO,YES,30-39%,,NO,Primary Tumor,1,,7th,,NO,YES,3C,Breast,,LIVING,2011,c924c2a8-ab41-4499-bb30-79705cc17d45,c924c2a8-ab41-4499-bb30-79705cc17d45,,,TCGA-3C-AALJ-01,TCGA-3C-AALJ-01A-31D-A41Q-05,,TCGA-3C-AALJ-01,,,,TCGA-3C-AALJ-01,,,c924c2a8-ab41-4499-bb30-79705cc17d45,,TCGA-3C-AALJ-01A-31D-A41E-01,TCGA-3C-AALJ-01,,c924c2a8-ab41-4499-bb30-79705cc17d45,TCGA-3C-AALJ-01A-31D-A41E-01,


## 3. Schema Probe (Shape,Columns, Data types, NA summary)

In [7]:
print("Expression shape (raw):", expr.shape)
print("Phenotype  shape (raw):", pheno.shape)

Expression shape (raw): (20530, 1219)
Phenotype  shape (raw): (1247, 194)


In [8]:
# data types
print("🧬 Expression matrix data types")
display(expr.dtypes.value_counts().to_frame("count"))

# --- Phenotype data types ---
print("\n🧫 Phenotype data types")
display(pheno.dtypes.value_counts().to_frame("count"))


🧬 Expression matrix data types


Unnamed: 0,count
float64,1218
object,1



🧫 Phenotype data types


Unnamed: 0,count
object,194


In [9]:
# Header sanity: duplicate column names
dups = pd.Series(expr.columns).duplicated(keep=False)
if dups.any():
    print("⚠️ Duplicate column names detected in expression header:")
    display(pd.Series(expr.columns)[dups].value_counts())
else:
    print("✓ No duplicate column names in expression header.")

✓ No duplicate column names in expression header.


In [10]:
# Expression: first column should be gene id; check a few values
expr_first_col = expr.columns[0]
print("\n- first expression column (expected gene id column):", expr_first_col)
display(expr.iloc[:5, :5])


- first expression column (expected gene id column): sample


Unnamed: 0,sample,TCGA-AR-A5QQ-01,TCGA-D8-A1JA-01,TCGA-BH-A0BQ-01,TCGA-BH-A0BT-01
0,ARHGEF10L,9.5074,7.4346,9.3216,9.0198
1,HIF3A,1.5787,3.6607,2.7224,1.3414
2,RNF17,0.0,0.6245,0.5526,0.0
3,RNF10,11.3676,11.9181,11.9665,13.1881
4,RNF11,11.1292,13.5273,11.4105,11.0911


In [11]:
# Show header sample (first 7 column names) to check TCGA sample IDs structure.
display(pd.Series(expr.columns[:7]))

0             sample
1    TCGA-AR-A5QQ-01
2    TCGA-D8-A1JA-01
3    TCGA-BH-A0BQ-01
4    TCGA-BH-A0BT-01
5    TCGA-A8-A06X-01
6    TCGA-A8-A096-01
dtype: object

In [12]:
# Quick numeric overview for phenotype (if any numeric columns exist)
num_cols = pheno.select_dtypes(include=[np.number]).columns
if len(num_cols):
    print("\n- numeric summary (phenotype; first 8 numeric columns):")
    display(pheno[num_cols[:8]].describe().T)
else:
    print("\n- phenotype appears mostly categorical (no numeric columns detected).")


- phenotype appears mostly categorical (no numeric columns detected).


In [13]:
# Missingness overview (phenotype)
print("\n- phenotype columns with >50% missing:")
na_ratio = pheno.isna().mean().sort_values(ascending=False)
display(na_ratio[na_ratio > 0.50].to_frame("NA_ratio"))


- phenotype columns with >50% missing:


Unnamed: 0,NA_ratio
mtsttc_brst_crcnm_hr2_rbb_ps_fndng_flrscnc_n_st_hybrdztn_clcltn,0.999198
days_to_additional_surgery_locoregional_procedure,0.998396
hr2_n_nd_cntrmr_17_cpy_nmbr_mtsttc_brst_crcnm_nlyss_npt_ttl_nmbr,0.998396
metastatic_breast_carcinoma_pos_finding_hr2_rbb2_thr_msr_scl_txt,0.998396
days_to_last_known_alive,0.998396
...,...
SigClust_Unsupervised_mRNA_nature2012,0.581395
SigClust_Intrinsic_mRNA_nature2012,0.581395
_GENOMIC_ID_TCGA_BRCA_PDMarray,0.578188
followup_case_report_form_submission_reason,0.549318


## 4. Candidate Sample-ID & Subtype Columns

In [14]:
# Which columns look like TCGA sample IDs?
sample_id_cols = [c for c in pheno.columns
                  if pheno[c].astype(str).str.contains(r"^TCGA-", na=False).any()]
print("candidate sample ID columns:", sample_id_cols)

candidate sample ID columns: ['sampleID', '_INTEGRATION', '_PATIENT', 'bcr_followup_barcode', 'bcr_patient_barcode', 'bcr_sample_barcode', 'pathology_report_file_name', '_GENOMIC_ID_TCGA_BRCA_RPPA_RBN', '_GENOMIC_ID_TCGA_BRCA_mutation', '_GENOMIC_ID_TCGA_BRCA_PDMRNAseq', '_GENOMIC_ID_TCGA_BRCA_hMethyl450', '_GENOMIC_ID_TCGA_BRCA_PDMRNAseqCNV', '_GENOMIC_ID_TCGA_BRCA_mutation_curated_wustl_gene', '_GENOMIC_ID_TCGA_BRCA_hMethyl27', '_GENOMIC_ID_TCGA_BRCA_PDMarrayCNV', '_GENOMIC_ID_TCGA_BRCA_miRNA_HiSeq', '_GENOMIC_ID_TCGA_BRCA_mutation_wustl_gene', '_GENOMIC_ID_TCGA_BRCA_miRNA_GA', '_GENOMIC_ID_data/public/TCGA/BRCA/miRNA_GA_gene', '_GENOMIC_ID_TCGA_BRCA_gistic2thd', '_GENOMIC_ID_data/public/TCGA/BRCA/miRNA_HiSeq_gene', '_GENOMIC_ID_TCGA_BRCA_G4502A_07_3', '_GENOMIC_ID_TCGA_BRCA_gistic2', '_GENOMIC_ID_TCGA_BRCA_PDMarray']


In [15]:
# Which columns look like subtypes?
subtype_cols = [c for c in pheno.columns
                if re.search(r"(pam50|subtype)", c, flags=re.I)]
print("candidate subtype columns:", subtype_cols if subtype_cols else "None found")

candidate subtype columns: ['Integrated_Clusters_with_PAM50__nature2012', 'PAM50Call_RNAseq', 'PAM50_mRNA_nature2012']


In [16]:
# Peek at a few values for each candidate
for c in sample_id_cols[:1]:
    print(f"\nPreview of sample IDs [{c}]:")
    display(pheno[c].dropna().astype(str).head(3).to_frame())

for c in subtype_cols[:2]:
    print(f"\nSubtype value counts [{c}] (top 3):")
    display(pheno[c].value_counts(dropna=False).head(3).to_frame("count"))


Preview of sample IDs [sampleID]:


Unnamed: 0,sampleID
0,TCGA-3C-AAAU-01
1,TCGA-3C-AALI-01
2,TCGA-3C-AALJ-01



Subtype value counts [Integrated_Clusters_with_PAM50__nature2012] (top 3):


Unnamed: 0_level_0,count
Integrated_Clusters_with_PAM50__nature2012,Unnamed: 1_level_1
,899
3.0,161
4.0,76



Subtype value counts [PAM50Call_RNAseq] (top 3):


Unnamed: 0_level_0,count
PAM50Call_RNAseq,Unnamed: 1_level_1
LumA,434
,291
LumB,194


## 5. TCGA Barcode Probe (barcode16) & Sample Type

In [17]:
def barcode16(b: str) -> str:
    b = str(b).strip()
    b = re.split(r"[^\w-]+", b)[0]  # remove trailing junk
    return b[:16] if b.startswith("TCGA-") else b

In [18]:
# Automatically pick the best sample-ID column from previous section
sample_priority = ["bcr_sample_barcode", "sample", "sampleID"]
pheno_sample_col = next((c for c in sample_priority if c in sample_id_cols), None)

print("Chosen phenotype sample column for barcode QC:", pheno_sample_col)

Chosen phenotype sample column for barcode QC: bcr_sample_barcode


In [19]:
# Expression barcodes
expr_barcode_examples = [barcode16(c) for c in expr.columns[1:6]]
print("Expression barcode16 examples:", expr_barcode_examples)

# Phenotype barcodes
if pheno_sample_col:
    pheno_barcode_examples = pheno[pheno_sample_col].dropna().astype(str).head(6).map(barcode16).tolist()
    print("Phenotype barcode16 examples:", pheno_barcode_examples)
else:
    print("⚠️ No sample-ID column selected from Section 4.")

Expression barcode16 examples: ['TCGA-AR-A5QQ-01', 'TCGA-D8-A1JA-01', 'TCGA-BH-A0BQ-01', 'TCGA-BH-A0BT-01', 'TCGA-A8-A06X-01']
Phenotype barcode16 examples: ['TCGA-3C-AAAU-01A', 'TCGA-3C-AALI-01A', 'TCGA-3C-AALK-01A', 'TCGA-4H-AAAK-01A', 'TCGA-5L-AAT0-01A', 'TCGA-5L-AAT1-01A']


In [20]:
# (optional tiny check for sample-type code)
def sample_type_from_barcode16(b16: str) -> str:
    # TCGA-XX-XXXX-01A  -> '01'
    try:
        return b16.split("-")[3][:2]
    except Exception:
        return ""

if pheno_sample_col:
    tmp = pheno[pheno_sample_col].dropna().astype(str).map(barcode16).to_frame("barcode16")
    tmp["sample_type"] = tmp["barcode16"].map(sample_type_from_barcode16)
    type_counts = tmp["sample_type"].value_counts(dropna=False)
    print("Sample type code counts (e.g., '01' = primary tumor):")
    display(type_counts.to_frame("count"))


Sample type code counts (e.g., '01' = primary tumor):


Unnamed: 0_level_0,count
sample_type,Unnamed: 1_level_1
1,932
11,117
6,5


## 6. Subtype Value Exploration 

In [21]:
subtype_counts = {}
for c in subtype_cols:
    vc = pheno[c].value_counts(dropna=False)
    subtype_counts[c] = vc.head(12)

for c, vc in subtype_counts.items():
    print(f"\nSubtype column [{c}] top values:")
    display(vc.to_frame("count"))



Subtype column [Integrated_Clusters_with_PAM50__nature2012] top values:


Unnamed: 0_level_0,count
Integrated_Clusters_with_PAM50__nature2012,Unnamed: 1_level_1
,899
3.0,161
4.0,76
2.0,72
1.0,39



Subtype column [PAM50Call_RNAseq] top values:


Unnamed: 0_level_0,count
PAM50Call_RNAseq,Unnamed: 1_level_1
LumA,434
,291
LumB,194
Basal,142
Normal,119
Her2,67



Subtype column [PAM50_mRNA_nature2012] top values:


Unnamed: 0_level_0,count
PAM50_mRNA_nature2012,Unnamed: 1_level_1
,725
Luminal A,231
Luminal B,127
Basal-like,98
HER2-enriched,58
Normal-like,8


## 7. Save Small Previews 

In [22]:
# Expression preview: 6 genes x 6 samples (raw, unmodified)
expr_preview = expr.iloc[:6, :6]
expr_preview.to_csv(DATA_INTERIM / "expr_preview.tsv", sep="\t", index=False)

# Phenotype preview: first 8 rows of relevant columns
keep_cols = list(set((sample_id_cols[:1] if sample_id_cols else []) + subtype_cols[:3]))
pheno_preview = pheno[keep_cols].head(8) if keep_cols else pheno.head(8)
pheno_preview.to_csv(DATA_INTERIM / "pheno_preview.tsv", sep="\t", index=False)

print("✓ Saved:", DATA_INTERIM / "expr_preview.tsv")
print("✓ Saved:", DATA_INTERIM / "pheno_preview.tsv")


✓ Saved: c:\Projects\BRCATranstypia\data\interim\expr_preview.tsv
✓ Saved: c:\Projects\BRCATranstypia\data\interim\pheno_preview.tsv


## 8. Write Human-Readable Audit Report

In [None]:
# Fallback md5 (in case it wasn't defined earlier)
if 'md5' not in locals():
    def md5(path: Path, chunk=1<<20) -> str:
        h = hashlib.md5()
        with open(path, "rb") as f:
            while True:
                b = f.read(chunk)
                if not b: break
                h.update(b)
        return h.hexdigest()

# Summaries used in the report
expr_first_col   = expr.columns[0]
expr_sample_cols = expr.columns[1:6].tolist()  # first few sample columns

# ---- Ensure variables from §4/§5 exist (safe fallbacks) ----
sample_id_cols   = locals().get('sample_id_cols', [])
subtype_cols     = locals().get('subtype_cols', [])
pheno_sample_col = locals().get('pheno_sample_col', None)

# Safe init for pheno_subtype_col (avoid referencing before defined)
pheno_subtype_col = locals().get('pheno_subtype_col', None)
if not pheno_subtype_col and subtype_cols:
    pheno_subtype_col = subtype_cols[0]

# Helper to pretty-print lists/Series inside f-strings
def _fmt_list(x):
    return ", ".join(map(str, x)) if isinstance(x, (list, tuple, pd.Index)) else str(x)

# Build the report text
report = f"""
RAW FILES
- Expression: {expr_path.name} | shape={expr.shape} | md5={md5(expr_path)}
- Phenotype:  {pheno_path.name} | shape={pheno.shape} | md5={md5(pheno_path)}

EXPRESSION
- First column (gene id): {expr_first_col}
- First few sample columns: {_fmt_list(expr_sample_cols)}
- Example barcode16 from expression: {_fmt_list(expr_barcode_examples) if 'expr_barcode_examples' in locals() else 'N/A'}

PHENOTYPE
- Candidate sample ID columns: {_fmt_list(sample_id_cols)}
- Candidate subtype columns:  {_fmt_list(subtype_cols)}
- Chosen sample-ID column: {pheno_sample_col if pheno_sample_col else 'N/A'}
- Chosen subtype column:   {pheno_subtype_col if pheno_subtype_col else 'N/A'}
- Example barcode16 from phenotype ({pheno_sample_col}): {_fmt_list(pheno_barcode_examples) if 'pheno_barcode_examples' in locals() else 'N/A'}

SAMPLE TYPE COUNTS
{type_counts.to_string() if 'type_counts' in locals() else 'N/A'}

SUBTYPE VALUE COUNTS (top 10 each)
"""

# Append top values for each detected/selected subtype column
candidates = subtype_cols if subtype_cols else ([pheno_subtype_col] if pheno_subtype_col else [])
for c in candidates:
    vc = pheno[c].value_counts(dropna=False).head(10)
    report += f"\n[{c}]\n{vc.to_string()}\n"

# Note for future preprocessing rename
report += "\nNOTE: The first column in the expression file is named 'sample' but contains gene IDs (to be renamed in preprocessing)."

# Write the report file
out_path = REPORTS / "data_audit.txt"
out_path.write_text(textwrap.dedent(report).strip(), encoding="utf-8")
print("✅ Wrote report:", out_path)


✅ Wrote report: c:\Projects\BRCATranstypia\reports\data_audit.txt


## 9. QC Checklist (manual confirmations)

- [ ] Expression first column is gene id (not a sample).
- [ ] Phenotype has at least one **sample-ID** column with TCGA- prefixes.
- [ ] Phenotype has a **subtype** column with values like LumA/LumB/Her2/Basal (or close variants).
- [ ] Majority of sample types include **'01'** (primary tumor) for your target analysis.
- [ ] Tiny previews saved under `data/interim/`.
- [ ] Text report written to `reports/data_audit.txt`.

> If any box is unchecked, fix (or note) before proceeding to `02_preprocess_align.ipynb`.
