# LOGS and data prep for SANDBOMICS

In [4]:
!pip install scanpy
import scanpy as sc
import pandas as pd
import numpy as np
from typing import Dict



my earlier data (h5ad matrices) have more information about the dat inside the matrices and i woudl not like to loose it (like IDs, molecular subtypes etc), i would like to add this information to csv files generated from RECODE analysis
Example on smallest TN sample:
- Old object: adata_triplenegative_epithelial_improved.h5ad
    - 7561 cells × 33514 genes, ~93.5% sparse.
    - adata.obs["molecular_subtype"], sample_name, etc. already defined.

- New RECODE CSV: TN_RECODE_sig_genes_cellsxgenes_forWGCNA.csv
    - 7561 rows × 16380 genes (after dropping the index column).
    - Same cells, fewer genes, denoised counts.

- So for clustering/UMAP in the SANBOMICS notebook, best is:
    - Build a new AnnData from the RECODE expression.
    - Copy the obs metadata (including molecular_subtype) from the old adata onto this new one, matched by cell ID.

okay, this seems a bit more complicated then i have expected: kernel keeps crashing becuase there is a big amount of data. i tried to loop and automate - didn't work, i have tried not looping and work with each sample separately - didn't wokr

so now i will try to:
extract the columns i need from the big h5ad dataset, this should create a smaller dataset then the one i had initially so it should not crash. 
then i gonna take the csv file i have created with RECODE which is already smaller and merge it with the file i have created via extracting the metadata from the dataset. 
what i could do, is first try it with the smallest TN sample: in this case i can save the extracted metadata from h5ad dataset and run a check first (like read rows and columns and check if they match expectations), if the check is good, i could merge with the existing RECODE csv file, then check again if everything is like i want. 
if it works, i could create a class and loop over 5 other samples so it is automated, however in this case, it won't be neccessary to save the file only with metadata since it will take space

In [1]:
import scanpy as sc
import pandas as pd

#Load original TN AnnData
adata_tn = sc.read_h5ad(
    "/triumvirate/home/alexarol/breast_cancer_analysis/results/adata_triplenegative_epithelial_improved.h5ad"
)
adata_tn.obs_names_make_unique()

#Select obs columns you care about
obs_cols = [
    "sample_name",
    "sample_type",
    "molecular_subtype",
    "cell_type",
    "epithelial_score",
    "immune_score",
    "geo_id",
]

meta_tn = adata_tn.obs[obs_cols].copy()
meta_tn["cell_id"] = meta_tn.index.astype(str)

#Save metadata to a small CSV
meta_tn_path = "/triumvirate/home/alexarol/breast_cancer_analysis/results/recode_outputs/tn/TN_metadata_only.csv"
meta_tn.to_csv(meta_tn_path, index=False)

print("Saved TN metadata to:", meta_tn_path)
print(meta_tn.head())
print(meta_tn.shape)

Saved TN metadata to: /triumvirate/home/alexarol/breast_cancer_analysis/results/recode_outputs/tn/TN_metadata_only.csv
   sample_name     sample_type molecular_subtype   cell_type  \
0      TN-0126  TripleNegative            Normal  Epithelial   
2      TN-0126  TripleNegative              LumA  Epithelial   
4      TN-0126  TripleNegative            Normal  Epithelial   
9      TN-0126  TripleNegative            Normal  Epithelial   
15     TN-0126  TripleNegative            Normal  Epithelial   

    epithelial_score  immune_score      geo_id cell_id  
0               26.0      7.250000  GSM4909281       0  
2                5.0      0.923077  GSM4909281       2  
4               20.0      1.750000  GSM4909281       4  
9                3.6      1.375000  GSM4909281       9  
15              11.8      8.428571  GSM4909281      15  
(7561, 8)


  utils.warn_names_duplicates("obs")


In [2]:
#Load RECODE expression CSV
expr_path = "/triumvirate/home/alexarol/breast_cancer_analysis/results/recode_outputs/tn/TN_RECODE_sig_genes_cellsxgenes_forWGCNA.csv"
expr_df = pd.read_csv(expr_path, dtype={0: str})

expr_df.rename(columns={expr_df.columns[0]: "cell_id"}, inplace=True)

#Load TN metadata
meta_tn_path = "/triumvirate/home/alexarol/breast_cancer_analysis/results/recode_outputs/tn/TN_metadata_only.csv"
meta_tn = pd.read_csv(meta_tn_path, dtype={"cell_id": str})

#Check that cell IDs overlap as expected
print("Expr rows:", expr_df.shape[0])
print("Meta rows:", meta_tn.shape[0])

common_ids = set(expr_df["cell_id"]) & set(meta_tn["cell_id"])
print("Common cell_ids:", len(common_ids))

#Merge metadata + expression (metadata on the left)
merged_tn = meta_tn.merge(expr_df, on="cell_id", how="inner")

print("Merged shape:", merged_tn.shape)
print(merged_tn.head())

Expr rows: 7561
Meta rows: 7561
Common cell_ids: 7561
Merged shape: (7561, 16388)
  sample_name     sample_type molecular_subtype   cell_type  epithelial_score  \
0     TN-0126  TripleNegative            Normal  Epithelial              26.0   
1     TN-0126  TripleNegative              LumA  Epithelial               5.0   
2     TN-0126  TripleNegative            Normal  Epithelial              20.0   
3     TN-0126  TripleNegative            Normal  Epithelial               3.6   
4     TN-0126  TripleNegative            Normal  Epithelial              11.8   

   immune_score      geo_id cell_id  AL627309.1  AL669831.5  ...  MT-ND4L  \
0      7.250000  GSM4909281       0           0           0  ...        5   
1      0.923077  GSM4909281       2           0           0  ...        0   
2      1.750000  GSM4909281       4           0           0  ...        2   
3      1.375000  GSM4909281       9           0           0  ...        2   
4      8.428571  GSM4909281      15           

- Expr rows: 7561, Meta rows: 7561, Common cell_ids: 7561 → all cells match; nothing lost.
- Merged shape: (7561, 16388) →
- 7 metadata columns (sample_name, sample_type, molecular_subtype, cell_type, epithelial_score, immune_score, geo_id),
- 16380 gene columns = 7 + 1 + 16380 = 16388 total.
- The head shows exactly what i want: each row has subtype (Normal, LumA, etc.), sample, scores, plus all gene counts.

In [3]:
merged_out = "/triumvirate/home/alexarol/breast_cancer_analysis/results/recode_outputs/tn/TN_RECODE_with_metadata.csv"
merged_tn.to_csv(merged_out, index=False)
print("Saved merged TN RECODE+metadata CSV to:", merged_out)

Saved merged TN RECODE+metadata CSV to: /triumvirate/home/alexarol/breast_cancer_analysis/results/recode_outputs/tn/TN_RECODE_with_metadata.csv


will try looping it for other 5 samples -. this diddint work, it crashes - i will try just savin gteh metadata in csv files without merging

In [2]:
def export_metadata_from_h5ad(
    orig_h5ad_path: str,
    out_meta_csv_path: str,
    obs_cols=None,
):
    """
    Load a full h5ad, extract selected obs columns + cell_id,
    and save as a small metadata CSV.
    """
    if obs_cols is None:
        obs_cols = [
            "sample_name",
            "sample_type",
            "molecular_subtype",
            "cell_type",
            "epithelial_score",
            "immune_score",
            "geo_id",
        ]

    adata = sc.read_h5ad(orig_h5ad_path)
    adata.obs_names_make_unique()

    meta = adata.obs[obs_cols].copy()
    meta["cell_id"] = meta.index.astype(str)

    meta.to_csv(out_meta_csv_path, index=False)
    print(f"Saved metadata to: {out_meta_csv_path}  (rows: {meta.shape[0]})")

In [4]:
base = "/triumvirate/home/alexarol/breast_cancer_analysis"

meta_targets = {
    "Normal": {
        "orig_h5ad": f"{base}/results/adata_normal_epithelial_improved.h5ad",
        "out_meta": f"{base}/results/recode_outputs/normal/Normal_metadata_only.csv",
    },
    "ER_Positive": {
        "orig_h5ad": f"{base}/results/adata_er_positive_epithelial_improved.h5ad",
        "out_meta": f"{base}/results/recode_outputs/er_positive/ER_Positive_metadata_only.csv",
    },
    "HER2_Positive": {
        "orig_h5ad": f"{base}/results/adata_her2_positive_epithelial_improved.h5ad",
        "out_meta": f"{base}/results/recode_outputs/her2_positive/HER2_Positive_metadata_only.csv",
    },
    "TN_BRCA1": {
        "orig_h5ad": f"{base}/results/adata_triplenegative_brca1_epithelial_improved.h5ad",
        "out_meta": f"{base}/results/recode_outputs/tn_brca1/TN_BRCA1_metadata_only.csv",
    },
    "BRCA1_preneoplastic": {
        "orig_h5ad": f"{base}/results/adata_brca1_preneoplastic_epithelial_improved.h5ad",
        "out_meta": f"{base}/results/recode_outputs/preneoplastic/BRCA1_preneoplastic_metadata_only.csv",
    },
}

for label, paths in meta_targets.items():
    print(f"Exporting metadata for {label}...")
    export_metadata_from_h5ad(
        orig_h5ad_path=paths["orig_h5ad"],
        out_meta_csv_path=paths["out_meta"],
    )

Exporting metadata for Normal...


  utils.warn_names_duplicates("obs")


Saved metadata to: /triumvirate/home/alexarol/breast_cancer_analysis/results/recode_outputs/normal/Normal_metadata_only.csv  (rows: 83522)
Exporting metadata for ER_Positive...


  utils.warn_names_duplicates("obs")


Saved metadata to: /triumvirate/home/alexarol/breast_cancer_analysis/results/recode_outputs/er_positive/ER_Positive_metadata_only.csv  (rows: 91908)
Exporting metadata for HER2_Positive...


  utils.warn_names_duplicates("obs")


Saved metadata to: /triumvirate/home/alexarol/breast_cancer_analysis/results/recode_outputs/her2_positive/HER2_Positive_metadata_only.csv  (rows: 19693)
Exporting metadata for TN_BRCA1...
Saved metadata to: /triumvirate/home/alexarol/breast_cancer_analysis/results/recode_outputs/tn_brca1/TN_BRCA1_metadata_only.csv  (rows: 14186)
Exporting metadata for BRCA1_preneoplastic...
Saved metadata to: /triumvirate/home/alexarol/breast_cancer_analysis/results/recode_outputs/preneoplastic/BRCA1_preneoplastic_metadata_only.csv  (rows: 7644)


  utils.warn_names_duplicates("obs")
  utils.warn_names_duplicates("obs")


now i will proceed with SANDBOMICS before builindg the networx

# SANDBOMICS

Working per “analysis type” and looping over samples - this is a good moment to decide how you handle HVGs inside that SANBOMICS‑style workflow
​
1. Using the SANBOMICS notebook with RECODE
- Start each analysis section from a RECODE expression matrix (per sample), loaded as an AnnData.
- Skip parts that assume raw sequencing counts (alignment, raw QC, very aggressive filtering).
- Keep:
    -   library‑size normalization (for visualization),
    - log1p transform,
    - PCA, neighbors, UMAP, Leiden, marker plots, etc.
​
​- Run these steps in loops over 6 RECODE AnnData objects inside each “chapter” (QC, UMAP, clustering, etc.)
- This way, SANBOMICS is a visualization + exploration layer on top of RECODE, not a second denoising pipeline.

2. Where to put HVG selection
- Effectively have two layers of HVG logic now:
    - RECODE already gave you sets like *_sig_genes_atleast2 and the *_cellsxgenes_forWGCNA CSVs, which are biologically enriched and denoising‑aware.
    - Scanpy/SANBOMICS normally calls sc.pp.highly_variable_genes on the raw (or normalized) data to pick, e.g., 2000 HVGs for clustering.
    - Given the RECODE paper explicitly recommends using the denoised variance as the basis for HVGs, and the new HVG method is designed exactly for that, i are already doing the “hard” selection in a principled way.

- For networks and final analyses (correlations, WGCNA‑like, etc.):
    - I will use RECODE‑based HVGs and gene sets that you control (e.g. 3000 HVGs per condition, from my variance ranks).
- For clustering/UMAP only in SANBOMICS:
    - I can either:
        - reuse the same RECODE HVG subset (e.g. limit .var to my chosen HVGs and run PCA/UMAP on those), or
        - let Scanpy pick a local HVG set for clustering only, but keep that separate from the network HVGs.

I already have a nice TN example with 3000 HVGs and reasonable sparsity, I’d keep HVG limit under my control, not fully delegated to Scanpy:
- Decide per condition: e.g. 3000–4000 HVGs for both clustering and networks.
- Implement that as a simple variance‑based ranking on the RECODE matrix per sample.

my thoughts: fix a core HVG set now, use it for Scanpy visualization to verify data quality, then use the same genes for correlation and networks, and later expand HVG size for sensitivity analyses

then i should do the 3000 as baseline and dohe SANDBOMICS (Scanpy), 
then i have the vizialuzation of the data (it shoudl be alright, i mean the vizualization should show me the quility of the data, but i hope it is alright since i have done the RECODE and now i will choose the core 3000 highly variable genes which will always saty as the core, those are the most important genes) 
and i can run correlation, once the correlation is done for those genes i will have a ready matrix to build the networks
and yeah, for networks i will be using the approch where: i start with combined analysis to identify shared and genotype-associated modules 
    (i need to think about this, i have 1 healthy one, 1 precancerous (for BRCA1) and 4 cancerous
    so if i abbreviate all teh data this way: 
        Normal = N, 
        Er_Positive = E, 
        HER2_positive = H, 
        TripleNegative = TN, 
        TripleNegative_BRCA1 = TNB, 
        BRCA1_preneoplastic = PreB; 
    i could maybe do these pairs: 
        N + PreB, 
        PreB+TNB,
        N+E, 
        N+H, 
        N+TN, 
        N+TNB; 
    and then i do group specific approach and i build networks for:
        N, 
        E, 
        H, 
        TN, 
        TNB, 
        PreB 
            (maybe even specify it for molecular subtypes as well?9
        and i don't know, but maybe i can somehow create a combined version of N+PreB+TNB?

then we build networks, analyze them and probably repeat the same process few more times with different limit for HVGs?

CORE: For each: RECODE → 3000 HVGs → correlation across cells → network → modules + hubs

ISSUE: ecause my groups differ in size (e.g. TN very small vs Normal huge in cell count), for combined analyses I may want to downsample cells per group to balance representation and avoid one group dominating correlations
- Balancing HVGs and balancing cells solve different issues; using the same number of HVGs does not fix the imbalance in cell counts between groups
- HVG choice controls which genes I analyze.
- Downsampling controls how many cells per group contribute to the correlation estimates.

In a combined analysis (e.g. N + TN):
- If Normal has 80k cells and TN has 7k, most pairwise correlations will be driven by patterns in the Normal cells simply because they dominate the sample size, even if you use the same 3000 HVGs.
- Downsampling (e.g. take 7k cells from Normal as well) gives each group more equal “weight” in the correlation structure, so genotype‑associated modules are easier to see and interpret.

So:
- For per‑group networks (N only, TN only, etc.), no downsampling is needed; I just use all cells for that group.
- For combined networks (N + TN, N + PreB, etc.), it is still wise to consider per‑group downsampling to avoid one group dominating, even though the HVG set is the same.

## Setting up SANDBOMICS

Define:
- A dictionary of my 6 RECODE CSV paths.
- A function to load one CSV into AnnData (no metadata for now).
- A function to select HVGs (e.g. top 3000 by variance) per adata.

In [2]:
base = "/triumvirate/home/alexarol/breast_cancer_analysis"

recode_csvs = {
    "TN": f"{base}/results/recode_outputs/tn/TN_RECODE_sig_genes_cellsxgenes_forWGCNA.csv",
    "Normal": f"{base}/results/recode_outputs/normal/Normal_RECODE_sig_genes_cellsxgenes_forWGCNA.csv",
    "ER_Positive": f"{base}/results/recode_outputs/er_positive/ER_Positive_RECODE_sig_genes_cellsxgenes_forWGCNA.csv",
    "HER2_Positive": f"{base}/results/recode_outputs/her2_positive/HER2_Positive_RECODE_sig_genes_cellsxgenes_forWGCNA.csv",
    "TN_BRCA1": f"{base}/results/recode_outputs/tn_brca1/TN_BRCA1_RECODE_sig_genes_cellsxgenes_forWGCNA.csv",
    "BRCA1_preneoplastic": f"{base}/results/recode_outputs/preneoplastic/BRCA1_preneoplastic_RECODE_sig_genes_cellsxgenes_forWGCNA.csv",
}

I will not doo doublet reduction and removal via svVI/SOLO (explained on paper)

## Flaggin mitochondial genes
- What: Identify mitochondrial genes (MT- prefix) and quantify mitochondrial counts per cell.
- Why: High mitochondrial fraction is a classic sign of low‑quality or stressed cells; i want to see these metrics and later use them for QC filtering and plots.
- Input: RECODE‑denoised expression matrices (cells × genes) loaded as AnnData, one per condition.
- Output: For each adata:
    - adata.var['mt'] marking mitochondrial genes.
    - adata.obs['total_counts'], adata.obs['n_genes_by_counts'], adata.obs['pct_counts_mt'] (and other QC metrics).
- Usage later: i will use these QC metrics to visualize data quality (violin plots, UMAP coloring) and to define cell‑level filters before network construction.

kernel crashes on big smaples, i need a different approach
The memory issue comes from adata = sc.AnnData(X=expr.values) on very large expr, not from adding the mt column.

because of this, i need to change approach, here is what i am thinkging about: 
- i need to select HVGs before proceesing, there is not other  i am aware of, but if i am selecting HVGs for one of the samples, i should do it for all and only then proceed further
- but what i am thinking about now, that it could not be a good idea to downsample everything to such a little amount as 3000 genes, i could downsample it later
- but since i know, that this flagging worked with 7561 cells and 16380 genes i could use this amount as starting maximum point?
- it is probaly not the amount of genes which is the problem, but the amount cells,  and even if we choose the most highly variable genes the cell amount would not go down, no?

Two separate things matter now for memory and for biology:
- Memory / crash risk
- Statistical meaning (cells vs genes)

The kernel crashes because creating AnnData(X=expr.values) for very large samples allocates a big dense array in RAM: roughly n_cells × n_genes.

Reducing genes (HVGs) shrinks this product.

Reducing cells would shrink it too, but that does change the biology because you’d lose cells.

So:
- Yes, HVG selection reduces memory.
- No, HVGs do not reduce the number of cells; they only cut columns. this is my logic

chunking big samples would not work now, because: 
- Break per‑sample QC summaries (e.g. global distributions of pct_counts_mt).
- Make it hard to compare cells across the whole sample in UMAP / clustering.

So i should not split the cell dimension for analysis; if i need to process data in pieces for intermediate computations (like computing variances in chunks), that’s fine, but the final AnnData per sample should still contain all cells.

THUS, now i will: 
1: HVG selection per sample directly from the RECODE CSVs
    For each sample:
    - Read the CSV in pandas.
    - Compute per‑gene variance (or RECODE HVG score) across cells.
    - Choose a relatively generous number, e.g. 6000 HVGs per sample as an upper bound (since flagging worked with TN sample) - maybe if it counts eveything too fast i can even make the number higher??
    - Save the list of HVGs per sample (plain text or CSV of gene names).

2: Load only HVGs into AnnData for SANBOMICS
    For each sample:    
    - Read the CSV again, but immediately subset columns to ["cell_id"] + hvg_genes.
    - Now AnnData.X is n_cells × n_HVGs (e.g. 80k × 6000 for Normal instead of 80k × 16k).
    - Run mito flagging + QC + UMAP on these HVGs.

3 (later): network HVGs
- From those 6000 per sample, define a 3000‑gene core (top by variance) for correlation and networks.
- Use the same core for both clustering and network analysis per condition if you want.

This keeps:
- Cell count intact (no downsampling yet).
- A consistent HVG pipeline across all 6 samples.
- Memory under control by cutting the gene dimension before building dense AnnData objects.

### per‑sample HVG preselection from RECODE CSVs
- What: For each sample, compute per‑gene variance directly from the RECODE CSV and keep the top n_top genes.
- Why: This reduces the gene dimension before building AnnData, so large samples (Normal, ER+) fit in memory, while keeping the most informative genes.
- Input: RECODE CSV (cells × genes, first column = cell_id).
- Output: A list of HVG gene names per sample, which i then use to load a smaller AnnData and run mito/QC/UMAP.

I will start with n_top = 6000 and later try 7000, 8000, etc., by changing the parameter.

yes, this would have worked - but kernel keeps crashing if i loop
so what i wiull do is that i will do is i will run analysisn on each sample separately 

In [5]:
def get_top_hvgs_from_csv(csv_path: str, n_top: int = 6000) -> pd.Index:
    df = pd.read_csv(csv_path, dtype={0: str})
    expr = df.iloc[:, 1:]
    var = expr.var(axis=0)
    top_genes = var.sort_values(ascending=False).head(n_top).index
    return top_genes

n_top_hvg = 6000
hvg_genes: Dict[str, pd.Index] = {}

for label, path in recode_csvs.items():
    print(f"Selecting HVGs for {label} from {path} ...")
    top_genes = get_top_hvgs_from_csv(path, n_top=n_top_hvg)
    hvg_genes[label] = top_genes
    print(f"  {label}: selected {len(top_genes)} HVGs")

Selecting HVGs for TN from /triumvirate/home/alexarol/breast_cancer_analysis/results/recode_outputs/tn/TN_RECODE_sig_genes_cellsxgenes_forWGCNA.csv ...
  TN: selected 6000 HVGs
Selecting HVGs for Normal from /triumvirate/home/alexarol/breast_cancer_analysis/results/recode_outputs/normal/Normal_RECODE_sig_genes_cellsxgenes_forWGCNA.csv ...


: 

at this point kernel crashes too much, i will do the SANDBOMICS of each smaple in separete notebook named: 06_RECODE_SANDBOMICS_[sample_name].ipynb